CLBlast

Commit Graph

Author	SHA1	Message	Date
Cedric Nugteren	99a4df88a6	Implemented the in-direct version of the strided-batched GEMM kernel	2018-01-08 21:07:01 +01:00
Cedric Nugteren	13f0f6fc6e	Implemented direct version of strided-batched GEMM kernel	2018-01-07 14:58:45 +01:00
Cedric Nugteren	7f893a85d9	Revert "Added options to disable parts of the invert kernel to find out where the AMD compiler crashes" This reverts commit `407ed52cec`.	2017-12-31 16:10:40 +01:00
Cedric Nugteren	69226ae828	Changed the invert kernel slightly; added part1a/part1b disable-defines	2017-12-31 14:07:08 +01:00
Cedric Nugteren	7ce415b927	Fixed ifdef's into ifndef's	2017-12-30 21:17:31 +01:00
Cedric Nugteren	407ed52cec	Added options to disable parts of the invert kernel to find out where the AMD compiler crashes	2017-12-30 21:07:50 +01:00
Cedric Nugteren	2b9bf3a9aa	Simplified invert kernel a little	2017-12-27 17:03:06 +01:00
Cedric Nugteren	736399e528	Split the invert kernel in two parts to prevent error C1091 in MSVC 2013	2017-12-23 14:18:07 +01:00
Cedric Nugteren	07a7012b0d	Added skeleton for a tuner for the invert kernel	2017-12-19 21:10:48 +01:00
Cedric Nugteren	b4d3a50f19	Split GEMM kernel in 4 files instead of 3 due to MSVC 2013 string length limit	2017-12-10 16:09:09 +01:00
Cedric Nugteren	9f02fb542c	Completed kernel modifications for pre-processor of all other kernels	2017-12-09 20:44:21 +01:00
Cedric Nugteren	02c0d64037	Modified the direct GEMM kernel to support array-to-register promotion	2017-12-09 14:53:10 +01:00
Cedric Nugteren	23e3a85f2c	Reformatted GEMM kernel to support array-to-register promotion	2017-12-09 14:09:13 +01:00
Cedric Nugteren	d9df62b794	Fixed defines parsing and substituting in pre-processor; fixed some variable names in kernels	2017-12-09 10:49:55 +01:00
Cedric Nugteren	540896476d	Added register promotion to the main GEMM kernel	2017-12-07 22:05:29 +01:00
Cedric Nugteren	0f9637bbac	Improved array-to-register promotion, now handling function calls as well	2017-12-05 20:39:49 +01:00
Cedric Nugteren	cf4555d1f4	Added GEMM (direct and in-direct) to the pre-processor testing; modified the loops in kernel accordingly	2017-12-03 16:40:36 +01:00
Cedric Nugteren	0a1a3de58a	Added basic bracket parsing in defines and loop expressions	2017-12-03 16:39:22 +01:00
Cedric Nugteren	60312e5878	Reformated transpose kernels for the pre-processor; extended the amount of tests	2017-12-03 12:00:37 +01:00
Cedric Nugteren	93ffb876c6	Reformatted unrollable kernel loops and added the new promote_to_registers pragma for several kernels	2017-11-29 20:21:08 +01:00
Cedric Nugteren	69aa3b35ed	Implemented first simple pre-processor: defines parser and loop unrolling based on assumptions	2017-11-25 17:46:01 +01:00
Cedric Nugteren	f349731d54	CUDA kernel compilation fixes	2017-10-17 19:53:09 +02:00
Cedric Nugteren	d62823f067	Added a missing OpenCL-to-CUDA function translation	2017-10-15 19:53:52 +02:00
Cedric Nugteren	7663cba234	Fixes for the CUDA API: first tests pass and the client runs	2017-10-15 17:43:20 +02:00
Cedric Nugteren	55a802c63d	Fixed a kernel/attribute order bug in the direct GEMM kernels	2017-10-14 17:21:34 +02:00
Cedric Nugteren	b06bc01da9	Make local memory pointers a define in OpenCL; some fixes to the recently changed transpose kernel code	2017-10-14 17:13:54 +02:00
Cedric Nugteren	d9456306e0	Made transpose kernel struct init proper according to the C standard	2017-10-14 16:48:06 +02:00
Cedric Nugteren	313fc796b2	Fixed several (not all) CUDA kernel compilation issues	2017-10-14 16:01:12 +02:00
Cedric Nugteren	54d0c440ce	Various fixes to make the host code and sample compile with the CUDA API	2017-10-14 11:43:57 +02:00
Cedric Nugteren	2d7b648a24	Added OpenCL to CUDA translation header for the kernels	2017-10-14 10:49:25 +02:00
Cedric Nugteren	375193fe4e	Gemm in-direct implementation now uses only 1 larger instead of max 3 optional temporary buffers	2017-10-03 21:55:21 +02:00
Cedric Nugteren	8905da259d	Fixed a modulo and division issue manifesting on Apple OpenCL for im2col	2017-09-05 18:49:23 +02:00
Cedric Nugteren	297159d5b9	Fixed a bug in im2col: process only valid channel IDs	2017-08-31 21:58:12 +02:00
Cedric Nugteren	6194d43efb	Fixed a bug in im2col confusing first and second workgroup size; made im2col kernel 2d instead of 3d	2017-08-31 20:34:10 +02:00
Cedric Nugteren	4d9d03ba51	Completed im2col implementation	2017-08-24 21:11:12 +02:00
Cedric Nugteren	803ca781f9	First version of im2col kernel, unoptimized but working	2017-08-19 18:25:13 +02:00
Cedric Nugteren	442c31dd50	Made the inline keyword in kernels optional currently only enabled for NVIDIA and ARM GPUs	2017-07-08 17:12:16 +02:00
Cedric Nugteren	4cf516cfec	Fixed an if-statement in the direct GEMM kernel causing a bug with specific sets of input parameters	2017-06-30 21:57:41 +02:00
Cedric Nugteren	512b83dbad	Fixed a missing synchronization barrier in the invert kernel; fixes TRSM tests	2017-05-14 20:27:35 +02:00
Cedric Nugteren	f151e56daa	Added the IxAMIN routines: absolute minimum version of IxAMAX	2017-05-12 20:01:33 -07:00
Cedric Nugteren	10205d773e	Added a new Xaxpy kernel in between the regular and fast version in	2017-04-14 20:16:10 +02:00
Cedric Nugteren	d28ee082b0	Uses float2 and double2 for base complex data-types instead of a custom struct; fixes bug on Apple OpenCL	2017-04-07 07:35:15 +02:00
Cedric Nugteren	c27d2f0c1e	Added an (optional) non-direct implementation of the batched GEMM routine	2017-03-19 16:04:04 +01:00
Cedric Nugteren	2fd04dae83	Added batched versions of the pad/copy/transpose kernels	2017-03-19 15:57:44 +01:00
Cedric Nugteren	7b8f8fce68	Added initial naive version of the batched GEMM routine based on the direct GEMM kernel	2017-03-11 16:02:45 +01:00
Cedric Nugteren	d754586b49	Added proper testing of the alpha parameter; finalized the batched AXPY implementation	2017-03-10 20:49:59 +01:00
Cedric Nugteren	878d93e7dc	Implemented a batched version of the AXPY kernel	2017-03-08 20:36:35 +01:00
Cedric Nugteren	fa0a9c689f	Make batched routines based on offsets instead of a vector of cl_mem objects - undoing many earlier changes	2017-03-08 20:10:20 +01:00
Cedric Nugteren	e993ee077b	Added a proper data-preparation function for the TRSM tests	2017-03-04 15:21:33 +01:00
Cedric Nugteren	df7638c305	Fixed an out-of-bounds memory access when filling a matrix with a constant	2017-02-26 14:31:05 +01:00
Cedric Nugteren	a433987441	Fixes division in the kernel for inversion of complex numbers	2017-02-26 10:18:45 +01:00
Cedric Nugteren	e47d95887c	Added PrepareData function for TRSM to create proper test input	2017-02-25 12:23:04 +01:00
Cedric Nugteren	c248f900c0	Merge branch 'development' into triangular_solvers	2017-02-05 22:18:59 +01:00
Cedric Nugteren	e7cbb5915a	Fixed complex version of the TRSV kernel	2017-02-05 14:36:31 +01:00
Cedric Nugteren	c209dd7af9	Improved substition kernels a bit; added complex support	2017-02-04 22:48:06 +01:00
Cedric Nugteren	fec8c1a806	Completed a first STRSV implementation	2017-02-04 16:04:19 +01:00
Cedric Nugteren	7c73ceb095	Added first (incomplete) version of TRSV routine	2017-01-29 17:02:00 +01:00
Cedric Nugteren	df9a77d74d	Added first version of the TRSM routine based on the diagonal invert kernel	2017-01-18 21:29:59 +01:00
Cedric Nugteren	4b3ffd9989	Added a first version of the diagonal block invert routine in preparation of TRSM	2017-01-15 17:30:00 +01:00
Cedric Nugteren	69ca271a8c	Always enables cl_khr_fp64 when running double-precision, not just for OpenCL 1.1 or lower	2017-01-07 13:31:29 +01:00
Cedric Nugteren	6b533dda1c	Fixed a bug when using offsets in the direct GEMM kernels	2016-12-18 11:54:32 +01:00
Cedric Nugteren	9b596820d2	Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters (2)	2016-10-22 10:50:12 +02:00
Cedric Nugteren	db17b1fbe9	Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters	2016-10-22 10:41:02 +02:00
Cedric Nugteren	7052a00a3e	Fixed a const-correctness issue with complex conjugation in the GEMM direct kernel	2016-10-03 20:13:19 +02:00
Cedric Nugteren	ca0c075de2	Added functions to load from off-chip to local memory without vector loads for the GEMM direct kernels	2016-10-03 20:09:15 +02:00
Cedric Nugteren	c1c4bc5d20	Re-organised GEMM direct kernel and added faster fall-back version for incomplete rectangles	2016-10-03 19:32:01 +02:00
Cedric Nugteren	d8827e908c	Specialised the GEMM direct kernel in four ways for transposing/non-transposing: NN, NT, TN, TT	2016-10-02 17:59:05 +02:00
Cedric Nugteren	61f489e370	Split the GEMM direct kernel into two files; set the default tuning target to 256-256-256	2016-10-02 15:06:59 +02:00
Cedric Nugteren	a459920105	Added padding to the local memory of the GEMM direct kernel	2016-10-01 16:58:53 +02:00
Cedric Nugteren	73d135c2ce	Added a first version of a tuner for the GEMM direct kernel; collapsed MWGD, NWGD and KWGD into one WGD parameter	2016-09-25 14:48:34 +02:00
Cedric Nugteren	669f43aed6	Separated the tuning parameters of the new direct GEMM kernel from the indirect version	2016-09-25 13:52:08 +02:00
Cedric Nugteren	140dc12854	Added a first version of the direct version of GEMM with local memory	2016-09-25 11:38:35 +02:00
Cedric Nugteren	6aa652d6ea	Merge branch 'development' into gemm_direct	2016-09-21 21:32:18 +02:00
Cedric Nugteren	4ce584a014	Split the XGEMM kernel further up: now in 3 parts. This is done because MSVC can't handle long strings	2016-09-12 22:13:16 +02:00
Cedric Nugteren	b30b26b89e	The GEMM kernel no longer adds beta*C in case beta is zero; this would cause problems if C contains NaNs	2016-09-04 17:21:16 +02:00
Cedric Nugteren	6eca53ee23	Merge branch 'master' of https://github.com/dvasschemacq/CLBlast into dvasschemacq-master Conflicts: src/kernels/level1/xaxpy.opencl src/kernels/level2/xgemv.opencl src/kernels/level2/xgemv_fast.opencl src/kernels/level2/xger.opencl src/kernels/level2/xher.opencl src/kernels/level2/xher2.opencl src/kernels/level3/xgemm_part2.opencl	2016-08-20 12:50:31 +02:00
D. Van Assche	57f1aa7685	Adapt opencl files for 1.1 OpenCL In OpenCL 1.1 __kernel has to be before __attribute__, at least with Vivante compiler.	2016-08-18 17:33:13 +02:00
Cedric Nugteren	5053f6ebc6	Merge branch 'development' into gemm_direct	2016-07-26 20:53:31 +02:00
Cedric Nugteren	40a72259eb	Fixe a bug in the new XgemvFastRot kernel related to local memory size	2016-07-23 16:58:11 +02:00
Cedric Nugteren	7a4f963763	Further improvements to the XgemvFastRot kernel, properly enables coalescing now	2016-07-23 14:52:32 +02:00
Cedric Nugteren	75fe8235f7	Improved the XgemvFastRot kernel by tiled loading of the input matrix A, enabling better memory performance	2016-07-23 10:20:11 +02:00
Cedric Nugteren	798d32edad	Improved the GEMM direct kernel by adding register blocking. Still not fast though	2016-07-17 14:36:51 +02:00
Cedric Nugteren	eaa348735e	Created infrastructure to support a direct GEMM kernel; added correct but slow reference kernel as a place-holder	2016-07-16 15:18:28 +02:00
Cedric Nugteren	c87e877bf2	Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel	2016-07-10 20:32:01 +02:00
Cedric Nugteren	52ccaf5b25	Added XOMATCOPY routines to perform out-of-place matrix scaling, copying, and/or transposing	2016-06-16 18:07:46 +02:00
Cedric Nugteren	b894611ad1	Re-organised the level-3 supporting kernels (copy, pad, transpose, convert) and renamed files and functions appropriately	2016-06-14 18:17:58 +02:00
Cedric Nugteren	6925003e45	Added global memory synchronisation for better cache performance on ARM Mali GPUs	2016-06-08 10:13:37 +02:00
Cedric Nugteren	c8ff3f143f	Prepared the GER kernels and tuner for half-precision support	2016-05-22 16:18:08 +02:00
Cedric Nugteren	88551b4005	Prepared the GEMV kernels and tuner for half-precision support	2016-05-22 15:22:54 +02:00
Cedric Nugteren	489c5d76cf	Merged in latest changes from 0.7.1 release	2016-05-18 21:32:56 +02:00
Cedric Nugteren	af2ac62212	Prepared GEMM and supporting kernels and tuners for half-precision support	2016-05-16 12:37:24 +02:00
Cedric Nugteren	5e1b2e021f	Set kernel arguments for AXPY as constant memory buffers, making it possible to transfer half-precision values as well	2016-05-14 18:06:00 +02:00
Cedric Nugteren	120c31a30f	Initial experimental version of the half-precision HAXPY routine	2016-05-13 20:49:34 +02:00
Cedric Nugteren	f2ba75890c	Initial changes in preparation for half-precision fp16 support	2016-05-12 19:56:21 +02:00
cnugteren	25a25dbd6f	Fixed errors in xAXPY and xSCAL tests on AMD hardware	2016-05-08 17:30:31 +02:00
Cedric Nugteren	e113ff0852	Added non-aboslute minimum counter-part IxMIN of the BLAS routine IxAMAX	2016-04-30 09:49:39 +02:00
Cedric Nugteren	d7ddbdeb1f	Added non-absolute counter-parts xSUM and IxMAX of the BLAS routines xASUM and IxAMAX	2016-04-27 18:07:30 +02:00
cnugteren	16a048f1ac	Added support for the iSAMAX/iDAMAX/iCAMAX/iZAMAX routines	2016-04-20 22:12:51 -06:00
cnugteren	8be99de82d	Added support for the SASUM/DASUM/ScASUM/DzASUM routines	2016-04-14 19:58:26 -06:00
cnugteren	5409f349a1	Fixed the nrm2 kernel for complex data-types	2016-03-30 21:32:04 -07:00

1 2 3 4

196 Commits (6e2ab6ee967c4a9b3350c7ce4e7d7b736c9e45f6)