Cedric Nugteren
|
99a4df88a6
|
Implemented the in-direct version of the strided-batched GEMM kernel
|
2018-01-08 21:07:01 +01:00 |
Cedric Nugteren
|
13f0f6fc6e
|
Implemented direct version of strided-batched GEMM kernel
|
2018-01-07 14:58:45 +01:00 |
Cedric Nugteren
|
7f893a85d9
|
Revert "Added options to disable parts of the invert kernel to find out where the AMD compiler crashes"
This reverts commit 407ed52cec .
|
2017-12-31 16:10:40 +01:00 |
Cedric Nugteren
|
69226ae828
|
Changed the invert kernel slightly; added part1a/part1b disable-defines
|
2017-12-31 14:07:08 +01:00 |
Cedric Nugteren
|
7ce415b927
|
Fixed ifdef's into ifndef's
|
2017-12-30 21:17:31 +01:00 |
Cedric Nugteren
|
407ed52cec
|
Added options to disable parts of the invert kernel to find out where the AMD compiler crashes
|
2017-12-30 21:07:50 +01:00 |
Cedric Nugteren
|
2b9bf3a9aa
|
Simplified invert kernel a little
|
2017-12-27 17:03:06 +01:00 |
Cedric Nugteren
|
736399e528
|
Split the invert kernel in two parts to prevent error C1091 in MSVC 2013
|
2017-12-23 14:18:07 +01:00 |
Cedric Nugteren
|
07a7012b0d
|
Added skeleton for a tuner for the invert kernel
|
2017-12-19 21:10:48 +01:00 |
Cedric Nugteren
|
b4d3a50f19
|
Split GEMM kernel in 4 files instead of 3 due to MSVC 2013 string length limit
|
2017-12-10 16:09:09 +01:00 |
Cedric Nugteren
|
9f02fb542c
|
Completed kernel modifications for pre-processor of all other kernels
|
2017-12-09 20:44:21 +01:00 |
Cedric Nugteren
|
02c0d64037
|
Modified the direct GEMM kernel to support array-to-register promotion
|
2017-12-09 14:53:10 +01:00 |
Cedric Nugteren
|
23e3a85f2c
|
Reformatted GEMM kernel to support array-to-register promotion
|
2017-12-09 14:09:13 +01:00 |
Cedric Nugteren
|
d9df62b794
|
Fixed defines parsing and substituting in pre-processor; fixed some variable names in kernels
|
2017-12-09 10:49:55 +01:00 |
Cedric Nugteren
|
540896476d
|
Added register promotion to the main GEMM kernel
|
2017-12-07 22:05:29 +01:00 |
Cedric Nugteren
|
0f9637bbac
|
Improved array-to-register promotion, now handling function calls as well
|
2017-12-05 20:39:49 +01:00 |
Cedric Nugteren
|
cf4555d1f4
|
Added GEMM (direct and in-direct) to the pre-processor testing; modified the loops in kernel accordingly
|
2017-12-03 16:40:36 +01:00 |
Cedric Nugteren
|
0a1a3de58a
|
Added basic bracket parsing in defines and loop expressions
|
2017-12-03 16:39:22 +01:00 |
Cedric Nugteren
|
60312e5878
|
Reformated transpose kernels for the pre-processor; extended the amount of tests
|
2017-12-03 12:00:37 +01:00 |
Cedric Nugteren
|
93ffb876c6
|
Reformatted unrollable kernel loops and added the new promote_to_registers pragma for several kernels
|
2017-11-29 20:21:08 +01:00 |
Cedric Nugteren
|
69aa3b35ed
|
Implemented first simple pre-processor: defines parser and loop unrolling based on assumptions
|
2017-11-25 17:46:01 +01:00 |
Cedric Nugteren
|
f349731d54
|
CUDA kernel compilation fixes
|
2017-10-17 19:53:09 +02:00 |
Cedric Nugteren
|
d62823f067
|
Added a missing OpenCL-to-CUDA function translation
|
2017-10-15 19:53:52 +02:00 |
Cedric Nugteren
|
7663cba234
|
Fixes for the CUDA API: first tests pass and the client runs
|
2017-10-15 17:43:20 +02:00 |
Cedric Nugteren
|
55a802c63d
|
Fixed a kernel/attribute order bug in the direct GEMM kernels
|
2017-10-14 17:21:34 +02:00 |
Cedric Nugteren
|
b06bc01da9
|
Make local memory pointers a define in OpenCL; some fixes to the recently changed transpose kernel code
|
2017-10-14 17:13:54 +02:00 |
Cedric Nugteren
|
d9456306e0
|
Made transpose kernel struct init proper according to the C standard
|
2017-10-14 16:48:06 +02:00 |
Cedric Nugteren
|
313fc796b2
|
Fixed several (not all) CUDA kernel compilation issues
|
2017-10-14 16:01:12 +02:00 |
Cedric Nugteren
|
54d0c440ce
|
Various fixes to make the host code and sample compile with the CUDA API
|
2017-10-14 11:43:57 +02:00 |
Cedric Nugteren
|
2d7b648a24
|
Added OpenCL to CUDA translation header for the kernels
|
2017-10-14 10:49:25 +02:00 |
Cedric Nugteren
|
375193fe4e
|
Gemm in-direct implementation now uses only 1 larger instead of max 3 optional temporary buffers
|
2017-10-03 21:55:21 +02:00 |
Cedric Nugteren
|
8905da259d
|
Fixed a modulo and division issue manifesting on Apple OpenCL for im2col
|
2017-09-05 18:49:23 +02:00 |
Cedric Nugteren
|
297159d5b9
|
Fixed a bug in im2col: process only valid channel IDs
|
2017-08-31 21:58:12 +02:00 |
Cedric Nugteren
|
6194d43efb
|
Fixed a bug in im2col confusing first and second workgroup size; made im2col kernel 2d instead of 3d
|
2017-08-31 20:34:10 +02:00 |
Cedric Nugteren
|
4d9d03ba51
|
Completed im2col implementation
|
2017-08-24 21:11:12 +02:00 |
Cedric Nugteren
|
803ca781f9
|
First version of im2col kernel, unoptimized but working
|
2017-08-19 18:25:13 +02:00 |
Cedric Nugteren
|
442c31dd50
|
Made the inline keyword in kernels optional currently only enabled for NVIDIA and ARM GPUs
|
2017-07-08 17:12:16 +02:00 |
Cedric Nugteren
|
4cf516cfec
|
Fixed an if-statement in the direct GEMM kernel causing a bug with specific sets of input parameters
|
2017-06-30 21:57:41 +02:00 |
Cedric Nugteren
|
512b83dbad
|
Fixed a missing synchronization barrier in the invert kernel; fixes TRSM tests
|
2017-05-14 20:27:35 +02:00 |
Cedric Nugteren
|
f151e56daa
|
Added the IxAMIN routines: absolute minimum version of IxAMAX
|
2017-05-12 20:01:33 -07:00 |
Cedric Nugteren
|
10205d773e
|
Added a new Xaxpy kernel in between the regular and fast version in
|
2017-04-14 20:16:10 +02:00 |
Cedric Nugteren
|
d28ee082b0
|
Uses float2 and double2 for base complex data-types instead of a custom struct; fixes bug on Apple OpenCL
|
2017-04-07 07:35:15 +02:00 |
Cedric Nugteren
|
c27d2f0c1e
|
Added an (optional) non-direct implementation of the batched GEMM routine
|
2017-03-19 16:04:04 +01:00 |
Cedric Nugteren
|
2fd04dae83
|
Added batched versions of the pad/copy/transpose kernels
|
2017-03-19 15:57:44 +01:00 |
Cedric Nugteren
|
7b8f8fce68
|
Added initial naive version of the batched GEMM routine based on the direct GEMM kernel
|
2017-03-11 16:02:45 +01:00 |
Cedric Nugteren
|
d754586b49
|
Added proper testing of the alpha parameter; finalized the batched AXPY implementation
|
2017-03-10 20:49:59 +01:00 |
Cedric Nugteren
|
878d93e7dc
|
Implemented a batched version of the AXPY kernel
|
2017-03-08 20:36:35 +01:00 |
Cedric Nugteren
|
fa0a9c689f
|
Make batched routines based on offsets instead of a vector of cl_mem objects - undoing many earlier changes
|
2017-03-08 20:10:20 +01:00 |
Cedric Nugteren
|
e993ee077b
|
Added a proper data-preparation function for the TRSM tests
|
2017-03-04 15:21:33 +01:00 |
Cedric Nugteren
|
df7638c305
|
Fixed an out-of-bounds memory access when filling a matrix with a constant
|
2017-02-26 14:31:05 +01:00 |
Cedric Nugteren
|
a433987441
|
Fixes division in the kernel for inversion of complex numbers
|
2017-02-26 10:18:45 +01:00 |
Cedric Nugteren
|
e47d95887c
|
Added PrepareData function for TRSM to create proper test input
|
2017-02-25 12:23:04 +01:00 |
Cedric Nugteren
|
c248f900c0
|
Merge branch 'development' into triangular_solvers
|
2017-02-05 22:18:59 +01:00 |
Cedric Nugteren
|
e7cbb5915a
|
Fixed complex version of the TRSV kernel
|
2017-02-05 14:36:31 +01:00 |
Cedric Nugteren
|
c209dd7af9
|
Improved substition kernels a bit; added complex support
|
2017-02-04 22:48:06 +01:00 |
Cedric Nugteren
|
fec8c1a806
|
Completed a first STRSV implementation
|
2017-02-04 16:04:19 +01:00 |
Cedric Nugteren
|
7c73ceb095
|
Added first (incomplete) version of TRSV routine
|
2017-01-29 17:02:00 +01:00 |
Cedric Nugteren
|
df9a77d74d
|
Added first version of the TRSM routine based on the diagonal invert kernel
|
2017-01-18 21:29:59 +01:00 |
Cedric Nugteren
|
4b3ffd9989
|
Added a first version of the diagonal block invert routine in preparation of TRSM
|
2017-01-15 17:30:00 +01:00 |
Cedric Nugteren
|
69ca271a8c
|
Always enables cl_khr_fp64 when running double-precision, not just for OpenCL 1.1 or lower
|
2017-01-07 13:31:29 +01:00 |
Cedric Nugteren
|
6b533dda1c
|
Fixed a bug when using offsets in the direct GEMM kernels
|
2016-12-18 11:54:32 +01:00 |
Cedric Nugteren
|
9b596820d2
|
Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters (2)
|
2016-10-22 10:50:12 +02:00 |
Cedric Nugteren
|
db17b1fbe9
|
Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters
|
2016-10-22 10:41:02 +02:00 |
Cedric Nugteren
|
7052a00a3e
|
Fixed a const-correctness issue with complex conjugation in the GEMM direct kernel
|
2016-10-03 20:13:19 +02:00 |
Cedric Nugteren
|
ca0c075de2
|
Added functions to load from off-chip to local memory without vector loads for the GEMM direct kernels
|
2016-10-03 20:09:15 +02:00 |
Cedric Nugteren
|
c1c4bc5d20
|
Re-organised GEMM direct kernel and added faster fall-back version for incomplete rectangles
|
2016-10-03 19:32:01 +02:00 |
Cedric Nugteren
|
d8827e908c
|
Specialised the GEMM direct kernel in four ways for transposing/non-transposing: NN, NT, TN, TT
|
2016-10-02 17:59:05 +02:00 |
Cedric Nugteren
|
61f489e370
|
Split the GEMM direct kernel into two files; set the default tuning target to 256-256-256
|
2016-10-02 15:06:59 +02:00 |
Cedric Nugteren
|
a459920105
|
Added padding to the local memory of the GEMM direct kernel
|
2016-10-01 16:58:53 +02:00 |
Cedric Nugteren
|
73d135c2ce
|
Added a first version of a tuner for the GEMM direct kernel; collapsed MWGD, NWGD and KWGD into one WGD parameter
|
2016-09-25 14:48:34 +02:00 |
Cedric Nugteren
|
669f43aed6
|
Separated the tuning parameters of the new direct GEMM kernel from the indirect version
|
2016-09-25 13:52:08 +02:00 |
Cedric Nugteren
|
140dc12854
|
Added a first version of the direct version of GEMM with local memory
|
2016-09-25 11:38:35 +02:00 |
Cedric Nugteren
|
6aa652d6ea
|
Merge branch 'development' into gemm_direct
|
2016-09-21 21:32:18 +02:00 |
Cedric Nugteren
|
4ce584a014
|
Split the XGEMM kernel further up: now in 3 parts. This is done because MSVC can't handle long strings
|
2016-09-12 22:13:16 +02:00 |
Cedric Nugteren
|
b30b26b89e
|
The GEMM kernel no longer adds beta*C in case beta is zero; this would cause problems if C contains NaNs
|
2016-09-04 17:21:16 +02:00 |
Cedric Nugteren
|
6eca53ee23
|
Merge branch 'master' of https://github.com/dvasschemacq/CLBlast into dvasschemacq-master
Conflicts:
src/kernels/level1/xaxpy.opencl
src/kernels/level2/xgemv.opencl
src/kernels/level2/xgemv_fast.opencl
src/kernels/level2/xger.opencl
src/kernels/level2/xher.opencl
src/kernels/level2/xher2.opencl
src/kernels/level3/xgemm_part2.opencl
|
2016-08-20 12:50:31 +02:00 |
D. Van Assche
|
57f1aa7685
|
Adapt opencl files for 1.1 OpenCL
In OpenCL 1.1 __kernel has to be before __attribute__, at least with
Vivante compiler.
|
2016-08-18 17:33:13 +02:00 |
Cedric Nugteren
|
5053f6ebc6
|
Merge branch 'development' into gemm_direct
|
2016-07-26 20:53:31 +02:00 |
Cedric Nugteren
|
40a72259eb
|
Fixe a bug in the new XgemvFastRot kernel related to local memory size
|
2016-07-23 16:58:11 +02:00 |
Cedric Nugteren
|
7a4f963763
|
Further improvements to the XgemvFastRot kernel, properly enables coalescing now
|
2016-07-23 14:52:32 +02:00 |
Cedric Nugteren
|
75fe8235f7
|
Improved the XgemvFastRot kernel by tiled loading of the input matrix A, enabling better memory performance
|
2016-07-23 10:20:11 +02:00 |
Cedric Nugteren
|
798d32edad
|
Improved the GEMM direct kernel by adding register blocking. Still not fast though
|
2016-07-17 14:36:51 +02:00 |
Cedric Nugteren
|
eaa348735e
|
Created infrastructure to support a direct GEMM kernel; added correct but slow reference kernel as a place-holder
|
2016-07-16 15:18:28 +02:00 |
Cedric Nugteren
|
c87e877bf2
|
Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel
|
2016-07-10 20:32:01 +02:00 |
Cedric Nugteren
|
52ccaf5b25
|
Added XOMATCOPY routines to perform out-of-place matrix scaling, copying, and/or transposing
|
2016-06-16 18:07:46 +02:00 |
Cedric Nugteren
|
b894611ad1
|
Re-organised the level-3 supporting kernels (copy, pad, transpose, convert) and renamed files and functions appropriately
|
2016-06-14 18:17:58 +02:00 |
Cedric Nugteren
|
6925003e45
|
Added global memory synchronisation for better cache performance on ARM Mali GPUs
|
2016-06-08 10:13:37 +02:00 |
Cedric Nugteren
|
c8ff3f143f
|
Prepared the GER kernels and tuner for half-precision support
|
2016-05-22 16:18:08 +02:00 |
Cedric Nugteren
|
88551b4005
|
Prepared the GEMV kernels and tuner for half-precision support
|
2016-05-22 15:22:54 +02:00 |
Cedric Nugteren
|
489c5d76cf
|
Merged in latest changes from 0.7.1 release
|
2016-05-18 21:32:56 +02:00 |
Cedric Nugteren
|
af2ac62212
|
Prepared GEMM and supporting kernels and tuners for half-precision support
|
2016-05-16 12:37:24 +02:00 |
Cedric Nugteren
|
5e1b2e021f
|
Set kernel arguments for AXPY as constant memory buffers, making it possible to transfer half-precision values as well
|
2016-05-14 18:06:00 +02:00 |
Cedric Nugteren
|
120c31a30f
|
Initial experimental version of the half-precision HAXPY routine
|
2016-05-13 20:49:34 +02:00 |
Cedric Nugteren
|
f2ba75890c
|
Initial changes in preparation for half-precision fp16 support
|
2016-05-12 19:56:21 +02:00 |
cnugteren
|
25a25dbd6f
|
Fixed errors in xAXPY and xSCAL tests on AMD hardware
|
2016-05-08 17:30:31 +02:00 |
Cedric Nugteren
|
e113ff0852
|
Added non-aboslute minimum counter-part IxMIN of the BLAS routine IxAMAX
|
2016-04-30 09:49:39 +02:00 |
Cedric Nugteren
|
d7ddbdeb1f
|
Added non-absolute counter-parts xSUM and IxMAX of the BLAS routines xASUM and IxAMAX
|
2016-04-27 18:07:30 +02:00 |
cnugteren
|
16a048f1ac
|
Added support for the iSAMAX/iDAMAX/iCAMAX/iZAMAX routines
|
2016-04-20 22:12:51 -06:00 |
cnugteren
|
8be99de82d
|
Added support for the SASUM/DASUM/ScASUM/DzASUM routines
|
2016-04-14 19:58:26 -06:00 |
cnugteren
|
5409f349a1
|
Fixed the nrm2 kernel for complex data-types
|
2016-03-30 21:32:04 -07:00 |