Cedric Nugteren
3d0c227fa5
AMAX/AMIN integer testing and bug fixes ( #457 )
...
* Fixed a bug in XAMAX/XMIN routines that caused the increment and offset to be included in the result
* Perform proper integer-output testing in XAMAX tests
* A few changes towards getting it ready for a PR
* Also fix compilation for clBLAS and cuBLAS references
* Fix a bug that would only use the real part of complex numbers in the amax/amin routines
* A few small fixes related to the AMAX tests
2023-05-07 20:02:52 +02:00
Angus, Alexander
73f49e9b3d
Updated according to feedback from CNugteren
2023-01-17 08:35:29 -08:00
Angus, Alexander
4f394608a2
implemented changes to boost Adreno performance according to https://jira-dc.qualcomm.com/jira/browse/OSR-8731
2023-01-03 10:56:04 -08:00
Cedric Nugteren
38fa34b432
Fix typo in comment
...
Resolves https://github.com/CNugteren/CLBlast/issues/440
2022-06-24 09:32:47 +02:00
Justin Graham
ba254d2f50
sum fix
2022-04-22 11:39:38 -05:00
Cedric Nugteren
b46853660e
Made it more likely (but no guarantees) for amax/amin to return the first index
2020-03-08 11:26:49 +01:00
etomzak
9560193a9e
Fix out-of-bounds read/write in XhadFaster
...
Fix an error in XhadFaster where data would be written beyond the end of zgm.
The kernel loop assumed that there was always enough work for each thread to
process WPT items, but this was not enforced. It's possible to detect the
overflow with the "canary" buffer regions, but for SHAD, kCanarySize must be
~500 (much larger than the normal 127).
This commit may improve the performance of XhadFaster, since the kernel was
performing 2x work in some cases (once over real data, once over garbage).
Courtesy of Codeplay Software Ltd.
2019-09-04 12:55:25 +01:00
Cedric Nugteren
3f9d7bca22
Fixed a bug in the absolute-min index kernel
2019-05-19 14:00:18 +02:00
Cedric Nugteren
9cbffc9b7c
Changed back to cl_intel_subgroups as suggested
2019-05-08 22:01:56 +02:00
Cedric Nugteren
c6ba86cdc3
Enabled avc_motion_estimation extension for Intel subgroup shuffling
2019-05-07 20:47:31 +02:00
Koichi Akabe
301dc280df
Fix xconvgemm kernel and enable ConvGemmMethod::kSingleKernel
2018-12-18 13:56:00 +09:00
Koichi Akabe
a646d6ca46
Remove unnecessary attribute of inline function
2018-11-19 13:03:50 +09:00
Koichi Akabe
032e3b0cc0
Add kernel_mode option to im2col, col2im, and convgemm functions
2018-11-12 10:12:07 +09:00
Cedric Nugteren
6f67525ea6
Changed col2im to append to the existing im-buffer
2018-11-07 19:45:07 +01:00
Cedric Nugteren
2d32a23293
Added new col2im routine to the documentation
2018-11-01 21:46:19 +01:00
Koichi Akabe
0b3d04f709
Fix col2im implementation
2018-10-30 14:54:55 +09:00
Cedric Nugteren
d45911b61d
Added groundwork for col2im algorithm plus first non-working version of kernel and test
2018-10-23 20:52:25 +02:00
Cedric Nugteren
9a1454496d
Fixed a bug with the pre-processing and the AXPY kernel
2018-10-17 21:15:53 +02:00
Cedric Nugteren
664a238adf
Fixed a bug in the XaxpyFaster kernel for specific parameters
2018-10-15 20:08:29 +02:00
Cedric Nugteren
634b2bc75c
Merge pull request #319 from CNugteren/convgemm_multi_kernel
...
First im2col+GEMM implementation of convolution
2018-10-14 17:27:45 +02:00
Cedric Nugteren
1736c0cef4
Fixed pre-processor warnings related to the subgroup shuffling
2018-10-10 19:12:42 +02:00
Cedric Nugteren
83ba3d4b7b
Merge branch 'master' into convgemm_multi_kernel
2018-09-16 20:01:18 +02:00
Cedric Nugteren
0f6dd01e51
Fixed an MSVC compilation error due to large strings
2018-09-15 19:58:07 +02:00
Cedric Nugteren
51cc346751
Fixed issues with GEMMK=1 kernel and the pre-processor
2018-09-15 16:50:34 +02:00
Cedric Nugteren
c788e040f7
Added xCONVGEMM as im2col plus a batched GEMM kernel
2018-09-07 22:02:44 +02:00
Cedric Nugteren
5903820ba2
Merge branch 'master' into CLBlast-267-convgemm
2018-07-29 10:26:34 +02:00
Cedric Nugteren
0f0baa561b
Disabled the use of staggered indices on AMD GPUs for the new GEMMK == 1 kernels to improve performance
2018-07-28 14:36:33 +02:00
Cedric Nugteren
03bed8633e
Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel
2018-07-27 23:08:49 +02:00
Tyler Sorensen
0772d63498
moved a two-line macro to a single line
2018-07-16 20:12:30 -04:00
Tyler Sorensen
7709a7308b
Applied feedback from Cedric from first pull request
2018-07-14 19:50:47 -04:00
Tyler Sorensen
7f2e98a140
added inline ptx to support shuffle on Nvidia GPUs
2018-07-11 15:12:22 -04:00
Cedric Nugteren
1c9a741470
Merge branch 'master' into CLBlast-267-convgemm
2018-06-03 15:53:27 +02:00
Cedric Nugteren
e609220393
Some potential fixes for error -54 when launching TRSV and TRSM kernels
2018-05-31 20:09:49 +02:00
Cedric Nugteren
838422fbb1
Further implemented single-kernel approach of convgemm; extended test to capture other parts of the kernel code
2018-05-21 11:47:16 +02:00
Cedric Nugteren
5d87abf780
Added method selection option to switch between im2col and single-kernel approach for convgemm
2018-05-21 11:28:11 +02:00
Cedric Nugteren
37cabd4f1f
Moved new convgemm kernel to levelx kernel folder
2018-05-19 21:05:45 +02:00
Cedric Nugteren
27b52ac2c8
Second version of direct reading from image tensor for convgemm: also with local memory support now
2018-05-19 21:02:44 +02:00
Cedric Nugteren
e057a9186a
First version of direct reading from image tensor for convgemm: only for edge cases now
2018-05-17 09:23:28 +01:00
Cedric Nugteren
0cb9580042
Created a dedicated convgemm GEMM kernel as a copy of the batched direct gemm kernel
2018-05-13 22:10:21 +02:00
Cedric Nugteren
ad8f1027ab
Plugged in the code of strided-batched-gemm into convgemm in preparation of a new kernel
2018-05-13 21:01:46 +02:00
Cedric Nugteren
2965b87dda
Added Intel subgroup shuffle support to the 2D register caching GEMM kernel
2018-04-24 21:32:42 +02:00
Cedric Nugteren
a93fec1026
Fixed issues with the pre-processor
2018-04-08 18:02:44 +02:00
Cedric Nugteren
3519d32ac4
Extended the GEMM tuner to be able to tune the new 'kernel 1'
2018-04-07 17:05:44 +02:00
Cedric Nugteren
381f1fe67a
Fixed a compilation issue for complex datatypes and vload
2018-04-07 16:57:36 +02:00
Cedric Nugteren
2a29dc061c
Fixed a compilation issue for complex datatypes and vload
2018-04-06 21:06:13 +02:00
Cedric Nugteren
eae25f5727
Added first version of 2D register tiling kernel with A and C transposed as well
2018-04-03 21:18:40 +02:00
Cedric Nugteren
1cbe2ea301
Removed arrays as function argument from GEMM kernels for Vivante OpenCL compiler
2018-03-23 20:29:20 +01:00
Cedric Nugteren
52791bf355
Fixed a failing TRSM test using a CPU with Apple OpenCL
2018-03-15 21:09:52 +01:00
Cedric Nugteren
7a756cbce7
Fixed a failing TRSV test using a CPU with Apple OpenCL
2018-03-15 20:58:42 +01:00
Cedric Nugteren
69ed46c8da
Implemented the XHAD Hadamard product routine
2018-02-02 21:18:37 +01:00