Cedric Nugteren
0ee39af5ed
Add tuning results for TITAN RTX
2020-10-10 13:01:12 +02:00
Cedric Nugteren
481d86665f
Add tuning results for Radeon RX Vega
2020-10-10 12:56:28 +02:00
Pradeep Garigipati
dff65e9217
Add a cautionary note in Program::GetIR and mention the fix in CHANGELOG
2020-06-07 21:13:33 +05:30
Pradeep Garigipati
aec71699f8
Fix Program::GetIR to handle programs with multiple devices
2020-06-05 12:00:45 +05:30
Cedric Nugteren
c369cf1a16
Increase display width of the local/global sizes
2020-05-11 20:26:33 +02:00
Cedric Nugteren
4a6c7c37a3
Made sure that the global workgroup size is a multiple of the local size in the tuners
2020-05-10 20:28:23 +02:00
Cedric Nugteren
69a4b4d4b0
Added logging of local/global workgroup sizes when run the tuners
2020-05-10 20:08:28 +02:00
Cedric Nugteren
0870e76fba
Updated PyCLBlast version number
2020-05-10 14:55:03 +02:00
Cedric Nugteren
0b7ce8033c
Added a sample to demonstrate a batched routine
2020-05-10 14:54:50 +02:00
Cedric Nugteren
b94e81af10
Added pyclblast bindings for the 3 batched routines
2020-05-10 12:26:25 +02:00
Cedric Nugteren
bbb2031bf3
Move queue creation out of the tuner loop
2020-05-03 20:30:55 +02:00
Cedric Nugteren
b46853660e
Made it more likely (but no guarantees) for amax/amin to return the first index
2020-03-08 11:26:49 +01:00
Cedric Nugteren
e3ce88154a
Silenced a new OpenCL warning message
2020-03-08 10:14:59 +01:00
Cedric Nugteren
49eb490ee1
Catches all exceptions of the tuners
2020-02-17 22:07:51 +01:00
Tarmo Räntilä
21b66ca761
Reduce TestMatrix calls for xgemmstridedbatched.
...
Replace the looped test by a single one with the offset of the last batch.
2019-12-09 22:17:24 +02:00
Tarmo Räntilä
bf50c4e53e
Reduce TestMatrix calls for xgemmbatched.
...
Replace the looped test by a single one with the maximal found offset.
2019-12-09 22:13:52 +02:00
etomzak
9560193a9e
Fix out-of-bounds read/write in XhadFaster
...
Fix an error in XhadFaster where data would be written beyond the end of zgm.
The kernel loop assumed that there was always enough work for each thread to
process WPT items, but this was not enforced. It's possible to detect the
overflow with the "canary" buffer regions, but for SHAD, kCanarySize must be
~500 (much larger than the normal 127).
This commit may improve the performance of XhadFaster, since the kernel was
performing 2x work in some cases (once over real data, once over garbage).
Courtesy of Codeplay Software Ltd.
2019-09-04 12:55:25 +01:00
Cedric Nugteren
3f9d7bca22
Fixed a bug in the absolute-min index kernel
2019-05-19 14:00:18 +02:00
Cedric Nugteren
af6a9eedd1
Added a function to set the OpenCL kernel standard, either 1.1 or 1.2
2019-05-11 20:39:00 +02:00
Cedric Nugteren
9cbffc9b7c
Changed back to cl_intel_subgroups as suggested
2019-05-08 22:01:56 +02:00
Cedric Nugteren
c5a82f6978
Added a host-code check to make sure the avc_motion_estimation is available
2019-05-07 20:47:50 +02:00
Cedric Nugteren
c6ba86cdc3
Enabled avc_motion_estimation extension for Intel subgroup shuffling
2019-05-07 20:47:31 +02:00
Umar Arshad
cf4907942c
Remove assert for extention not available in macOS
...
The cl_nv_device_attribute_query extention is not available on the
Apple platform. This caused failures during debug builds at runtime.
2019-05-03 23:28:07 -04:00
Cedric Nugteren
7084311e45
Added tuning parameters for Tesla P100 16GB
2019-02-09 16:31:48 +01:00
Cedric Nugteren
1035e533cd
Added tuning parameters for Xeon E5-2630 v3 and v4
2019-02-09 16:29:30 +01:00
Cedric Nugteren
e0541c41a1
Added fp32 to fp16 conversion function in Python to make haxpy example work
2019-01-23 19:52:01 +01:00
Cedric Nugteren
347f0df32f
Added a (non-working) sample of half precision AXPY in Python
2019-01-22 21:14:43 +01:00
Cedric Nugteren
23b9f655fa
Updated pyclblast README, updated to 1.2.0 for half-precision support
2019-01-22 21:14:02 +01:00
Cedric Nugteren
3937efdcda
Added experimental support for half-precision in pyclblast
2019-01-22 21:13:41 +01:00
Cedric Nugteren
9a9c24e811
Merge pull request #345 from CNugteren/convolution-fixes-and-tuner
...
Convolution with single kernel
2019-01-19 17:56:05 +01:00
Cedric Nugteren
c42e48068b
Added a few more initial Intel tuning parameters for convgemm
2019-01-19 15:32:35 +01:00
Cedric Nugteren
afcf5dc6eb
Added a check to prevent the stride of matrix C being set to 0 for the strided-batched-GEMM routine
2019-01-05 10:56:35 +01:00
Cedric Nugteren
560f7a40f6
Added convgemm to the CLBlast database, added initial parameters for Skylake GPU
2018-12-31 19:05:34 +01:00
Cedric Nugteren
d929525039
Added support for the convgemm tuner in the tuner database
2018-12-31 18:49:12 +01:00
Cedric Nugteren
153ac06cf2
Added the forgotten batch dimension to the tuner to get correct kernel executions
2018-12-31 13:19:58 +01:00
Koichi Akabe
a8e6f813dd
Fix the xconvgemm tuner
2018-12-18 14:05:25 +09:00
Cedric Nugteren
1f0cd61824
Added first version of a tuner for the ConvGemm direct kernel
2018-12-18 13:59:26 +09:00
Koichi Akabe
301dc280df
Fix xconvgemm kernel and enable ConvGemmMethod::kSingleKernel
2018-12-18 13:56:00 +09:00
Cedric Nugteren
c0e41b87cb
Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel
2018-11-30 20:23:26 +01:00
Koichi Akabe
a646d6ca46
Remove unnecessary attribute of inline function
2018-11-19 13:03:50 +09:00
Koichi Akabe
032e3b0cc0
Add kernel_mode option to im2col, col2im, and convgemm functions
2018-11-12 10:12:07 +09:00
Cedric Nugteren
6f67525ea6
Changed col2im to append to the existing im-buffer
2018-11-07 19:45:07 +01:00
Cedric Nugteren
2d32a23293
Added new col2im routine to the documentation
2018-11-01 21:46:19 +01:00
Koichi Akabe
0b3d04f709
Fix col2im implementation
2018-10-30 14:54:55 +09:00
Cedric Nugteren
d45911b61d
Added groundwork for col2im algorithm plus first non-working version of kernel and test
2018-10-23 20:52:25 +02:00
Cedric Nugteren
44b630fc22
Some name changes in im2col code
2018-10-22 22:12:58 +02:00
Cedric Nugteren
9a1454496d
Fixed a bug with the pre-processing and the AXPY kernel
2018-10-17 21:15:53 +02:00
Cedric Nugteren
664a238adf
Fixed a bug in the XaxpyFaster kernel for specific parameters
2018-10-15 20:08:29 +02:00
Cedric Nugteren
634b2bc75c
Merge pull request #319 from CNugteren/convgemm_multi_kernel
...
First im2col+GEMM implementation of convolution
2018-10-14 17:27:45 +02:00
Cedric Nugteren
46c50cdd7e
Made tuning API more flexible: disregards any extra parameter values
2018-10-13 17:47:29 +02:00
Cedric Nugteren
1736c0cef4
Fixed pre-processor warnings related to the subgroup shuffling
2018-10-10 19:12:42 +02:00
Cedric Nugteren
83ba3d4b7b
Merge branch 'master' into convgemm_multi_kernel
2018-09-16 20:01:18 +02:00
Cedric Nugteren
0f6dd01e51
Fixed an MSVC compilation error due to large strings
2018-09-15 19:58:07 +02:00
Cedric Nugteren
9bedaa752d
Fixed an MSVC compilation error due to large strings
2018-09-15 17:35:26 +02:00
Cedric Nugteren
8ac39fa331
Disabled Intel subgroup shuffling for double-precision
2018-09-15 16:53:09 +02:00
Cedric Nugteren
51cc346751
Fixed issues with GEMMK=1 kernel and the pre-processor
2018-09-15 16:50:34 +02:00
Cedric Nugteren
c788e040f7
Added xCONVGEMM as im2col plus a batched GEMM kernel
2018-09-07 22:02:44 +02:00
Cedric Nugteren
bf43dbb4ee
Made last operation in TRSV and TRSM asynchronous, making the events not null
2018-08-13 22:58:44 +02:00
Cedric Nugteren
3115c15db5
Small refactoring of events in TRSV substitution routine
2018-08-13 22:58:01 +02:00
Cedric Nugteren
9d9f09fce9
Name change of setting to NETLIB_PERSISTENT_OPENCL
2018-08-07 22:41:06 +02:00
Cedric Nugteren
fe639455bd
Added an option to compile the Netlib API with static OpenCL device and context
2018-08-05 21:12:39 +02:00
Cedric Nugteren
2bea758165
Merge pull request #309 from CNugteren/CLBlast-306-omatcopy-conjugate
...
Fixes bug in conjugate transpose not being executed
2018-08-02 08:35:32 +02:00
Cedric Nugteren
bed10d2731
Merge pull request #308 from CNugteren/CLBlast-301-weird-AMD-Hainan-bug
...
Added workaround for AMD Southern Islands GPU issue
2018-07-31 21:49:53 +02:00
Cedric Nugteren
503ab74f02
Fixed issue with not performing complex conjugation under certain cases when transposing
2018-07-31 21:49:37 +02:00
Cedric Nugteren
bf24421a34
Updated the tuning results for Intel IvyBridge M GT2
2018-07-31 20:49:41 +02:00
Cedric Nugteren
2b76bfee97
Fixed a wrong event issue causing error -57
2018-07-29 22:16:27 +02:00
Cedric Nugteren
2dd539f911
Removed complex numbers support for CONVGEMM
2018-07-29 10:37:14 +02:00
Cedric Nugteren
5903820ba2
Merge branch 'master' into CLBlast-267-convgemm
2018-07-29 10:26:34 +02:00
Cedric Nugteren
bc47e7e7cc
Added print statements to indicate the 4 stages of GEMM tuning
2018-07-28 16:08:22 +02:00
Cedric Nugteren
fa84ac36f2
The tuners now also check for valid local thread configurations and skip invalid ones completely, saving compilation time
2018-07-28 16:01:03 +02:00
Cedric Nugteren
0f0baa561b
Disabled the use of staggered indices on AMD GPUs for the new GEMMK == 1 kernels to improve performance
2018-07-28 14:36:33 +02:00
Cedric Nugteren
03bed8633e
Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel
2018-07-27 23:08:49 +02:00
Cedric Nugteren
429ff070f8
Fixed a bug: forgot to initialize the shared pointer for the null kernel
2018-07-27 20:53:24 +02:00
Cedric Nugteren
f84036948b
Renamed AMD SI workaround defines
2018-07-27 20:38:01 +02:00
Cedric Nugteren
e8dea34fce
Added workaround for weird AMD SI Hainan bug
2018-07-25 22:59:36 +02:00
Cedric Nugteren
6a8b9e24f2
Added code to report the average tuning results
2018-07-25 22:28:44 +02:00
Cedric Nugteren
f8fb707fa4
Merge pull request #297 from tyler-utah/master
...
inline PTX to support subgroup shuffle for Nvidia GPUs
2018-07-23 19:43:03 +02:00
Tyler Sorensen
0772d63498
moved a two-line macro to a single line
2018-07-16 20:12:30 -04:00
Tyler Sorensen
f4e5b1c14c
forgot to add test cases back in, oops
2018-07-14 22:47:39 -04:00
Tyler Sorensen
7709a7308b
Applied feedback from Cedric from first pull request
2018-07-14 19:50:47 -04:00
Cedric Nugteren
f72620f474
Added tuning results for Intel i5-4970S
2018-07-13 21:25:21 +02:00
Cedric Nugteren
3621639b63
Added device-name removal code to handle POCL naming convention
2018-07-13 21:20:27 +02:00
Cedric Nugteren
08b1417956
Added tuning results for GeForce GTX 1070 Ti
2018-07-13 21:07:32 +02:00
Cedric Nugteren
c459582c4f
Added tuning results for HD Graphics 6000 Broadwell GT3
2018-07-13 21:05:43 +02:00
Tyler Sorensen
36093429fd
restored some of the changed tuning files for xgemm
2018-07-11 15:31:51 -04:00
Tyler Sorensen
7f2e98a140
added inline ptx to support shuffle on Nvidia GPUs
2018-07-11 15:12:22 -04:00
Alastair Murray
25661b2d6f
Eliminate a temporary Program object
...
This was causing a crash for me because the temporary Program destructor called
clReleaseProgram on the cl_program with Program, and then clBuildProgram was
called on the same cl_program (belonging to the Program owned by the
shared_ptr, but it's the same cl_program).
2018-07-06 12:58:20 +01:00
Cedric Nugteren
e3eedacbcc
Disabled calls to clReleaseProgram under Windows to avoid segfaults when the OpenCL driver unloads first
2018-06-28 20:35:18 +09:00
Cedric Nugteren
1c9a741470
Merge branch 'master' into CLBlast-267-convgemm
2018-06-03 15:53:27 +02:00
Cedric Nugteren
bd1715aff9
Fixes for CUDA version of CLBlast
2018-06-03 10:41:57 +02:00
Cedric Nugteren
7c3431a72a
Fixes for Apple OpenCL CPU implementation which requires a LWGS of 1 when barriers are present
2018-06-01 20:59:44 +02:00
Cedric Nugteren
5702bff5ad
Added error-checking for half-empty local work group sizes; fixed a minor TRSV global worksize issue
2018-05-31 22:37:06 +02:00
Cedric Nugteren
e609220393
Some potential fixes for error -54 when launching TRSV and TRSM kernels
2018-05-31 20:09:49 +02:00
Cedric Nugteren
ff4d5558a6
Widened Apple OpenCL check, added way to debug too-large-workgroups issue
2018-05-30 22:59:04 +02:00
Cedric Nugteren
a8bb0c9f3c
Added Apple OpenCL TRSV block size override; removed failing old Intel GPU test from README
2018-05-29 21:29:12 +02:00
Cedric Nugteren
6616a59774
Merge pull request #287 from CNugteren/apple-opencl-limitations-fixes
...
Apple opencl limitations for TRSV/TRSM now return not-implemented status
2018-05-27 20:54:27 +02:00
Cedric Nugteren
01d254c0b0
Added a check to return 'NotImplemented' error code in case of systems with < 16 LWGS for TSRV and TRSM
2018-05-27 18:38:47 +02:00
Cedric Nugteren
53198121ac
Made FillMatrix and FillVector functions take a configurable local workgroup size
2018-05-27 12:03:32 +02:00
Cedric Nugteren
c85c385aaf
Added an option in the clients to output timing statistics: minimum, mean, and standard-deviation
2018-05-23 22:36:38 +02:00
Cedric Nugteren
838422fbb1
Further implemented single-kernel approach of convgemm; extended test to capture other parts of the kernel code
2018-05-21 11:47:16 +02:00