Commit Graph

780 Commits (6e2ab6ee967c4a9b3350c7ce4e7d7b736c9e45f6)

Author SHA1 Message Date
Cedric Nugteren 0ee39af5ed Add tuning results for TITAN RTX 2020-10-10 13:01:12 +02:00
Cedric Nugteren 481d86665f Add tuning results for Radeon RX Vega 2020-10-10 12:56:28 +02:00
Pradeep Garigipati dff65e9217 Add a cautionary note in Program::GetIR and mention the fix in CHANGELOG 2020-06-07 21:13:33 +05:30
Pradeep Garigipati aec71699f8
Fix Program::GetIR to handle programs with multiple devices 2020-06-05 12:00:45 +05:30
Cedric Nugteren c369cf1a16 Increase display width of the local/global sizes 2020-05-11 20:26:33 +02:00
Cedric Nugteren 4a6c7c37a3 Made sure that the global workgroup size is a multiple of the local size in the tuners 2020-05-10 20:28:23 +02:00
Cedric Nugteren 69a4b4d4b0 Added logging of local/global workgroup sizes when run the tuners 2020-05-10 20:08:28 +02:00
Cedric Nugteren 0870e76fba Updated PyCLBlast version number 2020-05-10 14:55:03 +02:00
Cedric Nugteren 0b7ce8033c Added a sample to demonstrate a batched routine 2020-05-10 14:54:50 +02:00
Cedric Nugteren b94e81af10 Added pyclblast bindings for the 3 batched routines 2020-05-10 12:26:25 +02:00
Cedric Nugteren bbb2031bf3 Move queue creation out of the tuner loop 2020-05-03 20:30:55 +02:00
Cedric Nugteren b46853660e Made it more likely (but no guarantees) for amax/amin to return the first index 2020-03-08 11:26:49 +01:00
Cedric Nugteren e3ce88154a Silenced a new OpenCL warning message 2020-03-08 10:14:59 +01:00
Cedric Nugteren 49eb490ee1 Catches all exceptions of the tuners 2020-02-17 22:07:51 +01:00
Tarmo Räntilä 21b66ca761 Reduce TestMatrix calls for xgemmstridedbatched.
Replace the looped test by a single one with the offset of the last batch.
2019-12-09 22:17:24 +02:00
Tarmo Räntilä bf50c4e53e Reduce TestMatrix calls for xgemmbatched.
Replace the looped test by a single one with the maximal found offset.
2019-12-09 22:13:52 +02:00
etomzak 9560193a9e Fix out-of-bounds read/write in XhadFaster
Fix an error in XhadFaster where data would be written beyond the end of zgm.
The kernel loop assumed that there was always enough work for each thread to
process WPT items, but this was not enforced. It's possible to detect the
overflow with the "canary" buffer regions, but for SHAD, kCanarySize must be
~500 (much larger than the normal 127).

This commit may improve the performance of XhadFaster, since the kernel was
performing 2x work in some cases (once over real data, once over garbage).

Courtesy of Codeplay Software Ltd.
2019-09-04 12:55:25 +01:00
Cedric Nugteren 3f9d7bca22 Fixed a bug in the absolute-min index kernel 2019-05-19 14:00:18 +02:00
Cedric Nugteren af6a9eedd1 Added a function to set the OpenCL kernel standard, either 1.1 or 1.2 2019-05-11 20:39:00 +02:00
Cedric Nugteren 9cbffc9b7c Changed back to cl_intel_subgroups as suggested 2019-05-08 22:01:56 +02:00
Cedric Nugteren c5a82f6978 Added a host-code check to make sure the avc_motion_estimation is available 2019-05-07 20:47:50 +02:00
Cedric Nugteren c6ba86cdc3 Enabled avc_motion_estimation extension for Intel subgroup shuffling 2019-05-07 20:47:31 +02:00
Umar Arshad cf4907942c Remove assert for extention not available in macOS
The cl_nv_device_attribute_query extention is not available on the
Apple platform. This caused failures during debug builds at runtime.
2019-05-03 23:28:07 -04:00
Cedric Nugteren 7084311e45 Added tuning parameters for Tesla P100 16GB 2019-02-09 16:31:48 +01:00
Cedric Nugteren 1035e533cd Added tuning parameters for Xeon E5-2630 v3 and v4 2019-02-09 16:29:30 +01:00
Cedric Nugteren e0541c41a1 Added fp32 to fp16 conversion function in Python to make haxpy example work 2019-01-23 19:52:01 +01:00
Cedric Nugteren 347f0df32f Added a (non-working) sample of half precision AXPY in Python 2019-01-22 21:14:43 +01:00
Cedric Nugteren 23b9f655fa Updated pyclblast README, updated to 1.2.0 for half-precision support 2019-01-22 21:14:02 +01:00
Cedric Nugteren 3937efdcda Added experimental support for half-precision in pyclblast 2019-01-22 21:13:41 +01:00
Cedric Nugteren 9a9c24e811
Merge pull request #345 from CNugteren/convolution-fixes-and-tuner
Convolution with single kernel
2019-01-19 17:56:05 +01:00
Cedric Nugteren c42e48068b Added a few more initial Intel tuning parameters for convgemm 2019-01-19 15:32:35 +01:00
Cedric Nugteren afcf5dc6eb Added a check to prevent the stride of matrix C being set to 0 for the strided-batched-GEMM routine 2019-01-05 10:56:35 +01:00
Cedric Nugteren 560f7a40f6 Added convgemm to the CLBlast database, added initial parameters for Skylake GPU 2018-12-31 19:05:34 +01:00
Cedric Nugteren d929525039 Added support for the convgemm tuner in the tuner database 2018-12-31 18:49:12 +01:00
Cedric Nugteren 153ac06cf2 Added the forgotten batch dimension to the tuner to get correct kernel executions 2018-12-31 13:19:58 +01:00
Koichi Akabe a8e6f813dd Fix the xconvgemm tuner 2018-12-18 14:05:25 +09:00
Cedric Nugteren 1f0cd61824 Added first version of a tuner for the ConvGemm direct kernel 2018-12-18 13:59:26 +09:00
Koichi Akabe 301dc280df Fix xconvgemm kernel and enable ConvGemmMethod::kSingleKernel 2018-12-18 13:56:00 +09:00
Cedric Nugteren c0e41b87cb Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel 2018-11-30 20:23:26 +01:00
Koichi Akabe a646d6ca46
Remove unnecessary attribute of inline function 2018-11-19 13:03:50 +09:00
Koichi Akabe 032e3b0cc0 Add kernel_mode option to im2col, col2im, and convgemm functions 2018-11-12 10:12:07 +09:00
Cedric Nugteren 6f67525ea6 Changed col2im to append to the existing im-buffer 2018-11-07 19:45:07 +01:00
Cedric Nugteren 2d32a23293 Added new col2im routine to the documentation 2018-11-01 21:46:19 +01:00
Koichi Akabe 0b3d04f709 Fix col2im implementation 2018-10-30 14:54:55 +09:00
Cedric Nugteren d45911b61d Added groundwork for col2im algorithm plus first non-working version of kernel and test 2018-10-23 20:52:25 +02:00
Cedric Nugteren 44b630fc22 Some name changes in im2col code 2018-10-22 22:12:58 +02:00
Cedric Nugteren 9a1454496d Fixed a bug with the pre-processing and the AXPY kernel 2018-10-17 21:15:53 +02:00
Cedric Nugteren 664a238adf Fixed a bug in the XaxpyFaster kernel for specific parameters 2018-10-15 20:08:29 +02:00
Cedric Nugteren 634b2bc75c
Merge pull request #319 from CNugteren/convgemm_multi_kernel
First im2col+GEMM implementation of convolution
2018-10-14 17:27:45 +02:00
Cedric Nugteren 46c50cdd7e Made tuning API more flexible: disregards any extra parameter values 2018-10-13 17:47:29 +02:00
Cedric Nugteren 1736c0cef4 Fixed pre-processor warnings related to the subgroup shuffling 2018-10-10 19:12:42 +02:00
Cedric Nugteren 83ba3d4b7b Merge branch 'master' into convgemm_multi_kernel 2018-09-16 20:01:18 +02:00
Cedric Nugteren 0f6dd01e51 Fixed an MSVC compilation error due to large strings 2018-09-15 19:58:07 +02:00
Cedric Nugteren 9bedaa752d Fixed an MSVC compilation error due to large strings 2018-09-15 17:35:26 +02:00
Cedric Nugteren 8ac39fa331 Disabled Intel subgroup shuffling for double-precision 2018-09-15 16:53:09 +02:00
Cedric Nugteren 51cc346751 Fixed issues with GEMMK=1 kernel and the pre-processor 2018-09-15 16:50:34 +02:00
Cedric Nugteren c788e040f7 Added xCONVGEMM as im2col plus a batched GEMM kernel 2018-09-07 22:02:44 +02:00
Cedric Nugteren bf43dbb4ee Made last operation in TRSV and TRSM asynchronous, making the events not null 2018-08-13 22:58:44 +02:00
Cedric Nugteren 3115c15db5 Small refactoring of events in TRSV substitution routine 2018-08-13 22:58:01 +02:00
Cedric Nugteren 9d9f09fce9 Name change of setting to NETLIB_PERSISTENT_OPENCL 2018-08-07 22:41:06 +02:00
Cedric Nugteren fe639455bd Added an option to compile the Netlib API with static OpenCL device and context 2018-08-05 21:12:39 +02:00
Cedric Nugteren 2bea758165
Merge pull request #309 from CNugteren/CLBlast-306-omatcopy-conjugate
Fixes bug in conjugate transpose not being executed
2018-08-02 08:35:32 +02:00
Cedric Nugteren bed10d2731
Merge pull request #308 from CNugteren/CLBlast-301-weird-AMD-Hainan-bug
Added workaround for AMD Southern Islands GPU issue
2018-07-31 21:49:53 +02:00
Cedric Nugteren 503ab74f02 Fixed issue with not performing complex conjugation under certain cases when transposing 2018-07-31 21:49:37 +02:00
Cedric Nugteren bf24421a34 Updated the tuning results for Intel IvyBridge M GT2 2018-07-31 20:49:41 +02:00
Cedric Nugteren 2b76bfee97 Fixed a wrong event issue causing error -57 2018-07-29 22:16:27 +02:00
Cedric Nugteren 2dd539f911 Removed complex numbers support for CONVGEMM 2018-07-29 10:37:14 +02:00
Cedric Nugteren 5903820ba2 Merge branch 'master' into CLBlast-267-convgemm 2018-07-29 10:26:34 +02:00
Cedric Nugteren bc47e7e7cc Added print statements to indicate the 4 stages of GEMM tuning 2018-07-28 16:08:22 +02:00
Cedric Nugteren fa84ac36f2 The tuners now also check for valid local thread configurations and skip invalid ones completely, saving compilation time 2018-07-28 16:01:03 +02:00
Cedric Nugteren 0f0baa561b Disabled the use of staggered indices on AMD GPUs for the new GEMMK == 1 kernels to improve performance 2018-07-28 14:36:33 +02:00
Cedric Nugteren 03bed8633e Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel 2018-07-27 23:08:49 +02:00
Cedric Nugteren 429ff070f8 Fixed a bug: forgot to initialize the shared pointer for the null kernel 2018-07-27 20:53:24 +02:00
Cedric Nugteren f84036948b Renamed AMD SI workaround defines 2018-07-27 20:38:01 +02:00
Cedric Nugteren e8dea34fce Added workaround for weird AMD SI Hainan bug 2018-07-25 22:59:36 +02:00
Cedric Nugteren 6a8b9e24f2 Added code to report the average tuning results 2018-07-25 22:28:44 +02:00
Cedric Nugteren f8fb707fa4
Merge pull request #297 from tyler-utah/master
inline PTX to support subgroup shuffle for Nvidia GPUs
2018-07-23 19:43:03 +02:00
Tyler Sorensen 0772d63498 moved a two-line macro to a single line 2018-07-16 20:12:30 -04:00
Tyler Sorensen f4e5b1c14c forgot to add test cases back in, oops 2018-07-14 22:47:39 -04:00
Tyler Sorensen 7709a7308b Applied feedback from Cedric from first pull request 2018-07-14 19:50:47 -04:00
Cedric Nugteren f72620f474 Added tuning results for Intel i5-4970S 2018-07-13 21:25:21 +02:00
Cedric Nugteren 3621639b63 Added device-name removal code to handle POCL naming convention 2018-07-13 21:20:27 +02:00
Cedric Nugteren 08b1417956 Added tuning results for GeForce GTX 1070 Ti 2018-07-13 21:07:32 +02:00
Cedric Nugteren c459582c4f Added tuning results for HD Graphics 6000 Broadwell GT3 2018-07-13 21:05:43 +02:00
Tyler Sorensen 36093429fd restored some of the changed tuning files for xgemm 2018-07-11 15:31:51 -04:00
Tyler Sorensen 7f2e98a140 added inline ptx to support shuffle on Nvidia GPUs 2018-07-11 15:12:22 -04:00
Alastair Murray 25661b2d6f Eliminate a temporary Program object
This was causing a crash for me because the temporary Program destructor called
clReleaseProgram on the cl_program with Program, and then clBuildProgram was
called on the same cl_program (belonging to the Program owned by the
shared_ptr, but it's the same cl_program).
2018-07-06 12:58:20 +01:00
Cedric Nugteren e3eedacbcc Disabled calls to clReleaseProgram under Windows to avoid segfaults when the OpenCL driver unloads first 2018-06-28 20:35:18 +09:00
Cedric Nugteren 1c9a741470 Merge branch 'master' into CLBlast-267-convgemm 2018-06-03 15:53:27 +02:00
Cedric Nugteren bd1715aff9 Fixes for CUDA version of CLBlast 2018-06-03 10:41:57 +02:00
Cedric Nugteren 7c3431a72a Fixes for Apple OpenCL CPU implementation which requires a LWGS of 1 when barriers are present 2018-06-01 20:59:44 +02:00
Cedric Nugteren 5702bff5ad Added error-checking for half-empty local work group sizes; fixed a minor TRSV global worksize issue 2018-05-31 22:37:06 +02:00
Cedric Nugteren e609220393 Some potential fixes for error -54 when launching TRSV and TRSM kernels 2018-05-31 20:09:49 +02:00
Cedric Nugteren ff4d5558a6 Widened Apple OpenCL check, added way to debug too-large-workgroups issue 2018-05-30 22:59:04 +02:00
Cedric Nugteren a8bb0c9f3c Added Apple OpenCL TRSV block size override; removed failing old Intel GPU test from README 2018-05-29 21:29:12 +02:00
Cedric Nugteren 6616a59774
Merge pull request #287 from CNugteren/apple-opencl-limitations-fixes
Apple opencl limitations for TRSV/TRSM now return not-implemented status
2018-05-27 20:54:27 +02:00
Cedric Nugteren 01d254c0b0 Added a check to return 'NotImplemented' error code in case of systems with < 16 LWGS for TSRV and TRSM 2018-05-27 18:38:47 +02:00
Cedric Nugteren 53198121ac Made FillMatrix and FillVector functions take a configurable local workgroup size 2018-05-27 12:03:32 +02:00
Cedric Nugteren c85c385aaf Added an option in the clients to output timing statistics: minimum, mean, and standard-deviation 2018-05-23 22:36:38 +02:00
Cedric Nugteren 838422fbb1 Further implemented single-kernel approach of convgemm; extended test to capture other parts of the kernel code 2018-05-21 11:47:16 +02:00