Commit graph

1398 commits

Author SHA1 Message Date
Cedric Nugteren c2c1e5fa95
Merge pull request #312 from CNugteren/CLBlast-311-missing-event-in-trsv-trsm
Missing events in TRSV and TRSM
2018-08-14 22:52:36 +02:00
Cedric Nugteren bf43dbb4ee Made last operation in TRSV and TRSM asynchronous, making the events not null 2018-08-13 22:58:44 +02:00
Cedric Nugteren 3115c15db5 Small refactoring of events in TRSV substitution routine 2018-08-13 22:58:01 +02:00
Cedric Nugteren dd1fa7cc81
Merge pull request #310 from CNugteren/CLBlast-307-netlib-api-static-opencl-vars
Netlib API with optional static OpenCL variables
2018-08-09 21:37:47 +02:00
Cedric Nugteren 9d9f09fce9 Name change of setting to NETLIB_PERSISTENT_OPENCL 2018-08-07 22:41:06 +02:00
Cedric Nugteren fe639455bd Added an option to compile the Netlib API with static OpenCL device and context 2018-08-05 21:12:39 +02:00
Cedric Nugteren 2bea758165
Merge pull request #309 from CNugteren/CLBlast-306-omatcopy-conjugate
Fixes bug in conjugate transpose not being executed
2018-08-02 08:35:32 +02:00
Cedric Nugteren bed10d2731
Merge pull request #308 from CNugteren/CLBlast-301-weird-AMD-Hainan-bug
Added workaround for AMD Southern Islands GPU issue
2018-07-31 21:49:53 +02:00
Cedric Nugteren 503ab74f02 Fixed issue with not performing complex conjugation under certain cases when transposing 2018-07-31 21:49:37 +02:00
Cedric Nugteren 391e5757bd Fixed the tests of OMATCOPY to include proper complex conjugation 2018-07-31 21:44:28 +02:00
Cedric Nugteren 713d0f96b3 Fixed an error reporting issue related to the canary region 2018-07-31 21:24:21 +02:00
Cedric Nugteren d749c4af72 Added note about AMD southern islands GPU issue and the required workaround 2018-07-31 20:55:56 +02:00
Cedric Nugteren 123f38a8ab Added Beignet 1.2.1 requirement to the README for IvyBridge GPUs 2018-07-31 20:52:00 +02:00
Cedric Nugteren bf24421a34 Updated the tuning results for Intel IvyBridge M GT2 2018-07-31 20:49:41 +02:00
Cedric Nugteren 38bdb248cd
Merge pull request #305 from CNugteren/CLBlast-303-tuner-check-local-size
Tuners now check for valid local thread size
2018-07-30 21:13:30 +02:00
Cedric Nugteren 2b76bfee97 Fixed a wrong event issue causing error -57 2018-07-29 22:16:27 +02:00
Cedric Nugteren 2dd539f911 Removed complex numbers support for CONVGEMM 2018-07-29 10:37:14 +02:00
Cedric Nugteren 5903820ba2 Merge branch 'master' into CLBlast-267-convgemm 2018-07-29 10:26:34 +02:00
Cedric Nugteren bc47e7e7cc Added print statements to indicate the 4 stages of GEMM tuning 2018-07-28 16:08:22 +02:00
Cedric Nugteren fa84ac36f2 The tuners now also check for valid local thread configurations and skip invalid ones completely, saving compilation time 2018-07-28 16:01:03 +02:00
Cedric Nugteren dda1e567f8
Merge pull request #304 from CNugteren/CLBlast-300-fix-staggered-indices-AMD-GEMMK1
Fix staggered indices on AMD GPUs for GEMMK == 1 kernel
2018-07-28 15:29:16 +02:00
Cedric Nugteren 0f0baa561b Disabled the use of staggered indices on AMD GPUs for the new GEMMK == 1 kernels to improve performance 2018-07-28 14:36:33 +02:00
Cedric Nugteren 03bed8633e Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel 2018-07-27 23:08:49 +02:00
Cedric Nugteren 429ff070f8 Fixed a bug: forgot to initialize the shared pointer for the null kernel 2018-07-27 20:53:24 +02:00
Cedric Nugteren f84036948b Renamed AMD SI workaround defines 2018-07-27 20:38:01 +02:00
Cedric Nugteren e8dea34fce Added workaround for weird AMD SI Hainan bug 2018-07-25 22:59:36 +02:00
Cedric Nugteren 6a8b9e24f2 Added code to report the average tuning results 2018-07-25 22:28:44 +02:00
Cedric Nugteren f8fb707fa4
Merge pull request #297 from tyler-utah/master
inline PTX to support subgroup shuffle for Nvidia GPUs
2018-07-23 19:43:03 +02:00
Tyler Sorensen 0772d63498 moved a two-line macro to a single line 2018-07-16 20:12:30 -04:00
Tyler Sorensen f4e5b1c14c forgot to add test cases back in, oops 2018-07-14 22:47:39 -04:00
Tyler Sorensen 7709a7308b Applied feedback from Cedric from first pull request 2018-07-14 19:50:47 -04:00
Cedric Nugteren db179a1e40 Updated to CLBlast version 1.4.1 2018-07-14 12:29:06 +02:00
Cedric Nugteren f72620f474 Added tuning results for Intel i5-4970S 2018-07-13 21:25:21 +02:00
Cedric Nugteren 3621639b63 Added device-name removal code to handle POCL naming convention 2018-07-13 21:20:27 +02:00
Cedric Nugteren 08b1417956 Added tuning results for GeForce GTX 1070 Ti 2018-07-13 21:07:32 +02:00
Cedric Nugteren c459582c4f Added tuning results for HD Graphics 6000 Broadwell GT3 2018-07-13 21:05:43 +02:00
Tyler Sorensen 36093429fd restored some of the changed tuning files for xgemm 2018-07-11 15:31:51 -04:00
Tyler Sorensen 7f2e98a140 added inline ptx to support shuffle on Nvidia GPUs 2018-07-11 15:12:22 -04:00
Cedric Nugteren 7bae54f61f Updated changelog 2018-07-06 19:39:46 +02:00
Cedric Nugteren 49e06d20ab
Merge pull request #296 from alycm/CLBlast-291-eliminate-temporary-program
Eliminate a temporary Program object
2018-07-06 19:35:10 +02:00
Alastair Murray 25661b2d6f Eliminate a temporary Program object
This was causing a crash for me because the temporary Program destructor called
clReleaseProgram on the cl_program with Program, and then clBuildProgram was
called on the same cl_program (belonging to the Program owned by the
shared_ptr, but it's the same cl_program).
2018-07-06 12:58:20 +01:00
Cedric Nugteren 43e3f27254
Merge pull request #295 from CNugteren/CLBlast-292-no-cl-program-release-windows
Disabled calls to clReleaseProgram under Windows
2018-06-28 21:22:12 +09:00
Cedric Nugteren e3eedacbcc Disabled calls to clReleaseProgram under Windows to avoid segfaults when the OpenCL driver unloads first 2018-06-28 20:35:18 +09:00
Cedric Nugteren 1c9a741470 Merge branch 'master' into CLBlast-267-convgemm 2018-06-03 15:53:27 +02:00
Cedric Nugteren 4471b67735 Updated to CLBlast version 1.4.0 2018-06-03 13:18:05 +02:00
Cedric Nugteren fee8df153c Added list of tuners to be run by 'alltuners' target 2018-06-03 10:42:15 +02:00
Cedric Nugteren bd1715aff9 Fixes for CUDA version of CLBlast 2018-06-03 10:41:57 +02:00
Cedric Nugteren 4f594e3931 Added MKL as an alternative for CBLAS for correctness and performance comparisons 2018-06-02 17:57:45 +02:00
Cedric Nugteren 7c3431a72a Fixes for Apple OpenCL CPU implementation which requires a LWGS of 1 when barriers are present 2018-06-01 20:59:44 +02:00
Cedric Nugteren 5702bff5ad Added error-checking for half-empty local work group sizes; fixed a minor TRSV global worksize issue 2018-05-31 22:37:06 +02:00