CLBlast

Commit Graph

Author	SHA1	Message	Date
Cedric Nugteren	0ee39af5ed	Add tuning results for TITAN RTX	2020-10-10 13:01:12 +02:00
Cedric Nugteren	481d86665f	Add tuning results for Radeon RX Vega	2020-10-10 12:56:28 +02:00
Pradeep Garigipati	dff65e9217	Add a cautionary note in Program::GetIR and mention the fix in CHANGELOG	2020-06-07 21:13:33 +05:30
Pradeep Garigipati	aec71699f8	Fix Program::GetIR to handle programs with multiple devices	2020-06-05 12:00:45 +05:30
Cedric Nugteren	c369cf1a16	Increase display width of the local/global sizes	2020-05-11 20:26:33 +02:00
Cedric Nugteren	4a6c7c37a3	Made sure that the global workgroup size is a multiple of the local size in the tuners	2020-05-10 20:28:23 +02:00
Cedric Nugteren	69a4b4d4b0	Added logging of local/global workgroup sizes when run the tuners	2020-05-10 20:08:28 +02:00
Cedric Nugteren	0870e76fba	Updated PyCLBlast version number	2020-05-10 14:55:03 +02:00
Cedric Nugteren	0b7ce8033c	Added a sample to demonstrate a batched routine	2020-05-10 14:54:50 +02:00
Cedric Nugteren	b94e81af10	Added pyclblast bindings for the 3 batched routines	2020-05-10 12:26:25 +02:00
Cedric Nugteren	bbb2031bf3	Move queue creation out of the tuner loop	2020-05-03 20:30:55 +02:00
Cedric Nugteren	b46853660e	Made it more likely (but no guarantees) for amax/amin to return the first index	2020-03-08 11:26:49 +01:00
Cedric Nugteren	e3ce88154a	Silenced a new OpenCL warning message	2020-03-08 10:14:59 +01:00
Cedric Nugteren	49eb490ee1	Catches all exceptions of the tuners	2020-02-17 22:07:51 +01:00
Tarmo Räntilä	21b66ca761	Reduce TestMatrix calls for xgemmstridedbatched. Replace the looped test by a single one with the offset of the last batch.	2019-12-09 22:17:24 +02:00
Tarmo Räntilä	bf50c4e53e	Reduce TestMatrix calls for xgemmbatched. Replace the looped test by a single one with the maximal found offset.	2019-12-09 22:13:52 +02:00
etomzak	9560193a9e	Fix out-of-bounds read/write in XhadFaster Fix an error in XhadFaster where data would be written beyond the end of zgm. The kernel loop assumed that there was always enough work for each thread to process WPT items, but this was not enforced. It's possible to detect the overflow with the "canary" buffer regions, but for SHAD, kCanarySize must be ~500 (much larger than the normal 127). This commit may improve the performance of XhadFaster, since the kernel was performing 2x work in some cases (once over real data, once over garbage). Courtesy of Codeplay Software Ltd.	2019-09-04 12:55:25 +01:00
Cedric Nugteren	3f9d7bca22	Fixed a bug in the absolute-min index kernel	2019-05-19 14:00:18 +02:00
Cedric Nugteren	af6a9eedd1	Added a function to set the OpenCL kernel standard, either 1.1 or 1.2	2019-05-11 20:39:00 +02:00
Cedric Nugteren	9cbffc9b7c	Changed back to cl_intel_subgroups as suggested	2019-05-08 22:01:56 +02:00
Cedric Nugteren	c5a82f6978	Added a host-code check to make sure the avc_motion_estimation is available	2019-05-07 20:47:50 +02:00
Cedric Nugteren	c6ba86cdc3	Enabled avc_motion_estimation extension for Intel subgroup shuffling	2019-05-07 20:47:31 +02:00
Umar Arshad	cf4907942c	Remove assert for extention not available in macOS The cl_nv_device_attribute_query extention is not available on the Apple platform. This caused failures during debug builds at runtime.	2019-05-03 23:28:07 -04:00
Cedric Nugteren	7084311e45	Added tuning parameters for Tesla P100 16GB	2019-02-09 16:31:48 +01:00
Cedric Nugteren	1035e533cd	Added tuning parameters for Xeon E5-2630 v3 and v4	2019-02-09 16:29:30 +01:00
Cedric Nugteren	e0541c41a1	Added fp32 to fp16 conversion function in Python to make haxpy example work	2019-01-23 19:52:01 +01:00
Cedric Nugteren	347f0df32f	Added a (non-working) sample of half precision AXPY in Python	2019-01-22 21:14:43 +01:00
Cedric Nugteren	23b9f655fa	Updated pyclblast README, updated to 1.2.0 for half-precision support	2019-01-22 21:14:02 +01:00
Cedric Nugteren	3937efdcda	Added experimental support for half-precision in pyclblast	2019-01-22 21:13:41 +01:00
Cedric Nugteren	9a9c24e811	Merge pull request #345 from CNugteren/convolution-fixes-and-tuner Convolution with single kernel	2019-01-19 17:56:05 +01:00
Cedric Nugteren	c42e48068b	Added a few more initial Intel tuning parameters for convgemm	2019-01-19 15:32:35 +01:00
Cedric Nugteren	afcf5dc6eb	Added a check to prevent the stride of matrix C being set to 0 for the strided-batched-GEMM routine	2019-01-05 10:56:35 +01:00
Cedric Nugteren	560f7a40f6	Added convgemm to the CLBlast database, added initial parameters for Skylake GPU	2018-12-31 19:05:34 +01:00
Cedric Nugteren	d929525039	Added support for the convgemm tuner in the tuner database	2018-12-31 18:49:12 +01:00
Cedric Nugteren	153ac06cf2	Added the forgotten batch dimension to the tuner to get correct kernel executions	2018-12-31 13:19:58 +01:00
Koichi Akabe	a8e6f813dd	Fix the xconvgemm tuner	2018-12-18 14:05:25 +09:00
Cedric Nugteren	1f0cd61824	Added first version of a tuner for the ConvGemm direct kernel	2018-12-18 13:59:26 +09:00
Koichi Akabe	301dc280df	Fix xconvgemm kernel and enable ConvGemmMethod::kSingleKernel	2018-12-18 13:56:00 +09:00
Cedric Nugteren	c0e41b87cb	Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel	2018-11-30 20:23:26 +01:00
Koichi Akabe	a646d6ca46	Remove unnecessary attribute of inline function	2018-11-19 13:03:50 +09:00
Koichi Akabe	032e3b0cc0	Add kernel_mode option to im2col, col2im, and convgemm functions	2018-11-12 10:12:07 +09:00
Cedric Nugteren	6f67525ea6	Changed col2im to append to the existing im-buffer	2018-11-07 19:45:07 +01:00
Cedric Nugteren	2d32a23293	Added new col2im routine to the documentation	2018-11-01 21:46:19 +01:00
Koichi Akabe	0b3d04f709	Fix col2im implementation	2018-10-30 14:54:55 +09:00
Cedric Nugteren	d45911b61d	Added groundwork for col2im algorithm plus first non-working version of kernel and test	2018-10-23 20:52:25 +02:00
Cedric Nugteren	44b630fc22	Some name changes in im2col code	2018-10-22 22:12:58 +02:00
Cedric Nugteren	9a1454496d	Fixed a bug with the pre-processing and the AXPY kernel	2018-10-17 21:15:53 +02:00
Cedric Nugteren	664a238adf	Fixed a bug in the XaxpyFaster kernel for specific parameters	2018-10-15 20:08:29 +02:00
Cedric Nugteren	634b2bc75c	Merge pull request #319 from CNugteren/convgemm_multi_kernel First im2col+GEMM implementation of convolution	2018-10-14 17:27:45 +02:00
Cedric Nugteren	46c50cdd7e	Made tuning API more flexible: disregards any extra parameter values	2018-10-13 17:47:29 +02:00
Cedric Nugteren	1736c0cef4	Fixed pre-processor warnings related to the subgroup shuffling	2018-10-10 19:12:42 +02:00
Cedric Nugteren	83ba3d4b7b	Merge branch 'master' into convgemm_multi_kernel	2018-09-16 20:01:18 +02:00
Cedric Nugteren	0f6dd01e51	Fixed an MSVC compilation error due to large strings	2018-09-15 19:58:07 +02:00
Cedric Nugteren	9bedaa752d	Fixed an MSVC compilation error due to large strings	2018-09-15 17:35:26 +02:00
Cedric Nugteren	8ac39fa331	Disabled Intel subgroup shuffling for double-precision	2018-09-15 16:53:09 +02:00
Cedric Nugteren	51cc346751	Fixed issues with GEMMK=1 kernel and the pre-processor	2018-09-15 16:50:34 +02:00
Cedric Nugteren	c788e040f7	Added xCONVGEMM as im2col plus a batched GEMM kernel	2018-09-07 22:02:44 +02:00
Cedric Nugteren	bf43dbb4ee	Made last operation in TRSV and TRSM asynchronous, making the events not null	2018-08-13 22:58:44 +02:00
Cedric Nugteren	3115c15db5	Small refactoring of events in TRSV substitution routine	2018-08-13 22:58:01 +02:00
Cedric Nugteren	9d9f09fce9	Name change of setting to NETLIB_PERSISTENT_OPENCL	2018-08-07 22:41:06 +02:00
Cedric Nugteren	fe639455bd	Added an option to compile the Netlib API with static OpenCL device and context	2018-08-05 21:12:39 +02:00
Cedric Nugteren	2bea758165	Merge pull request #309 from CNugteren/CLBlast-306-omatcopy-conjugate Fixes bug in conjugate transpose not being executed	2018-08-02 08:35:32 +02:00
Cedric Nugteren	bed10d2731	Merge pull request #308 from CNugteren/CLBlast-301-weird-AMD-Hainan-bug Added workaround for AMD Southern Islands GPU issue	2018-07-31 21:49:53 +02:00
Cedric Nugteren	503ab74f02	Fixed issue with not performing complex conjugation under certain cases when transposing	2018-07-31 21:49:37 +02:00
Cedric Nugteren	bf24421a34	Updated the tuning results for Intel IvyBridge M GT2	2018-07-31 20:49:41 +02:00
Cedric Nugteren	2b76bfee97	Fixed a wrong event issue causing error -57	2018-07-29 22:16:27 +02:00
Cedric Nugteren	2dd539f911	Removed complex numbers support for CONVGEMM	2018-07-29 10:37:14 +02:00
Cedric Nugteren	5903820ba2	Merge branch 'master' into CLBlast-267-convgemm	2018-07-29 10:26:34 +02:00
Cedric Nugteren	bc47e7e7cc	Added print statements to indicate the 4 stages of GEMM tuning	2018-07-28 16:08:22 +02:00
Cedric Nugteren	fa84ac36f2	The tuners now also check for valid local thread configurations and skip invalid ones completely, saving compilation time	2018-07-28 16:01:03 +02:00
Cedric Nugteren	0f0baa561b	Disabled the use of staggered indices on AMD GPUs for the new GEMMK == 1 kernels to improve performance	2018-07-28 14:36:33 +02:00
Cedric Nugteren	03bed8633e	Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel	2018-07-27 23:08:49 +02:00
Cedric Nugteren	429ff070f8	Fixed a bug: forgot to initialize the shared pointer for the null kernel	2018-07-27 20:53:24 +02:00
Cedric Nugteren	f84036948b	Renamed AMD SI workaround defines	2018-07-27 20:38:01 +02:00
Cedric Nugteren	e8dea34fce	Added workaround for weird AMD SI Hainan bug	2018-07-25 22:59:36 +02:00
Cedric Nugteren	6a8b9e24f2	Added code to report the average tuning results	2018-07-25 22:28:44 +02:00
Cedric Nugteren	f8fb707fa4	Merge pull request #297 from tyler-utah/master inline PTX to support subgroup shuffle for Nvidia GPUs	2018-07-23 19:43:03 +02:00
Tyler Sorensen	0772d63498	moved a two-line macro to a single line	2018-07-16 20:12:30 -04:00
Tyler Sorensen	f4e5b1c14c	forgot to add test cases back in, oops	2018-07-14 22:47:39 -04:00
Tyler Sorensen	7709a7308b	Applied feedback from Cedric from first pull request	2018-07-14 19:50:47 -04:00
Cedric Nugteren	f72620f474	Added tuning results for Intel i5-4970S	2018-07-13 21:25:21 +02:00
Cedric Nugteren	3621639b63	Added device-name removal code to handle POCL naming convention	2018-07-13 21:20:27 +02:00
Cedric Nugteren	08b1417956	Added tuning results for GeForce GTX 1070 Ti	2018-07-13 21:07:32 +02:00
Cedric Nugteren	c459582c4f	Added tuning results for HD Graphics 6000 Broadwell GT3	2018-07-13 21:05:43 +02:00
Tyler Sorensen	36093429fd	restored some of the changed tuning files for xgemm	2018-07-11 15:31:51 -04:00
Tyler Sorensen	7f2e98a140	added inline ptx to support shuffle on Nvidia GPUs	2018-07-11 15:12:22 -04:00
Alastair Murray	25661b2d6f	Eliminate a temporary Program object This was causing a crash for me because the temporary Program destructor called clReleaseProgram on the cl_program with Program, and then clBuildProgram was called on the same cl_program (belonging to the Program owned by the shared_ptr, but it's the same cl_program).	2018-07-06 12:58:20 +01:00
Cedric Nugteren	e3eedacbcc	Disabled calls to clReleaseProgram under Windows to avoid segfaults when the OpenCL driver unloads first	2018-06-28 20:35:18 +09:00
Cedric Nugteren	1c9a741470	Merge branch 'master' into CLBlast-267-convgemm	2018-06-03 15:53:27 +02:00
Cedric Nugteren	bd1715aff9	Fixes for CUDA version of CLBlast	2018-06-03 10:41:57 +02:00
Cedric Nugteren	7c3431a72a	Fixes for Apple OpenCL CPU implementation which requires a LWGS of 1 when barriers are present	2018-06-01 20:59:44 +02:00
Cedric Nugteren	5702bff5ad	Added error-checking for half-empty local work group sizes; fixed a minor TRSV global worksize issue	2018-05-31 22:37:06 +02:00
Cedric Nugteren	e609220393	Some potential fixes for error -54 when launching TRSV and TRSM kernels	2018-05-31 20:09:49 +02:00
Cedric Nugteren	ff4d5558a6	Widened Apple OpenCL check, added way to debug too-large-workgroups issue	2018-05-30 22:59:04 +02:00
Cedric Nugteren	a8bb0c9f3c	Added Apple OpenCL TRSV block size override; removed failing old Intel GPU test from README	2018-05-29 21:29:12 +02:00
Cedric Nugteren	6616a59774	Merge pull request #287 from CNugteren/apple-opencl-limitations-fixes Apple opencl limitations for TRSV/TRSM now return not-implemented status	2018-05-27 20:54:27 +02:00
Cedric Nugteren	01d254c0b0	Added a check to return 'NotImplemented' error code in case of systems with < 16 LWGS for TSRV and TRSM	2018-05-27 18:38:47 +02:00
Cedric Nugteren	53198121ac	Made FillMatrix and FillVector functions take a configurable local workgroup size	2018-05-27 12:03:32 +02:00
Cedric Nugteren	c85c385aaf	Added an option in the clients to output timing statistics: minimum, mean, and standard-deviation	2018-05-23 22:36:38 +02:00
Cedric Nugteren	838422fbb1	Further implemented single-kernel approach of convgemm; extended test to capture other parts of the kernel code	2018-05-21 11:47:16 +02:00

1 2 3 4 5 ...

780 Commits (6e2ab6ee967c4a9b3350c7ce4e7d7b736c9e45f6)