CLBlast

mirror of https://github.com/CNugteren/CLBlast.git synced 2024-08-21 20:42:28 +02:00

Author	SHA1	Message	Date
Cedric Nugteren	066af4069b	Removed an unused variable from the copy-transpose-pad function	2016-07-16 10:56:37 +02:00
Cedric Nugteren	c87e877bf2	Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel	2016-07-10 20:32:01 +02:00
Cedric Nugteren	57f09178d8	Added tuning results for AMD Oland and for Intel Graphics HD 530	2016-07-10 11:46:44 +02:00
Cedric Nugteren	39e9b1238f	Fixed a bug related to the cache and retrieval of programs based on the OpenCL context	2016-07-10 11:24:36 +02:00
Cedric Nugteren	9caa7ca5b9	Cache now compares cl_context instead of a pointer to a context; added verbose print statements to the cache	2016-07-08 20:57:58 +02:00
Cedric Nugteren	27854070b4	Added a VERBOSE mode to debug performance: now prints details about compilation and kernel execution to screen	2016-07-06 21:50:12 +02:00
Cedric Nugteren	77325b8974	Added an option to the performance clients to do a warm-up run before timing	2016-07-06 21:25:55 +02:00
Cedric Nugteren	9683b50c55	Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp)	2016-07-03 20:30:47 +02:00
Gian-Carlo Pascutto	7424532859	Ensure clGetKernelWorkGroupInfo return value fits. In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo to get the "bytes" amount needed to store the result from CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an "auto result = size_t", which in 32-bit mode is 4 bytes, regardless of the previous return value. The spec describes that it will actually be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure we are in fact passing a cl_ulong. Also adjust all callers to take the changed type into account.	2016-07-02 21:14:36 +02:00
Cedric Nugteren	7cf2f8c268	Fixed some memory leaks related to events not properly cleaned-up	2016-07-02 15:34:55 +02:00
Cedric Nugteren	b330ab0866	Added declspec(dllexport) to ClearCache and FillCache, and added declspec(dllimport) when not building the library	2016-06-30 10:49:17 +02:00
Cedric Nugteren	cd74aaac52	Updated to version 6.0 of the CLCudaAPI header	2016-06-29 19:42:49 +02:00
CNugteren	871b576c06	Made it possible to build the clients and tests on Windows using Visual Studio	2016-06-28 16:38:45 +02:00
Cedric Nugteren	76b20cfe0c	Fixes for the AppVeyor Windows build	2016-06-27 14:44:08 +02:00
Cedric Nugteren	66908ef5cd	Added tuning results for 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (thanks to OursDesCavernes)	2016-06-19 14:59:50 +02:00
Cedric Nugteren	61203453aa	Renamed all C++ source files to .cpp to match the .hpp extension better	2016-06-19 13:55:49 +02:00
Cedric Nugteren	f726fbdc9f	Moved all headers into the source tree, changed headers to .hpp extension	2016-06-18 20:20:13 +02:00
Cedric Nugteren	bacb5d2bb2	Clean-up of the routine class, moved RunKernel to the routine/common file	2016-06-18 18:16:14 +02:00
Cedric Nugteren	7b4c0e1cf0	Removed the template from the Routine base-class	2016-06-18 14:56:55 +02:00
Cedric Nugteren	f9947b4d7f	Removed the precision argument from the routines in favor of a single templated function	2016-06-17 14:30:37 +02:00
Cedric Nugteren	536b7fe4bc	Removed the interface to the cache functions from the Routine class, calls them directly now	2016-06-17 13:57:50 +02:00
Cedric Nugteren	98a95c89fc	Moved the RunKernel and PadCopyTransposeMatrix functions out of the Routine class	2016-06-17 12:32:06 +02:00
Cedric Nugteren	afe8852eaa	Moved the test-for-valid-buffers function from the Routine class to separate functions in a separate file	2016-06-17 11:29:07 +02:00
Cedric Nugteren	52ccaf5b25	Added XOMATCOPY routines to perform out-of-place matrix scaling, copying, and/or transposing	2016-06-16 18:07:46 +02:00
Cedric Nugteren	39b7dbc5e3	Added some constness to variables related to the GEMM routines	2016-06-15 12:34:05 +02:00
Cedric Nugteren	b894611ad1	Re-organised the level-3 supporting kernels (copy, pad, transpose, convert) and renamed files and functions appropriately	2016-06-14 18:17:58 +02:00
Cedric Nugteren	3e78a99355	Moved device vendor and type checks to a common header	2016-06-14 14:30:22 +02:00
Cedric Nugteren	6e2017c67d	Added support for FP16 on ARM Mali-T628 (officially not supported)	2016-06-14 14:29:53 +02:00
Cedric Nugteren	6925003e45	Added global memory synchronisation for better cache performance on ARM Mali GPUs	2016-06-08 10:13:37 +02:00
Cedric Nugteren	03182f9d07	Added half-precision tests for the clBLAS reference through conversion to single-precision	2016-05-26 23:36:19 +02:00
Cedric Nugteren	9f87455070	Added level-3 half-precision routines HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM	2016-05-25 13:29:53 +02:00
Cedric Nugteren	ac1575056e	Added proper argument handling and displaying for half-precision data-types	2016-05-24 14:06:16 +02:00
Cedric Nugteren	3e9a07f00a	Added level-2 half-precision routines HGER/HSYR/HSPR/HSYR2/HSPR2	2016-05-22 16:59:14 +02:00
Cedric Nugteren	f0cb3fdc81	Fixed tuning results for half-precision; added first results for the xGER kernels	2016-05-22 16:46:05 +02:00
Cedric Nugteren	c8ff3f143f	Prepared the GER kernels and tuner for half-precision support	2016-05-22 16:18:08 +02:00
Cedric Nugteren	95b828da12	Added level-2 half-precision routines HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV	2016-05-22 15:38:26 +02:00
Cedric Nugteren	b6268d0c22	Added first tuning results for the half-precision xGEMV kernels	2016-05-22 15:29:05 +02:00
Cedric Nugteren	88551b4005	Prepared the GEMV kernels and tuner for half-precision support	2016-05-22 15:22:54 +02:00
Cedric Nugteren	803aaf3070	Added level-1 half-precision routines HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN	2016-05-22 14:47:14 +02:00
Cedric Nugteren	3c9e63c054	Added first tuning results for the half-precision xDOT kernels	2016-05-22 14:43:25 +02:00
Cedric Nugteren	f70ded34f3	Added half-precision support for all level 1 routines	2016-05-22 14:26:19 +02:00
Cedric Nugteren	489c5d76cf	Merged in latest changes from 0.7.1 release	2016-05-18 21:32:56 +02:00
Cedric Nugteren	7a3b695db7	Added half precision tuning results for supporting kernels (pad, copy, transpose, padtranspose)	2016-05-16 12:45:10 +02:00
Cedric Nugteren	af2ac62212	Prepared GEMM and supporting kernels and tuners for half-precision support	2016-05-16 12:37:24 +02:00
Cedric Nugteren	4b6bdd83a2	Added header with conversions from and to half-precision floating-point	2016-05-15 20:13:57 +02:00
Cedric Nugteren	5e1b2e021f	Set kernel arguments for AXPY as constant memory buffers, making it possible to transfer half-precision values as well	2016-05-14 18:06:00 +02:00
Cedric Nugteren	120c31a30f	Initial experimental version of the half-precision HAXPY routine	2016-05-13 20:49:34 +02:00
Cedric Nugteren	f2ba75890c	Initial changes in preparation for half-precision fp16 support	2016-05-12 19:56:21 +02:00
cnugteren	25a25dbd6f	Fixed errors in xAXPY and xSCAL tests on AMD hardware	2016-05-08 17:30:31 +02:00
Cedric Nugteren	a8f109296c	Fixed the calculation of the required buffer sizes in case of subvectors and submatrices	2016-05-02 20:04:55 +02:00

1 2 3 4

171 commits