CLBlast

Commit Graph

Author	SHA1	Message	Date
Cedric Nugteren	ebce82e650	Merge pull request #222 from CNugteren/override_params_from_json Override params in clients from tuner JSON	2017-11-25 09:48:27 +01:00
Cedric Nugteren	abb4d5ab32	Added tuning results for ARM Mali T760 GPU	2017-11-24 21:16:54 +01:00
Cedric Nugteren	9527c89c30	Made parameter override in the clients a command-line argument and added support for multi-kernel routines	2017-11-22 20:53:20 +01:00
Cedric Nugteren	0f080bbc6e	Potentially fixed an MSVC 2013 issue with a copy-constructor not being generated	2017-11-20 20:54:18 +01:00
Cedric Nugteren	e0f3484084	Fixes some displaying issues in the GEMM routine tuner	2017-11-20 20:29:52 +01:00
Cedric Nugteren	5467c0cac5	Fixed a variety of warnings and an error for MSVC2013 compilation	2017-11-19 21:09:24 +01:00
Cedric Nugteren	4e0d08c3bc	Added compilation timing and better compilation error reporting	2017-11-19 16:58:13 +01:00
Cedric Nugteren	a3a8b44f59	Some fixed for the new auto-tuner to be compatible with the Python scripts	2017-11-19 16:31:08 +01:00
Cedric Nugteren	76d2b7f0b6	Revived the GEMM routine tuner; minor formatting changes	2017-11-19 12:59:52 +01:00
Cedric Nugteren	7a54494577	Modified the kernel tuners to use the newly integrated auto-tuner	2017-11-19 12:58:41 +01:00
Cedric Nugteren	8a5a5e031e	Moved some tuning functions from .hpp to .cpp	2017-11-17 20:58:36 +01:00
Cedric Nugteren	f94d498a37	Moved compilation function to separate file; removed dependency of tuners of the CLBlast library	2017-11-17 20:57:46 +01:00
Cedric Nugteren	2b8ad70b63	Added printing of the best parameters for the new tuner	2017-11-16 21:18:29 +01:00
Cedric Nugteren	1b2b46f2f0	Added first version of integrated and re-written auto-tuner	2017-11-15 22:49:35 +01:00
Cedric Nugteren	0cd78bb6f9	Added kernel timing functionality to the utilities	2017-11-15 22:47:06 +01:00
Cedric Nugteren	b337bffbaf	Added exception handle with catch-all	2017-11-15 22:44:44 +01:00
Cedric Nugteren	03ebf14b97	Made the exception dispatch function optionally silent	2017-11-13 21:11:31 +01:00
Cedric Nugteren	4bac1287f2	Moved square-difference utility function for use in the tuners	2017-11-13 21:10:44 +01:00
Cedric Nugteren	677afd3b96	Factored out the creation of the OpenCL header and the program compilation	2017-11-11 16:14:43 +01:00
Cedric Nugteren	c41d219ea4	Added tuning results for the GeForce GTX750Ti	2017-11-09 21:19:21 +01:00
Cedric Nugteren	b18cc9d3f1	Merge pull request #212 from CNugteren/kernel_selection_tuner GEMM kernel selection tuner	2017-11-07 22:20:13 +01:00
Cedric Nugteren	3ec0be6fb8	Added various GEMM routine tuning results	2017-11-07 21:34:54 +01:00
Cedric Nugteren	33ac2b0175	Improved the way the database defaults are computed	2017-11-06 21:59:45 +01:00
Cedric Nugteren	34a33b54cf	Changed GEMM routine tuner's scoring to use L2 measure instead for better averaging	2017-11-06 20:50:36 +01:00
Cedric Nugteren	9b0a435fb0	Integrated the GEMM routine tuner for kernel selection; added first tuning results	2017-11-02 21:47:14 +01:00
Cedric Nugteren	73272ab97d	Fixed a bug in database compression/decompression	2017-11-02 21:19:18 +01:00
Cedric Nugteren	5c90577dfd	Added collecting and printing of scores for the kernel-selection tuner	2017-10-30 20:39:21 +01:00
Cedric Nugteren	ac5a58cfe5	Added platform ID to the binary program cache to prevent issues with multi-platform systems	2017-10-29 20:01:30 +01:00
Cedric Nugteren	319762f150	Added Android support using the GNU C++ STL library and the GCC toolchain	2017-10-29 12:07:07 +01:00
Cedric Nugteren	12b08ae491	Merge branch 'master' into android_support	2017-10-28 17:32:37 +02:00
Cedric Nugteren	334a26eb12	Added initial version of a GEMM kernel selection tuner	2017-10-28 17:30:29 +02:00
Cedric Nugteren	bd57dfa435	Moved timing function to a separate file	2017-10-28 14:12:05 +02:00
Cedric Nugteren	fa6e5e67f5	Fixed a bug when using the matrix A-offset argument for the TRSM routine	2017-10-27 22:12:30 +02:00
Cedric Nugteren	449577cf07	Reduced TRSM block-size for better numerical stability	2017-10-27 22:07:43 +02:00
Cedric Nugteren	44f7fa628a	Added GEMV synchronisation for the TRSV routine: similar bug as in TRSM	2017-10-27 22:01:15 +02:00
Cedric Nugteren	d49aae236e	Fixed a bug in TRSM routine due to missing event synchronisations after GEMM calls	2017-10-25 20:35:39 +02:00
Cedric Nugteren	472f90501c	Added tuning parameters for GeForce GTX 580, GeForce GTX 1080Ti, and Core i5-4570	2017-10-20 18:06:12 +02:00
Cedric Nugteren	363568787e	Moved CUmodule code from Kernel to Program class to not require re-compilation every time	2017-10-18 18:17:30 +02:00
Cedric Nugteren	9d879c949a	Fix an incompatibility with CUDA's FP16 definition	2017-10-17 20:29:23 +02:00
Cedric Nugteren	b1270f04b8	Made buffers of batched routines read/write (was: read-only)	2017-10-17 19:56:47 +02:00
Cedric Nugteren	f349731d54	CUDA kernel compilation fixes	2017-10-17 19:53:09 +02:00
Cedric Nugteren	0719f14486	Made all CUDA kernel launches synchronous; removed exception raising	2017-10-16 21:54:42 +02:00
Cedric Nugteren	d62823f067	Added a missing OpenCL-to-CUDA function translation	2017-10-15 19:53:52 +02:00
Cedric Nugteren	7663cba234	Fixes for the CUDA API: first tests pass and the client runs	2017-10-15 17:43:20 +02:00
Cedric Nugteren	71049e8d39	Added the SM-compute-arch version to the nv compile options	2017-10-15 17:41:44 +02:00
Cedric Nugteren	7408da174c	Various fixes to make the first CUDA examples work	2017-10-15 12:17:35 +02:00
Cedric Nugteren	55a802c63d	Fixed a kernel/attribute order bug in the direct GEMM kernels	2017-10-14 17:21:34 +02:00
Cedric Nugteren	b06bc01da9	Make local memory pointers a define in OpenCL; some fixes to the recently changed transpose kernel code	2017-10-14 17:13:54 +02:00
Cedric Nugteren	d9456306e0	Made transpose kernel struct init proper according to the C standard	2017-10-14 16:48:06 +02:00
Cedric Nugteren	313fc796b2	Fixed several (not all) CUDA kernel compilation issues	2017-10-14 16:01:12 +02:00
Cedric Nugteren	54d0c440ce	Various fixes to make the host code and sample compile with the CUDA API	2017-10-14 11:43:57 +02:00
Cedric Nugteren	2d7b648a24	Added OpenCL to CUDA translation header for the kernels	2017-10-14 10:49:25 +02:00
Cedric Nugteren	cc5b475425	CUDA API now takes context and device in instead of stream	2017-10-12 12:20:43 +02:00
Cedric Nugteren	b901809345	Added first (untested) version of a CUDA API	2017-10-11 23:16:57 +02:00
Cedric Nugteren	44246053a5	Removed include of clpp11.hpp in places other than utilities.hpp	2017-10-09 19:41:40 +02:00
Cedric Nugteren	df3c9f4a8a	Moved non-routine-specific API functions and includes to separate files	2017-10-08 21:52:02 +02:00
Cedric Nugteren	3598762029	Moved the remaining OpenCL specific host code to the clpp11.h header where it belongs	2017-10-08 10:29:47 +02:00
Cedric Nugteren	6d3e1212f0	Synchronizes clpp11.h with CLCudaAPI 9.0	2017-10-07 18:43:29 +02:00
Cedric Nugteren	86b80cdc98	Fixed a small typo	2017-10-07 18:39:32 +02:00
Cedric Nugteren	375193fe4e	Gemm in-direct implementation now uses only 1 larger instead of max 3 optional temporary buffers	2017-10-03 21:55:21 +02:00
Cedric Nugteren	6b226028d5	Allow OverrideParameters function to work before a kernel was first used	2017-10-01 20:32:39 +02:00
Cedric Nugteren	1009303717	Merge branch 'additional_tuners'	2017-09-30 21:04:32 +02:00
Cedric Nugteren	c151ab1325	Refactored the tuning architecture: less duplicate now; more defaults	2017-09-30 20:26:26 +02:00
Cedric Nugteren	00b5771477	Added Android header for compilation with gnustl STL	2017-09-26 21:20:01 +02:00
Cedric Nugteren	21af690472	Added missing headers	2017-09-26 21:17:55 +02:00
Cedric Nugteren	ed980a1df1	Updated database override function to work with the new database storage format	2017-09-24 15:44:14 +02:00
Cedric Nugteren	255f09843c	Made program and binary databases dependent on the routine parameters on top of the name	2017-09-23 20:40:38 +02:00
Cedric Nugteren	890281f3e8	Made database-caching no longer dependent on device name but on device/platform IDs	2017-09-23 17:50:44 +02:00
Cedric Nugteren	ae1eeb4d1f	Fixed type conversion warnings under MSVC 2013	2017-09-19 19:44:34 +02:00
Cedric Nugteren	1d2ee29cb9	Fixed compilation issues of the database for MSVC 2013	2017-09-19 19:44:05 +02:00
Cedric Nugteren	a23cd8d13a	Updated README with proper AMD device names; fixed device look-up for names of length 50+	2017-09-16 21:26:38 +02:00
Cedric Nugteren	0802e3d84c	Added tuning results for Intel Core i7 6770HQ	2017-09-16 21:19:06 +02:00
Cedric Nugteren	bcf39eb79a	Fixed a compilation error and warning under MacOS	2017-09-16 18:34:11 +02:00
Cedric Nugteren	163474e171	Fixed an issue with the NVIDIA compute capability not being retrieved properly	2017-09-16 18:25:23 +02:00
Cedric Nugteren	4e317f5e85	Improved compilation time of the tuner database	2017-09-16 18:02:37 +02:00
Cedric Nugteren	c21878ecce	Added a guard against missing AMD and NVIDIA extensions	2017-09-14 21:58:08 +02:00
Cedric Nugteren	0d13d814c2	Added architecture layer in the tuning database for better performance on unseen devices	2017-09-14 21:27:33 +02:00
Cedric Nugteren	76382ff6c1	Added the new vendor-architecture-name hierarchy to the tuners as well	2017-09-10 16:34:54 +02:00
Cedric Nugteren	91ea7fcde2	Introduced the notion of a device-architecture for the database and added device and architecture name mappings	2017-09-08 21:09:05 +02:00
Cedric Nugteren	20da5e33a8	Split the database files over multiple directories and files; first step towards separate compilation	2017-09-06 21:50:42 +02:00
Cedric Nugteren	8905da259d	Fixed a modulo and division issue manifesting on Apple OpenCL for im2col	2017-09-05 18:49:23 +02:00
Cedric Nugteren	28462aa050	Removed an assumption that the 'default' tuning parameters have to be stored last; this is no longer needed	2017-09-04 17:39:57 +02:00
Cedric Nugteren	297159d5b9	Fixed a bug in im2col: process only valid channel IDs	2017-08-31 21:58:12 +02:00
Cedric Nugteren	6194d43efb	Fixed a bug in im2col confusing first and second workgroup size; made im2col kernel 2d instead of 3d	2017-08-31 20:34:10 +02:00
Cedric Nugteren	54e160cd88	Fixed some things in the tuner: bugs, style, and defaults to random search	2017-08-31 20:28:01 +02:00
Cedric Nugteren	161fd8514d	Merge branch 'master' into im_to_col	2017-08-24 21:15:14 +02:00
Cedric Nugteren	4d9d03ba51	Completed im2col implementation	2017-08-24 21:11:12 +02:00
Cedric Nugteren	a8c26594d9	Made the im2col client properly handle the arguments	2017-08-23 19:54:09 +02:00
Cedric Nugteren	da28cc5e93	Minor updates after merging in the PSO addition to the tuners	2017-08-21 20:14:02 +02:00
Cedric Nugteren	e5eb6b1d3a	Merge pull request #173 from mcian/PSO_params Add PSO parameters support and search strategy selection from command…	2017-08-21 20:06:29 +02:00
mcian	dfd332524a	Remove multistrategy and related functions	2017-08-21 14:09:11 +02:00
Cedric Nugteren	803ca781f9	First version of im2col kernel, unoptimized but working	2017-08-19 18:25:13 +02:00
Cedric Nugteren	777681dcbd	Merge branch 'master' into im_to_col	2017-08-12 20:50:00 +02:00
Cedric Nugteren	0a63621579	Moved functions from the header to the .cpp file to prevent compiling the same code multiple times	2017-08-12 15:59:14 +02:00
Cedric Nugteren	844e68853e	Moved some utility functions to a test-specific utility compilation-unit	2017-08-12 15:38:17 +02:00
mcian	4adee60884	Revert the xgemm strategy to default. If user wants to use multistrategy can simple call the function TestHeuristic from the main	2017-08-09 16:58:46 +02:00
mcian	0b4aa109f8	Use cltune::SearchMethod enum instead of int values	2017-08-09 16:05:25 +02:00
mcian	99afdcd908	Restore direct GEMM to previous version	2017-07-31 14:06:23 +02:00
Cedric Nugteren	18d832e149	Added tuning results for the Qualcomm Adreno 330 GPU	2017-07-30 18:18:02 +02:00
Cedric Nugteren	0ea16a0e63	Minor optimization for the direct GEMM kernel: don't ceil m and n unnecessarily high	2017-07-25 20:53:12 +02:00
Cedric Nugteren	55861c40ff	Merge branch 'relax_gemmbatched_ld_requirements'	2017-07-23 21:04:17 +02:00
mcian	473e814718	Code refactoring	2017-07-23 14:48:13 +02:00
Cedric Nugteren	2d52f9b1d3	Merge pull request #176 from CNugteren/inline_keyword_optional Made the inline keyword in kernels optional	2017-07-22 10:44:08 +02:00
mcian	a36283aaec	Add new threshold for ARM	2017-07-17 12:20:46 +02:00
mcian	8131e68664	Add PSO parameters support and search strategy selection from command line	2017-07-17 12:00:25 +02:00
Cedric Nugteren	97bcf77d4b	First step towards supporting im2col in the test infrastructure	2017-07-16 22:33:49 +02:00
Cedric Nugteren	f77b48692b	Relaxed requirement on a_ld and b_ld for batched GEMM	2017-07-12 21:53:39 +02:00
Cedric Nugteren	442c31dd50	Made the inline keyword in kernels optional currently only enabled for NVIDIA and ARM GPUs	2017-07-08 17:12:16 +02:00
Cedric Nugteren	84ec50e29d	Added interface and stubs for the im2col routine	2017-07-02 12:10:22 +02:00
Cedric Nugteren	4cf516cfec	Fixed an if-statement in the direct GEMM kernel causing a bug with specific sets of input parameters	2017-06-30 21:57:41 +02:00
Cedric Nugteren	1a8ed48a35	Fixed some Clang and MSVC warnings	2017-06-25 11:50:36 +02:00
Cedric Nugteren	615a7fdc81	Fixes some compilation issues related to the database structure change	2017-06-21 23:07:47 +02:00
Cedric Nugteren	e44feb8576	Changed the structure of the database to reduce compilation time and save memory	2017-06-20 21:19:26 +02:00
Cedric Nugteren	48f2682eb7	Added tuning results for the Core i7-920 CPU	2017-06-18 20:53:59 +02:00
Cedric Nugteren	3070b502b5	Fixed an overflow bug on 32-bit systems when chosing a GEMM kernel	2017-06-18 20:51:11 +02:00
Cedric Nugteren	33ed1e5a06	Added tuning results for GeForce GT 650M (thanks to bzcheeseman)	2017-06-01 22:52:08 +02:00
Cedric Nugteren	f57e209aab	Merge pull request #158 from CNugteren/msvc_compilation_fixes MSVC compilation fixes	2017-05-27 17:53:30 +02:00
Kirill Mavreshko	64ba590279	Fixed comment decribing the order of program cache fields	2017-05-27 10:30:09 +05:00
Cedric Nugteren	f7a16d427c	Fixed a compilation issue under MSVC 2013	2017-05-26 22:10:56 +02:00
Kirill Mavreshko	628e1e8cce	Fixes inability to run GEMM on multiple identical GPUs (issue #155 )	2017-05-26 15:04:19 +05:00
Cedric Nugteren	8400ee3a09	Fixed an TRSM issue caused by incorrect block size calculation	2017-05-15 22:04:55 +02:00
Cedric Nugteren	512b83dbad	Fixed a missing synchronization barrier in the invert kernel; fixes TRSM tests	2017-05-14 20:27:35 +02:00
Cedric Nugteren	f151e56daa	Added the IxAMIN routines: absolute minimum version of IxAMAX	2017-05-12 20:01:33 -07:00
Cedric Nugteren	86e8df60f1	Fixed a bug in the TRSM routine; tests now pass	2017-05-12 17:43:56 -07:00
Cedric Nugteren	71933c3411	Added tuning results for the AMD Radeon Fiji GPU	2017-05-11 22:53:52 -07:00
Cedric Nugteren	1df28a15fc	Re-added random tuning for GEMM after accidental removal	2017-05-11 22:12:38 -07:00
Cedric Nugteren	1c33af6eab	Re-added Titan X (Pascal) tuning results based on more averaging when tuning	2017-04-23 17:58:56 +02:00
Cedric Nugteren	3eea8dc998	Increased the default number of runs for the tuner from 2 up to 10 for fast kernels	2017-04-22 13:56:07 +02:00
Cedric Nugteren	192199c9cb	Fixed the direct vs indirect setting for NVIDIA GPUs	2017-04-22 13:43:27 +02:00
Cedric Nugteren	e41d204856	Increased the default number of runs for GEMV tuning; updated GEMV tuning results for Iris Pro	2017-04-21 22:12:20 +02:00
Cedric Nugteren	d7314d4f8e	Tuned the direct versus indirect GEMM kernel trade-off point for NVIDIA GPUs	2017-04-20 22:19:09 +02:00
Cedric Nugteren	409a5a2ad0	Fixed a namespace clash with CUDA FP16 for the half-datatype	2017-04-17 16:47:15 +02:00
Cedric Nugteren	2673f50518	Merge branch 'development' into benchmarking	2017-04-16 19:41:14 +02:00
Cedric Nugteren	10205d773e	Added a new Xaxpy kernel in between the regular and fast version in	2017-04-14 20:16:10 +02:00
Cedric Nugteren	f7f8ec644f	Fixed CUDA malloc and cuBLAS handles: cuBLAS as a performance-reference now works	2017-04-13 21:31:27 +02:00
Cedric Nugteren	22b3ea9256	Merge branch 'development' into cublas_reference Conflicts: scripts/generator/generator.py	2017-04-10 20:11:45 +02:00
Cedric Nugteren	7374c37e2e	Fixed a compilation issue under MSVC and GCC	2017-04-10 08:38:24 +02:00
Cedric Nugteren	2d45c37676	Removed const-vector-of-const-objects from the database class to remain according to the C++11 standard	2017-04-10 07:40:27 +02:00
Cedric Nugteren	fb6c78ea07	Added a special override database for the Apple CPU implementation on OS X: this makes the test work, it does not focus on good performance	2017-04-07 07:37:30 +02:00
Cedric Nugteren	d28ee082b0	Uses float2 and double2 for base complex data-types instead of a custom struct; fixes bug on Apple OpenCL	2017-04-07 07:35:15 +02:00
Cedric Nugteren	ce369702d8	Added some missing const-ness	2017-04-07 07:34:32 +02:00
Cedric Nugteren	b24d364743	Layed the groundwork for cuBLAS comparisons in the clients	2017-04-02 18:06:15 +02:00
Cedric Nugteren	b84d2296b8	Separated host-device and device-host memory copies from execution of the CBLAS reference code; for fair timing and code de-duplication	2017-04-01 13:36:24 +02:00
Cedric Nugteren	c27d2f0c1e	Added an (optional) non-direct implementation of the batched GEMM routine	2017-03-19 16:04:04 +01:00
Cedric Nugteren	2fd04dae83	Added batched versions of the pad/copy/transpose kernels	2017-03-19 15:57:44 +01:00
Cedric Nugteren	11bb30e72b	Added the possibility to tune batched kernels	2017-03-14 20:29:51 +01:00
Cedric Nugteren	7b8f8fce68	Added initial naive version of the batched GEMM routine based on the direct GEMM kernel	2017-03-11 16:02:45 +01:00
Cedric Nugteren	49e04c7fce	Added API and test infrastructure for the batched GEMM routine	2017-03-10 21:24:35 +01:00
Cedric Nugteren	d754586b49	Added proper testing of the alpha parameter; finalized the batched AXPY implementation	2017-03-10 20:49:59 +01:00
Cedric Nugteren	92a657290a	Fixed a small compilation bug for MSVC related to a floating-point constant	2017-03-10 20:30:10 +01:00
Cedric Nugteren	878d93e7dc	Implemented a batched version of the AXPY kernel	2017-03-08 20:36:35 +01:00
Cedric Nugteren	fa0a9c689f	Make batched routines based on offsets instead of a vector of cl_mem objects - undoing many earlier changes	2017-03-08 20:10:20 +01:00
Cedric Nugteren	6aba0bbae7	Minor fixes to the client w.r.t. the addition of the batch count	2017-03-05 16:44:16 +01:00
Cedric Nugteren	b114ea49a9	Added first naive version of the batched AXPY routine	2017-03-05 15:06:14 +01:00
Cedric Nugteren	cdf354f895	Adjusted the test-infrastructure to support testing of batched-versions of routines	2017-03-05 15:04:16 +01:00
Cedric Nugteren	7f14b11f1e	Changed the way the test-data is generated: now using a single MT generator and distribution for all data	2017-03-05 11:13:47 +01:00
Cedric Nugteren	f9a520b3af	Prepared generator for batched routines; added batched AXPY routine interface	2017-03-05 10:38:38 +01:00
Cedric Nugteren	e9ef037549	Added tuning results for the Radeon HD6750M GPU (Apple OpenCL)	2017-03-04 15:24:55 +01:00
Cedric Nugteren	e993ee077b	Added a proper data-preparation function for the TRSM tests	2017-03-04 15:21:33 +01:00
Cedric Nugteren	3fc73851f7	Added proper support for the b_offset argument in TRSM	2017-03-01 21:23:33 +01:00
Cedric Nugteren	00281dad26	Fixed half-precision bugs in HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM related to incorrect constants	2017-02-27 21:00:04 +01:00
Cedric Nugteren	e09c26c706	Split the GEMM kernel further up to prevent C1091 in MSVC	2017-02-26 15:03:12 +01:00
Cedric Nugteren	ea6790665d	Merge branch 'development' into triangular_solvers	2017-02-26 14:51:45 +01:00
Cedric Nugteren	df7638c305	Fixed an out-of-bounds memory access when filling a matrix with a constant	2017-02-26 14:31:05 +01:00
Cedric Nugteren	b7310036ed	Removed half-precision support from the TRSM routine; too unstable	2017-02-26 12:56:21 +01:00
Cedric Nugteren	a433987441	Fixes division in the kernel for inversion of complex numbers	2017-02-26 10:18:45 +01:00
Cedric Nugteren	e47d95887c	Added PrepareData function for TRSM to create proper test input	2017-02-25 12:23:04 +01:00
Cedric Nugteren	2f2a510c38	Implemented a simple row-major to col-major problem conversion for TRSM	2017-02-24 21:08:44 +01:00
Cedric Nugteren	1e5b5157bc	Fixed a few issues with the TRSM routine; some tests still failing	2017-02-22 20:31:33 +01:00
Cedric Nugteren	133ebfc834	Added data-preparation function for the TRSV tests and special nan/inf checks in the error checking to make the tests pass	2017-02-19 17:43:26 +01:00
Cedric Nugteren	0643a29af5	Added tuning parameters for the AMD RX480 GPU (Ellesmere)	2017-02-18 13:59:10 +01:00
Cedric Nugteren	d6538dfc25	Fixed the naming of the C API of OverrideParameters and fixed the description	2017-02-18 10:59:38 +01:00
Cedric Nugteren	cda449a5c3	Added a C interface to the OverrideParameters function; added some in-line comments to the API	2017-02-16 21:14:48 +01:00
Cedric Nugteren	08bfb75a9d	Added input-sanity checks for the OverrideParameters function	2017-02-16 21:12:50 +01:00
Cedric Nugteren	cdb3bb7166	Added first version of the OverrideParameters function	2017-02-13 20:53:06 +01:00
Cedric Nugteren	00eb55a2d4	Fixed a small bug in GEMV: unused kernel in parameter list	2017-02-13 20:48:32 +01:00
Cedric Nugteren	345a5feb9a	Split the database into several smaller cached per-kernel databases (in preparation of per-kernel database overrides)	2017-02-12 12:02:39 +01:00
Cedric Nugteren	faa842b927	Made RemoveBySubset from the cache work with references to keys	2017-02-12 11:58:20 +01:00
Cedric Nugteren	36b942a698	Added an option to remove items from the caches, optionally by a subset of 2 specific key-values only	2017-02-11 14:05:38 +01:00
Cedric Nugteren	dc93523204	Added tuning results for Titan X (Pascal version)	2017-02-08 21:14:38 +01:00
Cedric Nugteren	c248f900c0	Merge branch 'development' into triangular_solvers	2017-02-05 22:18:59 +01:00
Cedric Nugteren	e7cbb5915a	Fixed complex version of the TRSV kernel	2017-02-05 14:36:31 +01:00
Cedric Nugteren	c209dd7af9	Improved substition kernels a bit; added complex support	2017-02-04 22:48:06 +01:00
Cedric Nugteren	fec8c1a806	Completed a first STRSV implementation	2017-02-04 16:04:19 +01:00
Cedric Nugteren	a6ba6470aa	Added row-major support for TRSV	2017-02-04 14:25:27 +01:00
Cedric Nugteren	7c73ceb095	Added first (incomplete) version of TRSV routine	2017-01-29 17:02:00 +01:00
Ivan Shapovalov	5fb1da1a0f	Database: pass Device instead of Queue for clarity	2017-01-24 12:18:14 +03:00
Ivan Shapovalov	50e758a007	Routine: cache the database instance as well This does not change much, but will become useful in next commits when plugin support is introduced.	2017-01-24 11:56:15 +03:00
Ivan Shapovalov	6dc18c1c57	Database: ref-count the internal map for caching	2017-01-24 11:56:15 +03:00
Ivan Shapovalov	5bcd92f297	Routine, Cache: generalize, reduce amount of copying in fast path Implement a generalized Cache<K, V>. Two variants are provided: the first one is based on std::map, using C++14-specific transparent std::less<> and generalized std::map::find() to allow searching by tuple of references. The second one is based on std::vector and O(n) lookup, but remains C++11-compliant.	2017-01-24 11:56:15 +03:00
Ivan Shapovalov	1b8e816333	FillCache: perform compilation for each precision separately Thus do not prevent filling cache for float if the device does not support e. g. double.	2017-01-24 02:43:00 +03:00
Ivan Shapovalov	6ad11665a1	Routine: fix semi-warm routine construction (when binary is in cache) There was a missing return statement in the semi-warm path that made CLBlast to continue to cold path after a cache hit.	2017-01-24 02:43:00 +03:00
Ivan Shapovalov	a9914ee3a8	src/clpp11.hpp: check pointers before clRelease*() This is to avoid spurious "induced" errors on destruction, if construction failed for some reason.	2017-01-24 02:42:59 +03:00
Ivan Shapovalov	8e1c084c93	src/clpp11.hpp: do not store program source/binary in Program The stored source/binary does not seem to serve any purpose, yet its presence makes Program a heavy (not pure refcounted) object, which is undesired esp. because it is copied from the cache in the hot path.	2017-01-24 02:42:59 +03:00
Ivan Shapovalov	1a1e863ab3	treewide: include clpp11.hpp first to silence deprecation warnings Otherwise, cl.h gets included through clblast.h before clpp11.hpp.	2017-01-20 17:32:42 +03:00
Ivan Shapovalov	43c7707173	Routine: use PrecisionSupported<>() instead of duplicating the check	2017-01-20 17:20:45 +03:00
Cedric Nugteren	a5fd2323b6	Added prototype for the TRSV routine	2017-01-20 11:30:32 +01:00
Cedric Nugteren	a2c0a9c551	Set number of decimals for floating-point printing for error reporting	2017-01-20 11:13:44 +01:00
Cedric Nugteren	2e4f6e1609	Added tuning results for NVIDIA GTX 1080 and Intel Core i7-4790K	2017-01-19 19:42:31 +01:00
Cedric Nugteren	df9a77d74d	Added first version of the TRSM routine based on the diagonal invert kernel	2017-01-18 21:29:59 +01:00
Cedric Nugteren	4b3ffd9989	Added a first version of the diagonal block invert routine in preparation of TRSM	2017-01-15 17:30:00 +01:00
Cedric Nugteren	4a4be0c3a5	Prints additional information in verbose/debug mode	2017-01-15 17:17:40 +01:00
Cedric Nugteren	69ca271a8c	Always enables cl_khr_fp64 when running double-precision, not just for OpenCL 1.1 or lower	2017-01-07 13:31:29 +01:00
Cedric Nugteren	32b850b12b	Added tuning results for the AMD Turks GPU and the Intel Core i7-2670QM CPU	2017-01-03 20:30:56 +01:00
Cedric Nugteren	681a465b35	Prepared for the addition of the TRSM triangular solver kernel	2016-12-18 12:30:16 +01:00
Cedric Nugteren	6b533dda1c	Fixed a bug when using offsets in the direct GEMM kernels	2016-12-18 11:54:32 +01:00
Cedric Nugteren	26e0177431	Made Intel GPUs always use the indirect version of the GEMM kernel	2016-11-29 20:47:20 +01:00
Cedric Nugteren	39c49bf4f9	Made it possible to use the command-line environmental variables for each executable and without re-running CMake	2016-11-27 11:00:29 +01:00
Cedric Nugteren	080e1be684	Improved the default parameters for cases with non-common parameters across all devices	2016-11-26 16:38:17 +01:00
Cedric Nugteren	cb398f0e42	Merge pull request #125 from CNugteren/netlib_blas_api Netlib CBLAS API for CLBlast	2016-11-24 19:35:59 +01:00
Cedric Nugteren	792cc8359f	Fixed a vector-size related bug in the CLBlast Netlib API	2016-11-23 22:00:20 +01:00
Cedric Nugteren	654b41bb2b	Fixed a bug in the HSCAL routine	2016-11-23 21:29:16 +01:00
Cedric Nugteren	26ca071480	Minor changes to ensure full compatibility with the Netlib CBLAS API	2016-11-22 08:41:52 +01:00
Cedric Nugteren	eefe0df435	Made functions with scalar-buffers as output properly return values	2016-11-20 21:36:57 +01:00
Cedric Nugteren	d8af24e388	Now correctly tests for validaty of the B matrix in the TRMM routine	2016-11-20 16:27:54 +01:00
Cedric Nugteren	90eb8738c4	Forced OpenCL 1.1 compilation and disabled a deprecation warning	2016-11-20 16:27:02 +01:00
Cedric Nugteren	2f0697564f	Fixed a bug in the TRMM routine caused by overwriting input data before consuming everything	2016-11-20 15:05:42 +01:00
Cedric Nugteren	6eeb1180fd	Changed the GEMM kernel selection parameters for Skylake GPUs to always favour the regular kernel	2016-11-19 22:15:33 +01:00
Cedric Nugteren	746d688e07	Updated the tuning results for the Intel Skylake ULT GT2 GPU	2016-11-15 22:42:04 +01:00
Cedric Nugteren	8ae8ab06a2	Renamed the include and source files of the Netlib CBLAS API	2016-10-25 20:33:10 +02:00
Cedric Nugteren	140121ef91	Removed the clblast namespace from the Netlib C API source file to ensure proper linking	2016-10-25 20:21:50 +02:00
Cedric Nugteren	729862e873	Fixed some issues preventing the Netlib CBLAS API from linking correctly	2016-10-25 19:56:42 +02:00
Cedric Nugteren	926aca53a0	Made the Netlib CBLAS API use the same enums with prefixes as the regular C API of CLBlast	2016-10-25 19:45:57 +02:00
Cedric Nugteren	59183b7d79	Sets the proper sizes for the buffers for the Netlib CBLAS API	2016-10-25 19:21:49 +02:00
Cedric Nugteren	f96fd372bc	Added initial version of a Netlib CBLAS implementation. TODO: Set correct buffer sizes	2016-10-25 14:28:52 +02:00
Cedric Nugteren	ec687afa75	Added tuning results for GeForce GTX TITAN Black	2016-10-24 19:49:10 +02:00
Cedric Nugteren	76d5d2ccfc	Fixed a bug in the transpose-matrix function	2016-10-23 20:49:55 +02:00
Cedric Nugteren	b8d4a9b9d0	Removed PUBLIC_API from the C++ exception classes	2016-10-23 16:09:59 +02:00
Cedric Nugteren	66f5c9d9b8	Added a fix for compilation under Visual Studio 2013 related to the new exception classes	2016-10-23 15:55:03 +02:00
Cedric Nugteren	c925fe463f	Added tuning results for the AMD Tonga GPU	2016-10-22 16:25:31 +02:00
Cedric Nugteren	a670c4c4bf	All enums in the C API are now prefixed with CLBlast to avoid potential name clashes with other projects	2016-10-22 16:14:56 +02:00
Cedric Nugteren	b0ff11acf0	Moved files around a bit; created a utilities subfolder	2016-10-22 15:36:48 +02:00
Cedric Nugteren	9afbbc9ef9	Added documentation for the better exception handling	2016-10-22 15:23:18 +02:00
Cedric Nugteren	280698d076	Merge pull request #117 from intelfx/exceptions Convert to use C++ exceptions internally	2016-10-22 15:05:12 +02:00
Cedric Nugteren	9b596820d2	Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters (2)	2016-10-22 10:50:12 +02:00
Cedric Nugteren	db17b1fbe9	Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters	2016-10-22 10:41:02 +02:00
Ivan Shapovalov	56f300607b	Routine: get rid of ::SetUp() Since we now use C++ exceptions inside the implementation (and exceptions can be thrown from constructors), there is no need for a separate Routine::SetUp() function. For this, we also change the way how the kernel source string is constructed. The kernel-specific source code is now passed to the Routine ctor via an initializer_list of C strings to avoid unnecessary data copying while also working around C1091 of MSVC 2013.	2016-10-22 08:45:27 +03:00
Ivan Shapovalov	b98af44fcf	treewide: use C++ exceptions properly Since the codebase is designed around proper C++ idioms such as RAII, it makes sense to only use C++ exceptions internally instead of mixing exceptions and error codes. The exceptions are now caught at top level to preserve compatibility with the existing error code-based API. Note that we deliberately do not catch C++ runtime errors (such as `std::bad_alloc`) nor logic errors (aka failed assertions) because no actual handling can ever happen for such errors. However, in the C interface we do catch _all_ exceptions (...) and convert them into a wild-card error code.	2016-10-22 08:45:25 +03:00
Ivan Shapovalov	5d03d48f7a	src/clpp11.hpp: avoid throwing exceptions from std::shared_ptr's Deleter	2016-10-22 07:25:16 +03:00
Ivan Shapovalov	6ac7edd2da	src/clpp11.hpp: GetInfoString: avoid reallocation	2016-10-22 07:25:16 +03:00
Ivan Shapovalov	106565fa9a	src/clpp11.hpp: reinstate error checking on clGetEventProfilingInfo()	2016-10-22 07:25:15 +03:00
Cedric Nugteren	597974b40d	Merge pull request #118 from matze/add-pkg-config Generate and install pkg-config description	2016-10-21 21:00:07 +02:00
Matthias Vogelgesang	3797d144cc	Generate and install pkg-config description	2016-10-21 09:38:25 +02:00
Cedric Nugteren	0f9311d46a	Fixed an issue with a growing database: the database is now a global variable in a namespace and its container uses const-pointers to the actual data	2016-10-14 20:56:32 +02:00
Cedric Nugteren	ebb505b783	Added tuning results for Intel HD Graphics IvyBridge GPU	2016-10-13 12:18:28 +02:00
Cedric Nugteren	c60f6715f8	Removed a spurious #ifdef	2016-10-12 21:49:59 +02:00
Cedric Nugteren	ad2b6ecea2	Fixed missing line ending	2016-10-12 21:10:22 +02:00
Cedric Nugteren	8a9d3cdf37	Added support for compiling the library, the client, and the samples under MSVC 2013	2016-10-10 22:45:39 +02:00
Cedric Nugteren	f88c50522d	Fixed an issue with const members of structs in the database	2016-10-10 22:24:05 +02:00
Cedric Nugteren	de77f00e8c	Fixed an issue with the length of the GEMM OpenCL string for both MSVC 2013 and 2015	2016-10-10 22:23:33 +02:00
Cedric Nugteren	fcac81bfef	First fixes towards compilation on Visual Studio 2013	2016-10-10 20:37:45 +02:00
Cedric Nugteren	08ee57f494	Updated the tuning results for the GTX 750 Ti GPU	2016-10-10 16:41:41 +02:00
Cedric Nugteren	7c228f6a67	Changed the thresholds for the direct/indirect GEMM kernels for NVIDIA and Intel GPUs	2016-10-10 16:01:02 +02:00
Cedric Nugteren	7baac46e72	Fixed a performance bug for Intel Iris Pro GPUs due to incorrect tuning results	2016-10-08 21:56:06 +02:00
Cedric Nugteren	b698e45478	Added first tuning results for the single-kernel direct GEMM implementation	2016-10-06 21:13:14 +02:00
Cedric Nugteren	a3e67f2be2	Added a kernel selection database to select between the direct and indirect GEMM kernels	2016-10-06 19:51:12 +02:00
Cedric Nugteren	7052a00a3e	Fixed a const-correctness issue with complex conjugation in the GEMM direct kernel	2016-10-03 20:13:19 +02:00
Cedric Nugteren	ca0c075de2	Added functions to load from off-chip to local memory without vector loads for the GEMM direct kernels	2016-10-03 20:09:15 +02:00
Cedric Nugteren	c1c4bc5d20	Re-organised GEMM direct kernel and added faster fall-back version for incomplete rectangles	2016-10-03 19:32:01 +02:00
Cedric Nugteren	243cef73db	Set the default number of runs for all kernels to at least 2 runs	2016-10-02 21:23:23 +02:00
Cedric Nugteren	d8827e908c	Specialised the GEMM direct kernel in four ways for transposing/non-transposing: NN, NT, TN, TT	2016-10-02 17:59:05 +02:00
Cedric Nugteren	61f489e370	Split the GEMM direct kernel into two files; set the default tuning target to 256-256-256	2016-10-02 15:06:59 +02:00
Cedric Nugteren	a459920105	Added padding to the local memory of the GEMM direct kernel	2016-10-01 16:58:53 +02:00
Cedric Nugteren	ecc704cc76	Added default num-runs to the tuner adding averaging over 10 runs as a default for the GEMM direct kernel	2016-10-01 16:55:21 +02:00
Cedric Nugteren	a9d35cf04c	Merge branch 'development' into gemm_direct	2016-10-01 13:45:08 +02:00
Cedric Nugteren	d59e5c570b	Added an option to run tuned kernels multiple times to average execution times; requires CLTune 2.5.0	2016-09-27 21:03:24 +02:00
Cedric Nugteren	db5772e521	Updated to version 8.0 of the CLCudaAPI header	2016-09-27 20:56:49 +02:00
Cedric Nugteren	adc058440c	Fixed the local memory size computation for the GEMM tuners	2016-09-27 20:03:55 +02:00
Cedric Nugteren	6178fcd584	Now generates test/client/tuner data using a fixed seed to enable reproducability of results	2016-09-27 19:55:21 +02:00
Cedric Nugteren	73d135c2ce	Added a first version of a tuner for the GEMM direct kernel; collapsed MWGD, NWGD and KWGD into one WGD parameter	2016-09-25 14:48:34 +02:00
Cedric Nugteren	669f43aed6	Separated the tuning parameters of the new direct GEMM kernel from the indirect version	2016-09-25 13:52:08 +02:00
Cedric Nugteren	140dc12854	Added a first version of the direct version of GEMM with local memory	2016-09-25 11:38:35 +02:00
Cedric Nugteren	6aa652d6ea	Merge branch 'development' into gemm_direct	2016-09-21 21:32:18 +02:00
Cedric Nugteren	b1929d8ce7	It is now possible to set the OpenCL compiler options through an environmental variable	2016-09-21 21:22:16 +02:00
Cedric Nugteren	4ce584a014	Split the XGEMM kernel further up: now in 3 parts. This is done because MSVC can't handle long strings	2016-09-12 22:13:16 +02:00
Cedric Nugteren	aa3dffe356	Added XgemvFastRot and Xgemm 16-bit tuning results: just defaults which are now automatically taken from 32-bit if there are no entries at all	2016-09-12 20:13:38 +02:00
Cedric Nugteren	b5a67f86ec	Complete re-write of the database script. Changed Pandas for the much faster and convienient plain JSON/dict data-type	2016-09-11 21:29:28 +02:00
Cedric Nugteren	e21f32bc99	Updated database based on exhaustive tuning results for GEMM for the R9 M370X GPU	2016-09-10 14:00:43 +02:00
Cedric Nugteren	3daba70997	Updated the database script to remove duplicate entries: keeps only the best-performing cases for a specific parameters combination	2016-09-10 11:12:09 +02:00
Cedric Nugteren	55038d3c91	Split GEMM tuning in two parts: a small set of tuning parameters which is explored exhaustively and a larger set which is explored randomly	2016-09-06 20:30:06 +02:00
Cedric Nugteren	b30b26b89e	The GEMM kernel no longer adds beta*C in case beta is zero; this would cause problems if C contains NaNs	2016-09-04 17:21:16 +02:00
Cedric Nugteren	521bf6cdfc	Added tuning results for Intel Broadwell 5500 GT2 GPU	2016-09-03 16:43:23 +02:00
Cedric Nugteren	19574b2519	Updated tuning results for Haswell GT2 Mobile GPU; fixed database script to handle duplicate entries of different runs	2016-09-03 12:45:11 +02:00
Ivan Shapovalov	ea43936e94	test/correctness: read platform and device from environment Support passing environment variables CLBLAST_PLATFORM and CLBLAST_DEVICE instead of -platform and -device arguments to test executables. This is for `ctest`.	2016-08-27 05:37:26 +03:00
Cedric Nugteren	8d6a6a5bbf	Merge branch 'database_defaults' into development	2016-08-22 19:31:36 +02:00
Cedric Nugteren	0c0f0ac7f9	Also changed the default-default for unknown device types to use the same method as for known device groups	2016-08-21 20:35:20 +02:00
Cedric Nugteren	84db8958d1	Increased the ratio of GEMM tuning results to explore; reduced the tuning search space to have a better chance to evaluate more likely parameter combinations	2016-08-21 20:28:02 +02:00
Cedric Nugteren	6eca53ee23	Merge branch 'master' of https://github.com/dvasschemacq/CLBlast into dvasschemacq-master Conflicts: src/kernels/level1/xaxpy.opencl src/kernels/level2/xgemv.opencl src/kernels/level2/xgemv_fast.opencl src/kernels/level2/xger.opencl src/kernels/level2/xher.opencl src/kernels/level2/xher2.opencl src/kernels/level3/xgemm_part2.opencl	2016-08-20 12:50:31 +02:00
D. Van Assche	57f1aa7685	Adapt opencl files for 1.1 OpenCL In OpenCL 1.1 __kernel has to be before __attribute__, at least with Vivante compiler.	2016-08-18 17:33:13 +02:00
Cedric Nugteren	7d5631b7e4	Updated the database script to calculate the relative best performance of tuning results common for a device/vendor type	2016-08-15 21:01:07 +02:00
Cedric Nugteren	5004a435ff	Fixed issues related to the recent changes in the Xgemm infrastructure	2016-07-26 20:59:59 +02:00
Cedric Nugteren	5053f6ebc6	Merge branch 'development' into gemm_direct	2016-07-26 20:53:31 +02:00
Cedric Nugteren	de1afe168d	Removed all old tuning results for the XgemvFastRot kernel; re-added for a couple of devices	2016-07-25 22:57:23 +02:00
Cedric Nugteren	2582f0290a	Moved the XgemvFast and XgemvFastRot tuning database into a separate file	2016-07-25 22:43:49 +02:00
Cedric Nugteren	0252df731a	Merge branch 'development' into gemv_performance	2016-07-24 17:06:27 +02:00
Cedric Nugteren	ffa35c623a	Minor improvements after merging in groundwork for custom tuning parameters and kernels	2016-07-24 17:00:21 +02:00
Cedric Nugteren	40a72259eb	Fixe a bug in the new XgemvFastRot kernel related to local memory size	2016-07-23 16:58:11 +02:00
Cedric Nugteren	7a4f963763	Further improvements to the XgemvFastRot kernel, properly enables coalescing now	2016-07-23 14:52:32 +02:00
Cedric Nugteren	75fe8235f7	Improved the XgemvFastRot kernel by tiled loading of the input matrix A, enabling better memory performance	2016-07-23 10:20:11 +02:00
Ivan Shapovalov	e4e1f05079	clblast::Database, clblast::Routine: implement "database overlays" provided by routine implementation	2016-07-22 11:15:52 +03:00
Ivan Shapovalov	ae3299da30	clblast::RunKernel, cl::Kernel: unify variants with/without waitForEvents, support empty LWS	2016-07-22 11:15:52 +03:00
Ivan Shapovalov	5502c5eec4	cl::Kernel: skip NULL entries in waitForEvents	2016-07-22 11:15:52 +03:00
Ivan Shapovalov	2dd5ee3f75	clblast::RunKernel, cl::Kernel: take const vector as waitForEvents	2016-07-22 11:15:52 +03:00
Ivan Shapovalov	1ae71614ac	xgemm: do not hardcode kernel requirements for internal matrix layout Do not hardcode the knowledge about "A and C col-major, B row-major". This allows for easier reuse of the DoGemm() routine with different kernels.	2016-07-22 11:15:52 +03:00
Cedric Nugteren	798d32edad	Improved the GEMM direct kernel by adding register blocking. Still not fast though	2016-07-17 14:36:51 +02:00
Cedric Nugteren	eaa348735e	Created infrastructure to support a direct GEMM kernel; added correct but slow reference kernel as a place-holder	2016-07-16 15:18:28 +02:00
Cedric Nugteren	b33bec4a59	Fixed some more types and type conversions in the clpp11 interface to OpenCL	2016-07-16 11:13:23 +02:00
Cedric Nugteren	bee9b959f4	Merge pull request #80 from gcp/getdevinfo_fixes Make sure the passed types are large enough.	2016-07-16 10:59:51 +02:00
Cedric Nugteren	066af4069b	Removed an unused variable from the copy-transpose-pad function	2016-07-16 10:56:37 +02:00
Gian-Carlo Pascutto	e0ba59c0ac	Make sure the passed types are large enough. Make sure all out parameters that are passed to functions such as clGetDeviceInfo are large enough to contain the replies.	2016-07-13 15:59:02 +02:00
Cedric Nugteren	c87e877bf2	Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel	2016-07-10 20:32:01 +02:00
Cedric Nugteren	57f09178d8	Added tuning results for AMD Oland and for Intel Graphics HD 530	2016-07-10 11:46:44 +02:00
Cedric Nugteren	39e9b1238f	Fixed a bug related to the cache and retrieval of programs based on the OpenCL context	2016-07-10 11:24:36 +02:00
Cedric Nugteren	9caa7ca5b9	Cache now compares cl_context instead of a pointer to a context; added verbose print statements to the cache	2016-07-08 20:57:58 +02:00
Cedric Nugteren	27854070b4	Added a VERBOSE mode to debug performance: now prints details about compilation and kernel execution to screen	2016-07-06 21:50:12 +02:00
Cedric Nugteren	77325b8974	Added an option to the performance clients to do a warm-up run before timing	2016-07-06 21:25:55 +02:00
Cedric Nugteren	9683b50c55	Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp)	2016-07-03 20:30:47 +02:00
Gian-Carlo Pascutto	7424532859	Ensure clGetKernelWorkGroupInfo return value fits. In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo to get the "bytes" amount needed to store the result from CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an "auto result = size_t", which in 32-bit mode is 4 bytes, regardless of the previous return value. The spec describes that it will actually be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure we are in fact passing a cl_ulong. Also adjust all callers to take the changed type into account.	2016-07-02 21:14:36 +02:00
Cedric Nugteren	7cf2f8c268	Fixed some memory leaks related to events not properly cleaned-up	2016-07-02 15:34:55 +02:00
Cedric Nugteren	b330ab0866	Added declspec(dllexport) to ClearCache and FillCache, and added declspec(dllimport) when not building the library	2016-06-30 10:49:17 +02:00
Cedric Nugteren	cd74aaac52	Updated to version 6.0 of the CLCudaAPI header	2016-06-29 19:42:49 +02:00
CNugteren	871b576c06	Made it possible to build the clients and tests on Windows using Visual Studio	2016-06-28 16:38:45 +02:00
Cedric Nugteren	76b20cfe0c	Fixes for the AppVeyor Windows build	2016-06-27 14:44:08 +02:00
Cedric Nugteren	66908ef5cd	Added tuning results for 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (thanks to OursDesCavernes)	2016-06-19 14:59:50 +02:00
Cedric Nugteren	61203453aa	Renamed all C++ source files to .cpp to match the .hpp extension better	2016-06-19 13:55:49 +02:00
Cedric Nugteren	f726fbdc9f	Moved all headers into the source tree, changed headers to .hpp extension	2016-06-18 20:20:13 +02:00
Cedric Nugteren	bacb5d2bb2	Clean-up of the routine class, moved RunKernel to the routine/common file	2016-06-18 18:16:14 +02:00
Cedric Nugteren	7b4c0e1cf0	Removed the template from the Routine base-class	2016-06-18 14:56:55 +02:00
Cedric Nugteren	f9947b4d7f	Removed the precision argument from the routines in favor of a single templated function	2016-06-17 14:30:37 +02:00
Cedric Nugteren	536b7fe4bc	Removed the interface to the cache functions from the Routine class, calls them directly now	2016-06-17 13:57:50 +02:00
Cedric Nugteren	98a95c89fc	Moved the RunKernel and PadCopyTransposeMatrix functions out of the Routine class	2016-06-17 12:32:06 +02:00
Cedric Nugteren	afe8852eaa	Moved the test-for-valid-buffers function from the Routine class to separate functions in a separate file	2016-06-17 11:29:07 +02:00
Cedric Nugteren	52ccaf5b25	Added XOMATCOPY routines to perform out-of-place matrix scaling, copying, and/or transposing	2016-06-16 18:07:46 +02:00
Cedric Nugteren	39b7dbc5e3	Added some constness to variables related to the GEMM routines	2016-06-15 12:34:05 +02:00
Cedric Nugteren	b894611ad1	Re-organised the level-3 supporting kernels (copy, pad, transpose, convert) and renamed files and functions appropriately	2016-06-14 18:17:58 +02:00
Cedric Nugteren	3e78a99355	Moved device vendor and type checks to a common header	2016-06-14 14:30:22 +02:00
Cedric Nugteren	6e2017c67d	Added support for FP16 on ARM Mali-T628 (officially not supported)	2016-06-14 14:29:53 +02:00
Cedric Nugteren	6925003e45	Added global memory synchronisation for better cache performance on ARM Mali GPUs	2016-06-08 10:13:37 +02:00
Cedric Nugteren	03182f9d07	Added half-precision tests for the clBLAS reference through conversion to single-precision	2016-05-26 23:36:19 +02:00
Cedric Nugteren	9f87455070	Added level-3 half-precision routines HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM	2016-05-25 13:29:53 +02:00
Cedric Nugteren	ac1575056e	Added proper argument handling and displaying for half-precision data-types	2016-05-24 14:06:16 +02:00
Cedric Nugteren	3e9a07f00a	Added level-2 half-precision routines HGER/HSYR/HSPR/HSYR2/HSPR2	2016-05-22 16:59:14 +02:00
Cedric Nugteren	f0cb3fdc81	Fixed tuning results for half-precision; added first results for the xGER kernels	2016-05-22 16:46:05 +02:00
Cedric Nugteren	c8ff3f143f	Prepared the GER kernels and tuner for half-precision support	2016-05-22 16:18:08 +02:00
Cedric Nugteren	95b828da12	Added level-2 half-precision routines HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV	2016-05-22 15:38:26 +02:00
Cedric Nugteren	b6268d0c22	Added first tuning results for the half-precision xGEMV kernels	2016-05-22 15:29:05 +02:00
Cedric Nugteren	88551b4005	Prepared the GEMV kernels and tuner for half-precision support	2016-05-22 15:22:54 +02:00
Cedric Nugteren	803aaf3070	Added level-1 half-precision routines HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN	2016-05-22 14:47:14 +02:00
Cedric Nugteren	3c9e63c054	Added first tuning results for the half-precision xDOT kernels	2016-05-22 14:43:25 +02:00
Cedric Nugteren	f70ded34f3	Added half-precision support for all level 1 routines	2016-05-22 14:26:19 +02:00

... 5 6 7 8 9 ...

780 Commits (6e2ab6ee967c4a9b3350c7ce4e7d7b736c9e45f6)