Commit graph

272 commits

Author SHA1 Message Date
Cedric Nugteren c1c4bc5d20 Re-organised GEMM direct kernel and added faster fall-back version for incomplete rectangles 2016-10-03 19:32:01 +02:00
Cedric Nugteren 243cef73db Set the default number of runs for all kernels to at least 2 runs 2016-10-02 21:23:23 +02:00
Cedric Nugteren d8827e908c Specialised the GEMM direct kernel in four ways for transposing/non-transposing: NN, NT, TN, TT 2016-10-02 17:59:05 +02:00
Cedric Nugteren 61f489e370 Split the GEMM direct kernel into two files; set the default tuning target to 256-256-256 2016-10-02 15:06:59 +02:00
Cedric Nugteren a459920105 Added padding to the local memory of the GEMM direct kernel 2016-10-01 16:58:53 +02:00
Cedric Nugteren ecc704cc76 Added default num-runs to the tuner adding averaging over 10 runs as a default for the GEMM direct kernel 2016-10-01 16:55:21 +02:00
Cedric Nugteren a9d35cf04c Merge branch 'development' into gemm_direct 2016-10-01 13:45:08 +02:00
Cedric Nugteren d59e5c570b Added an option to run tuned kernels multiple times to average execution times; requires CLTune 2.5.0 2016-09-27 21:03:24 +02:00
Cedric Nugteren db5772e521 Updated to version 8.0 of the CLCudaAPI header 2016-09-27 20:56:49 +02:00
Cedric Nugteren adc058440c Fixed the local memory size computation for the GEMM tuners 2016-09-27 20:03:55 +02:00
Cedric Nugteren 6178fcd584 Now generates test/client/tuner data using a fixed seed to enable reproducability of results 2016-09-27 19:55:21 +02:00
Cedric Nugteren 73d135c2ce Added a first version of a tuner for the GEMM direct kernel; collapsed MWGD, NWGD and KWGD into one WGD parameter 2016-09-25 14:48:34 +02:00
Cedric Nugteren 669f43aed6 Separated the tuning parameters of the new direct GEMM kernel from the indirect version 2016-09-25 13:52:08 +02:00
Cedric Nugteren 140dc12854 Added a first version of the direct version of GEMM with local memory 2016-09-25 11:38:35 +02:00
Cedric Nugteren 6aa652d6ea Merge branch 'development' into gemm_direct 2016-09-21 21:32:18 +02:00
Cedric Nugteren b1929d8ce7 It is now possible to set the OpenCL compiler options through an environmental variable 2016-09-21 21:22:16 +02:00
Cedric Nugteren 4ce584a014 Split the XGEMM kernel further up: now in 3 parts. This is done because MSVC can't handle long strings 2016-09-12 22:13:16 +02:00
Cedric Nugteren aa3dffe356 Added XgemvFastRot and Xgemm 16-bit tuning results: just defaults which are now automatically taken from 32-bit if there are no entries at all 2016-09-12 20:13:38 +02:00
Cedric Nugteren b5a67f86ec Complete re-write of the database script. Changed Pandas for the much faster and convienient plain JSON/dict data-type 2016-09-11 21:29:28 +02:00
Cedric Nugteren e21f32bc99 Updated database based on exhaustive tuning results for GEMM for the R9 M370X GPU 2016-09-10 14:00:43 +02:00
Cedric Nugteren 3daba70997 Updated the database script to remove duplicate entries: keeps only the best-performing cases for a specific parameters combination 2016-09-10 11:12:09 +02:00
Cedric Nugteren 55038d3c91 Split GEMM tuning in two parts: a small set of tuning parameters which is explored exhaustively and a larger set which is explored randomly 2016-09-06 20:30:06 +02:00
Cedric Nugteren b30b26b89e The GEMM kernel no longer adds beta*C in case beta is zero; this would cause problems if C contains NaNs 2016-09-04 17:21:16 +02:00
Cedric Nugteren 521bf6cdfc Added tuning results for Intel Broadwell 5500 GT2 GPU 2016-09-03 16:43:23 +02:00
Cedric Nugteren 19574b2519 Updated tuning results for Haswell GT2 Mobile GPU; fixed database script to handle duplicate entries of different runs 2016-09-03 12:45:11 +02:00
Ivan Shapovalov ea43936e94 test/correctness: read platform and device from environment
Support passing environment variables CLBLAST_PLATFORM and CLBLAST_DEVICE
instead of -platform and -device arguments to test executables.

This is for `ctest`.
2016-08-27 05:37:26 +03:00
Cedric Nugteren 8d6a6a5bbf Merge branch 'database_defaults' into development 2016-08-22 19:31:36 +02:00
Cedric Nugteren 0c0f0ac7f9 Also changed the default-default for unknown device types to use the same method as for known device groups 2016-08-21 20:35:20 +02:00
Cedric Nugteren 84db8958d1 Increased the ratio of GEMM tuning results to explore; reduced the tuning search space to have a better chance to evaluate more likely parameter combinations 2016-08-21 20:28:02 +02:00
Cedric Nugteren 6eca53ee23 Merge branch 'master' of https://github.com/dvasschemacq/CLBlast into dvasschemacq-master
Conflicts:
	src/kernels/level1/xaxpy.opencl
	src/kernels/level2/xgemv.opencl
	src/kernels/level2/xgemv_fast.opencl
	src/kernels/level2/xger.opencl
	src/kernels/level2/xher.opencl
	src/kernels/level2/xher2.opencl
	src/kernels/level3/xgemm_part2.opencl
2016-08-20 12:50:31 +02:00
D. Van Assche 57f1aa7685 Adapt opencl files for 1.1 OpenCL
In OpenCL 1.1 __kernel has to be before __attribute__, at least with
Vivante compiler.
2016-08-18 17:33:13 +02:00
Cedric Nugteren 7d5631b7e4 Updated the database script to calculate the relative best performance of tuning results common for a device/vendor type 2016-08-15 21:01:07 +02:00
Cedric Nugteren 5004a435ff Fixed issues related to the recent changes in the Xgemm infrastructure 2016-07-26 20:59:59 +02:00
Cedric Nugteren 5053f6ebc6 Merge branch 'development' into gemm_direct 2016-07-26 20:53:31 +02:00
Cedric Nugteren de1afe168d Removed all old tuning results for the XgemvFastRot kernel; re-added for a couple of devices 2016-07-25 22:57:23 +02:00
Cedric Nugteren 2582f0290a Moved the XgemvFast and XgemvFastRot tuning database into a separate file 2016-07-25 22:43:49 +02:00
Cedric Nugteren 0252df731a Merge branch 'development' into gemv_performance 2016-07-24 17:06:27 +02:00
Cedric Nugteren ffa35c623a Minor improvements after merging in groundwork for custom tuning parameters and kernels 2016-07-24 17:00:21 +02:00
Cedric Nugteren 40a72259eb Fixe a bug in the new XgemvFastRot kernel related to local memory size 2016-07-23 16:58:11 +02:00
Cedric Nugteren 7a4f963763 Further improvements to the XgemvFastRot kernel, properly enables coalescing now 2016-07-23 14:52:32 +02:00
Cedric Nugteren 75fe8235f7 Improved the XgemvFastRot kernel by tiled loading of the input matrix A, enabling better memory performance 2016-07-23 10:20:11 +02:00
Ivan Shapovalov e4e1f05079 clblast::Database, clblast::Routine: implement "database overlays" provided by routine implementation 2016-07-22 11:15:52 +03:00
Ivan Shapovalov ae3299da30 clblast::RunKernel, cl::Kernel: unify variants with/without waitForEvents, support empty LWS 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 5502c5eec4 cl::Kernel: skip NULL entries in waitForEvents 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 2dd5ee3f75 clblast::RunKernel, cl::Kernel: take const vector as waitForEvents 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 1ae71614ac xgemm: do not hardcode kernel requirements for internal matrix layout
Do not hardcode the knowledge about "A and C col-major, B row-major".

This allows for easier reuse of the DoGemm() routine with different
kernels.
2016-07-22 11:15:52 +03:00
Cedric Nugteren 798d32edad Improved the GEMM direct kernel by adding register blocking. Still not fast though 2016-07-17 14:36:51 +02:00
Cedric Nugteren eaa348735e Created infrastructure to support a direct GEMM kernel; added correct but slow reference kernel as a place-holder 2016-07-16 15:18:28 +02:00
Cedric Nugteren b33bec4a59 Fixed some more types and type conversions in the clpp11 interface to OpenCL 2016-07-16 11:13:23 +02:00
Cedric Nugteren bee9b959f4 Merge pull request #80 from gcp/getdevinfo_fixes
Make sure the passed types are large enough.
2016-07-16 10:59:51 +02:00
Cedric Nugteren 066af4069b Removed an unused variable from the copy-transpose-pad function 2016-07-16 10:56:37 +02:00
Gian-Carlo Pascutto e0ba59c0ac Make sure the passed types are large enough.
Make sure all out parameters that are passed to functions such
as clGetDeviceInfo are large enough to contain the replies.
2016-07-13 15:59:02 +02:00
Cedric Nugteren c87e877bf2 Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel 2016-07-10 20:32:01 +02:00
Cedric Nugteren 57f09178d8 Added tuning results for AMD Oland and for Intel Graphics HD 530 2016-07-10 11:46:44 +02:00
Cedric Nugteren 39e9b1238f Fixed a bug related to the cache and retrieval of programs based on the OpenCL context 2016-07-10 11:24:36 +02:00
Cedric Nugteren 9caa7ca5b9 Cache now compares cl_context instead of a pointer to a context; added verbose print statements to the cache 2016-07-08 20:57:58 +02:00
Cedric Nugteren 27854070b4 Added a VERBOSE mode to debug performance: now prints details about compilation and kernel execution to screen 2016-07-06 21:50:12 +02:00
Cedric Nugteren 77325b8974 Added an option to the performance clients to do a warm-up run before timing 2016-07-06 21:25:55 +02:00
Cedric Nugteren 9683b50c55 Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp) 2016-07-03 20:30:47 +02:00
Gian-Carlo Pascutto 7424532859 Ensure clGetKernelWorkGroupInfo return value fits.
In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo
to get the "bytes" amount needed to store the result from
CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an
"auto result = size_t", which in 32-bit mode is 4 bytes, regardless
of the previous return value. The spec describes that it will actually
be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure
we are in fact passing a cl_ulong.

Also adjust all callers to take the changed type into account.
2016-07-02 21:14:36 +02:00
Cedric Nugteren 7cf2f8c268 Fixed some memory leaks related to events not properly cleaned-up 2016-07-02 15:34:55 +02:00
Cedric Nugteren b330ab0866 Added declspec(dllexport) to ClearCache and FillCache, and added declspec(dllimport) when not building the library 2016-06-30 10:49:17 +02:00
Cedric Nugteren cd74aaac52 Updated to version 6.0 of the CLCudaAPI header 2016-06-29 19:42:49 +02:00
CNugteren 871b576c06 Made it possible to build the clients and tests on Windows using Visual Studio 2016-06-28 16:38:45 +02:00
Cedric Nugteren 76b20cfe0c Fixes for the AppVeyor Windows build 2016-06-27 14:44:08 +02:00
Cedric Nugteren 66908ef5cd Added tuning results for 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (thanks to OursDesCavernes) 2016-06-19 14:59:50 +02:00
Cedric Nugteren 61203453aa Renamed all C++ source files to .cpp to match the .hpp extension better 2016-06-19 13:55:49 +02:00
Cedric Nugteren f726fbdc9f Moved all headers into the source tree, changed headers to .hpp extension 2016-06-18 20:20:13 +02:00
Cedric Nugteren bacb5d2bb2 Clean-up of the routine class, moved RunKernel to the routine/common file 2016-06-18 18:16:14 +02:00
Cedric Nugteren 7b4c0e1cf0 Removed the template from the Routine base-class 2016-06-18 14:56:55 +02:00
Cedric Nugteren f9947b4d7f Removed the precision argument from the routines in favor of a single templated function 2016-06-17 14:30:37 +02:00
Cedric Nugteren 536b7fe4bc Removed the interface to the cache functions from the Routine class, calls them directly now 2016-06-17 13:57:50 +02:00
Cedric Nugteren 98a95c89fc Moved the RunKernel and PadCopyTransposeMatrix functions out of the Routine class 2016-06-17 12:32:06 +02:00
Cedric Nugteren afe8852eaa Moved the test-for-valid-buffers function from the Routine class to separate functions in a separate file 2016-06-17 11:29:07 +02:00
Cedric Nugteren 52ccaf5b25 Added XOMATCOPY routines to perform out-of-place matrix scaling, copying, and/or transposing 2016-06-16 18:07:46 +02:00
Cedric Nugteren 39b7dbc5e3 Added some constness to variables related to the GEMM routines 2016-06-15 12:34:05 +02:00
Cedric Nugteren b894611ad1 Re-organised the level-3 supporting kernels (copy, pad, transpose, convert) and renamed files and functions appropriately 2016-06-14 18:17:58 +02:00
Cedric Nugteren 3e78a99355 Moved device vendor and type checks to a common header 2016-06-14 14:30:22 +02:00
Cedric Nugteren 6e2017c67d Added support for FP16 on ARM Mali-T628 (officially not supported) 2016-06-14 14:29:53 +02:00
Cedric Nugteren 6925003e45 Added global memory synchronisation for better cache performance on ARM Mali GPUs 2016-06-08 10:13:37 +02:00
Cedric Nugteren 03182f9d07 Added half-precision tests for the clBLAS reference through conversion to single-precision 2016-05-26 23:36:19 +02:00
Cedric Nugteren 9f87455070 Added level-3 half-precision routines HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM 2016-05-25 13:29:53 +02:00
Cedric Nugteren ac1575056e Added proper argument handling and displaying for half-precision data-types 2016-05-24 14:06:16 +02:00
Cedric Nugteren 3e9a07f00a Added level-2 half-precision routines HGER/HSYR/HSPR/HSYR2/HSPR2 2016-05-22 16:59:14 +02:00
Cedric Nugteren f0cb3fdc81 Fixed tuning results for half-precision; added first results for the xGER kernels 2016-05-22 16:46:05 +02:00
Cedric Nugteren c8ff3f143f Prepared the GER kernels and tuner for half-precision support 2016-05-22 16:18:08 +02:00
Cedric Nugteren 95b828da12 Added level-2 half-precision routines HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV 2016-05-22 15:38:26 +02:00
Cedric Nugteren b6268d0c22 Added first tuning results for the half-precision xGEMV kernels 2016-05-22 15:29:05 +02:00
Cedric Nugteren 88551b4005 Prepared the GEMV kernels and tuner for half-precision support 2016-05-22 15:22:54 +02:00
Cedric Nugteren 803aaf3070 Added level-1 half-precision routines HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN 2016-05-22 14:47:14 +02:00
Cedric Nugteren 3c9e63c054 Added first tuning results for the half-precision xDOT kernels 2016-05-22 14:43:25 +02:00
Cedric Nugteren f70ded34f3 Added half-precision support for all level 1 routines 2016-05-22 14:26:19 +02:00
Cedric Nugteren 489c5d76cf Merged in latest changes from 0.7.1 release 2016-05-18 21:32:56 +02:00
Cedric Nugteren 7a3b695db7 Added half precision tuning results for supporting kernels (pad, copy, transpose, padtranspose) 2016-05-16 12:45:10 +02:00
Cedric Nugteren af2ac62212 Prepared GEMM and supporting kernels and tuners for half-precision support 2016-05-16 12:37:24 +02:00
Cedric Nugteren 4b6bdd83a2 Added header with conversions from and to half-precision floating-point 2016-05-15 20:13:57 +02:00
Cedric Nugteren 5e1b2e021f Set kernel arguments for AXPY as constant memory buffers, making it possible to transfer half-precision values as well 2016-05-14 18:06:00 +02:00
Cedric Nugteren 120c31a30f Initial experimental version of the half-precision HAXPY routine 2016-05-13 20:49:34 +02:00
Cedric Nugteren f2ba75890c Initial changes in preparation for half-precision fp16 support 2016-05-12 19:56:21 +02:00
cnugteren 25a25dbd6f Fixed errors in xAXPY and xSCAL tests on AMD hardware 2016-05-08 17:30:31 +02:00