Commit Graph

780 Commits (6e2ab6ee967c4a9b3350c7ce4e7d7b736c9e45f6)

Author SHA1 Message Date
Cedric Nugteren 7a756cbce7 Fixed a failing TRSV test using a CPU with Apple OpenCL 2018-03-15 20:58:42 +01:00
Cedric Nugteren 9ff6cd7547 Added queue-finish commands to PyCLBlast samples and tests 2018-03-15 20:37:48 +01:00
Cedric Nugteren 934893972e
Merge pull request #262 from CNugteren/CLBlast-237-tuning-api
CLBlast #237: Tuning API
2018-03-11 15:38:33 +01:00
Cedric Nugteren bcf1208431 Added basic tests for PyCLBlast 2018-03-11 15:32:36 +01:00
Cedric Nugteren 903deaf368 Fixed an issue for DLL linking under Windows 2018-03-10 16:45:31 +01:00
Cedric Nugteren 3d2ef9331b Fixed a few things for the new tuning API 2018-03-10 14:35:11 +01:00
Cedric Nugteren 0bdc51e47c Completed the API for all tuneable kernels 2018-03-10 10:54:44 +01:00
kodonell c6056da0c8 ok, device id working 2018-03-10 22:21:30 +13:00
Cedric Nugteren 6397e61746 Added several more tuner API functions 2018-03-09 21:40:22 +01:00
kodonell 54a4b871b3 initial add of override parameters to pyclblast - cython not complaining, but segfault 2018-03-09 15:27:33 +13:00
Cedric Nugteren 49cc8b31ff Fixed compilation issue in Xger tuner 2018-03-06 20:59:23 +01:00
Cedric Nugteren 0e1a152023 First version of the tuning API, added interface for copy-kernel, added sample 2018-03-06 20:52:12 +01:00
Cedric Nugteren a1cedf36e3 Separate kernel tuners in .cpp with main and .hpp with settings 2018-03-03 16:37:31 +01:00
Cedric Nugteren bff64917bd Fixed some small issues regarding PR#253 2018-03-03 10:43:12 +01:00
sivagnanamn 1433dc67f1 Added C API for getting GEMM temp buffer size 2018-03-03 03:00:17 +09:00
Cedric Nugteren 11f765c16c Generated function signatures/inspect for PyCLBlast 2018-02-25 15:31:38 +01:00
Cedric Nugteren 13dc26e63d Generated PyCLBlast docstrings 2018-02-25 15:30:57 +01:00
Cedric Nugteren 0557694d39 Fixed several issues in the new invert tuner 2018-02-20 20:53:13 +01:00
Cedric Nugteren fc10a4baca Set initial pyclblast to be version 1.0.0 2018-02-18 20:19:19 +01:00
Cedric Nugteren ce5e2a1e00 Prepared PyCLBlast for release as a package on PyPi 2018-02-18 18:01:02 +01:00
Cedric Nugteren 76c21a95c2 Added PyCLBlast samples 2018-02-18 17:59:43 +01:00
Cedric Nugteren a66e24a009 Added all other level 1/2/3 routines to pyclblast 2018-02-18 17:34:10 +01:00
Cedric Nugteren e1bfb40827 Added GEMM to the Python wrapper 2018-02-18 16:33:20 +01:00
Cedric Nugteren eb85f6b514 First agenerated version (clblastXswap only for now) of the pyclblast wrapper 2018-02-14 20:50:47 +01:00
Cedric Nugteren 61b8c771ed Added skeleton for Python interface using Cython 2018-02-13 21:42:32 +01:00
Cedric Nugteren 70d0fe89c6 Fixed a minor typo 2018-02-11 15:31:08 +01:00
Cedric Nugteren 69ed46c8da Implemented the XHAD Hadamard product routine 2018-02-02 21:18:37 +01:00
Cedric Nugteren ef5008f5e4 Created the API and stubs for the HAD (hadamard-product) routines 2018-01-31 20:41:02 +01:00
Cedric Nugteren caebe8a9d5 Fixed an event synchronisation issue in the batched gemm routines 2018-01-26 20:37:04 +01:00
Cedric Nugteren 19fd263fb2 Moved some constants from global scope to a function; removed unnecessary includes 2018-01-25 20:00:43 +01:00
Cedric Nugteren 6a9d6b5da2 Changed the default number of runs for the GEMV tuner to fix issues for FP16 2018-01-25 19:57:36 +01:00
Cedric Nugteren c3f9371d16 Made GEMM routine tuning a bit more generic in preparation of possible separate batched tuning arguments 2018-01-18 19:41:59 +01:00
Cedric Nugteren bc54411d19 Made the batched routines also chose direct/indirect kernel like the main GEMM routine 2018-01-18 19:41:02 +01:00
Cedric Nugteren 0e5eaa6eb9 Factored out the generic parts of the GEMM routine tuner 2018-01-15 21:32:51 +01:00
Cedric Nugteren a500f537d8 Added a RetrieveParameters function to inspect tuning parameters 2018-01-11 20:32:06 +01:00
Cedric Nugteren 99a4df88a6 Implemented the in-direct version of the strided-batched GEMM kernel 2018-01-08 21:07:01 +01:00
Cedric Nugteren 13f0f6fc6e Implemented direct version of strided-batched GEMM kernel 2018-01-07 14:58:45 +01:00
Cedric Nugteren 9fb2c61b25 Added API and tests for new GemmStridedBatched routine 2018-01-07 14:27:15 +01:00
Cedric Nugteren f1e3b35541 Reduced duplicate code in the batched GEMM implementation 2018-01-06 19:26:11 +01:00
Cedric Nugteren c9b5d614e2 Fixed a vendor naming bug in the tuners and in the database 2018-01-06 17:02:58 +01:00
Cedric Nugteren a7ccce1969
Merge pull request #238 from CNugteren/gemm_api_with_temp_buffer
GEMM API with optional temp buffer
2018-01-06 16:08:27 +01:00
Cedric Nugteren ad197da08d Fixed the CUDA interface: replaced nullptr with 0 2018-01-06 13:38:44 +01:00
Cedric Nugteren e71c037304 Fixed a performance overhead in database creation: it is again a static variable now as it was before 2018-01-06 11:28:04 +01:00
Cedric Nugteren ce069545d4 Added CUDA interface to get temporary-buffer size for GEMM routine 2018-01-06 10:05:28 +01:00
Cedric Nugteren 44431daecc Added a CUDA version of the GEMM temp-buffer optional argument 2018-01-04 19:33:51 +01:00
Cedric Nugteren af14fff1e9 Updated the generator script to automatically generate the temp-buffer code 2018-01-04 19:31:57 +01:00
Cedric Nugteren 7f893a85d9 Revert "Added options to disable parts of the invert kernel to find out where the AMD compiler crashes"
This reverts commit 407ed52cec.
2017-12-31 16:10:40 +01:00
Cedric Nugteren 69226ae828 Changed the invert kernel slightly; added part1a/part1b disable-defines 2017-12-31 14:07:08 +01:00
Cedric Nugteren 7ce415b927 Fixed ifdef's into ifndef's 2017-12-30 21:17:31 +01:00
Cedric Nugteren 407ed52cec Added options to disable parts of the invert kernel to find out where the AMD compiler crashes 2017-12-30 21:07:50 +01:00
Cedric Nugteren ad1227c4f2 Added optional temp-buffer argument to C++ interface of GEMM 2017-12-30 18:45:06 +01:00
Cedric Nugteren 6d1e30e61f Added interface to compute the required temporary buffer size for GEMM 2017-12-28 14:46:45 +01:00
Cedric Nugteren aaea9474a1 Factored out argument processing from the GEMM routine 2017-12-28 13:56:18 +01:00
Cedric Nugteren 74792ce96c Refactored GEMM code in preparation of separate temp-buffer computation 2017-12-28 11:08:10 +01:00
Cedric Nugteren 2b9bf3a9aa Simplified invert kernel a little 2017-12-27 17:03:06 +01:00
Cedric Nugteren 1e738db6dd Split the database into multiple small compilation units 2017-12-27 12:04:22 +01:00
Cedric Nugteren 4a2fc4aa98 Made the database-vector a non-static member 2017-12-26 11:32:05 +01:00
Cedric Nugteren bd540829ea Fixes for the CUDA backend of CLBlast 2017-12-24 12:10:55 +01:00
Cedric Nugteren ef71d8e9b5 Fixed unused variable warnings showing up with Clang 2017-12-23 16:07:26 +01:00
Cedric Nugteren 7aabeb44cc Updated the tuning results for the IvyBridge M GT2 GPU 2017-12-23 15:46:41 +01:00
Cedric Nugteren 2b020d59f9 Added defines to disable OpenCL deprecation warnings 2017-12-23 15:32:22 +01:00
Cedric Nugteren 04bf5437bc Fixed a warning under MSVC 2017-12-23 15:30:08 +01:00
Cedric Nugteren 288766debb Now calling main TRSV routine again to fix compilation in MSVC 2017-12-23 14:49:21 +01:00
Cedric Nugteren 736399e528 Split the invert kernel in two parts to prevent error C1091 in MSVC 2013 2017-12-23 14:18:07 +01:00
Cedric Nugteren b1f52f130c Updated the database to use the new TRSV and Invert tuners 2017-12-23 13:55:22 +01:00
Cedric Nugteren aa7db4f987 Added TRSV block-size tuner 2017-12-23 13:34:57 +01:00
Cedric Nugteren 9dec53ff52 Merge branch 'master' into feature/more_tuners 2017-12-21 20:18:05 +01:00
Cedric Nugteren 0ee81e27b9 Added tuning results for Apple AMD Radeon Pro 580 2017-12-20 19:59:31 +01:00
Cedric Nugteren 07a7012b0d Added skeleton for a tuner for the invert kernel 2017-12-19 21:10:48 +01:00
Cedric Nugteren 249bdaa8e9 Reformatted tuning code to make compilation faster 2017-12-18 21:34:07 +01:00
Cedric Nugteren e2f8068459 Fixed an issue with the tuner: it was using platform vendor rather than device vendor 2017-12-17 17:58:06 +01:00
Cedric Nugteren 69f6591564 Removed all ARM Mali tuning results; re-added Mali-T760 and Mali-T628 results based on kernel pre-processor 2017-12-17 16:59:08 +01:00
Cedric Nugteren 7408f6e6eb Fixed an unnecessary overflow issue on 32-bit systems 2017-12-17 16:42:54 +01:00
Cedric Nugteren 4a58efc130 Fixed for error C1091 in MSVC 2013 2017-12-10 16:40:59 +01:00
Cedric Nugteren b4d3a50f19 Split GEMM kernel in 4 files instead of 3 due to MSVC 2013 string length limit 2017-12-10 16:09:09 +01:00
Cedric Nugteren 82467b64c4 Fixed a missing include 2017-12-10 14:49:38 +01:00
Cedric Nugteren c2f08fa346 Fixed an issue in the tuners to prevent error -14 from persisting (CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST) 2017-12-10 14:48:13 +01:00
Cedric Nugteren 9112e587ae Fixed an Android compilation issue 2017-12-10 13:31:57 +01:00
Cedric Nugteren 9f02fb542c Completed kernel modifications for pre-processor of all other kernels 2017-12-09 20:44:21 +01:00
Cedric Nugteren ca5dbcd2bd Made the pre-processor run by default for ARM and Qualcomm GPUs 2017-12-09 15:16:53 +01:00
Cedric Nugteren 02c0d64037 Modified the direct GEMM kernel to support array-to-register promotion 2017-12-09 14:53:10 +01:00
Cedric Nugteren 23e3a85f2c Reformatted GEMM kernel to support array-to-register promotion 2017-12-09 14:09:13 +01:00
Cedric Nugteren d9df62b794 Fixed defines parsing and substituting in pre-processor; fixed some variable names in kernels 2017-12-09 10:49:55 +01:00
Cedric Nugteren 540896476d Added register promotion to the main GEMM kernel 2017-12-07 22:05:29 +01:00
Cedric Nugteren 0f9637bbac Improved array-to-register promotion, now handling function calls as well 2017-12-05 20:39:49 +01:00
Cedric Nugteren cf4555d1f4 Added GEMM (direct and in-direct) to the pre-processor testing; modified the loops in kernel accordingly 2017-12-03 16:40:36 +01:00
Cedric Nugteren 0a1a3de58a Added basic bracket parsing in defines and loop expressions 2017-12-03 16:39:22 +01:00
Cedric Nugteren 60312e5878 Reformated transpose kernels for the pre-processor; extended the amount of tests 2017-12-03 12:00:37 +01:00
Cedric Nugteren 92842024b0 Improved array to register promotion in the pre-processor 2017-12-03 11:59:38 +01:00
Cedric Nugteren bf7aeb8d5b Improved the pre-processor's handling of defines; added a special nested defines test 2017-11-30 21:43:16 +01:00
Cedric Nugteren 13eb772343 Integrated pre-processor in compilation flow, default is still disabled 2017-11-30 21:32:47 +01:00
Cedric Nugteren 93ffb876c6 Reformatted unrollable kernel loops and added the new promote_to_registers pragma for several kernels 2017-11-29 20:21:08 +01:00
Cedric Nugteren 0dde6af703 Extended the preprocessor tests to include CopyFast and CopyPad 2017-11-29 20:18:36 +01:00
Cedric Nugteren 1d35f65cea Improves the array-to-register promotion in the pre-processor 2017-11-29 19:53:50 +01:00
Cedric Nugteren 14047861ce Improved the kernel pre-processor in various ways 2017-11-28 20:52:08 +01:00
Cedric Nugteren 35956f9db1 Added simple implementation of array-to-register promotion 2017-11-27 20:26:30 +01:00
Cedric Nugteren 9c643b293c Improved the for-loop pre-processing 2017-11-26 13:32:48 +01:00
Cedric Nugteren 69aa3b35ed Implemented first simple pre-processor: defines parser and loop unrolling based on assumptions 2017-11-25 17:46:01 +01:00
Cedric Nugteren f01bcded1e Moved string splitting functions; added string character removal function 2017-11-25 17:44:21 +01:00
Cedric Nugteren c0c6d00b12 Added stub for a preprocessor and a corresponding compilation test 2017-11-25 10:24:05 +01:00
Cedric Nugteren ebce82e650
Merge pull request #222 from CNugteren/override_params_from_json
Override params in clients from tuner JSON
2017-11-25 09:48:27 +01:00
Cedric Nugteren abb4d5ab32 Added tuning results for ARM Mali T760 GPU 2017-11-24 21:16:54 +01:00
Cedric Nugteren 9527c89c30 Made parameter override in the clients a command-line argument and added support for multi-kernel routines 2017-11-22 20:53:20 +01:00
Cedric Nugteren 0f080bbc6e Potentially fixed an MSVC 2013 issue with a copy-constructor not being generated 2017-11-20 20:54:18 +01:00
Cedric Nugteren e0f3484084 Fixes some displaying issues in the GEMM routine tuner 2017-11-20 20:29:52 +01:00
Cedric Nugteren 5467c0cac5 Fixed a variety of warnings and an error for MSVC2013 compilation 2017-11-19 21:09:24 +01:00
Cedric Nugteren 4e0d08c3bc Added compilation timing and better compilation error reporting 2017-11-19 16:58:13 +01:00
Cedric Nugteren a3a8b44f59 Some fixed for the new auto-tuner to be compatible with the Python scripts 2017-11-19 16:31:08 +01:00
Cedric Nugteren 76d2b7f0b6 Revived the GEMM routine tuner; minor formatting changes 2017-11-19 12:59:52 +01:00
Cedric Nugteren 7a54494577 Modified the kernel tuners to use the newly integrated auto-tuner 2017-11-19 12:58:41 +01:00
Cedric Nugteren 8a5a5e031e Moved some tuning functions from .hpp to .cpp 2017-11-17 20:58:36 +01:00
Cedric Nugteren f94d498a37 Moved compilation function to separate file; removed dependency of tuners of the CLBlast library 2017-11-17 20:57:46 +01:00
Cedric Nugteren 2b8ad70b63 Added printing of the best parameters for the new tuner 2017-11-16 21:18:29 +01:00
Cedric Nugteren 1b2b46f2f0 Added first version of integrated and re-written auto-tuner 2017-11-15 22:49:35 +01:00
Cedric Nugteren 0cd78bb6f9 Added kernel timing functionality to the utilities 2017-11-15 22:47:06 +01:00
Cedric Nugteren b337bffbaf Added exception handle with catch-all 2017-11-15 22:44:44 +01:00
Cedric Nugteren 03ebf14b97 Made the exception dispatch function optionally silent 2017-11-13 21:11:31 +01:00
Cedric Nugteren 4bac1287f2 Moved square-difference utility function for use in the tuners 2017-11-13 21:10:44 +01:00
Cedric Nugteren 677afd3b96 Factored out the creation of the OpenCL header and the program compilation 2017-11-11 16:14:43 +01:00
Cedric Nugteren c41d219ea4 Added tuning results for the GeForce GTX750Ti 2017-11-09 21:19:21 +01:00
Cedric Nugteren b18cc9d3f1
Merge pull request #212 from CNugteren/kernel_selection_tuner
GEMM kernel selection tuner
2017-11-07 22:20:13 +01:00
Cedric Nugteren 3ec0be6fb8 Added various GEMM routine tuning results 2017-11-07 21:34:54 +01:00
Cedric Nugteren 33ac2b0175 Improved the way the database defaults are computed 2017-11-06 21:59:45 +01:00
Cedric Nugteren 34a33b54cf Changed GEMM routine tuner's scoring to use L2 measure instead for better averaging 2017-11-06 20:50:36 +01:00
Cedric Nugteren 9b0a435fb0 Integrated the GEMM routine tuner for kernel selection; added first tuning results 2017-11-02 21:47:14 +01:00
Cedric Nugteren 73272ab97d Fixed a bug in database compression/decompression 2017-11-02 21:19:18 +01:00
Cedric Nugteren 5c90577dfd Added collecting and printing of scores for the kernel-selection tuner 2017-10-30 20:39:21 +01:00
Cedric Nugteren ac5a58cfe5 Added platform ID to the binary program cache to prevent issues with multi-platform systems 2017-10-29 20:01:30 +01:00
Cedric Nugteren 319762f150 Added Android support using the GNU C++ STL library and the GCC toolchain 2017-10-29 12:07:07 +01:00
Cedric Nugteren 12b08ae491 Merge branch 'master' into android_support 2017-10-28 17:32:37 +02:00
Cedric Nugteren 334a26eb12 Added initial version of a GEMM kernel selection tuner 2017-10-28 17:30:29 +02:00
Cedric Nugteren bd57dfa435 Moved timing function to a separate file 2017-10-28 14:12:05 +02:00
Cedric Nugteren fa6e5e67f5 Fixed a bug when using the matrix A-offset argument for the TRSM routine 2017-10-27 22:12:30 +02:00
Cedric Nugteren 449577cf07 Reduced TRSM block-size for better numerical stability 2017-10-27 22:07:43 +02:00
Cedric Nugteren 44f7fa628a Added GEMV synchronisation for the TRSV routine: similar bug as in TRSM 2017-10-27 22:01:15 +02:00
Cedric Nugteren d49aae236e Fixed a bug in TRSM routine due to missing event synchronisations after GEMM calls 2017-10-25 20:35:39 +02:00
Cedric Nugteren 472f90501c Added tuning parameters for GeForce GTX 580, GeForce GTX 1080Ti, and Core i5-4570 2017-10-20 18:06:12 +02:00
Cedric Nugteren 363568787e Moved CUmodule code from Kernel to Program class to not require re-compilation every time 2017-10-18 18:17:30 +02:00
Cedric Nugteren 9d879c949a Fix an incompatibility with CUDA's FP16 definition 2017-10-17 20:29:23 +02:00
Cedric Nugteren b1270f04b8 Made buffers of batched routines read/write (was: read-only) 2017-10-17 19:56:47 +02:00
Cedric Nugteren f349731d54 CUDA kernel compilation fixes 2017-10-17 19:53:09 +02:00
Cedric Nugteren 0719f14486 Made all CUDA kernel launches synchronous; removed exception raising 2017-10-16 21:54:42 +02:00
Cedric Nugteren d62823f067 Added a missing OpenCL-to-CUDA function translation 2017-10-15 19:53:52 +02:00
Cedric Nugteren 7663cba234 Fixes for the CUDA API: first tests pass and the client runs 2017-10-15 17:43:20 +02:00
Cedric Nugteren 71049e8d39 Added the SM-compute-arch version to the nv compile options 2017-10-15 17:41:44 +02:00
Cedric Nugteren 7408da174c Various fixes to make the first CUDA examples work 2017-10-15 12:17:35 +02:00
Cedric Nugteren 55a802c63d Fixed a kernel/attribute order bug in the direct GEMM kernels 2017-10-14 17:21:34 +02:00
Cedric Nugteren b06bc01da9 Make local memory pointers a define in OpenCL; some fixes to the recently changed transpose kernel code 2017-10-14 17:13:54 +02:00
Cedric Nugteren d9456306e0 Made transpose kernel struct init proper according to the C standard 2017-10-14 16:48:06 +02:00
Cedric Nugteren 313fc796b2 Fixed several (not all) CUDA kernel compilation issues 2017-10-14 16:01:12 +02:00
Cedric Nugteren 54d0c440ce Various fixes to make the host code and sample compile with the CUDA API 2017-10-14 11:43:57 +02:00
Cedric Nugteren 2d7b648a24 Added OpenCL to CUDA translation header for the kernels 2017-10-14 10:49:25 +02:00
Cedric Nugteren cc5b475425 CUDA API now takes context and device in instead of stream 2017-10-12 12:20:43 +02:00
Cedric Nugteren b901809345 Added first (untested) version of a CUDA API 2017-10-11 23:16:57 +02:00
Cedric Nugteren 44246053a5 Removed include of clpp11.hpp in places other than utilities.hpp 2017-10-09 19:41:40 +02:00
Cedric Nugteren df3c9f4a8a Moved non-routine-specific API functions and includes to separate files 2017-10-08 21:52:02 +02:00
Cedric Nugteren 3598762029 Moved the remaining OpenCL specific host code to the clpp11.h header where it belongs 2017-10-08 10:29:47 +02:00
Cedric Nugteren 6d3e1212f0 Synchronizes clpp11.h with CLCudaAPI 9.0 2017-10-07 18:43:29 +02:00
Cedric Nugteren 86b80cdc98 Fixed a small typo 2017-10-07 18:39:32 +02:00
Cedric Nugteren 375193fe4e Gemm in-direct implementation now uses only 1 larger instead of max 3 optional temporary buffers 2017-10-03 21:55:21 +02:00
Cedric Nugteren 6b226028d5 Allow OverrideParameters function to work before a kernel was first used 2017-10-01 20:32:39 +02:00
Cedric Nugteren 1009303717 Merge branch 'additional_tuners' 2017-09-30 21:04:32 +02:00
Cedric Nugteren c151ab1325 Refactored the tuning architecture: less duplicate now; more defaults 2017-09-30 20:26:26 +02:00
Cedric Nugteren 00b5771477 Added Android header for compilation with gnustl STL 2017-09-26 21:20:01 +02:00
Cedric Nugteren 21af690472 Added missing headers 2017-09-26 21:17:55 +02:00
Cedric Nugteren ed980a1df1 Updated database override function to work with the new database storage format 2017-09-24 15:44:14 +02:00
Cedric Nugteren 255f09843c Made program and binary databases dependent on the routine parameters on top of the name 2017-09-23 20:40:38 +02:00
Cedric Nugteren 890281f3e8 Made database-caching no longer dependent on device name but on device/platform IDs 2017-09-23 17:50:44 +02:00
Cedric Nugteren ae1eeb4d1f Fixed type conversion warnings under MSVC 2013 2017-09-19 19:44:34 +02:00
Cedric Nugteren 1d2ee29cb9 Fixed compilation issues of the database for MSVC 2013 2017-09-19 19:44:05 +02:00
Cedric Nugteren a23cd8d13a Updated README with proper AMD device names; fixed device look-up for names of length 50+ 2017-09-16 21:26:38 +02:00
Cedric Nugteren 0802e3d84c Added tuning results for Intel Core i7 6770HQ 2017-09-16 21:19:06 +02:00
Cedric Nugteren bcf39eb79a Fixed a compilation error and warning under MacOS 2017-09-16 18:34:11 +02:00
Cedric Nugteren 163474e171 Fixed an issue with the NVIDIA compute capability not being retrieved properly 2017-09-16 18:25:23 +02:00
Cedric Nugteren 4e317f5e85 Improved compilation time of the tuner database 2017-09-16 18:02:37 +02:00
Cedric Nugteren c21878ecce Added a guard against missing AMD and NVIDIA extensions 2017-09-14 21:58:08 +02:00
Cedric Nugteren 0d13d814c2 Added architecture layer in the tuning database for better performance on unseen devices 2017-09-14 21:27:33 +02:00
Cedric Nugteren 76382ff6c1 Added the new vendor-architecture-name hierarchy to the tuners as well 2017-09-10 16:34:54 +02:00
Cedric Nugteren 91ea7fcde2 Introduced the notion of a device-architecture for the database and added device and architecture name mappings 2017-09-08 21:09:05 +02:00
Cedric Nugteren 20da5e33a8 Split the database files over multiple directories and files; first step towards separate compilation 2017-09-06 21:50:42 +02:00
Cedric Nugteren 8905da259d Fixed a modulo and division issue manifesting on Apple OpenCL for im2col 2017-09-05 18:49:23 +02:00
Cedric Nugteren 28462aa050 Removed an assumption that the 'default' tuning parameters have to be stored last; this is no longer needed 2017-09-04 17:39:57 +02:00
Cedric Nugteren 297159d5b9 Fixed a bug in im2col: process only valid channel IDs 2017-08-31 21:58:12 +02:00
Cedric Nugteren 6194d43efb Fixed a bug in im2col confusing first and second workgroup size; made im2col kernel 2d instead of 3d 2017-08-31 20:34:10 +02:00
Cedric Nugteren 54e160cd88 Fixed some things in the tuner: bugs, style, and defaults to random search 2017-08-31 20:28:01 +02:00
Cedric Nugteren 161fd8514d Merge branch 'master' into im_to_col 2017-08-24 21:15:14 +02:00
Cedric Nugteren 4d9d03ba51 Completed im2col implementation 2017-08-24 21:11:12 +02:00
Cedric Nugteren a8c26594d9 Made the im2col client properly handle the arguments 2017-08-23 19:54:09 +02:00
Cedric Nugteren da28cc5e93 Minor updates after merging in the PSO addition to the tuners 2017-08-21 20:14:02 +02:00
Cedric Nugteren e5eb6b1d3a Merge pull request #173 from mcian/PSO_params
Add PSO parameters support and search strategy selection from command…
2017-08-21 20:06:29 +02:00
mcian dfd332524a Remove multistrategy and related functions 2017-08-21 14:09:11 +02:00
Cedric Nugteren 803ca781f9 First version of im2col kernel, unoptimized but working 2017-08-19 18:25:13 +02:00
Cedric Nugteren 777681dcbd Merge branch 'master' into im_to_col 2017-08-12 20:50:00 +02:00
Cedric Nugteren 0a63621579 Moved functions from the header to the .cpp file to prevent compiling the same code multiple times 2017-08-12 15:59:14 +02:00
Cedric Nugteren 844e68853e Moved some utility functions to a test-specific utility compilation-unit 2017-08-12 15:38:17 +02:00
mcian 4adee60884 Revert the xgemm strategy to default. If user wants to use multistrategy can simple call the function TestHeuristic from the main 2017-08-09 16:58:46 +02:00
mcian 0b4aa109f8 Use cltune::SearchMethod enum instead of int values 2017-08-09 16:05:25 +02:00
mcian 99afdcd908 Restore direct GEMM to previous version 2017-07-31 14:06:23 +02:00
Cedric Nugteren 18d832e149 Added tuning results for the Qualcomm Adreno 330 GPU 2017-07-30 18:18:02 +02:00
Cedric Nugteren 0ea16a0e63 Minor optimization for the direct GEMM kernel: don't ceil m and n unnecessarily high 2017-07-25 20:53:12 +02:00
Cedric Nugteren 55861c40ff Merge branch 'relax_gemmbatched_ld_requirements' 2017-07-23 21:04:17 +02:00
mcian 473e814718 Code refactoring 2017-07-23 14:48:13 +02:00
Cedric Nugteren 2d52f9b1d3 Merge pull request #176 from CNugteren/inline_keyword_optional
Made the inline keyword in kernels optional
2017-07-22 10:44:08 +02:00
mcian a36283aaec Add new threshold for ARM 2017-07-17 12:20:46 +02:00
mcian 8131e68664 Add PSO parameters support and search strategy selection from command line 2017-07-17 12:00:25 +02:00
Cedric Nugteren 97bcf77d4b First step towards supporting im2col in the test infrastructure 2017-07-16 22:33:49 +02:00
Cedric Nugteren f77b48692b Relaxed requirement on a_ld and b_ld for batched GEMM 2017-07-12 21:53:39 +02:00
Cedric Nugteren 442c31dd50 Made the inline keyword in kernels optional currently only enabled for NVIDIA and ARM GPUs 2017-07-08 17:12:16 +02:00
Cedric Nugteren 84ec50e29d Added interface and stubs for the im2col routine 2017-07-02 12:10:22 +02:00
Cedric Nugteren 4cf516cfec Fixed an if-statement in the direct GEMM kernel causing a bug with specific sets of input parameters 2017-06-30 21:57:41 +02:00
Cedric Nugteren 1a8ed48a35 Fixed some Clang and MSVC warnings 2017-06-25 11:50:36 +02:00
Cedric Nugteren 615a7fdc81 Fixes some compilation issues related to the database structure change 2017-06-21 23:07:47 +02:00
Cedric Nugteren e44feb8576 Changed the structure of the database to reduce compilation time and save memory 2017-06-20 21:19:26 +02:00
Cedric Nugteren 48f2682eb7 Added tuning results for the Core i7-920 CPU 2017-06-18 20:53:59 +02:00
Cedric Nugteren 3070b502b5 Fixed an overflow bug on 32-bit systems when chosing a GEMM kernel 2017-06-18 20:51:11 +02:00
Cedric Nugteren 33ed1e5a06 Added tuning results for GeForce GT 650M (thanks to bzcheeseman) 2017-06-01 22:52:08 +02:00
Cedric Nugteren f57e209aab Merge pull request #158 from CNugteren/msvc_compilation_fixes
MSVC compilation fixes
2017-05-27 17:53:30 +02:00
Kirill Mavreshko 64ba590279 Fixed comment decribing the order of program cache fields 2017-05-27 10:30:09 +05:00
Cedric Nugteren f7a16d427c Fixed a compilation issue under MSVC 2013 2017-05-26 22:10:56 +02:00
Kirill Mavreshko 628e1e8cce Fixes inability to run GEMM on multiple identical GPUs (issue #155) 2017-05-26 15:04:19 +05:00
Cedric Nugteren 8400ee3a09 Fixed an TRSM issue caused by incorrect block size calculation 2017-05-15 22:04:55 +02:00
Cedric Nugteren 512b83dbad Fixed a missing synchronization barrier in the invert kernel; fixes TRSM tests 2017-05-14 20:27:35 +02:00
Cedric Nugteren f151e56daa Added the IxAMIN routines: absolute minimum version of IxAMAX 2017-05-12 20:01:33 -07:00
Cedric Nugteren 86e8df60f1 Fixed a bug in the TRSM routine; tests now pass 2017-05-12 17:43:56 -07:00
Cedric Nugteren 71933c3411 Added tuning results for the AMD Radeon Fiji GPU 2017-05-11 22:53:52 -07:00
Cedric Nugteren 1df28a15fc Re-added random tuning for GEMM after accidental removal 2017-05-11 22:12:38 -07:00
Cedric Nugteren 1c33af6eab Re-added Titan X (Pascal) tuning results based on more averaging when tuning 2017-04-23 17:58:56 +02:00
Cedric Nugteren 3eea8dc998 Increased the default number of runs for the tuner from 2 up to 10 for fast kernels 2017-04-22 13:56:07 +02:00
Cedric Nugteren 192199c9cb Fixed the direct vs indirect setting for NVIDIA GPUs 2017-04-22 13:43:27 +02:00
Cedric Nugteren e41d204856 Increased the default number of runs for GEMV tuning; updated GEMV tuning results for Iris Pro 2017-04-21 22:12:20 +02:00
Cedric Nugteren d7314d4f8e Tuned the direct versus indirect GEMM kernel trade-off point for NVIDIA GPUs 2017-04-20 22:19:09 +02:00
Cedric Nugteren 409a5a2ad0 Fixed a namespace clash with CUDA FP16 for the half-datatype 2017-04-17 16:47:15 +02:00
Cedric Nugteren 2673f50518 Merge branch 'development' into benchmarking 2017-04-16 19:41:14 +02:00
Cedric Nugteren 10205d773e Added a new Xaxpy kernel in between the regular and fast version in 2017-04-14 20:16:10 +02:00
Cedric Nugteren f7f8ec644f Fixed CUDA malloc and cuBLAS handles: cuBLAS as a performance-reference now works 2017-04-13 21:31:27 +02:00
Cedric Nugteren 22b3ea9256 Merge branch 'development' into cublas_reference
Conflicts:
	scripts/generator/generator.py
2017-04-10 20:11:45 +02:00
Cedric Nugteren 7374c37e2e Fixed a compilation issue under MSVC and GCC 2017-04-10 08:38:24 +02:00
Cedric Nugteren 2d45c37676 Removed const-vector-of-const-objects from the database class to remain according to the C++11 standard 2017-04-10 07:40:27 +02:00
Cedric Nugteren fb6c78ea07 Added a special override database for the Apple CPU implementation on OS X: this makes the test work, it does not focus on good performance 2017-04-07 07:37:30 +02:00
Cedric Nugteren d28ee082b0 Uses float2 and double2 for base complex data-types instead of a custom struct; fixes bug on Apple OpenCL 2017-04-07 07:35:15 +02:00
Cedric Nugteren ce369702d8 Added some missing const-ness 2017-04-07 07:34:32 +02:00
Cedric Nugteren b24d364743 Layed the groundwork for cuBLAS comparisons in the clients 2017-04-02 18:06:15 +02:00
Cedric Nugteren b84d2296b8 Separated host-device and device-host memory copies from execution of the CBLAS reference code; for fair timing and code de-duplication 2017-04-01 13:36:24 +02:00
Cedric Nugteren c27d2f0c1e Added an (optional) non-direct implementation of the batched GEMM routine 2017-03-19 16:04:04 +01:00
Cedric Nugteren 2fd04dae83 Added batched versions of the pad/copy/transpose kernels 2017-03-19 15:57:44 +01:00
Cedric Nugteren 11bb30e72b Added the possibility to tune batched kernels 2017-03-14 20:29:51 +01:00
Cedric Nugteren 7b8f8fce68 Added initial naive version of the batched GEMM routine based on the direct GEMM kernel 2017-03-11 16:02:45 +01:00
Cedric Nugteren 49e04c7fce Added API and test infrastructure for the batched GEMM routine 2017-03-10 21:24:35 +01:00
Cedric Nugteren d754586b49 Added proper testing of the alpha parameter; finalized the batched AXPY implementation 2017-03-10 20:49:59 +01:00
Cedric Nugteren 92a657290a Fixed a small compilation bug for MSVC related to a floating-point constant 2017-03-10 20:30:10 +01:00