Cedric Nugteren
ebce82e650
Merge pull request #222 from CNugteren/override_params_from_json
...
Override params in clients from tuner JSON
2017-11-25 09:48:27 +01:00
Cedric Nugteren
abb4d5ab32
Added tuning results for ARM Mali T760 GPU
2017-11-24 21:16:54 +01:00
Cedric Nugteren
9527c89c30
Made parameter override in the clients a command-line argument and added support for multi-kernel routines
2017-11-22 20:53:20 +01:00
Cedric Nugteren
0f080bbc6e
Potentially fixed an MSVC 2013 issue with a copy-constructor not being generated
2017-11-20 20:54:18 +01:00
Cedric Nugteren
e0f3484084
Fixes some displaying issues in the GEMM routine tuner
2017-11-20 20:29:52 +01:00
Cedric Nugteren
5467c0cac5
Fixed a variety of warnings and an error for MSVC2013 compilation
2017-11-19 21:09:24 +01:00
Cedric Nugteren
4e0d08c3bc
Added compilation timing and better compilation error reporting
2017-11-19 16:58:13 +01:00
Cedric Nugteren
a3a8b44f59
Some fixed for the new auto-tuner to be compatible with the Python scripts
2017-11-19 16:31:08 +01:00
Cedric Nugteren
76d2b7f0b6
Revived the GEMM routine tuner; minor formatting changes
2017-11-19 12:59:52 +01:00
Cedric Nugteren
7a54494577
Modified the kernel tuners to use the newly integrated auto-tuner
2017-11-19 12:58:41 +01:00
Cedric Nugteren
8a5a5e031e
Moved some tuning functions from .hpp to .cpp
2017-11-17 20:58:36 +01:00
Cedric Nugteren
f94d498a37
Moved compilation function to separate file; removed dependency of tuners of the CLBlast library
2017-11-17 20:57:46 +01:00
Cedric Nugteren
2b8ad70b63
Added printing of the best parameters for the new tuner
2017-11-16 21:18:29 +01:00
Cedric Nugteren
1b2b46f2f0
Added first version of integrated and re-written auto-tuner
2017-11-15 22:49:35 +01:00
Cedric Nugteren
0cd78bb6f9
Added kernel timing functionality to the utilities
2017-11-15 22:47:06 +01:00
Cedric Nugteren
b337bffbaf
Added exception handle with catch-all
2017-11-15 22:44:44 +01:00
Cedric Nugteren
03ebf14b97
Made the exception dispatch function optionally silent
2017-11-13 21:11:31 +01:00
Cedric Nugteren
4bac1287f2
Moved square-difference utility function for use in the tuners
2017-11-13 21:10:44 +01:00
Cedric Nugteren
677afd3b96
Factored out the creation of the OpenCL header and the program compilation
2017-11-11 16:14:43 +01:00
Cedric Nugteren
c41d219ea4
Added tuning results for the GeForce GTX750Ti
2017-11-09 21:19:21 +01:00
Cedric Nugteren
b18cc9d3f1
Merge pull request #212 from CNugteren/kernel_selection_tuner
...
GEMM kernel selection tuner
2017-11-07 22:20:13 +01:00
Cedric Nugteren
3ec0be6fb8
Added various GEMM routine tuning results
2017-11-07 21:34:54 +01:00
Cedric Nugteren
33ac2b0175
Improved the way the database defaults are computed
2017-11-06 21:59:45 +01:00
Cedric Nugteren
34a33b54cf
Changed GEMM routine tuner's scoring to use L2 measure instead for better averaging
2017-11-06 20:50:36 +01:00
Cedric Nugteren
9b0a435fb0
Integrated the GEMM routine tuner for kernel selection; added first tuning results
2017-11-02 21:47:14 +01:00
Cedric Nugteren
73272ab97d
Fixed a bug in database compression/decompression
2017-11-02 21:19:18 +01:00
Cedric Nugteren
5c90577dfd
Added collecting and printing of scores for the kernel-selection tuner
2017-10-30 20:39:21 +01:00
Cedric Nugteren
ac5a58cfe5
Added platform ID to the binary program cache to prevent issues with multi-platform systems
2017-10-29 20:01:30 +01:00
Cedric Nugteren
319762f150
Added Android support using the GNU C++ STL library and the GCC toolchain
2017-10-29 12:07:07 +01:00
Cedric Nugteren
12b08ae491
Merge branch 'master' into android_support
2017-10-28 17:32:37 +02:00
Cedric Nugteren
334a26eb12
Added initial version of a GEMM kernel selection tuner
2017-10-28 17:30:29 +02:00
Cedric Nugteren
bd57dfa435
Moved timing function to a separate file
2017-10-28 14:12:05 +02:00
Cedric Nugteren
fa6e5e67f5
Fixed a bug when using the matrix A-offset argument for the TRSM routine
2017-10-27 22:12:30 +02:00
Cedric Nugteren
449577cf07
Reduced TRSM block-size for better numerical stability
2017-10-27 22:07:43 +02:00
Cedric Nugteren
44f7fa628a
Added GEMV synchronisation for the TRSV routine: similar bug as in TRSM
2017-10-27 22:01:15 +02:00
Cedric Nugteren
d49aae236e
Fixed a bug in TRSM routine due to missing event synchronisations after GEMM calls
2017-10-25 20:35:39 +02:00
Cedric Nugteren
472f90501c
Added tuning parameters for GeForce GTX 580, GeForce GTX 1080Ti, and Core i5-4570
2017-10-20 18:06:12 +02:00
Cedric Nugteren
363568787e
Moved CUmodule code from Kernel to Program class to not require re-compilation every time
2017-10-18 18:17:30 +02:00
Cedric Nugteren
9d879c949a
Fix an incompatibility with CUDA's FP16 definition
2017-10-17 20:29:23 +02:00
Cedric Nugteren
b1270f04b8
Made buffers of batched routines read/write (was: read-only)
2017-10-17 19:56:47 +02:00
Cedric Nugteren
f349731d54
CUDA kernel compilation fixes
2017-10-17 19:53:09 +02:00
Cedric Nugteren
0719f14486
Made all CUDA kernel launches synchronous; removed exception raising
2017-10-16 21:54:42 +02:00
Cedric Nugteren
d62823f067
Added a missing OpenCL-to-CUDA function translation
2017-10-15 19:53:52 +02:00
Cedric Nugteren
7663cba234
Fixes for the CUDA API: first tests pass and the client runs
2017-10-15 17:43:20 +02:00
Cedric Nugteren
71049e8d39
Added the SM-compute-arch version to the nv compile options
2017-10-15 17:41:44 +02:00
Cedric Nugteren
7408da174c
Various fixes to make the first CUDA examples work
2017-10-15 12:17:35 +02:00
Cedric Nugteren
55a802c63d
Fixed a kernel/attribute order bug in the direct GEMM kernels
2017-10-14 17:21:34 +02:00
Cedric Nugteren
b06bc01da9
Make local memory pointers a define in OpenCL; some fixes to the recently changed transpose kernel code
2017-10-14 17:13:54 +02:00
Cedric Nugteren
d9456306e0
Made transpose kernel struct init proper according to the C standard
2017-10-14 16:48:06 +02:00
Cedric Nugteren
313fc796b2
Fixed several (not all) CUDA kernel compilation issues
2017-10-14 16:01:12 +02:00
Cedric Nugteren
54d0c440ce
Various fixes to make the host code and sample compile with the CUDA API
2017-10-14 11:43:57 +02:00
Cedric Nugteren
2d7b648a24
Added OpenCL to CUDA translation header for the kernels
2017-10-14 10:49:25 +02:00
Cedric Nugteren
cc5b475425
CUDA API now takes context and device in instead of stream
2017-10-12 12:20:43 +02:00
Cedric Nugteren
b901809345
Added first (untested) version of a CUDA API
2017-10-11 23:16:57 +02:00
Cedric Nugteren
44246053a5
Removed include of clpp11.hpp in places other than utilities.hpp
2017-10-09 19:41:40 +02:00
Cedric Nugteren
df3c9f4a8a
Moved non-routine-specific API functions and includes to separate files
2017-10-08 21:52:02 +02:00
Cedric Nugteren
3598762029
Moved the remaining OpenCL specific host code to the clpp11.h header where it belongs
2017-10-08 10:29:47 +02:00
Cedric Nugteren
6d3e1212f0
Synchronizes clpp11.h with CLCudaAPI 9.0
2017-10-07 18:43:29 +02:00
Cedric Nugteren
86b80cdc98
Fixed a small typo
2017-10-07 18:39:32 +02:00
Cedric Nugteren
375193fe4e
Gemm in-direct implementation now uses only 1 larger instead of max 3 optional temporary buffers
2017-10-03 21:55:21 +02:00
Cedric Nugteren
6b226028d5
Allow OverrideParameters function to work before a kernel was first used
2017-10-01 20:32:39 +02:00
Cedric Nugteren
1009303717
Merge branch 'additional_tuners'
2017-09-30 21:04:32 +02:00
Cedric Nugteren
c151ab1325
Refactored the tuning architecture: less duplicate now; more defaults
2017-09-30 20:26:26 +02:00
Cedric Nugteren
00b5771477
Added Android header for compilation with gnustl STL
2017-09-26 21:20:01 +02:00
Cedric Nugteren
21af690472
Added missing headers
2017-09-26 21:17:55 +02:00
Cedric Nugteren
ed980a1df1
Updated database override function to work with the new database storage format
2017-09-24 15:44:14 +02:00
Cedric Nugteren
255f09843c
Made program and binary databases dependent on the routine parameters on top of the name
2017-09-23 20:40:38 +02:00
Cedric Nugteren
890281f3e8
Made database-caching no longer dependent on device name but on device/platform IDs
2017-09-23 17:50:44 +02:00
Cedric Nugteren
ae1eeb4d1f
Fixed type conversion warnings under MSVC 2013
2017-09-19 19:44:34 +02:00
Cedric Nugteren
1d2ee29cb9
Fixed compilation issues of the database for MSVC 2013
2017-09-19 19:44:05 +02:00
Cedric Nugteren
a23cd8d13a
Updated README with proper AMD device names; fixed device look-up for names of length 50+
2017-09-16 21:26:38 +02:00
Cedric Nugteren
0802e3d84c
Added tuning results for Intel Core i7 6770HQ
2017-09-16 21:19:06 +02:00
Cedric Nugteren
bcf39eb79a
Fixed a compilation error and warning under MacOS
2017-09-16 18:34:11 +02:00
Cedric Nugteren
163474e171
Fixed an issue with the NVIDIA compute capability not being retrieved properly
2017-09-16 18:25:23 +02:00
Cedric Nugteren
4e317f5e85
Improved compilation time of the tuner database
2017-09-16 18:02:37 +02:00
Cedric Nugteren
c21878ecce
Added a guard against missing AMD and NVIDIA extensions
2017-09-14 21:58:08 +02:00
Cedric Nugteren
0d13d814c2
Added architecture layer in the tuning database for better performance on unseen devices
2017-09-14 21:27:33 +02:00
Cedric Nugteren
76382ff6c1
Added the new vendor-architecture-name hierarchy to the tuners as well
2017-09-10 16:34:54 +02:00
Cedric Nugteren
91ea7fcde2
Introduced the notion of a device-architecture for the database and added device and architecture name mappings
2017-09-08 21:09:05 +02:00
Cedric Nugteren
20da5e33a8
Split the database files over multiple directories and files; first step towards separate compilation
2017-09-06 21:50:42 +02:00
Cedric Nugteren
8905da259d
Fixed a modulo and division issue manifesting on Apple OpenCL for im2col
2017-09-05 18:49:23 +02:00
Cedric Nugteren
28462aa050
Removed an assumption that the 'default' tuning parameters have to be stored last; this is no longer needed
2017-09-04 17:39:57 +02:00
Cedric Nugteren
297159d5b9
Fixed a bug in im2col: process only valid channel IDs
2017-08-31 21:58:12 +02:00
Cedric Nugteren
6194d43efb
Fixed a bug in im2col confusing first and second workgroup size; made im2col kernel 2d instead of 3d
2017-08-31 20:34:10 +02:00
Cedric Nugteren
54e160cd88
Fixed some things in the tuner: bugs, style, and defaults to random search
2017-08-31 20:28:01 +02:00
Cedric Nugteren
161fd8514d
Merge branch 'master' into im_to_col
2017-08-24 21:15:14 +02:00
Cedric Nugteren
4d9d03ba51
Completed im2col implementation
2017-08-24 21:11:12 +02:00
Cedric Nugteren
a8c26594d9
Made the im2col client properly handle the arguments
2017-08-23 19:54:09 +02:00
Cedric Nugteren
da28cc5e93
Minor updates after merging in the PSO addition to the tuners
2017-08-21 20:14:02 +02:00
Cedric Nugteren
e5eb6b1d3a
Merge pull request #173 from mcian/PSO_params
...
Add PSO parameters support and search strategy selection from command…
2017-08-21 20:06:29 +02:00
mcian
dfd332524a
Remove multistrategy and related functions
2017-08-21 14:09:11 +02:00
Cedric Nugteren
803ca781f9
First version of im2col kernel, unoptimized but working
2017-08-19 18:25:13 +02:00
Cedric Nugteren
777681dcbd
Merge branch 'master' into im_to_col
2017-08-12 20:50:00 +02:00
Cedric Nugteren
0a63621579
Moved functions from the header to the .cpp file to prevent compiling the same code multiple times
2017-08-12 15:59:14 +02:00
Cedric Nugteren
844e68853e
Moved some utility functions to a test-specific utility compilation-unit
2017-08-12 15:38:17 +02:00
mcian
4adee60884
Revert the xgemm strategy to default. If user wants to use multistrategy can simple call the function TestHeuristic from the main
2017-08-09 16:58:46 +02:00
mcian
0b4aa109f8
Use cltune::SearchMethod enum instead of int values
2017-08-09 16:05:25 +02:00
mcian
99afdcd908
Restore direct GEMM to previous version
2017-07-31 14:06:23 +02:00
Cedric Nugteren
18d832e149
Added tuning results for the Qualcomm Adreno 330 GPU
2017-07-30 18:18:02 +02:00
Cedric Nugteren
0ea16a0e63
Minor optimization for the direct GEMM kernel: don't ceil m and n unnecessarily high
2017-07-25 20:53:12 +02:00
Cedric Nugteren
55861c40ff
Merge branch 'relax_gemmbatched_ld_requirements'
2017-07-23 21:04:17 +02:00
mcian
473e814718
Code refactoring
2017-07-23 14:48:13 +02:00
Cedric Nugteren
2d52f9b1d3
Merge pull request #176 from CNugteren/inline_keyword_optional
...
Made the inline keyword in kernels optional
2017-07-22 10:44:08 +02:00
mcian
a36283aaec
Add new threshold for ARM
2017-07-17 12:20:46 +02:00
mcian
8131e68664
Add PSO parameters support and search strategy selection from command line
2017-07-17 12:00:25 +02:00
Cedric Nugteren
97bcf77d4b
First step towards supporting im2col in the test infrastructure
2017-07-16 22:33:49 +02:00
Cedric Nugteren
f77b48692b
Relaxed requirement on a_ld and b_ld for batched GEMM
2017-07-12 21:53:39 +02:00
Cedric Nugteren
442c31dd50
Made the inline keyword in kernels optional currently only enabled for NVIDIA and ARM GPUs
2017-07-08 17:12:16 +02:00
Cedric Nugteren
84ec50e29d
Added interface and stubs for the im2col routine
2017-07-02 12:10:22 +02:00
Cedric Nugteren
4cf516cfec
Fixed an if-statement in the direct GEMM kernel causing a bug with specific sets of input parameters
2017-06-30 21:57:41 +02:00
Cedric Nugteren
1a8ed48a35
Fixed some Clang and MSVC warnings
2017-06-25 11:50:36 +02:00
Cedric Nugteren
615a7fdc81
Fixes some compilation issues related to the database structure change
2017-06-21 23:07:47 +02:00
Cedric Nugteren
e44feb8576
Changed the structure of the database to reduce compilation time and save memory
2017-06-20 21:19:26 +02:00
Cedric Nugteren
48f2682eb7
Added tuning results for the Core i7-920 CPU
2017-06-18 20:53:59 +02:00
Cedric Nugteren
3070b502b5
Fixed an overflow bug on 32-bit systems when chosing a GEMM kernel
2017-06-18 20:51:11 +02:00
Cedric Nugteren
33ed1e5a06
Added tuning results for GeForce GT 650M (thanks to bzcheeseman)
2017-06-01 22:52:08 +02:00
Cedric Nugteren
f57e209aab
Merge pull request #158 from CNugteren/msvc_compilation_fixes
...
MSVC compilation fixes
2017-05-27 17:53:30 +02:00
Kirill Mavreshko
64ba590279
Fixed comment decribing the order of program cache fields
2017-05-27 10:30:09 +05:00
Cedric Nugteren
f7a16d427c
Fixed a compilation issue under MSVC 2013
2017-05-26 22:10:56 +02:00
Kirill Mavreshko
628e1e8cce
Fixes inability to run GEMM on multiple identical GPUs (issue #155 )
2017-05-26 15:04:19 +05:00
Cedric Nugteren
8400ee3a09
Fixed an TRSM issue caused by incorrect block size calculation
2017-05-15 22:04:55 +02:00
Cedric Nugteren
512b83dbad
Fixed a missing synchronization barrier in the invert kernel; fixes TRSM tests
2017-05-14 20:27:35 +02:00
Cedric Nugteren
f151e56daa
Added the IxAMIN routines: absolute minimum version of IxAMAX
2017-05-12 20:01:33 -07:00
Cedric Nugteren
86e8df60f1
Fixed a bug in the TRSM routine; tests now pass
2017-05-12 17:43:56 -07:00
Cedric Nugteren
71933c3411
Added tuning results for the AMD Radeon Fiji GPU
2017-05-11 22:53:52 -07:00
Cedric Nugteren
1df28a15fc
Re-added random tuning for GEMM after accidental removal
2017-05-11 22:12:38 -07:00
Cedric Nugteren
1c33af6eab
Re-added Titan X (Pascal) tuning results based on more averaging when tuning
2017-04-23 17:58:56 +02:00
Cedric Nugteren
3eea8dc998
Increased the default number of runs for the tuner from 2 up to 10 for fast kernels
2017-04-22 13:56:07 +02:00
Cedric Nugteren
192199c9cb
Fixed the direct vs indirect setting for NVIDIA GPUs
2017-04-22 13:43:27 +02:00
Cedric Nugteren
e41d204856
Increased the default number of runs for GEMV tuning; updated GEMV tuning results for Iris Pro
2017-04-21 22:12:20 +02:00
Cedric Nugteren
d7314d4f8e
Tuned the direct versus indirect GEMM kernel trade-off point for NVIDIA GPUs
2017-04-20 22:19:09 +02:00
Cedric Nugteren
409a5a2ad0
Fixed a namespace clash with CUDA FP16 for the half-datatype
2017-04-17 16:47:15 +02:00
Cedric Nugteren
2673f50518
Merge branch 'development' into benchmarking
2017-04-16 19:41:14 +02:00
Cedric Nugteren
10205d773e
Added a new Xaxpy kernel in between the regular and fast version in
2017-04-14 20:16:10 +02:00
Cedric Nugteren
f7f8ec644f
Fixed CUDA malloc and cuBLAS handles: cuBLAS as a performance-reference now works
2017-04-13 21:31:27 +02:00
Cedric Nugteren
22b3ea9256
Merge branch 'development' into cublas_reference
...
Conflicts:
scripts/generator/generator.py
2017-04-10 20:11:45 +02:00
Cedric Nugteren
7374c37e2e
Fixed a compilation issue under MSVC and GCC
2017-04-10 08:38:24 +02:00
Cedric Nugteren
2d45c37676
Removed const-vector-of-const-objects from the database class to remain according to the C++11 standard
2017-04-10 07:40:27 +02:00
Cedric Nugteren
fb6c78ea07
Added a special override database for the Apple CPU implementation on OS X: this makes the test work, it does not focus on good performance
2017-04-07 07:37:30 +02:00
Cedric Nugteren
d28ee082b0
Uses float2 and double2 for base complex data-types instead of a custom struct; fixes bug on Apple OpenCL
2017-04-07 07:35:15 +02:00
Cedric Nugteren
ce369702d8
Added some missing const-ness
2017-04-07 07:34:32 +02:00
Cedric Nugteren
b24d364743
Layed the groundwork for cuBLAS comparisons in the clients
2017-04-02 18:06:15 +02:00
Cedric Nugteren
b84d2296b8
Separated host-device and device-host memory copies from execution of the CBLAS reference code; for fair timing and code de-duplication
2017-04-01 13:36:24 +02:00
Cedric Nugteren
c27d2f0c1e
Added an (optional) non-direct implementation of the batched GEMM routine
2017-03-19 16:04:04 +01:00
Cedric Nugteren
2fd04dae83
Added batched versions of the pad/copy/transpose kernels
2017-03-19 15:57:44 +01:00
Cedric Nugteren
11bb30e72b
Added the possibility to tune batched kernels
2017-03-14 20:29:51 +01:00
Cedric Nugteren
7b8f8fce68
Added initial naive version of the batched GEMM routine based on the direct GEMM kernel
2017-03-11 16:02:45 +01:00
Cedric Nugteren
49e04c7fce
Added API and test infrastructure for the batched GEMM routine
2017-03-10 21:24:35 +01:00
Cedric Nugteren
d754586b49
Added proper testing of the alpha parameter; finalized the batched AXPY implementation
2017-03-10 20:49:59 +01:00
Cedric Nugteren
92a657290a
Fixed a small compilation bug for MSVC related to a floating-point constant
2017-03-10 20:30:10 +01:00
Cedric Nugteren
878d93e7dc
Implemented a batched version of the AXPY kernel
2017-03-08 20:36:35 +01:00
Cedric Nugteren
fa0a9c689f
Make batched routines based on offsets instead of a vector of cl_mem objects - undoing many earlier changes
2017-03-08 20:10:20 +01:00
Cedric Nugteren
6aba0bbae7
Minor fixes to the client w.r.t. the addition of the batch count
2017-03-05 16:44:16 +01:00
Cedric Nugteren
b114ea49a9
Added first naive version of the batched AXPY routine
2017-03-05 15:06:14 +01:00
Cedric Nugteren
cdf354f895
Adjusted the test-infrastructure to support testing of batched-versions of routines
2017-03-05 15:04:16 +01:00
Cedric Nugteren
7f14b11f1e
Changed the way the test-data is generated: now using a single MT generator and distribution for all data
2017-03-05 11:13:47 +01:00
Cedric Nugteren
f9a520b3af
Prepared generator for batched routines; added batched AXPY routine interface
2017-03-05 10:38:38 +01:00
Cedric Nugteren
e9ef037549
Added tuning results for the Radeon HD6750M GPU (Apple OpenCL)
2017-03-04 15:24:55 +01:00
Cedric Nugteren
e993ee077b
Added a proper data-preparation function for the TRSM tests
2017-03-04 15:21:33 +01:00
Cedric Nugteren
3fc73851f7
Added proper support for the b_offset argument in TRSM
2017-03-01 21:23:33 +01:00
Cedric Nugteren
00281dad26
Fixed half-precision bugs in HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM related to incorrect constants
2017-02-27 21:00:04 +01:00
Cedric Nugteren
e09c26c706
Split the GEMM kernel further up to prevent C1091 in MSVC
2017-02-26 15:03:12 +01:00
Cedric Nugteren
ea6790665d
Merge branch 'development' into triangular_solvers
2017-02-26 14:51:45 +01:00
Cedric Nugteren
df7638c305
Fixed an out-of-bounds memory access when filling a matrix with a constant
2017-02-26 14:31:05 +01:00
Cedric Nugteren
b7310036ed
Removed half-precision support from the TRSM routine; too unstable
2017-02-26 12:56:21 +01:00
Cedric Nugteren
a433987441
Fixes division in the kernel for inversion of complex numbers
2017-02-26 10:18:45 +01:00
Cedric Nugteren
e47d95887c
Added PrepareData function for TRSM to create proper test input
2017-02-25 12:23:04 +01:00
Cedric Nugteren
2f2a510c38
Implemented a simple row-major to col-major problem conversion for TRSM
2017-02-24 21:08:44 +01:00
Cedric Nugteren
1e5b5157bc
Fixed a few issues with the TRSM routine; some tests still failing
2017-02-22 20:31:33 +01:00
Cedric Nugteren
133ebfc834
Added data-preparation function for the TRSV tests and special nan/inf checks in the error checking to make the tests pass
2017-02-19 17:43:26 +01:00
Cedric Nugteren
0643a29af5
Added tuning parameters for the AMD RX480 GPU (Ellesmere)
2017-02-18 13:59:10 +01:00
Cedric Nugteren
d6538dfc25
Fixed the naming of the C API of OverrideParameters and fixed the description
2017-02-18 10:59:38 +01:00
Cedric Nugteren
cda449a5c3
Added a C interface to the OverrideParameters function; added some in-line comments to the API
2017-02-16 21:14:48 +01:00
Cedric Nugteren
08bfb75a9d
Added input-sanity checks for the OverrideParameters function
2017-02-16 21:12:50 +01:00
Cedric Nugteren
cdb3bb7166
Added first version of the OverrideParameters function
2017-02-13 20:53:06 +01:00
Cedric Nugteren
00eb55a2d4
Fixed a small bug in GEMV: unused kernel in parameter list
2017-02-13 20:48:32 +01:00
Cedric Nugteren
345a5feb9a
Split the database into several smaller cached per-kernel databases (in preparation of per-kernel database overrides)
2017-02-12 12:02:39 +01:00
Cedric Nugteren
faa842b927
Made RemoveBySubset from the cache work with references to keys
2017-02-12 11:58:20 +01:00
Cedric Nugteren
36b942a698
Added an option to remove items from the caches, optionally by a subset of 2 specific key-values only
2017-02-11 14:05:38 +01:00
Cedric Nugteren
dc93523204
Added tuning results for Titan X (Pascal version)
2017-02-08 21:14:38 +01:00
Cedric Nugteren
c248f900c0
Merge branch 'development' into triangular_solvers
2017-02-05 22:18:59 +01:00
Cedric Nugteren
e7cbb5915a
Fixed complex version of the TRSV kernel
2017-02-05 14:36:31 +01:00
Cedric Nugteren
c209dd7af9
Improved substition kernels a bit; added complex support
2017-02-04 22:48:06 +01:00
Cedric Nugteren
fec8c1a806
Completed a first STRSV implementation
2017-02-04 16:04:19 +01:00
Cedric Nugteren
a6ba6470aa
Added row-major support for TRSV
2017-02-04 14:25:27 +01:00
Cedric Nugteren
7c73ceb095
Added first (incomplete) version of TRSV routine
2017-01-29 17:02:00 +01:00
Ivan Shapovalov
5fb1da1a0f
Database: pass Device instead of Queue for clarity
2017-01-24 12:18:14 +03:00
Ivan Shapovalov
50e758a007
Routine: cache the database instance as well
...
This does not change much, but will become useful in next commits when
plugin support is introduced.
2017-01-24 11:56:15 +03:00
Ivan Shapovalov
6dc18c1c57
Database: ref-count the internal map for caching
2017-01-24 11:56:15 +03:00
Ivan Shapovalov
5bcd92f297
Routine, Cache: generalize, reduce amount of copying in fast path
...
Implement a generalized Cache<K, V>. Two variants are provided: the
first one is based on std::map, using C++14-specific transparent
std::less<> and generalized std::map::find() to allow searching by tuple
of references. The second one is based on std::vector and O(n) lookup,
but remains C++11-compliant.
2017-01-24 11:56:15 +03:00
Ivan Shapovalov
1b8e816333
FillCache: perform compilation for each precision separately
...
Thus do not prevent filling cache for float if the device does not support
e. g. double.
2017-01-24 02:43:00 +03:00
Ivan Shapovalov
6ad11665a1
Routine: fix semi-warm routine construction (when binary is in cache)
...
There was a missing return statement in the semi-warm path that made
CLBlast to continue to cold path after a cache hit.
2017-01-24 02:43:00 +03:00
Ivan Shapovalov
a9914ee3a8
src/clpp11.hpp: check pointers before clRelease*()
...
This is to avoid spurious "induced" errors on destruction, if construction
failed for some reason.
2017-01-24 02:42:59 +03:00
Ivan Shapovalov
8e1c084c93
src/clpp11.hpp: do not store program source/binary in Program
...
The stored source/binary does not seem to serve any purpose, yet its
presence makes Program a heavy (not pure refcounted) object, which is
undesired esp. because it is copied from the cache in the hot path.
2017-01-24 02:42:59 +03:00
Ivan Shapovalov
1a1e863ab3
treewide: include clpp11.hpp first to silence deprecation warnings
...
Otherwise, cl.h gets included through clblast.h before clpp11.hpp.
2017-01-20 17:32:42 +03:00
Ivan Shapovalov
43c7707173
Routine: use PrecisionSupported<>() instead of duplicating the check
2017-01-20 17:20:45 +03:00
Cedric Nugteren
a5fd2323b6
Added prototype for the TRSV routine
2017-01-20 11:30:32 +01:00
Cedric Nugteren
a2c0a9c551
Set number of decimals for floating-point printing for error reporting
2017-01-20 11:13:44 +01:00
Cedric Nugteren
2e4f6e1609
Added tuning results for NVIDIA GTX 1080 and Intel Core i7-4790K
2017-01-19 19:42:31 +01:00
Cedric Nugteren
df9a77d74d
Added first version of the TRSM routine based on the diagonal invert kernel
2017-01-18 21:29:59 +01:00
Cedric Nugteren
4b3ffd9989
Added a first version of the diagonal block invert routine in preparation of TRSM
2017-01-15 17:30:00 +01:00
Cedric Nugteren
4a4be0c3a5
Prints additional information in verbose/debug mode
2017-01-15 17:17:40 +01:00
Cedric Nugteren
69ca271a8c
Always enables cl_khr_fp64 when running double-precision, not just for OpenCL 1.1 or lower
2017-01-07 13:31:29 +01:00
Cedric Nugteren
32b850b12b
Added tuning results for the AMD Turks GPU and the Intel Core i7-2670QM CPU
2017-01-03 20:30:56 +01:00
Cedric Nugteren
681a465b35
Prepared for the addition of the TRSM triangular solver kernel
2016-12-18 12:30:16 +01:00
Cedric Nugteren
6b533dda1c
Fixed a bug when using offsets in the direct GEMM kernels
2016-12-18 11:54:32 +01:00
Cedric Nugteren
26e0177431
Made Intel GPUs always use the indirect version of the GEMM kernel
2016-11-29 20:47:20 +01:00
Cedric Nugteren
39c49bf4f9
Made it possible to use the command-line environmental variables for each executable and without re-running CMake
2016-11-27 11:00:29 +01:00
Cedric Nugteren
080e1be684
Improved the default parameters for cases with non-common parameters across all devices
2016-11-26 16:38:17 +01:00
Cedric Nugteren
cb398f0e42
Merge pull request #125 from CNugteren/netlib_blas_api
...
Netlib CBLAS API for CLBlast
2016-11-24 19:35:59 +01:00
Cedric Nugteren
792cc8359f
Fixed a vector-size related bug in the CLBlast Netlib API
2016-11-23 22:00:20 +01:00
Cedric Nugteren
654b41bb2b
Fixed a bug in the HSCAL routine
2016-11-23 21:29:16 +01:00
Cedric Nugteren
26ca071480
Minor changes to ensure full compatibility with the Netlib CBLAS API
2016-11-22 08:41:52 +01:00
Cedric Nugteren
eefe0df435
Made functions with scalar-buffers as output properly return values
2016-11-20 21:36:57 +01:00
Cedric Nugteren
d8af24e388
Now correctly tests for validaty of the B matrix in the TRMM routine
2016-11-20 16:27:54 +01:00
Cedric Nugteren
90eb8738c4
Forced OpenCL 1.1 compilation and disabled a deprecation warning
2016-11-20 16:27:02 +01:00
Cedric Nugteren
2f0697564f
Fixed a bug in the TRMM routine caused by overwriting input data before consuming everything
2016-11-20 15:05:42 +01:00
Cedric Nugteren
6eeb1180fd
Changed the GEMM kernel selection parameters for Skylake GPUs to always favour the regular kernel
2016-11-19 22:15:33 +01:00
Cedric Nugteren
746d688e07
Updated the tuning results for the Intel Skylake ULT GT2 GPU
2016-11-15 22:42:04 +01:00
Cedric Nugteren
8ae8ab06a2
Renamed the include and source files of the Netlib CBLAS API
2016-10-25 20:33:10 +02:00
Cedric Nugteren
140121ef91
Removed the clblast namespace from the Netlib C API source file to ensure proper linking
2016-10-25 20:21:50 +02:00
Cedric Nugteren
729862e873
Fixed some issues preventing the Netlib CBLAS API from linking correctly
2016-10-25 19:56:42 +02:00
Cedric Nugteren
926aca53a0
Made the Netlib CBLAS API use the same enums with prefixes as the regular C API of CLBlast
2016-10-25 19:45:57 +02:00
Cedric Nugteren
59183b7d79
Sets the proper sizes for the buffers for the Netlib CBLAS API
2016-10-25 19:21:49 +02:00
Cedric Nugteren
f96fd372bc
Added initial version of a Netlib CBLAS implementation. TODO: Set correct buffer sizes
2016-10-25 14:28:52 +02:00
Cedric Nugteren
ec687afa75
Added tuning results for GeForce GTX TITAN Black
2016-10-24 19:49:10 +02:00
Cedric Nugteren
76d5d2ccfc
Fixed a bug in the transpose-matrix function
2016-10-23 20:49:55 +02:00
Cedric Nugteren
b8d4a9b9d0
Removed PUBLIC_API from the C++ exception classes
2016-10-23 16:09:59 +02:00
Cedric Nugteren
66f5c9d9b8
Added a fix for compilation under Visual Studio 2013 related to the new exception classes
2016-10-23 15:55:03 +02:00
Cedric Nugteren
c925fe463f
Added tuning results for the AMD Tonga GPU
2016-10-22 16:25:31 +02:00
Cedric Nugteren
a670c4c4bf
All enums in the C API are now prefixed with CLBlast to avoid potential name clashes with other projects
2016-10-22 16:14:56 +02:00
Cedric Nugteren
b0ff11acf0
Moved files around a bit; created a utilities subfolder
2016-10-22 15:36:48 +02:00
Cedric Nugteren
9afbbc9ef9
Added documentation for the better exception handling
2016-10-22 15:23:18 +02:00
Cedric Nugteren
280698d076
Merge pull request #117 from intelfx/exceptions
...
Convert to use C++ exceptions internally
2016-10-22 15:05:12 +02:00
Cedric Nugteren
9b596820d2
Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters (2)
2016-10-22 10:50:12 +02:00
Cedric Nugteren
db17b1fbe9
Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters
2016-10-22 10:41:02 +02:00
Ivan Shapovalov
56f300607b
Routine: get rid of ::SetUp()
...
Since we now use C++ exceptions inside the implementation (and exceptions
can be thrown from constructors), there is no need for a separate
Routine::SetUp() function.
For this, we also change the way how the kernel source string is constructed.
The kernel-specific source code is now passed to the Routine ctor via
an initializer_list of C strings to avoid unnecessary data copying
while also working around C1091 of MSVC 2013.
2016-10-22 08:45:27 +03:00
Ivan Shapovalov
b98af44fcf
treewide: use C++ exceptions properly
...
Since the codebase is designed around proper C++ idioms such as RAII, it
makes sense to only use C++ exceptions internally instead of mixing
exceptions and error codes. The exceptions are now caught at top level
to preserve compatibility with the existing error code-based API.
Note that we deliberately do not catch C++ runtime errors (such as
`std::bad_alloc`) nor logic errors (aka failed assertions) because no
actual handling can ever happen for such errors.
However, in the C interface we do catch _all_ exceptions (...) and
convert them into a wild-card error code.
2016-10-22 08:45:25 +03:00
Ivan Shapovalov
5d03d48f7a
src/clpp11.hpp: avoid throwing exceptions from std::shared_ptr's Deleter
2016-10-22 07:25:16 +03:00
Ivan Shapovalov
6ac7edd2da
src/clpp11.hpp: GetInfoString: avoid reallocation
2016-10-22 07:25:16 +03:00
Ivan Shapovalov
106565fa9a
src/clpp11.hpp: reinstate error checking on clGetEventProfilingInfo()
2016-10-22 07:25:15 +03:00
Cedric Nugteren
597974b40d
Merge pull request #118 from matze/add-pkg-config
...
Generate and install pkg-config description
2016-10-21 21:00:07 +02:00
Matthias Vogelgesang
3797d144cc
Generate and install pkg-config description
2016-10-21 09:38:25 +02:00
Cedric Nugteren
0f9311d46a
Fixed an issue with a growing database: the database is now a global variable in a namespace and its container uses const-pointers to the actual data
2016-10-14 20:56:32 +02:00
Cedric Nugteren
ebb505b783
Added tuning results for Intel HD Graphics IvyBridge GPU
2016-10-13 12:18:28 +02:00
Cedric Nugteren
c60f6715f8
Removed a spurious #ifdef
2016-10-12 21:49:59 +02:00
Cedric Nugteren
ad2b6ecea2
Fixed missing line ending
2016-10-12 21:10:22 +02:00
Cedric Nugteren
8a9d3cdf37
Added support for compiling the library, the client, and the samples under MSVC 2013
2016-10-10 22:45:39 +02:00
Cedric Nugteren
f88c50522d
Fixed an issue with const members of structs in the database
2016-10-10 22:24:05 +02:00
Cedric Nugteren
de77f00e8c
Fixed an issue with the length of the GEMM OpenCL string for both MSVC 2013 and 2015
2016-10-10 22:23:33 +02:00
Cedric Nugteren
fcac81bfef
First fixes towards compilation on Visual Studio 2013
2016-10-10 20:37:45 +02:00
Cedric Nugteren
08ee57f494
Updated the tuning results for the GTX 750 Ti GPU
2016-10-10 16:41:41 +02:00
Cedric Nugteren
7c228f6a67
Changed the thresholds for the direct/indirect GEMM kernels for NVIDIA and Intel GPUs
2016-10-10 16:01:02 +02:00
Cedric Nugteren
7baac46e72
Fixed a performance bug for Intel Iris Pro GPUs due to incorrect tuning results
2016-10-08 21:56:06 +02:00
Cedric Nugteren
b698e45478
Added first tuning results for the single-kernel direct GEMM implementation
2016-10-06 21:13:14 +02:00
Cedric Nugteren
a3e67f2be2
Added a kernel selection database to select between the direct and indirect GEMM kernels
2016-10-06 19:51:12 +02:00
Cedric Nugteren
7052a00a3e
Fixed a const-correctness issue with complex conjugation in the GEMM direct kernel
2016-10-03 20:13:19 +02:00
Cedric Nugteren
ca0c075de2
Added functions to load from off-chip to local memory without vector loads for the GEMM direct kernels
2016-10-03 20:09:15 +02:00
Cedric Nugteren
c1c4bc5d20
Re-organised GEMM direct kernel and added faster fall-back version for incomplete rectangles
2016-10-03 19:32:01 +02:00
Cedric Nugteren
243cef73db
Set the default number of runs for all kernels to at least 2 runs
2016-10-02 21:23:23 +02:00
Cedric Nugteren
d8827e908c
Specialised the GEMM direct kernel in four ways for transposing/non-transposing: NN, NT, TN, TT
2016-10-02 17:59:05 +02:00
Cedric Nugteren
61f489e370
Split the GEMM direct kernel into two files; set the default tuning target to 256-256-256
2016-10-02 15:06:59 +02:00
Cedric Nugteren
a459920105
Added padding to the local memory of the GEMM direct kernel
2016-10-01 16:58:53 +02:00
Cedric Nugteren
ecc704cc76
Added default num-runs to the tuner adding averaging over 10 runs as a default for the GEMM direct kernel
2016-10-01 16:55:21 +02:00
Cedric Nugteren
a9d35cf04c
Merge branch 'development' into gemm_direct
2016-10-01 13:45:08 +02:00
Cedric Nugteren
d59e5c570b
Added an option to run tuned kernels multiple times to average execution times; requires CLTune 2.5.0
2016-09-27 21:03:24 +02:00
Cedric Nugteren
db5772e521
Updated to version 8.0 of the CLCudaAPI header
2016-09-27 20:56:49 +02:00
Cedric Nugteren
adc058440c
Fixed the local memory size computation for the GEMM tuners
2016-09-27 20:03:55 +02:00
Cedric Nugteren
6178fcd584
Now generates test/client/tuner data using a fixed seed to enable reproducability of results
2016-09-27 19:55:21 +02:00
Cedric Nugteren
73d135c2ce
Added a first version of a tuner for the GEMM direct kernel; collapsed MWGD, NWGD and KWGD into one WGD parameter
2016-09-25 14:48:34 +02:00
Cedric Nugteren
669f43aed6
Separated the tuning parameters of the new direct GEMM kernel from the indirect version
2016-09-25 13:52:08 +02:00
Cedric Nugteren
140dc12854
Added a first version of the direct version of GEMM with local memory
2016-09-25 11:38:35 +02:00
Cedric Nugteren
6aa652d6ea
Merge branch 'development' into gemm_direct
2016-09-21 21:32:18 +02:00
Cedric Nugteren
b1929d8ce7
It is now possible to set the OpenCL compiler options through an environmental variable
2016-09-21 21:22:16 +02:00
Cedric Nugteren
4ce584a014
Split the XGEMM kernel further up: now in 3 parts. This is done because MSVC can't handle long strings
2016-09-12 22:13:16 +02:00
Cedric Nugteren
aa3dffe356
Added XgemvFastRot and Xgemm 16-bit tuning results: just defaults which are now automatically taken from 32-bit if there are no entries at all
2016-09-12 20:13:38 +02:00
Cedric Nugteren
b5a67f86ec
Complete re-write of the database script. Changed Pandas for the much faster and convienient plain JSON/dict data-type
2016-09-11 21:29:28 +02:00
Cedric Nugteren
e21f32bc99
Updated database based on exhaustive tuning results for GEMM for the R9 M370X GPU
2016-09-10 14:00:43 +02:00
Cedric Nugteren
3daba70997
Updated the database script to remove duplicate entries: keeps only the best-performing cases for a specific parameters combination
2016-09-10 11:12:09 +02:00
Cedric Nugteren
55038d3c91
Split GEMM tuning in two parts: a small set of tuning parameters which is explored exhaustively and a larger set which is explored randomly
2016-09-06 20:30:06 +02:00
Cedric Nugteren
b30b26b89e
The GEMM kernel no longer adds beta*C in case beta is zero; this would cause problems if C contains NaNs
2016-09-04 17:21:16 +02:00
Cedric Nugteren
521bf6cdfc
Added tuning results for Intel Broadwell 5500 GT2 GPU
2016-09-03 16:43:23 +02:00
Cedric Nugteren
19574b2519
Updated tuning results for Haswell GT2 Mobile GPU; fixed database script to handle duplicate entries of different runs
2016-09-03 12:45:11 +02:00
Ivan Shapovalov
ea43936e94
test/correctness: read platform and device from environment
...
Support passing environment variables CLBLAST_PLATFORM and CLBLAST_DEVICE
instead of -platform and -device arguments to test executables.
This is for `ctest`.
2016-08-27 05:37:26 +03:00
Cedric Nugteren
8d6a6a5bbf
Merge branch 'database_defaults' into development
2016-08-22 19:31:36 +02:00
Cedric Nugteren
0c0f0ac7f9
Also changed the default-default for unknown device types to use the same method as for known device groups
2016-08-21 20:35:20 +02:00
Cedric Nugteren
84db8958d1
Increased the ratio of GEMM tuning results to explore; reduced the tuning search space to have a better chance to evaluate more likely parameter combinations
2016-08-21 20:28:02 +02:00
Cedric Nugteren
6eca53ee23
Merge branch 'master' of https://github.com/dvasschemacq/CLBlast into dvasschemacq-master
...
Conflicts:
src/kernels/level1/xaxpy.opencl
src/kernels/level2/xgemv.opencl
src/kernels/level2/xgemv_fast.opencl
src/kernels/level2/xger.opencl
src/kernels/level2/xher.opencl
src/kernels/level2/xher2.opencl
src/kernels/level3/xgemm_part2.opencl
2016-08-20 12:50:31 +02:00
D. Van Assche
57f1aa7685
Adapt opencl files for 1.1 OpenCL
...
In OpenCL 1.1 __kernel has to be before __attribute__, at least with
Vivante compiler.
2016-08-18 17:33:13 +02:00
Cedric Nugteren
7d5631b7e4
Updated the database script to calculate the relative best performance of tuning results common for a device/vendor type
2016-08-15 21:01:07 +02:00
Cedric Nugteren
5004a435ff
Fixed issues related to the recent changes in the Xgemm infrastructure
2016-07-26 20:59:59 +02:00
Cedric Nugteren
5053f6ebc6
Merge branch 'development' into gemm_direct
2016-07-26 20:53:31 +02:00
Cedric Nugteren
de1afe168d
Removed all old tuning results for the XgemvFastRot kernel; re-added for a couple of devices
2016-07-25 22:57:23 +02:00
Cedric Nugteren
2582f0290a
Moved the XgemvFast and XgemvFastRot tuning database into a separate file
2016-07-25 22:43:49 +02:00
Cedric Nugteren
0252df731a
Merge branch 'development' into gemv_performance
2016-07-24 17:06:27 +02:00
Cedric Nugteren
ffa35c623a
Minor improvements after merging in groundwork for custom tuning parameters and kernels
2016-07-24 17:00:21 +02:00
Cedric Nugteren
40a72259eb
Fixe a bug in the new XgemvFastRot kernel related to local memory size
2016-07-23 16:58:11 +02:00
Cedric Nugteren
7a4f963763
Further improvements to the XgemvFastRot kernel, properly enables coalescing now
2016-07-23 14:52:32 +02:00
Cedric Nugteren
75fe8235f7
Improved the XgemvFastRot kernel by tiled loading of the input matrix A, enabling better memory performance
2016-07-23 10:20:11 +02:00
Ivan Shapovalov
e4e1f05079
clblast::Database, clblast::Routine: implement "database overlays" provided by routine implementation
2016-07-22 11:15:52 +03:00
Ivan Shapovalov
ae3299da30
clblast::RunKernel, cl::Kernel: unify variants with/without waitForEvents, support empty LWS
2016-07-22 11:15:52 +03:00
Ivan Shapovalov
5502c5eec4
cl::Kernel: skip NULL entries in waitForEvents
2016-07-22 11:15:52 +03:00
Ivan Shapovalov
2dd5ee3f75
clblast::RunKernel, cl::Kernel: take const vector as waitForEvents
2016-07-22 11:15:52 +03:00
Ivan Shapovalov
1ae71614ac
xgemm: do not hardcode kernel requirements for internal matrix layout
...
Do not hardcode the knowledge about "A and C col-major, B row-major".
This allows for easier reuse of the DoGemm() routine with different
kernels.
2016-07-22 11:15:52 +03:00
Cedric Nugteren
798d32edad
Improved the GEMM direct kernel by adding register blocking. Still not fast though
2016-07-17 14:36:51 +02:00
Cedric Nugteren
eaa348735e
Created infrastructure to support a direct GEMM kernel; added correct but slow reference kernel as a place-holder
2016-07-16 15:18:28 +02:00
Cedric Nugteren
b33bec4a59
Fixed some more types and type conversions in the clpp11 interface to OpenCL
2016-07-16 11:13:23 +02:00
Cedric Nugteren
bee9b959f4
Merge pull request #80 from gcp/getdevinfo_fixes
...
Make sure the passed types are large enough.
2016-07-16 10:59:51 +02:00
Cedric Nugteren
066af4069b
Removed an unused variable from the copy-transpose-pad function
2016-07-16 10:56:37 +02:00
Gian-Carlo Pascutto
e0ba59c0ac
Make sure the passed types are large enough.
...
Make sure all out parameters that are passed to functions such
as clGetDeviceInfo are large enough to contain the replies.
2016-07-13 15:59:02 +02:00
Cedric Nugteren
c87e877bf2
Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel
2016-07-10 20:32:01 +02:00
Cedric Nugteren
57f09178d8
Added tuning results for AMD Oland and for Intel Graphics HD 530
2016-07-10 11:46:44 +02:00
Cedric Nugteren
39e9b1238f
Fixed a bug related to the cache and retrieval of programs based on the OpenCL context
2016-07-10 11:24:36 +02:00
Cedric Nugteren
9caa7ca5b9
Cache now compares cl_context instead of a pointer to a context; added verbose print statements to the cache
2016-07-08 20:57:58 +02:00
Cedric Nugteren
27854070b4
Added a VERBOSE mode to debug performance: now prints details about compilation and kernel execution to screen
2016-07-06 21:50:12 +02:00
Cedric Nugteren
77325b8974
Added an option to the performance clients to do a warm-up run before timing
2016-07-06 21:25:55 +02:00
Cedric Nugteren
9683b50c55
Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp)
2016-07-03 20:30:47 +02:00
Gian-Carlo Pascutto
7424532859
Ensure clGetKernelWorkGroupInfo return value fits.
...
In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo
to get the "bytes" amount needed to store the result from
CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an
"auto result = size_t", which in 32-bit mode is 4 bytes, regardless
of the previous return value. The spec describes that it will actually
be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure
we are in fact passing a cl_ulong.
Also adjust all callers to take the changed type into account.
2016-07-02 21:14:36 +02:00
Cedric Nugteren
7cf2f8c268
Fixed some memory leaks related to events not properly cleaned-up
2016-07-02 15:34:55 +02:00
Cedric Nugteren
b330ab0866
Added declspec(dllexport) to ClearCache and FillCache, and added declspec(dllimport) when not building the library
2016-06-30 10:49:17 +02:00
Cedric Nugteren
cd74aaac52
Updated to version 6.0 of the CLCudaAPI header
2016-06-29 19:42:49 +02:00
CNugteren
871b576c06
Made it possible to build the clients and tests on Windows using Visual Studio
2016-06-28 16:38:45 +02:00
Cedric Nugteren
76b20cfe0c
Fixes for the AppVeyor Windows build
2016-06-27 14:44:08 +02:00
Cedric Nugteren
66908ef5cd
Added tuning results for 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (thanks to OursDesCavernes)
2016-06-19 14:59:50 +02:00
Cedric Nugteren
61203453aa
Renamed all C++ source files to .cpp to match the .hpp extension better
2016-06-19 13:55:49 +02:00
Cedric Nugteren
f726fbdc9f
Moved all headers into the source tree, changed headers to .hpp extension
2016-06-18 20:20:13 +02:00
Cedric Nugteren
bacb5d2bb2
Clean-up of the routine class, moved RunKernel to the routine/common file
2016-06-18 18:16:14 +02:00
Cedric Nugteren
7b4c0e1cf0
Removed the template from the Routine base-class
2016-06-18 14:56:55 +02:00
Cedric Nugteren
f9947b4d7f
Removed the precision argument from the routines in favor of a single templated function
2016-06-17 14:30:37 +02:00
Cedric Nugteren
536b7fe4bc
Removed the interface to the cache functions from the Routine class, calls them directly now
2016-06-17 13:57:50 +02:00
Cedric Nugteren
98a95c89fc
Moved the RunKernel and PadCopyTransposeMatrix functions out of the Routine class
2016-06-17 12:32:06 +02:00
Cedric Nugteren
afe8852eaa
Moved the test-for-valid-buffers function from the Routine class to separate functions in a separate file
2016-06-17 11:29:07 +02:00
Cedric Nugteren
52ccaf5b25
Added XOMATCOPY routines to perform out-of-place matrix scaling, copying, and/or transposing
2016-06-16 18:07:46 +02:00
Cedric Nugteren
39b7dbc5e3
Added some constness to variables related to the GEMM routines
2016-06-15 12:34:05 +02:00
Cedric Nugteren
b894611ad1
Re-organised the level-3 supporting kernels (copy, pad, transpose, convert) and renamed files and functions appropriately
2016-06-14 18:17:58 +02:00
Cedric Nugteren
3e78a99355
Moved device vendor and type checks to a common header
2016-06-14 14:30:22 +02:00
Cedric Nugteren
6e2017c67d
Added support for FP16 on ARM Mali-T628 (officially not supported)
2016-06-14 14:29:53 +02:00
Cedric Nugteren
6925003e45
Added global memory synchronisation for better cache performance on ARM Mali GPUs
2016-06-08 10:13:37 +02:00
Cedric Nugteren
03182f9d07
Added half-precision tests for the clBLAS reference through conversion to single-precision
2016-05-26 23:36:19 +02:00
Cedric Nugteren
9f87455070
Added level-3 half-precision routines HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
2016-05-25 13:29:53 +02:00
Cedric Nugteren
ac1575056e
Added proper argument handling and displaying for half-precision data-types
2016-05-24 14:06:16 +02:00
Cedric Nugteren
3e9a07f00a
Added level-2 half-precision routines HGER/HSYR/HSPR/HSYR2/HSPR2
2016-05-22 16:59:14 +02:00
Cedric Nugteren
f0cb3fdc81
Fixed tuning results for half-precision; added first results for the xGER kernels
2016-05-22 16:46:05 +02:00
Cedric Nugteren
c8ff3f143f
Prepared the GER kernels and tuner for half-precision support
2016-05-22 16:18:08 +02:00
Cedric Nugteren
95b828da12
Added level-2 half-precision routines HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV
2016-05-22 15:38:26 +02:00
Cedric Nugteren
b6268d0c22
Added first tuning results for the half-precision xGEMV kernels
2016-05-22 15:29:05 +02:00
Cedric Nugteren
88551b4005
Prepared the GEMV kernels and tuner for half-precision support
2016-05-22 15:22:54 +02:00
Cedric Nugteren
803aaf3070
Added level-1 half-precision routines HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
2016-05-22 14:47:14 +02:00
Cedric Nugteren
3c9e63c054
Added first tuning results for the half-precision xDOT kernels
2016-05-22 14:43:25 +02:00
Cedric Nugteren
f70ded34f3
Added half-precision support for all level 1 routines
2016-05-22 14:26:19 +02:00