Ivan Shapovalov
ae3299da30
clblast::RunKernel, cl::Kernel: unify variants with/without waitForEvents, support empty LWS
2016-07-22 11:15:52 +03:00
Ivan Shapovalov
5502c5eec4
cl::Kernel: skip NULL entries in waitForEvents
2016-07-22 11:15:52 +03:00
Ivan Shapovalov
2dd5ee3f75
clblast::RunKernel, cl::Kernel: take const vector as waitForEvents
2016-07-22 11:15:52 +03:00
Ivan Shapovalov
1ae71614ac
xgemm: do not hardcode kernel requirements for internal matrix layout
...
Do not hardcode the knowledge about "A and C col-major, B row-major".
This allows for easier reuse of the DoGemm() routine with different
kernels.
2016-07-22 11:15:52 +03:00
Cedric Nugteren
798d32edad
Improved the GEMM direct kernel by adding register blocking. Still not fast though
2016-07-17 14:36:51 +02:00
Cedric Nugteren
eaa348735e
Created infrastructure to support a direct GEMM kernel; added correct but slow reference kernel as a place-holder
2016-07-16 15:18:28 +02:00
Cedric Nugteren
b33bec4a59
Fixed some more types and type conversions in the clpp11 interface to OpenCL
2016-07-16 11:13:23 +02:00
Cedric Nugteren
bee9b959f4
Merge pull request #80 from gcp/getdevinfo_fixes
...
Make sure the passed types are large enough.
2016-07-16 10:59:51 +02:00
Cedric Nugteren
066af4069b
Removed an unused variable from the copy-transpose-pad function
2016-07-16 10:56:37 +02:00
Gian-Carlo Pascutto
e0ba59c0ac
Make sure the passed types are large enough.
...
Make sure all out parameters that are passed to functions such
as clGetDeviceInfo are large enough to contain the replies.
2016-07-13 15:59:02 +02:00
Cedric Nugteren
c87e877bf2
Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel
2016-07-10 20:32:01 +02:00
Cedric Nugteren
57f09178d8
Added tuning results for AMD Oland and for Intel Graphics HD 530
2016-07-10 11:46:44 +02:00
Cedric Nugteren
39e9b1238f
Fixed a bug related to the cache and retrieval of programs based on the OpenCL context
2016-07-10 11:24:36 +02:00
Cedric Nugteren
9caa7ca5b9
Cache now compares cl_context instead of a pointer to a context; added verbose print statements to the cache
2016-07-08 20:57:58 +02:00
Cedric Nugteren
27854070b4
Added a VERBOSE mode to debug performance: now prints details about compilation and kernel execution to screen
2016-07-06 21:50:12 +02:00
Cedric Nugteren
77325b8974
Added an option to the performance clients to do a warm-up run before timing
2016-07-06 21:25:55 +02:00
Cedric Nugteren
9683b50c55
Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp)
2016-07-03 20:30:47 +02:00
Gian-Carlo Pascutto
7424532859
Ensure clGetKernelWorkGroupInfo return value fits.
...
In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo
to get the "bytes" amount needed to store the result from
CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an
"auto result = size_t", which in 32-bit mode is 4 bytes, regardless
of the previous return value. The spec describes that it will actually
be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure
we are in fact passing a cl_ulong.
Also adjust all callers to take the changed type into account.
2016-07-02 21:14:36 +02:00
Cedric Nugteren
7cf2f8c268
Fixed some memory leaks related to events not properly cleaned-up
2016-07-02 15:34:55 +02:00
Cedric Nugteren
b330ab0866
Added declspec(dllexport) to ClearCache and FillCache, and added declspec(dllimport) when not building the library
2016-06-30 10:49:17 +02:00
Cedric Nugteren
cd74aaac52
Updated to version 6.0 of the CLCudaAPI header
2016-06-29 19:42:49 +02:00
CNugteren
871b576c06
Made it possible to build the clients and tests on Windows using Visual Studio
2016-06-28 16:38:45 +02:00
Cedric Nugteren
76b20cfe0c
Fixes for the AppVeyor Windows build
2016-06-27 14:44:08 +02:00
Cedric Nugteren
66908ef5cd
Added tuning results for 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (thanks to OursDesCavernes)
2016-06-19 14:59:50 +02:00
Cedric Nugteren
61203453aa
Renamed all C++ source files to .cpp to match the .hpp extension better
2016-06-19 13:55:49 +02:00
Cedric Nugteren
f726fbdc9f
Moved all headers into the source tree, changed headers to .hpp extension
2016-06-18 20:20:13 +02:00
Cedric Nugteren
bacb5d2bb2
Clean-up of the routine class, moved RunKernel to the routine/common file
2016-06-18 18:16:14 +02:00
Cedric Nugteren
7b4c0e1cf0
Removed the template from the Routine base-class
2016-06-18 14:56:55 +02:00
Cedric Nugteren
f9947b4d7f
Removed the precision argument from the routines in favor of a single templated function
2016-06-17 14:30:37 +02:00
Cedric Nugteren
536b7fe4bc
Removed the interface to the cache functions from the Routine class, calls them directly now
2016-06-17 13:57:50 +02:00
Cedric Nugteren
98a95c89fc
Moved the RunKernel and PadCopyTransposeMatrix functions out of the Routine class
2016-06-17 12:32:06 +02:00
Cedric Nugteren
afe8852eaa
Moved the test-for-valid-buffers function from the Routine class to separate functions in a separate file
2016-06-17 11:29:07 +02:00
Cedric Nugteren
52ccaf5b25
Added XOMATCOPY routines to perform out-of-place matrix scaling, copying, and/or transposing
2016-06-16 18:07:46 +02:00
Cedric Nugteren
39b7dbc5e3
Added some constness to variables related to the GEMM routines
2016-06-15 12:34:05 +02:00
Cedric Nugteren
b894611ad1
Re-organised the level-3 supporting kernels (copy, pad, transpose, convert) and renamed files and functions appropriately
2016-06-14 18:17:58 +02:00
Cedric Nugteren
3e78a99355
Moved device vendor and type checks to a common header
2016-06-14 14:30:22 +02:00
Cedric Nugteren
6e2017c67d
Added support for FP16 on ARM Mali-T628 (officially not supported)
2016-06-14 14:29:53 +02:00
Cedric Nugteren
6925003e45
Added global memory synchronisation for better cache performance on ARM Mali GPUs
2016-06-08 10:13:37 +02:00
Cedric Nugteren
03182f9d07
Added half-precision tests for the clBLAS reference through conversion to single-precision
2016-05-26 23:36:19 +02:00
Cedric Nugteren
9f87455070
Added level-3 half-precision routines HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
2016-05-25 13:29:53 +02:00
Cedric Nugteren
ac1575056e
Added proper argument handling and displaying for half-precision data-types
2016-05-24 14:06:16 +02:00
Cedric Nugteren
3e9a07f00a
Added level-2 half-precision routines HGER/HSYR/HSPR/HSYR2/HSPR2
2016-05-22 16:59:14 +02:00
Cedric Nugteren
f0cb3fdc81
Fixed tuning results for half-precision; added first results for the xGER kernels
2016-05-22 16:46:05 +02:00
Cedric Nugteren
c8ff3f143f
Prepared the GER kernels and tuner for half-precision support
2016-05-22 16:18:08 +02:00
Cedric Nugteren
95b828da12
Added level-2 half-precision routines HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV
2016-05-22 15:38:26 +02:00
Cedric Nugteren
b6268d0c22
Added first tuning results for the half-precision xGEMV kernels
2016-05-22 15:29:05 +02:00
Cedric Nugteren
88551b4005
Prepared the GEMV kernels and tuner for half-precision support
2016-05-22 15:22:54 +02:00
Cedric Nugteren
803aaf3070
Added level-1 half-precision routines HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
2016-05-22 14:47:14 +02:00
Cedric Nugteren
3c9e63c054
Added first tuning results for the half-precision xDOT kernels
2016-05-22 14:43:25 +02:00
Cedric Nugteren
f70ded34f3
Added half-precision support for all level 1 routines
2016-05-22 14:26:19 +02:00
Cedric Nugteren
489c5d76cf
Merged in latest changes from 0.7.1 release
2016-05-18 21:32:56 +02:00
Cedric Nugteren
7a3b695db7
Added half precision tuning results for supporting kernels (pad, copy, transpose, padtranspose)
2016-05-16 12:45:10 +02:00
Cedric Nugteren
af2ac62212
Prepared GEMM and supporting kernels and tuners for half-precision support
2016-05-16 12:37:24 +02:00
Cedric Nugteren
4b6bdd83a2
Added header with conversions from and to half-precision floating-point
2016-05-15 20:13:57 +02:00
Cedric Nugteren
5e1b2e021f
Set kernel arguments for AXPY as constant memory buffers, making it possible to transfer half-precision values as well
2016-05-14 18:06:00 +02:00
Cedric Nugteren
120c31a30f
Initial experimental version of the half-precision HAXPY routine
2016-05-13 20:49:34 +02:00
Cedric Nugteren
f2ba75890c
Initial changes in preparation for half-precision fp16 support
2016-05-12 19:56:21 +02:00
cnugteren
25a25dbd6f
Fixed errors in xAXPY and xSCAL tests on AMD hardware
2016-05-08 17:30:31 +02:00
Cedric Nugteren
a8f109296c
Fixed the calculation of the required buffer sizes in case of subvectors and submatrices
2016-05-02 20:04:55 +02:00
Cedric Nugteren
b9317d7d0c
Made the default xDOT tuning size smaller
2016-05-01 14:39:44 +02:00
Cedric Nugteren
bee2f943ec
Changed the index buffer of IxAMAX routines to unsigned int for proper buffersize checking
2016-05-01 14:03:37 +02:00
Cedric Nugteren
9602c150aa
Added a program cache (per-context) next to the per-device binary cache
2016-05-01 12:56:08 +02:00
Cedric Nugteren
e113ff0852
Added non-aboslute minimum counter-part IxMIN of the BLAS routine IxAMAX
2016-04-30 09:49:39 +02:00
Cedric Nugteren
877aad693f
Added FillCache: a function to pre-compile all kernels for a specific device
2016-04-29 23:33:12 +02:00
Cedric Nugteren
d9b21d7f49
Fixed the cache to store binaries instead of OpenCL programs
2016-04-28 21:14:17 +02:00
Cedric Nugteren
d7ddbdeb1f
Added non-absolute counter-parts xSUM and IxMAX of the BLAS routines xASUM and IxAMAX
2016-04-27 18:07:30 +02:00
Cedric Nugteren
8075934ca7
Added prototypes for non-BLAS routines: xSUM and IxMAX (non-absolute counterparts of xASUM and IxAMAX)
2016-04-27 17:06:19 +02:00
Cedric Nugteren
82be8f211c
Moved all cache-related functions to a separate file; added a ClearCompiledProgramCache function to clear the cache
2016-04-27 16:02:13 +02:00
cnugteren
16a048f1ac
Added support for the iSAMAX/iDAMAX/iCAMAX/iZAMAX routines
2016-04-20 22:12:51 -06:00
cnugteren
894983fc3c
Added prototype for ixAMAX routines
2016-04-20 21:11:33 -06:00
cnugteren
5a4f8217be
Updated the reduction-kernel tuner to also tune the epilogue
2016-04-14 21:37:52 -06:00
cnugteren
8be99de82d
Added support for the SASUM/DASUM/ScASUM/DzASUM routines
2016-04-14 19:58:26 -06:00
cnugteren
e0497807e2
Added prototype for xASUM routines
2016-04-13 21:44:49 -06:00
cnugteren
1d3d38a261
Events are now properly implemented using event waiting list and asking the user to wait for event completion
2016-04-09 22:22:24 -06:00
cnugteren
90e237b97a
Removed redundant queue synchronisation statements
2016-04-04 08:38:31 -07:00
cnugteren
5c83217cf2
Added a wrapper for CBLAS libraries for performance/correctness testing
2016-04-01 22:36:39 -07:00
cnugteren
8c3c6db7d0
Merge branch 'level1_routines' into development
2016-03-30 21:37:56 -07:00
cnugteren
5409f349a1
Fixed the nrm2 kernel for complex data-types
2016-03-30 21:32:04 -07:00
Cedric Nugteren
c1df786764
Added prototypes for the xROTM and xROTMG routines
2016-03-30 16:13:37 -07:00
Cedric Nugteren
6ecc0d089c
Added prototypes for the xROT and xROTG functions
2016-03-30 16:13:32 -07:00
Cedric Nugteren
2429ad5025
Fixed properly passing of OpenCL events to CLBlast functions
2016-03-30 16:12:53 -07:00
Cedric Nugteren
aaa687ca98
Added preliminary support for the xNRM2 routines
2016-03-28 23:00:44 +02:00
Cedric Nugteren
1d5a702d9d
Added prototypes for ScNRM2/DzNRM2 routines
2016-03-25 10:30:38 +01:00
Cedric Nugteren
3876096c30
Added prototypes for SNRM2/DNRM2 routines
2016-03-25 10:00:40 +01:00
Cedric Nugteren
49822c8ead
Fixed the C-api export to be able to properly build a DLL on Windows
2016-03-23 20:49:28 +01:00
Cedric Nugteren
d935695417
Added __declspec(dllexport) to create a DLL on Windows
2016-03-19 11:09:09 +01:00
Cedric Nugteren
918797735d
Made the library thread-safe by guarding the kernel cache with a mutex
2016-03-14 22:55:22 +01:00
Cedric Nugteren
f4c09220c1
Fixed a bug in the GER-family of routines due to incorrect division of the workgroup size
2016-03-06 16:43:28 +01:00
Cedric Nugteren
306bf67660
Added preliminary support for xHPR2 and xSPR2 routines
2016-03-06 15:48:11 +01:00
Cedric Nugteren
60da54da5d
Added preliminary support for xHER2 and xSYR2 routines
2016-03-02 21:18:01 +01:00
Cedric Nugteren
4a56822dcc
Fixed a couple of correctness bugs in the Xher kernels
2016-02-28 15:49:59 +01:00
Cedric Nugteren
e3545215a5
Added support for xHER, xHPR, xSYR, and xSPR routines
2016-02-28 14:16:48 +01:00
Cedric Nugteren
9f682aa66b
Set a proper default precision for the CLBlast clients
2016-02-20 14:41:53 +01:00
Cedric Nugteren
6dc44da07b
Added support for xGERU and xGERC routines
2016-02-20 14:15:41 +01:00
Cedric Nugteren
8854a73127
Added XGER routine, kernel, and tuner
2016-02-20 12:40:01 +01:00
Cedric Nugteren
bf84463ab2
Separated the GEMM kernel in two parts to reduce string length for MSVC
2016-02-08 20:06:02 +01:00
Cedric Nugteren
38c56bbde2
Split-up the XGEMV kernel in two parts
2016-02-08 19:43:34 +01:00
Cedric Nugteren
00be6f7530
Added dictionary with short and long OpenCL vendor names to fix issues with Intel having multiple names
2016-02-07 11:59:30 +01:00
CNugteren
b7900652b2
Reduced the maximum workgroup-size for GEMV kernels further
2016-02-06 13:07:19 +01:00
CNugteren
40346bb3a5
Reduced unrolling factor in xgemv kernel to reduce compilation times
2016-02-06 12:09:21 +01:00
CNugteren
9622d3be22
Fixes for compilation under Visual Studio
2016-01-30 14:57:49 +01:00
Cedric Nugteren
276e772a2c
Added first auto-generated database headers from the Python database; only K40 and Iris supported now
2016-01-30 11:43:21 +01:00
CNugteren
c0d469718a
Now sets local memory size in xgemv tuner properly
2015-10-28 21:19:59 +01:00
CNugteren
179ad0666d
Fixed an arguments-related bug in the GEMV tuner
2015-10-25 16:48:26 +01:00
CNugteren
a2d5d7770e
Moved the tuner database script to a separate folder
2015-10-25 16:27:14 +01:00
CNugteren
0d4091fdfb
Added guards for routine-specific level-3 pad kernels
2015-10-13 08:29:45 +02:00
CNugteren
f74c9a5640
Routine names are now all default arguments defined in the header
2015-10-12 08:35:58 +02:00
CNugteren
54a8723f8c
Moved level3 kernel files to a subfolder
2015-10-12 08:28:40 +02:00
CNugteren
2b56c2c603
Added TRMV/TBMV/TPMV routines
2015-09-26 16:58:03 +02:00
CNugteren
de6547a92b
Added SBMV and SPMV routines
2015-09-19 18:01:19 +02:00
CNugteren
80da67d28b
Added the HPMV routine
2015-09-19 17:40:38 +02:00
CNugteren
c32c4a9739
Added infrastructure for packed matrices
2015-09-19 17:37:42 +02:00
CNugteren
aebd156869
Added the HBMV routine
2015-09-19 11:11:34 +02:00
CNugteren
93dddda63e
Improved the organization and performance of level 2 routines
2015-09-18 17:46:41 +02:00
CNugteren
4507ba4997
Added first version of banded matrix-vector multiplication
2015-09-18 15:25:20 +02:00
CNugteren
6105ad6f5b
Added interface of all level 2 routines
2015-09-17 17:05:45 +02:00
CNugteren
6307d2e5db
Added script to generate API interface and implementation automatically
2015-09-17 10:14:33 +02:00
CNugteren
a2e726d3bd
Added xDOT/xDOTU/xDOTC dot-product routines
2015-09-14 16:57:00 +02:00
CNugteren
2a383f3450
Added extra temporary buffer to tuners in preparation of Xdot routines
2015-09-14 15:53:34 +02:00
CNugteren
e0c5312abb
Added support for the dot buffer and offset argument
2015-09-14 12:28:50 +02:00
CNugteren
ff0c54c386
Added the XSWAP, XSCAL and XCOPY level-1 routines
2015-08-22 17:11:20 +02:00
CNugteren
75517353d5
Re-organized level1 xaxpy kernel
2015-08-22 14:33:48 +02:00
Cedric Nugteren
cf168fca70
Merge pull request #23 from CNugteren/tuner_database
...
Added initial version of a tuner-database
2015-08-20 08:38:18 +02:00
CNugteren
15db2bcc20
Added initial version of tuner-database Python script
2015-08-20 08:30:51 +02:00
CNugteren
b46de22433
Moved precision tester to utilities
2015-08-19 19:34:29 +02:00
CNugteren
cbd25bffea
Added hotfix 8eeb7f721f
2015-08-19 11:12:16 +02:00
Cedric Nugteren
4f6e42d052
Merge pull request #21 from CNugteren/c_api
...
Added a plain C API
2015-08-13 18:02:03 +02:00
CNugteren
603e389545
Added all supported routines to the C API
2015-08-13 17:58:46 +02:00
CNugteren
8eeb7f721f
Fixed a complex data-type bug in the transpose kernel
2015-08-13 14:33:42 +02:00
CNugteren
8617195ac5
Added initial version of C API with just one routine
2015-08-13 13:46:13 +02:00
CNugteren
dbdb58c600
Refactored the tuners, added JSON output
2015-08-09 15:50:41 +02:00
CNugteren
75b4d92ac3
Added distinguished names for GEMV inherited HEMV/SYMV
2015-08-04 08:15:39 +02:00
CNugteren
d1a7cf18ec
Abstracted loading of matrix A for GEMV kernel
2015-08-03 07:37:14 +02:00
CNugteren
938ca2707f
Added HEMV routine
2015-07-31 17:35:42 +02:00
CNugteren
b89517a2e7
Added SYMV routine
2015-07-31 17:13:41 +02:00
CNugteren
f7199b831f
Now using the new Claduc C++11 OpenCL header
2015-07-27 07:18:06 +02:00
CNugteren
4dcecfe934
Added workgroup shuffle option to transpose kernel for AMD GPUs
2015-07-22 07:31:16 +02:00
CNugteren
d93efa3169
Transpose kernel now uses vectorized local memory loads and stores
2015-07-21 08:22:18 +02:00
CNugteren
a0f0f6c8ce
Triangular GEMM kernels are only compiled when needed
2015-07-19 16:36:12 +02:00
CNugteren
48e2e96f1b
Kernel caching is now based on a routine's name
2015-07-19 16:24:14 +02:00
CNugteren
4e499a67c1
The kernel source string is now a routine's member variable
2015-07-19 13:44:37 +02:00
CNugteren
9300261bd4
Fixed a bug when using the Xgemm kernel without local memory
2015-07-16 22:49:55 +02:00
CNugteren
0157d6d4ea
Using mad() instruction for AMD devices like clBLAS does
2015-07-16 22:42:02 +02:00
CNugteren
b526623fc7
Skips pre/post processing kernels if not needed
2015-07-15 22:12:38 +02:00
CNugteren
0dc85845f7
Updated interface of the PadCopyTransposeMatrix method
2015-07-13 08:41:26 +02:00
CNugteren
aa852bbe67
Added subfolders for the level1/2/3 routines
2015-07-12 16:57:09 +02:00
CNugteren
b5d39d9d0c
Added the HEMM routine, tester, and client
2015-07-12 15:11:50 +02:00
CNugteren
9a929f3fb2
Disabled prototype of TRSM
2015-07-10 21:08:18 +02:00
CNugteren
b02876d6e9
Added the HER2K routine, tester, and client
2015-07-10 20:59:20 +02:00
CNugteren
919bba3eaf
Added the HERK routine, tester, and client
2015-07-10 07:19:59 +02:00
CNugteren
5578d5ab28
Added option to set the imaginary part of the diagonal to zero
2015-07-08 07:25:18 +02:00
CNugteren
599f9a70a6
Added option to set the imaginary part of the diagonal to zero
2015-07-07 07:34:36 +02:00
CNugteren
d9ea0c47c6
Added the TRMM routine, tester, and client
2015-07-02 07:16:04 +02:00
CNugteren
d879eb3abf
Added a set-to-one function for kernels
2015-07-02 07:11:27 +02:00
CNugteren
e3dd35f91b
Added the unit/non-unit diagonal enum
2015-07-01 09:39:41 +02:00
CNugteren
b8d81a60d6
Fixed typos in SYMM
2015-07-01 09:38:04 +02:00
CNugteren
8574f72d46
Added the TRMM and TRSM interface
2015-06-30 07:36:11 +02:00
CNugteren
7c8d16147a
Added the SYR2K routine, tester, and client
2015-06-26 08:12:56 +02:00
CNugteren
57c705dbf2
Clarified comment
2015-06-25 20:38:34 +02:00
CNugteren
60a88aac86
Added the SYRK routine, tester, and client
2015-06-24 07:50:18 +02:00
CNugteren
9fc38cdf5e
Added a lower/upper triangular version of the GEMM kernel
2015-06-23 17:58:51 +02:00
CNugteren
20eb3506d6
Added a condition to update only lower/upper triangular parts in the un-pad kernels
2015-06-23 08:09:07 +02:00
CNugteren
e3829c1067
Added prototypes of SYRK and SYR2K
2015-06-21 12:44:03 +02:00
CNugteren
3ea3ba2bee
Distinguish between a short smoke test and a full test
2015-06-20 13:33:50 +02:00
CNugteren
e26742c629
Added additional absolute error checking when testing
2015-06-20 10:58:21 +02:00
CNugteren
682c01a80c
Now returns program from database by reference
2015-06-18 18:44:14 +02:00
CNugteren
7e176ccac9
Added support for conjugate transpose in GEMV
2015-06-16 08:42:52 +02:00
CNugteren
af78a04eca
Updated the tuners to set the conjugate argument
2015-06-16 07:50:45 +02:00
CNugteren
e03582a112
Added support for CGEMM/ZGEMM and CSYMM/ZSYMM
2015-06-16 07:45:09 +02:00
CNugteren
8f01c644b5
Added support for complex conjugate transpose
2015-06-16 07:43:19 +02:00
CNugteren
01726197ab
Fixed a bug in AXPBY defines for complex data-types
2015-06-15 08:38:24 +02:00
CNugteren
294a3e3d41
Split the three variations of the GEMV kernel for maximal tuning freedom
2015-06-14 11:15:53 +02:00
CNugteren
ab0064dab7
Fixed number of threads launched for GEMV
2015-06-14 10:08:56 +02:00
CNugteren
9aa2989447
Fixed number of threads launched for AXPY
2015-06-14 10:08:23 +02:00
CNugteren
4b3e3dcfe0
Added a fast GEMV kernel with vector loads, no tail, and fewer if-statements
2015-06-13 20:46:01 +02:00
CNugteren
6662f5d8e9
Refactored the GEMV kernel
2015-06-13 17:07:31 +02:00
CNugteren
9b66883e9c
Improved GEMV kernel with local memory and a tunable WPT
2015-06-13 14:10:07 +02:00
CNugteren
e522d1a74e
Added initial version of GEMV including tester and performance client
2015-06-13 11:01:20 +02:00
CNugteren
85c1db9322
Added initial naive version of Xgemv kernel
2015-06-10 08:44:30 +02:00
CNugteren
bc5a341dfe
Initial commit of preview version
2015-05-30 12:30:43 +02:00