Commit Graph

780 Commits (6e2ab6ee967c4a9b3350c7ce4e7d7b736c9e45f6)

Author SHA1 Message Date
Ivan Shapovalov ae3299da30 clblast::RunKernel, cl::Kernel: unify variants with/without waitForEvents, support empty LWS 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 5502c5eec4 cl::Kernel: skip NULL entries in waitForEvents 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 2dd5ee3f75 clblast::RunKernel, cl::Kernel: take const vector as waitForEvents 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 1ae71614ac xgemm: do not hardcode kernel requirements for internal matrix layout
Do not hardcode the knowledge about "A and C col-major, B row-major".

This allows for easier reuse of the DoGemm() routine with different
kernels.
2016-07-22 11:15:52 +03:00
Cedric Nugteren 798d32edad Improved the GEMM direct kernel by adding register blocking. Still not fast though 2016-07-17 14:36:51 +02:00
Cedric Nugteren eaa348735e Created infrastructure to support a direct GEMM kernel; added correct but slow reference kernel as a place-holder 2016-07-16 15:18:28 +02:00
Cedric Nugteren b33bec4a59 Fixed some more types and type conversions in the clpp11 interface to OpenCL 2016-07-16 11:13:23 +02:00
Cedric Nugteren bee9b959f4 Merge pull request #80 from gcp/getdevinfo_fixes
Make sure the passed types are large enough.
2016-07-16 10:59:51 +02:00
Cedric Nugteren 066af4069b Removed an unused variable from the copy-transpose-pad function 2016-07-16 10:56:37 +02:00
Gian-Carlo Pascutto e0ba59c0ac Make sure the passed types are large enough.
Make sure all out parameters that are passed to functions such
as clGetDeviceInfo are large enough to contain the replies.
2016-07-13 15:59:02 +02:00
Cedric Nugteren c87e877bf2 Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel 2016-07-10 20:32:01 +02:00
Cedric Nugteren 57f09178d8 Added tuning results for AMD Oland and for Intel Graphics HD 530 2016-07-10 11:46:44 +02:00
Cedric Nugteren 39e9b1238f Fixed a bug related to the cache and retrieval of programs based on the OpenCL context 2016-07-10 11:24:36 +02:00
Cedric Nugteren 9caa7ca5b9 Cache now compares cl_context instead of a pointer to a context; added verbose print statements to the cache 2016-07-08 20:57:58 +02:00
Cedric Nugteren 27854070b4 Added a VERBOSE mode to debug performance: now prints details about compilation and kernel execution to screen 2016-07-06 21:50:12 +02:00
Cedric Nugteren 77325b8974 Added an option to the performance clients to do a warm-up run before timing 2016-07-06 21:25:55 +02:00
Cedric Nugteren 9683b50c55 Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp) 2016-07-03 20:30:47 +02:00
Gian-Carlo Pascutto 7424532859 Ensure clGetKernelWorkGroupInfo return value fits.
In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo
to get the "bytes" amount needed to store the result from
CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an
"auto result = size_t", which in 32-bit mode is 4 bytes, regardless
of the previous return value. The spec describes that it will actually
be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure
we are in fact passing a cl_ulong.

Also adjust all callers to take the changed type into account.
2016-07-02 21:14:36 +02:00
Cedric Nugteren 7cf2f8c268 Fixed some memory leaks related to events not properly cleaned-up 2016-07-02 15:34:55 +02:00
Cedric Nugteren b330ab0866 Added declspec(dllexport) to ClearCache and FillCache, and added declspec(dllimport) when not building the library 2016-06-30 10:49:17 +02:00
Cedric Nugteren cd74aaac52 Updated to version 6.0 of the CLCudaAPI header 2016-06-29 19:42:49 +02:00
CNugteren 871b576c06 Made it possible to build the clients and tests on Windows using Visual Studio 2016-06-28 16:38:45 +02:00
Cedric Nugteren 76b20cfe0c Fixes for the AppVeyor Windows build 2016-06-27 14:44:08 +02:00
Cedric Nugteren 66908ef5cd Added tuning results for 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (thanks to OursDesCavernes) 2016-06-19 14:59:50 +02:00
Cedric Nugteren 61203453aa Renamed all C++ source files to .cpp to match the .hpp extension better 2016-06-19 13:55:49 +02:00
Cedric Nugteren f726fbdc9f Moved all headers into the source tree, changed headers to .hpp extension 2016-06-18 20:20:13 +02:00
Cedric Nugteren bacb5d2bb2 Clean-up of the routine class, moved RunKernel to the routine/common file 2016-06-18 18:16:14 +02:00
Cedric Nugteren 7b4c0e1cf0 Removed the template from the Routine base-class 2016-06-18 14:56:55 +02:00
Cedric Nugteren f9947b4d7f Removed the precision argument from the routines in favor of a single templated function 2016-06-17 14:30:37 +02:00
Cedric Nugteren 536b7fe4bc Removed the interface to the cache functions from the Routine class, calls them directly now 2016-06-17 13:57:50 +02:00
Cedric Nugteren 98a95c89fc Moved the RunKernel and PadCopyTransposeMatrix functions out of the Routine class 2016-06-17 12:32:06 +02:00
Cedric Nugteren afe8852eaa Moved the test-for-valid-buffers function from the Routine class to separate functions in a separate file 2016-06-17 11:29:07 +02:00
Cedric Nugteren 52ccaf5b25 Added XOMATCOPY routines to perform out-of-place matrix scaling, copying, and/or transposing 2016-06-16 18:07:46 +02:00
Cedric Nugteren 39b7dbc5e3 Added some constness to variables related to the GEMM routines 2016-06-15 12:34:05 +02:00
Cedric Nugteren b894611ad1 Re-organised the level-3 supporting kernels (copy, pad, transpose, convert) and renamed files and functions appropriately 2016-06-14 18:17:58 +02:00
Cedric Nugteren 3e78a99355 Moved device vendor and type checks to a common header 2016-06-14 14:30:22 +02:00
Cedric Nugteren 6e2017c67d Added support for FP16 on ARM Mali-T628 (officially not supported) 2016-06-14 14:29:53 +02:00
Cedric Nugteren 6925003e45 Added global memory synchronisation for better cache performance on ARM Mali GPUs 2016-06-08 10:13:37 +02:00
Cedric Nugteren 03182f9d07 Added half-precision tests for the clBLAS reference through conversion to single-precision 2016-05-26 23:36:19 +02:00
Cedric Nugteren 9f87455070 Added level-3 half-precision routines HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM 2016-05-25 13:29:53 +02:00
Cedric Nugteren ac1575056e Added proper argument handling and displaying for half-precision data-types 2016-05-24 14:06:16 +02:00
Cedric Nugteren 3e9a07f00a Added level-2 half-precision routines HGER/HSYR/HSPR/HSYR2/HSPR2 2016-05-22 16:59:14 +02:00
Cedric Nugteren f0cb3fdc81 Fixed tuning results for half-precision; added first results for the xGER kernels 2016-05-22 16:46:05 +02:00
Cedric Nugteren c8ff3f143f Prepared the GER kernels and tuner for half-precision support 2016-05-22 16:18:08 +02:00
Cedric Nugteren 95b828da12 Added level-2 half-precision routines HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV 2016-05-22 15:38:26 +02:00
Cedric Nugteren b6268d0c22 Added first tuning results for the half-precision xGEMV kernels 2016-05-22 15:29:05 +02:00
Cedric Nugteren 88551b4005 Prepared the GEMV kernels and tuner for half-precision support 2016-05-22 15:22:54 +02:00
Cedric Nugteren 803aaf3070 Added level-1 half-precision routines HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN 2016-05-22 14:47:14 +02:00
Cedric Nugteren 3c9e63c054 Added first tuning results for the half-precision xDOT kernels 2016-05-22 14:43:25 +02:00
Cedric Nugteren f70ded34f3 Added half-precision support for all level 1 routines 2016-05-22 14:26:19 +02:00
Cedric Nugteren 489c5d76cf Merged in latest changes from 0.7.1 release 2016-05-18 21:32:56 +02:00
Cedric Nugteren 7a3b695db7 Added half precision tuning results for supporting kernels (pad, copy, transpose, padtranspose) 2016-05-16 12:45:10 +02:00
Cedric Nugteren af2ac62212 Prepared GEMM and supporting kernels and tuners for half-precision support 2016-05-16 12:37:24 +02:00
Cedric Nugteren 4b6bdd83a2 Added header with conversions from and to half-precision floating-point 2016-05-15 20:13:57 +02:00
Cedric Nugteren 5e1b2e021f Set kernel arguments for AXPY as constant memory buffers, making it possible to transfer half-precision values as well 2016-05-14 18:06:00 +02:00
Cedric Nugteren 120c31a30f Initial experimental version of the half-precision HAXPY routine 2016-05-13 20:49:34 +02:00
Cedric Nugteren f2ba75890c Initial changes in preparation for half-precision fp16 support 2016-05-12 19:56:21 +02:00
cnugteren 25a25dbd6f Fixed errors in xAXPY and xSCAL tests on AMD hardware 2016-05-08 17:30:31 +02:00
Cedric Nugteren a8f109296c Fixed the calculation of the required buffer sizes in case of subvectors and submatrices 2016-05-02 20:04:55 +02:00
Cedric Nugteren b9317d7d0c Made the default xDOT tuning size smaller 2016-05-01 14:39:44 +02:00
Cedric Nugteren bee2f943ec Changed the index buffer of IxAMAX routines to unsigned int for proper buffersize checking 2016-05-01 14:03:37 +02:00
Cedric Nugteren 9602c150aa Added a program cache (per-context) next to the per-device binary cache 2016-05-01 12:56:08 +02:00
Cedric Nugteren e113ff0852 Added non-aboslute minimum counter-part IxMIN of the BLAS routine IxAMAX 2016-04-30 09:49:39 +02:00
Cedric Nugteren 877aad693f Added FillCache: a function to pre-compile all kernels for a specific device 2016-04-29 23:33:12 +02:00
Cedric Nugteren d9b21d7f49 Fixed the cache to store binaries instead of OpenCL programs 2016-04-28 21:14:17 +02:00
Cedric Nugteren d7ddbdeb1f Added non-absolute counter-parts xSUM and IxMAX of the BLAS routines xASUM and IxAMAX 2016-04-27 18:07:30 +02:00
Cedric Nugteren 8075934ca7 Added prototypes for non-BLAS routines: xSUM and IxMAX (non-absolute counterparts of xASUM and IxAMAX) 2016-04-27 17:06:19 +02:00
Cedric Nugteren 82be8f211c Moved all cache-related functions to a separate file; added a ClearCompiledProgramCache function to clear the cache 2016-04-27 16:02:13 +02:00
cnugteren 16a048f1ac Added support for the iSAMAX/iDAMAX/iCAMAX/iZAMAX routines 2016-04-20 22:12:51 -06:00
cnugteren 894983fc3c Added prototype for ixAMAX routines 2016-04-20 21:11:33 -06:00
cnugteren 5a4f8217be Updated the reduction-kernel tuner to also tune the epilogue 2016-04-14 21:37:52 -06:00
cnugteren 8be99de82d Added support for the SASUM/DASUM/ScASUM/DzASUM routines 2016-04-14 19:58:26 -06:00
cnugteren e0497807e2 Added prototype for xASUM routines 2016-04-13 21:44:49 -06:00
cnugteren 1d3d38a261 Events are now properly implemented using event waiting list and asking the user to wait for event completion 2016-04-09 22:22:24 -06:00
cnugteren 90e237b97a Removed redundant queue synchronisation statements 2016-04-04 08:38:31 -07:00
cnugteren 5c83217cf2 Added a wrapper for CBLAS libraries for performance/correctness testing 2016-04-01 22:36:39 -07:00
cnugteren 8c3c6db7d0 Merge branch 'level1_routines' into development 2016-03-30 21:37:56 -07:00
cnugteren 5409f349a1 Fixed the nrm2 kernel for complex data-types 2016-03-30 21:32:04 -07:00
Cedric Nugteren c1df786764 Added prototypes for the xROTM and xROTMG routines 2016-03-30 16:13:37 -07:00
Cedric Nugteren 6ecc0d089c Added prototypes for the xROT and xROTG functions 2016-03-30 16:13:32 -07:00
Cedric Nugteren 2429ad5025 Fixed properly passing of OpenCL events to CLBlast functions 2016-03-30 16:12:53 -07:00
Cedric Nugteren aaa687ca98 Added preliminary support for the xNRM2 routines 2016-03-28 23:00:44 +02:00
Cedric Nugteren 1d5a702d9d Added prototypes for ScNRM2/DzNRM2 routines 2016-03-25 10:30:38 +01:00
Cedric Nugteren 3876096c30 Added prototypes for SNRM2/DNRM2 routines 2016-03-25 10:00:40 +01:00
Cedric Nugteren 49822c8ead Fixed the C-api export to be able to properly build a DLL on Windows 2016-03-23 20:49:28 +01:00
Cedric Nugteren d935695417 Added __declspec(dllexport) to create a DLL on Windows 2016-03-19 11:09:09 +01:00
Cedric Nugteren 918797735d Made the library thread-safe by guarding the kernel cache with a mutex 2016-03-14 22:55:22 +01:00
Cedric Nugteren f4c09220c1 Fixed a bug in the GER-family of routines due to incorrect division of the workgroup size 2016-03-06 16:43:28 +01:00
Cedric Nugteren 306bf67660 Added preliminary support for xHPR2 and xSPR2 routines 2016-03-06 15:48:11 +01:00
Cedric Nugteren 60da54da5d Added preliminary support for xHER2 and xSYR2 routines 2016-03-02 21:18:01 +01:00
Cedric Nugteren 4a56822dcc Fixed a couple of correctness bugs in the Xher kernels 2016-02-28 15:49:59 +01:00
Cedric Nugteren e3545215a5 Added support for xHER, xHPR, xSYR, and xSPR routines 2016-02-28 14:16:48 +01:00
Cedric Nugteren 9f682aa66b Set a proper default precision for the CLBlast clients 2016-02-20 14:41:53 +01:00
Cedric Nugteren 6dc44da07b Added support for xGERU and xGERC routines 2016-02-20 14:15:41 +01:00
Cedric Nugteren 8854a73127 Added XGER routine, kernel, and tuner 2016-02-20 12:40:01 +01:00
Cedric Nugteren bf84463ab2 Separated the GEMM kernel in two parts to reduce string length for MSVC 2016-02-08 20:06:02 +01:00
Cedric Nugteren 38c56bbde2 Split-up the XGEMV kernel in two parts 2016-02-08 19:43:34 +01:00
Cedric Nugteren 00be6f7530 Added dictionary with short and long OpenCL vendor names to fix issues with Intel having multiple names 2016-02-07 11:59:30 +01:00
CNugteren b7900652b2 Reduced the maximum workgroup-size for GEMV kernels further 2016-02-06 13:07:19 +01:00
CNugteren 40346bb3a5 Reduced unrolling factor in xgemv kernel to reduce compilation times 2016-02-06 12:09:21 +01:00
CNugteren 9622d3be22 Fixes for compilation under Visual Studio 2016-01-30 14:57:49 +01:00
Cedric Nugteren 276e772a2c Added first auto-generated database headers from the Python database; only K40 and Iris supported now 2016-01-30 11:43:21 +01:00
CNugteren c0d469718a Now sets local memory size in xgemv tuner properly 2015-10-28 21:19:59 +01:00
CNugteren 179ad0666d Fixed an arguments-related bug in the GEMV tuner 2015-10-25 16:48:26 +01:00
CNugteren a2d5d7770e Moved the tuner database script to a separate folder 2015-10-25 16:27:14 +01:00
CNugteren 0d4091fdfb Added guards for routine-specific level-3 pad kernels 2015-10-13 08:29:45 +02:00
CNugteren f74c9a5640 Routine names are now all default arguments defined in the header 2015-10-12 08:35:58 +02:00
CNugteren 54a8723f8c Moved level3 kernel files to a subfolder 2015-10-12 08:28:40 +02:00
CNugteren 2b56c2c603 Added TRMV/TBMV/TPMV routines 2015-09-26 16:58:03 +02:00
CNugteren de6547a92b Added SBMV and SPMV routines 2015-09-19 18:01:19 +02:00
CNugteren 80da67d28b Added the HPMV routine 2015-09-19 17:40:38 +02:00
CNugteren c32c4a9739 Added infrastructure for packed matrices 2015-09-19 17:37:42 +02:00
CNugteren aebd156869 Added the HBMV routine 2015-09-19 11:11:34 +02:00
CNugteren 93dddda63e Improved the organization and performance of level 2 routines 2015-09-18 17:46:41 +02:00
CNugteren 4507ba4997 Added first version of banded matrix-vector multiplication 2015-09-18 15:25:20 +02:00
CNugteren 6105ad6f5b Added interface of all level 2 routines 2015-09-17 17:05:45 +02:00
CNugteren 6307d2e5db Added script to generate API interface and implementation automatically 2015-09-17 10:14:33 +02:00
CNugteren a2e726d3bd Added xDOT/xDOTU/xDOTC dot-product routines 2015-09-14 16:57:00 +02:00
CNugteren 2a383f3450 Added extra temporary buffer to tuners in preparation of Xdot routines 2015-09-14 15:53:34 +02:00
CNugteren e0c5312abb Added support for the dot buffer and offset argument 2015-09-14 12:28:50 +02:00
CNugteren ff0c54c386 Added the XSWAP, XSCAL and XCOPY level-1 routines 2015-08-22 17:11:20 +02:00
CNugteren 75517353d5 Re-organized level1 xaxpy kernel 2015-08-22 14:33:48 +02:00
Cedric Nugteren cf168fca70 Merge pull request #23 from CNugteren/tuner_database
Added initial version of a tuner-database
2015-08-20 08:38:18 +02:00
CNugteren 15db2bcc20 Added initial version of tuner-database Python script 2015-08-20 08:30:51 +02:00
CNugteren b46de22433 Moved precision tester to utilities 2015-08-19 19:34:29 +02:00
CNugteren cbd25bffea Added hotfix 8eeb7f721f 2015-08-19 11:12:16 +02:00
Cedric Nugteren 4f6e42d052 Merge pull request #21 from CNugteren/c_api
Added a plain C API
2015-08-13 18:02:03 +02:00
CNugteren 603e389545 Added all supported routines to the C API 2015-08-13 17:58:46 +02:00
CNugteren 8eeb7f721f Fixed a complex data-type bug in the transpose kernel 2015-08-13 14:33:42 +02:00
CNugteren 8617195ac5 Added initial version of C API with just one routine 2015-08-13 13:46:13 +02:00
CNugteren dbdb58c600 Refactored the tuners, added JSON output 2015-08-09 15:50:41 +02:00
CNugteren 75b4d92ac3 Added distinguished names for GEMV inherited HEMV/SYMV 2015-08-04 08:15:39 +02:00
CNugteren d1a7cf18ec Abstracted loading of matrix A for GEMV kernel 2015-08-03 07:37:14 +02:00
CNugteren 938ca2707f Added HEMV routine 2015-07-31 17:35:42 +02:00
CNugteren b89517a2e7 Added SYMV routine 2015-07-31 17:13:41 +02:00
CNugteren f7199b831f Now using the new Claduc C++11 OpenCL header 2015-07-27 07:18:06 +02:00
CNugteren 4dcecfe934 Added workgroup shuffle option to transpose kernel for AMD GPUs 2015-07-22 07:31:16 +02:00
CNugteren d93efa3169 Transpose kernel now uses vectorized local memory loads and stores 2015-07-21 08:22:18 +02:00
CNugteren a0f0f6c8ce Triangular GEMM kernels are only compiled when needed 2015-07-19 16:36:12 +02:00
CNugteren 48e2e96f1b Kernel caching is now based on a routine's name 2015-07-19 16:24:14 +02:00
CNugteren 4e499a67c1 The kernel source string is now a routine's member variable 2015-07-19 13:44:37 +02:00
CNugteren 9300261bd4 Fixed a bug when using the Xgemm kernel without local memory 2015-07-16 22:49:55 +02:00
CNugteren 0157d6d4ea Using mad() instruction for AMD devices like clBLAS does 2015-07-16 22:42:02 +02:00
CNugteren b526623fc7 Skips pre/post processing kernels if not needed 2015-07-15 22:12:38 +02:00
CNugteren 0dc85845f7 Updated interface of the PadCopyTransposeMatrix method 2015-07-13 08:41:26 +02:00
CNugteren aa852bbe67 Added subfolders for the level1/2/3 routines 2015-07-12 16:57:09 +02:00
CNugteren b5d39d9d0c Added the HEMM routine, tester, and client 2015-07-12 15:11:50 +02:00
CNugteren 9a929f3fb2 Disabled prototype of TRSM 2015-07-10 21:08:18 +02:00
CNugteren b02876d6e9 Added the HER2K routine, tester, and client 2015-07-10 20:59:20 +02:00
CNugteren 919bba3eaf Added the HERK routine, tester, and client 2015-07-10 07:19:59 +02:00
CNugteren 5578d5ab28 Added option to set the imaginary part of the diagonal to zero 2015-07-08 07:25:18 +02:00
CNugteren 599f9a70a6 Added option to set the imaginary part of the diagonal to zero 2015-07-07 07:34:36 +02:00
CNugteren d9ea0c47c6 Added the TRMM routine, tester, and client 2015-07-02 07:16:04 +02:00
CNugteren d879eb3abf Added a set-to-one function for kernels 2015-07-02 07:11:27 +02:00
CNugteren e3dd35f91b Added the unit/non-unit diagonal enum 2015-07-01 09:39:41 +02:00
CNugteren b8d81a60d6 Fixed typos in SYMM 2015-07-01 09:38:04 +02:00
CNugteren 8574f72d46 Added the TRMM and TRSM interface 2015-06-30 07:36:11 +02:00
CNugteren 7c8d16147a Added the SYR2K routine, tester, and client 2015-06-26 08:12:56 +02:00
CNugteren 57c705dbf2 Clarified comment 2015-06-25 20:38:34 +02:00
CNugteren 60a88aac86 Added the SYRK routine, tester, and client 2015-06-24 07:50:18 +02:00
CNugteren 9fc38cdf5e Added a lower/upper triangular version of the GEMM kernel 2015-06-23 17:58:51 +02:00
CNugteren 20eb3506d6 Added a condition to update only lower/upper triangular parts in the un-pad kernels 2015-06-23 08:09:07 +02:00
CNugteren e3829c1067 Added prototypes of SYRK and SYR2K 2015-06-21 12:44:03 +02:00
CNugteren 3ea3ba2bee Distinguish between a short smoke test and a full test 2015-06-20 13:33:50 +02:00
CNugteren e26742c629 Added additional absolute error checking when testing 2015-06-20 10:58:21 +02:00
CNugteren 682c01a80c Now returns program from database by reference 2015-06-18 18:44:14 +02:00
CNugteren 7e176ccac9 Added support for conjugate transpose in GEMV 2015-06-16 08:42:52 +02:00
CNugteren af78a04eca Updated the tuners to set the conjugate argument 2015-06-16 07:50:45 +02:00
CNugteren e03582a112 Added support for CGEMM/ZGEMM and CSYMM/ZSYMM 2015-06-16 07:45:09 +02:00
CNugteren 8f01c644b5 Added support for complex conjugate transpose 2015-06-16 07:43:19 +02:00
CNugteren 01726197ab Fixed a bug in AXPBY defines for complex data-types 2015-06-15 08:38:24 +02:00
CNugteren 294a3e3d41 Split the three variations of the GEMV kernel for maximal tuning freedom 2015-06-14 11:15:53 +02:00
CNugteren ab0064dab7 Fixed number of threads launched for GEMV 2015-06-14 10:08:56 +02:00
CNugteren 9aa2989447 Fixed number of threads launched for AXPY 2015-06-14 10:08:23 +02:00
CNugteren 4b3e3dcfe0 Added a fast GEMV kernel with vector loads, no tail, and fewer if-statements 2015-06-13 20:46:01 +02:00
CNugteren 6662f5d8e9 Refactored the GEMV kernel 2015-06-13 17:07:31 +02:00
CNugteren 9b66883e9c Improved GEMV kernel with local memory and a tunable WPT 2015-06-13 14:10:07 +02:00
CNugteren e522d1a74e Added initial version of GEMV including tester and performance client 2015-06-13 11:01:20 +02:00
CNugteren 85c1db9322 Added initial naive version of Xgemv kernel 2015-06-10 08:44:30 +02:00
CNugteren bc5a341dfe Initial commit of preview version 2015-05-30 12:30:43 +02:00