Commit graph

427 commits

Author SHA1 Message Date
Cedric Nugteren 0c0f0ac7f9 Also changed the default-default for unknown device types to use the same method as for known device groups 2016-08-21 20:35:20 +02:00
Cedric Nugteren 84db8958d1 Increased the ratio of GEMM tuning results to explore; reduced the tuning search space to have a better chance to evaluate more likely parameter combinations 2016-08-21 20:28:02 +02:00
Cedric Nugteren 00979faab4 Updated the changelog; refactored the database-get-bests code a bit 2016-08-21 20:16:06 +02:00
Cedric Nugteren 7d5631b7e4 Updated the database script to calculate the relative best performance of tuning results common for a device/vendor type 2016-08-15 21:01:07 +02:00
Cedric Nugteren 7da6492b36 Improved the speed of the new common-best defaults method for the database generation 2016-08-09 21:06:04 +02:00
Cedric Nugteren 3f5401d4c8 Added a first version of the database's common-best default calculation 2016-08-07 16:25:38 +02:00
Cedric Nugteren 35623cd98d Minor update regarding the previous CMake export/install target changes 2016-07-28 20:45:09 +02:00
Cedric Nugteren c3712f5b36 Merge pull request #86 from intelfx/cmake
CMakeLists.txt: provide a find_package() config for dependent projects
2016-07-28 20:17:13 +02:00
Ivan Shapovalov 227374deba .appveyor.yml: move {OPENCL,CLBLAST}_ROOT out of source tree
Reasoning is the same as in previous commit: CMake does not like having
OpenCL header path inside of the source tree. CLBLAST_ROOT is moved for
uniformity.
2016-07-28 19:09:30 +03:00
Ivan Shapovalov 6c11fdc12c .travis.yml: use OpenCL ICD Loader and headers shipped by distro
Using our own headers causes problems with CMake which does not like having
OpenCL header path inside of the source tree. While at it, use distro's
universal OpenCL loader as well.
2016-07-28 19:09:29 +03:00
Ivan Shapovalov b5d7b58393 CMakeLists.txt: use target_include_directories() 2016-07-28 19:09:29 +03:00
Ivan Shapovalov 570cbcffa7 CMakeLists.txt: provide a find_package() config for dependent projects 2016-07-28 19:09:29 +03:00
Cedric Nugteren 1ec21421d7 Merge branch 'gemv_performance' into development 2016-07-26 20:02:14 +02:00
Cedric Nugteren de1afe168d Removed all old tuning results for the XgemvFastRot kernel; re-added for a couple of devices 2016-07-25 22:57:23 +02:00
Cedric Nugteren 2582f0290a Moved the XgemvFast and XgemvFastRot tuning database into a separate file 2016-07-25 22:43:49 +02:00
Cedric Nugteren 0252df731a Merge branch 'development' into gemv_performance 2016-07-24 17:06:27 +02:00
Cedric Nugteren ffa35c623a Minor improvements after merging in groundwork for custom tuning parameters and kernels 2016-07-24 17:00:21 +02:00
Cedric Nugteren d4ffa6395e Merge pull request #84 from intelfx/device-specific-kernels
Groundwork for device-specific routines
2016-07-24 16:48:20 +02:00
Cedric Nugteren 622682ffe3 Refactored the Python database script: separated functionality in modules, now complies to the PEP8 style, added proper command-line argument parsing, and cleaned-up 2016-07-24 16:41:01 +02:00
Cedric Nugteren 40a72259eb Fixe a bug in the new XgemvFastRot kernel related to local memory size 2016-07-23 16:58:11 +02:00
Cedric Nugteren 7a4f963763 Further improvements to the XgemvFastRot kernel, properly enables coalescing now 2016-07-23 14:52:32 +02:00
Cedric Nugteren 75fe8235f7 Improved the XgemvFastRot kernel by tiled loading of the input matrix A, enabling better memory performance 2016-07-23 10:20:11 +02:00
Ivan Shapovalov e4e1f05079 clblast::Database, clblast::Routine: implement "database overlays" provided by routine implementation 2016-07-22 11:15:52 +03:00
Ivan Shapovalov ae3299da30 clblast::RunKernel, cl::Kernel: unify variants with/without waitForEvents, support empty LWS 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 5502c5eec4 cl::Kernel: skip NULL entries in waitForEvents 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 2dd5ee3f75 clblast::RunKernel, cl::Kernel: take const vector as waitForEvents 2016-07-22 11:15:52 +03:00
Ivan Shapovalov 1ae71614ac xgemm: do not hardcode kernel requirements for internal matrix layout
Do not hardcode the knowledge about "A and C col-major, B row-major".

This allows for easier reuse of the DoGemm() routine with different
kernels.
2016-07-22 11:15:52 +03:00
Ivan Shapovalov a1d80e7402 CMakeLists.txt: use ${clblast_SOURCE_DIR} instead of ${CMAKE_SOURCE_DIR} 2016-07-22 11:15:52 +03:00
Cedric Nugteren b33bec4a59 Fixed some more types and type conversions in the clpp11 interface to OpenCL 2016-07-16 11:13:23 +02:00
Cedric Nugteren bee9b959f4 Merge pull request #80 from gcp/getdevinfo_fixes
Make sure the passed types are large enough.
2016-07-16 10:59:51 +02:00
Cedric Nugteren 066af4069b Removed an unused variable from the copy-transpose-pad function 2016-07-16 10:56:37 +02:00
Gian-Carlo Pascutto e0ba59c0ac Make sure the passed types are large enough.
Make sure all out parameters that are passed to functions such
as clGetDeviceInfo are large enough to contain the replies.
2016-07-13 15:59:02 +02:00
Cedric Nugteren c87e877bf2 Now passing alpha/beta to the kernel as arguments as before fp16 support; in case of fp16 arguments are cast on host and in kernel 2016-07-10 20:32:01 +02:00
Cedric Nugteren 57f09178d8 Added tuning results for AMD Oland and for Intel Graphics HD 530 2016-07-10 11:46:44 +02:00
Cedric Nugteren 39e9b1238f Fixed a bug related to the cache and retrieval of programs based on the OpenCL context 2016-07-10 11:24:36 +02:00
Cedric Nugteren 9caa7ca5b9 Cache now compares cl_context instead of a pointer to a context; added verbose print statements to the cache 2016-07-08 20:57:58 +02:00
Cedric Nugteren 27854070b4 Added a VERBOSE mode to debug performance: now prints details about compilation and kernel execution to screen 2016-07-06 21:50:12 +02:00
Cedric Nugteren 77325b8974 Added an option to the performance clients to do a warm-up run before timing 2016-07-06 21:25:55 +02:00
CNugteren 2d665099ef Fixed a linking issue with the tuners on Visual Studio 2016-07-04 19:46:14 +02:00
Cedric Nugteren 9683b50c55 Added tuning results for GTX670, GTX750, and GTX1070 (thanks to gcp) 2016-07-03 20:30:47 +02:00
Cedric Nugteren 4105a79598 Merge pull request #76 from gcp/fix_local_mem_size
Fixes clGetKernelWorkGroupInfo to work well with both 32-bit and 64-bit systems
2016-07-03 16:34:44 +02:00
Gian-Carlo Pascutto 7424532859 Ensure clGetKernelWorkGroupInfo return value fits.
In LocalMemUsage(), there's a first call to clGetKernelWorkGroupInfo
to get the "bytes" amount needed to store the result from
CL_KERNEL_LOCAL_MEM_SIZE. However, the actual value passed is an
"auto result = size_t", which in 32-bit mode is 4 bytes, regardless
of the previous return value. The spec describes that it will actually
be a cl_ulong which is 8 bytes. To prevent stack corruption, make sure
we are in fact passing a cl_ulong.

Also adjust all callers to take the changed type into account.
2016-07-02 21:14:36 +02:00
Cedric Nugteren 5a690f4e36 Prints the current pandas version and reports the minimum required version 2016-07-02 16:44:13 +02:00
Cedric Nugteren 7cf2f8c268 Fixed some memory leaks related to events not properly cleaned-up 2016-07-02 15:34:55 +02:00
Cedric Nugteren b330ab0866 Added declspec(dllexport) to ClearCache and FillCache, and added declspec(dllimport) when not building the library 2016-06-30 10:49:17 +02:00
Cedric Nugteren cd74aaac52 Updated to version 6.0 of the CLCudaAPI header 2016-06-29 19:42:49 +02:00
Cedric Nugteren 56483347e8 Prepared the changelog for the next release 2016-06-28 22:33:13 +02:00
Cedric Nugteren 577f0ee117 Updated to version 0.8.0 2016-06-28 21:32:00 +02:00
Cedric Nugteren 33dddd3ff1 Changed the AppVeyor buildscript to use nmake instead of 'cmake --build' (2) 2016-06-28 20:56:49 +02:00
Cedric Nugteren a003cc2f2c Changed the AppVeyor buildscript to use nmake instead of 'cmake --build' 2016-06-28 20:48:23 +02:00