2022-10-13 22:26:26 +02:00
|
|
|
Development version (next version)
|
2023-05-21 20:51:05 +02:00
|
|
|
-
|
|
|
|
|
|
|
|
Version 1.6.0
|
2023-01-17 17:35:29 +01:00
|
|
|
- Modifications to improve performance on Qualcomm Adreno GPUs:
|
|
|
|
* Unique database entries for specific Adreno devices
|
|
|
|
* Toggle OpenCL kernel compilation options for Adreno
|
|
|
|
* New preprocessor directive RELAX_WORKGROUP_SIZE
|
|
|
|
- Fixed a bug in handling of #undef in CLBlast loop unrolling and array-to-register mapping functions
|
2023-05-07 20:02:52 +02:00
|
|
|
- Fixed a bug in XAMAX/XAMIN routines related to inadvertently including the increment and offset in the result
|
|
|
|
- Fixed a bug in XAMAX/XAMIN routines that would cause only the real part of a complex number to be taken into account
|
|
|
|
- Fixed a bug that caused tests to not properly do integer-output testing (for XAMAX/XAMIN)
|
2023-05-10 12:48:25 +02:00
|
|
|
- Fixes a minor issue with the expected input buffer size in the TRMV/TBMV/TPMV/TRSV routines
|
2023-05-10 17:10:17 +02:00
|
|
|
- Fixes an issue with crashes on Android related to calling clReleaseProgram
|
2023-05-10 12:48:25 +02:00
|
|
|
- Fixes two small issues in the plotting script
|
2023-03-25 20:24:40 +01:00
|
|
|
- Fixed a documentation bug in the 'ld' requirements
|
2023-05-21 20:51:05 +02:00
|
|
|
- Enabled Github Actions CI builds for testing and releasing
|
|
|
|
- Various minor fixes and enhancements
|
2023-01-21 21:09:09 +01:00
|
|
|
- Added tuned parameters for various devices (see doc/tuning.md)
|
2022-10-13 22:26:26 +02:00
|
|
|
|
2022-09-22 22:07:33 +02:00
|
|
|
Version 1.5.3
|
2021-08-19 20:37:46 +02:00
|
|
|
- Fix a correctness issue with DGEMM on SM 7.5 Turing GPUs
|
|
|
|
- Various minor fixes and enhancements
|
|
|
|
- Added tuned parameters for various devices (see doc/tuning.md)
|
2022-09-22 22:18:58 +02:00
|
|
|
- Update cl.hpp to the new opencl.hpp header in the samples
|
2022-05-13 15:45:54 +02:00
|
|
|
- Changed the complex sum routine to return the complex sum instead of the absolute complex sum.
|
|
|
|
|
2021-01-19 21:19:12 +01:00
|
|
|
Version 1.5.2
|
2020-03-08 11:29:47 +01:00
|
|
|
- Changed XAMAX/XAMIN to more likely return first rather than last min/max index, updated API docs
|
2020-05-10 14:55:03 +02:00
|
|
|
- Added batched routines to pyclblast
|
2020-05-12 14:43:25 +02:00
|
|
|
- Added CLBLAST_VERSION_MAJOR/MINOR/PATCH defines in headers to store version numbering
|
2020-10-10 12:56:17 +02:00
|
|
|
- Several small improvements to the benchmark script (thanks to 'baryluk')
|
2021-01-19 21:19:12 +01:00
|
|
|
- Fixed a bug in the caching when using a context with multiple devices
|
|
|
|
- Fixed a bug in the tuners related to global workgroup size not being a multiple of the local
|
|
|
|
- Various minor fixes and enhancements
|
2020-10-10 12:56:17 +02:00
|
|
|
- Added tuned parameters for various devices (see doc/tuning.md)
|
2020-03-08 10:14:59 +01:00
|
|
|
|
2020-02-18 10:29:10 +01:00
|
|
|
Version 1.5.1
|
2018-12-21 03:08:01 +01:00
|
|
|
- Implemented single-kernel version of convolution as GEMM
|
2020-02-17 22:07:51 +01:00
|
|
|
- Now catches all exceptions thrown by the tuners
|
2019-05-19 14:00:18 +02:00
|
|
|
- Fixed a bug in ISAMIN kernel
|
2019-09-06 19:33:30 +02:00
|
|
|
- Fixed an out-of-bounds read/write in the XHAD routine (thanks to etomzak)
|
2018-12-21 03:08:01 +01:00
|
|
|
- Various minor fixes and enhancements
|
2019-02-09 16:29:30 +01:00
|
|
|
- Added tuned parameters for various devices (see doc/tuning.md)
|
2015-05-30 12:30:43 +02:00
|
|
|
|
2018-12-04 20:46:02 +01:00
|
|
|
Version 1.5.0
|
2018-07-23 21:00:10 +02:00
|
|
|
- Added support for shuffle instructions for NVIDIA GPUs (thanks to 'tyler-utah')
|
2018-08-07 22:41:06 +02:00
|
|
|
- Added an option to compile the Netlib API with static OpenCL device and context (-DNETLIB_PERSISTENT_OPENCL=ON)
|
2018-12-01 17:19:28 +01:00
|
|
|
- Added a FAQ page to the documentation
|
2018-07-28 16:01:03 +02:00
|
|
|
- The tuners now check beforehand on invalid local thread sizes and skip those completely
|
2018-10-13 17:49:49 +02:00
|
|
|
- Made the tuning API (OverrideParameters) more flexible, disregarding superfluous parameters
|
2018-07-31 21:49:37 +02:00
|
|
|
- Fixed an issue with conjugate transpose not being executed in certain cases for a.o. XOMATCOPY
|
2018-07-27 23:08:49 +02:00
|
|
|
- Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel
|
2018-09-15 16:53:09 +02:00
|
|
|
- Fixed an issue with the preprocessor and the new GEMMK == 1 kernel
|
2018-11-30 20:23:26 +01:00
|
|
|
- Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel
|
2018-10-15 20:08:29 +02:00
|
|
|
- Fixed an issue for certain parameters for AXPY's 'XaxpyFaster' kernel
|
2018-07-23 21:00:10 +02:00
|
|
|
- Various minor fixes and enhancements
|
2018-09-07 22:02:44 +02:00
|
|
|
- Added non-BLAS routines:
|
|
|
|
* SCONVGEMM/DCONVGEMM/HCONVGEMM (convolution as im2col followed by batched GEMM)
|
2018-11-01 21:46:19 +01:00
|
|
|
* SCOL2IM/DCOL2IM/CCOL2IM/ZCOL2IM/HCOL2IM (col2im transform as used in machine learning)
|
2018-07-23 21:00:10 +02:00
|
|
|
|
2018-07-14 12:29:06 +02:00
|
|
|
Version 1.4.1
|
2018-06-28 13:35:18 +02:00
|
|
|
- Fixed an access violation under Windows upon releasing the OpenCL program when the driver is already unloaded
|
2018-07-06 19:39:46 +02:00
|
|
|
- Fixed an issue with double cl_program release in the CLBlast caching system
|
2018-07-13 21:05:43 +02:00
|
|
|
- Added tuned parameters for various devices (see doc/tuning.md)
|
2018-06-28 13:35:18 +02:00
|
|
|
|
2018-06-03 13:18:05 +02:00
|
|
|
Version 1.4.0
|
2018-02-18 18:01:26 +01:00
|
|
|
- Added Python interface to CLBlast 'PyCLBlast'
|
2018-02-26 19:53:50 +01:00
|
|
|
- Added CLBlast to Ubuntu PPA and macOS Homebrew package managers
|
2018-03-10 14:52:40 +01:00
|
|
|
- Added an API to run the tuners programmatically without any I/O
|
2018-04-15 11:45:45 +02:00
|
|
|
- Improved the performance potential by adding a second tunable GEMM kernel with 2D register tiling
|
2018-04-29 15:06:44 +02:00
|
|
|
- Added support for Intel specific subgroup shuffling extensions for faster GEMM on Intel GPUs
|
2018-03-22 21:01:02 +01:00
|
|
|
- Re-added a local memory size constraint to the tuners
|
2018-05-19 12:48:59 +02:00
|
|
|
- The routine tuners now automatically pick up tuning results from disk from the kernel tuners
|
2018-02-26 19:53:50 +01:00
|
|
|
- Updated and reorganised the CLBlast documentation
|
2018-06-03 13:18:05 +02:00
|
|
|
- Added a 'canary' region to check for overflows in the tuner and tests (inspired by clARMOR)
|
2018-06-02 17:57:45 +02:00
|
|
|
- Added an option to test against and compare performance with Intel's MKL
|
2018-04-26 21:10:17 +02:00
|
|
|
- Fixed an access violation when compiled with Visual Studio upon releasing the OpenCL program
|
2018-05-01 20:34:48 +02:00
|
|
|
- Fixed incorrect releasing of the OpenCL program resulting in segfaults / access violations
|
2018-02-20 20:53:13 +01:00
|
|
|
- Various minor fixes and enhancements
|
2018-04-07 17:44:32 +02:00
|
|
|
- Added tuned parameters for various devices (see doc/tuning.md)
|
2018-02-02 21:18:37 +01:00
|
|
|
- Added non-BLAS level-1 routines:
|
|
|
|
* SHAD/DHAD/CHAD/ZHAD/HHAD (Hadamard element-wise vector-vector product)
|
|
|
|
|
2018-01-29 20:45:21 +01:00
|
|
|
Version 1.3.0
|
2017-11-19 12:59:52 +01:00
|
|
|
- Re-designed and integrated the auto-tuner, no more dependency on CLTune
|
2017-11-24 21:09:39 +01:00
|
|
|
- Made it possible to override the tuning parameters in the clients straight from JSON tuning files
|
2017-12-09 15:16:53 +01:00
|
|
|
- Added OpenCL pre-processor to unroll loops and perform array-to-register promotions for compilers
|
2017-12-17 16:59:08 +01:00
|
|
|
which don't do this themselves (ARM Mali) - greatly improves performance on these platforms
|
2017-12-23 13:55:22 +01:00
|
|
|
- Added first tuners for the TRSV (block size) and TRSM (invert kernel) routines
|
2018-01-06 17:16:11 +01:00
|
|
|
- Added an optional argument to the GEMM routine to provide a pre-allocated temporary buffer
|
2017-12-31 16:13:13 +01:00
|
|
|
- Fixed an issue with a crashing/hanging AMD APP compiler with the TRSM routine (invert kernel)
|
2017-12-27 12:04:22 +01:00
|
|
|
- Improved compilation time by splitting the tuning database into multiple compilation units
|
2017-12-20 19:14:04 +01:00
|
|
|
- Various minor fixes and enhancements
|
2017-11-09 21:19:21 +01:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2018-01-11 20:32:06 +01:00
|
|
|
- Added the RetrieveParameters function to the API to be able to inspect the tuning parameters
|
2018-01-08 21:07:01 +01:00
|
|
|
- Added a strided-batched (not part of the BLAS standard) routine, faster but less generic compared
|
|
|
|
to the existing xGEMMBATCHED routines:
|
|
|
|
* SGEMMSTRIDEDBATCHED/DGEMMSTRIDEDBATCHED/CGEMMSTRIDEDBATCHED/ZGEMMSTRIDEDBATCHED/HGEMMSTRIDEDBATCHED
|
2017-11-09 21:19:21 +01:00
|
|
|
|
2017-11-08 21:30:06 +01:00
|
|
|
Version 1.2.0
|
2017-10-27 22:01:15 +02:00
|
|
|
- Fixed a bug in the TRSM/TRSV routines due to missing synchronisations after GEMM/GEMV calls
|
2017-10-27 22:12:30 +02:00
|
|
|
- Fixed a bug in TRSM when using the a-offset argument
|
2017-10-16 21:54:23 +02:00
|
|
|
- Added a CUDA API to CLBlast:
|
|
|
|
* The library and kernels can be compiled with the CUDA driver API and NVRTC (requires CUDA 7.5)
|
|
|
|
* Two CUDA API sample programs are added: SGEMM and DAXPY
|
|
|
|
* All correctness tests and performance clients work on CUDA like they did for OpenCL
|
2017-09-30 20:29:18 +02:00
|
|
|
- Kernels are now cached based on their tuning parameters: fits the use-case of 'OverrideParameters'
|
2017-10-29 13:02:14 +01:00
|
|
|
- Cross-compiling for Android is now supported using CMake; instructions are added to the README
|
2017-10-03 21:55:21 +02:00
|
|
|
- Improved performance for small GEMM problems by going from 3 to 1 optional temporary buffers
|
2017-11-02 21:47:14 +01:00
|
|
|
- GEMM kernel selection (direct vs in-direct) is now done automatically using a new tuner
|
2017-10-01 20:32:39 +02:00
|
|
|
- Various minor fixes and enhancements
|
2017-10-20 18:06:12 +02:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2017-09-30 20:29:18 +02:00
|
|
|
|
2017-09-30 17:19:17 +02:00
|
|
|
Version 1.1.0
|
2017-09-14 21:27:33 +02:00
|
|
|
- The tuning database now has defaults per architecture (e.g. NVIDIA Kepler SM3.5, AMD Fiji)
|
|
|
|
- The tuning database now has a dictionary to translate vendor/device names to a common set
|
|
|
|
- The tuners can now distinguish between different AMD GPU board names of the same architecture
|
2017-08-21 20:14:02 +02:00
|
|
|
- The tuners can now use particle-swarm optimisation to search more efficiently (thanks to 'mcian')
|
2017-09-23 18:06:43 +02:00
|
|
|
- Improved performance for small problems on NVIDIA hardware by caching the device name
|
2017-09-16 18:02:37 +02:00
|
|
|
- Further improved compilation time of database.cpp
|
2017-09-22 21:35:32 +02:00
|
|
|
- Added a small diagnostics helper executable
|
2017-09-04 17:39:57 +02:00
|
|
|
- Various minor fixes and enhancements
|
2017-09-16 21:19:06 +02:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2017-08-24 21:11:12 +02:00
|
|
|
- Added non-BLAS routines:
|
|
|
|
* SIM2COL/DIM2COL/CIM2COL/ZIM2COL/HIM2COL (im2col transform as used to express convolution as GEMM)
|
|
|
|
|
2017-08-08 20:35:49 +02:00
|
|
|
Version 1.0.1
|
2017-11-08 21:30:06 +01:00
|
|
|
- Fixed a bug in the direct version of the GEMM kernel
|
2017-08-08 20:35:49 +02:00
|
|
|
|
2017-07-30 20:54:21 +02:00
|
|
|
Version 1.0.0
|
2017-05-13 02:43:56 +02:00
|
|
|
- Fixed a bug in the TRSM routine for alpha != 1
|
2017-06-01 22:52:08 +02:00
|
|
|
- Fixed a bug in the cache related to multi-device contexts (thanks to 'kpot')
|
2017-06-30 21:57:41 +02:00
|
|
|
- Fixed a bug in the direct version of the GEMM kernel
|
2017-06-26 21:38:04 +02:00
|
|
|
- Fixed several warnings for MSVC and Clang
|
2017-07-24 20:14:47 +02:00
|
|
|
- Added support for Mesa Clover and AMD's ROCm by making the inline keyword optional in kernels
|
2017-05-12 22:18:10 +02:00
|
|
|
- Performance reports are now external at https://cnugteren.github.io/clblast
|
2017-06-21 23:07:47 +02:00
|
|
|
- Greatly improved compilation time of database.cpp
|
2017-05-12 07:12:16 +02:00
|
|
|
- Various minor fixes and enhancements
|
2017-05-12 07:53:52 +02:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2017-05-13 05:01:33 +02:00
|
|
|
- Added non-BLAS level-1 routines:
|
|
|
|
* iSAMIN/iDAMIN/iCAMIN/iZAMIN (absolute minimum version of the ixAMAX BLAS routines)
|
2017-05-12 07:12:16 +02:00
|
|
|
|
2017-05-02 20:29:59 +02:00
|
|
|
Version 0.11.0
|
2017-01-24 20:34:09 +01:00
|
|
|
- Improved the internal program source and binary caches for scalability and speed (thanks to 'intelfx')
|
|
|
|
- Fixed a bug having to re-create the binary even if it was in the cache
|
2016-12-18 11:54:32 +01:00
|
|
|
- Fixed a bug when using offsets in the direct version of the GEMM kernels
|
2017-01-07 13:31:29 +01:00
|
|
|
- Fixed a missing cl_khr_fp64 when running double-precision on Intel CPUs
|
2017-04-10 07:21:34 +02:00
|
|
|
- Fixed tests on Apple's CPU OpenCL implementation; still not fast but correct at least
|
2017-02-27 21:00:04 +01:00
|
|
|
- Fixed bugs in the half-precision routines HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM
|
2017-01-24 20:34:09 +01:00
|
|
|
- Tests now also exit with an error code when OpenCL errors or compilation errors occur
|
2017-02-27 21:49:20 +01:00
|
|
|
- Tests now also check for the L2 error in case of half-precision
|
2017-04-16 17:53:51 +02:00
|
|
|
- Clients can now test against cuBLAS on NVIDIA systems for performance comparisons (-DCUBLAS=ON)
|
2017-03-26 15:36:34 +02:00
|
|
|
- Replaced the R graph scripts with Python/Matplotlib scripts
|
2017-01-07 13:57:23 +01:00
|
|
|
- Various minor fixes and enhancements
|
2017-01-03 20:30:56 +01:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2018-01-11 20:32:06 +01:00
|
|
|
- Added the OverrideParameters function to the API to be able to supply custom tuning parameters
|
2017-03-10 20:49:59 +01:00
|
|
|
- Added triangular solver (level-2 & level-3) routines:
|
2017-02-25 13:02:15 +01:00
|
|
|
* STRSV/DTRSV/CTRSV/ZTRSV (experimental, un-optimized)
|
|
|
|
* STRSM/DTRSM/CTRSM/ZTRSM (experimental, un-optimized)
|
2017-03-26 15:36:34 +02:00
|
|
|
- Added batched (not part of the BLAS standard) routines:
|
2017-03-10 20:49:59 +01:00
|
|
|
* SAXPYBATCHED/DAXPYBATCHED/CAXPYBATCHED/ZAXPYBATCHED/HAXPYBATCHED (batched version of AXPY)
|
2017-03-11 16:02:45 +01:00
|
|
|
* SGEMMBATCHED/DGEMMBATCHED/CGEMMBATCHED/ZGEMMBATCHED/HGEMMBATCHED (batched version of GEMM)
|
2016-12-18 11:54:32 +01:00
|
|
|
|
2016-11-27 13:34:18 +01:00
|
|
|
Version 0.10.0
|
2016-09-27 20:56:49 +02:00
|
|
|
- Updated to version 8.0 of the CLCudaAPI C++11 OpenCL header
|
2016-10-22 16:14:56 +02:00
|
|
|
- Changed the enums in the C API to avoid potential name clashes with external code
|
2016-10-25 20:37:33 +02:00
|
|
|
- Added a Netlib CBLAS compatible API (not recommended for full control over performance)
|
2016-10-22 15:23:18 +02:00
|
|
|
- Greatly improved the way exceptions are handled in the library (thanks to 'intelfx')
|
2016-10-06 21:13:14 +02:00
|
|
|
- Improved performance of GEMM kernels for small sizes by using a direct single-kernel implementation
|
|
|
|
- Fixed a bug in the tests and samples related to waiting for an invalid event
|
2016-10-22 10:41:02 +02:00
|
|
|
- Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters
|
2016-11-20 15:05:42 +01:00
|
|
|
- Fixed a bug in the TRMM routine that would overwrite input data before consuming everything
|
2016-10-10 22:45:39 +02:00
|
|
|
- Added support for compilation under Visual Studio 2013 (MSVC++ 12.0)
|
2016-10-06 21:13:14 +02:00
|
|
|
- Added an option to set OpenCL compiler options through the env variable CLBLAST_BUILD_OPTIONS
|
2016-09-27 21:03:24 +02:00
|
|
|
- Added an option to run tuned kernels multiple times to average execution times
|
2016-10-15 17:11:08 +02:00
|
|
|
- Added an option to build a static version of the library
|
2016-11-27 11:00:29 +01:00
|
|
|
- Made it possible to use the command-line environmental vars everywhere and without re-running CMake
|
2016-09-27 19:42:58 +02:00
|
|
|
- Various minor fixes and enhancements
|
2016-10-13 12:18:28 +02:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2016-09-21 21:22:16 +02:00
|
|
|
|
2016-09-13 19:20:39 +02:00
|
|
|
Version 0.9.0
|
2016-06-29 19:42:49 +02:00
|
|
|
- Updated to version 6.0 of the CLCudaAPI C++11 OpenCL header
|
2016-09-04 17:21:16 +02:00
|
|
|
- Improved performance significantly of rotated GEMV computations
|
|
|
|
- Improved performance of unseen/un-tuned devices by a better default tuning parameter selection
|
2016-06-30 10:49:17 +02:00
|
|
|
- Fixed proper MSVC dllimport and dllexport declarations
|
2016-07-02 15:34:55 +02:00
|
|
|
- Fixed memory leaks related to events not being released
|
2016-07-03 20:30:47 +02:00
|
|
|
- Fixed a bug with a size_t and cl_ulong mismatch on 32-bit systems
|
2016-07-08 20:57:58 +02:00
|
|
|
- Fixed a bug related to the cache and retrieval of programs based on the OpenCL context
|
2016-07-10 20:32:01 +02:00
|
|
|
- Fixed a performance issue (caused by fp16 support) by optimizing alpha/beta parameter passing to kernels
|
2016-08-20 12:50:31 +02:00
|
|
|
- Fixed a bug in the OpenCL kernels: now placing __kernel before __attribute__
|
2016-09-04 17:21:16 +02:00
|
|
|
- Fixed a bug in level-3 routines when beta is zero and matrix C contains NaNs
|
2016-07-06 21:25:55 +02:00
|
|
|
- Added an option (-warm_up) to do a warm-up run before timing in the performance clients
|
2016-07-28 20:45:09 +02:00
|
|
|
- Various minor fixes and enhancements
|
2016-07-03 20:30:47 +02:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2016-06-28 22:33:13 +02:00
|
|
|
|
2016-06-28 21:32:00 +02:00
|
|
|
Version 0.8.0
|
2016-05-31 20:53:55 +02:00
|
|
|
- Added support for half-precision floating-point (fp16) in the library
|
2016-05-30 16:38:26 +02:00
|
|
|
- Made it possible to compile the performance tests (clients) separately from the correctness tests
|
|
|
|
- Made a reference BLAS and head-to-head performance comparison optional in the clients
|
2016-05-30 20:07:09 +02:00
|
|
|
- Increased the verbosity of the "-verbose" option in the correctness tests
|
2016-06-19 13:55:49 +02:00
|
|
|
- Refactored the host code for better compilation times and fewer lines of code
|
2016-06-27 12:47:39 +02:00
|
|
|
- Added Appveyor continuous integration and increased coverage of the Travis builds
|
2016-06-13 20:17:26 +02:00
|
|
|
- Improved the API documentation
|
2016-05-31 20:53:55 +02:00
|
|
|
- Various minor fixes and enhancements
|
2016-06-01 09:39:33 +02:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2016-05-22 14:47:14 +02:00
|
|
|
- Added half-precision routines:
|
|
|
|
* Level-1: HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
|
2016-05-22 16:59:14 +02:00
|
|
|
* Level-2: HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV/HGER/HSYR/HSPR/HSYR2/HSPR2
|
2016-05-25 13:29:53 +02:00
|
|
|
* Level-3: HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
|
2016-06-16 18:07:46 +02:00
|
|
|
- Added non-BLAS routines:
|
|
|
|
* SOMATCOPY/DOMATCOPY/COMATCOPY/ZOMATCOPY/HOMATCOPY (matrix copy, scaling, and/or transpose)
|
2016-05-08 21:30:04 +02:00
|
|
|
|
2016-05-18 21:32:56 +02:00
|
|
|
Version 0.7.1
|
|
|
|
- Improved performance of large power-of-2 xGEMM kernels for AMD GPUs
|
|
|
|
- Fixed a bug in the xGEMM routine related to the event incorrectly set
|
|
|
|
- Made MSVC link the run-time libraries statically
|
2016-05-18 21:26:20 +02:00
|
|
|
|
2016-05-08 20:29:41 +02:00
|
|
|
Version 0.7.0
|
2016-03-23 20:49:28 +01:00
|
|
|
- Added exports to be able to create a DLL on Windows (thanks to Marco Hutter)
|
2016-03-14 22:55:22 +01:00
|
|
|
- Made the library thread-safe
|
2016-04-04 01:07:25 +02:00
|
|
|
- Performance and correctness tests can now (on top of clBLAS) be performed against CPU BLAS libraries
|
2016-04-01 05:20:32 +02:00
|
|
|
- Fixed the use of events within the library
|
2016-04-27 16:02:13 +02:00
|
|
|
- Changed the enum parameters to match the raw values of the cblas standard
|
2016-04-30 09:49:39 +02:00
|
|
|
- Fixed the cache of previously compiled binaries and added a function to fill or clear it
|
2016-05-07 12:22:06 +02:00
|
|
|
- Various minor fixes and enhancements
|
2016-05-08 09:49:00 +02:00
|
|
|
- Added a preliminary version of the API documentation
|
2016-04-30 09:49:39 +02:00
|
|
|
- Added additional sample programs
|
2016-05-02 20:20:23 +02:00
|
|
|
- Added tuned parameters for various devices (see README)
|
2016-04-01 05:20:32 +02:00
|
|
|
- Added level-1 routines:
|
|
|
|
* SNRM2/DNRM2/ScNRM2/DzNRM2
|
2016-04-15 05:37:52 +02:00
|
|
|
* SASUM/DASUM/ScASUM/DzASUM
|
2016-04-27 18:07:30 +02:00
|
|
|
* SSUM/DSUM/ScSUM/DzSUM (non-absolute version of the above xASUM BLAS routines)
|
2016-04-21 06:12:51 +02:00
|
|
|
* iSAMAX/iDAMAX/iCAMAX/iZAMAX
|
2016-04-27 18:07:30 +02:00
|
|
|
* iSMAX/iDMAX/iCMAX/iZMAX (non-absolute version of the above ixAMAX BLAS routines)
|
2016-04-30 09:49:39 +02:00
|
|
|
* iSMIN/iDMIN/iCMIN/iZMIN (non-absolute minimum version of the above ixAMAX BLAS routines)
|
2016-03-13 11:09:02 +01:00
|
|
|
|
2016-03-13 11:02:40 +01:00
|
|
|
Version 0.6.0
|
2016-02-10 21:32:09 +01:00
|
|
|
- Added support for MSVC (Visual Studio) 2015
|
|
|
|
- Added tuned parameters for various devices (see README)
|
|
|
|
- Now automatically generates C++ code from JSON tuning results
|
2016-02-28 16:37:49 +01:00
|
|
|
- Added level-2 routines:
|
|
|
|
* SGER/DGER
|
|
|
|
* CGERU/ZGERU
|
|
|
|
* CGERC/ZGERC
|
|
|
|
* CHER/ZHER
|
|
|
|
* CHPR/ZHPR
|
2016-03-06 15:48:11 +01:00
|
|
|
* CHER2/ZHER2
|
|
|
|
* CHPR2/ZHPR2
|
2016-02-28 16:37:49 +01:00
|
|
|
* CSYR/ZSYR
|
|
|
|
* CSPR/ZSPR
|
2016-03-06 15:48:11 +01:00
|
|
|
* CSYR2/ZSYR2
|
|
|
|
* CSPR2/ZSPR2
|
2015-10-17 15:57:04 +02:00
|
|
|
|
2015-10-17 15:48:13 +02:00
|
|
|
Version 0.5.0
|
2015-09-18 17:46:41 +02:00
|
|
|
- Improved structure and performance of level-2 routines (xSYMV/xHEMV)
|
2015-10-13 08:29:45 +02:00
|
|
|
- Reduced compilation time of level-3 OpenCL kernels
|
2015-08-22 17:11:20 +02:00
|
|
|
- Added level-1 routines:
|
|
|
|
* SSWAP/DSWAP/CSWAP/ZSWAP
|
|
|
|
* SSCAL/DSCAL/CSCAL/ZSCAL
|
|
|
|
* SCOPY/DCOPY/CCOPY/ZCOPY
|
2015-09-14 16:57:00 +02:00
|
|
|
* SDOT/DDOT
|
|
|
|
* CDOTU/ZDOTU
|
|
|
|
* CDOTC/ZDOTC
|
2015-09-18 15:25:20 +02:00
|
|
|
- Added level-2 routines:
|
|
|
|
* SGBMV/DGBMV/CGBMV/ZGBMV
|
2015-09-19 11:11:34 +02:00
|
|
|
* CHBMV/ZHBMV
|
2015-09-19 17:40:38 +02:00
|
|
|
* CHPMV/ZHPMV
|
2015-09-19 18:01:19 +02:00
|
|
|
* SSBMV/DSBMV
|
|
|
|
* SSPMV/DSPMV
|
2015-09-26 16:58:03 +02:00
|
|
|
* STRMV/DTRMV/CTRMV/ZTRMV
|
|
|
|
* STBMV/DTBMV/CTBMV/ZTBMV
|
|
|
|
* STPMV/DTPMV/CTPMV/ZTPMV
|
2015-08-22 12:50:26 +02:00
|
|
|
|
2015-08-22 12:41:40 +02:00
|
|
|
Version 0.4.0
|
2015-07-31 11:15:48 +02:00
|
|
|
- Now using the Claduc C++11 interface to OpenCL
|
2015-08-13 18:00:09 +02:00
|
|
|
- Added plain C API for increased compatibility (clblast_c.h)
|
2015-08-22 12:40:18 +02:00
|
|
|
- Re-organized tuner infrastructure and added JSON output
|
|
|
|
- Removed clBLAS sources, it should now be installed separately for testing
|
|
|
|
- Added Travis continuous integration
|
2015-07-31 17:44:17 +02:00
|
|
|
- Added level-2 routines:
|
|
|
|
* CHEMV/ZHEMV
|
|
|
|
* SSYMV/DSYMV
|
2015-07-24 20:50:00 +02:00
|
|
|
|
2015-07-24 08:25:32 +02:00
|
|
|
Version 0.3.0
|
2015-06-29 20:42:34 +02:00
|
|
|
- Re-organized test/client infrastructure to avoid code duplication
|
2015-07-24 08:16:41 +02:00
|
|
|
- Added an optional bypass for pre/post-processing kernels in level-3 routines
|
|
|
|
- Significantly improved performance of level-3 routines on AMD GPUs
|
2015-06-24 07:52:19 +02:00
|
|
|
- Added level-3 routines:
|
2015-07-12 15:14:35 +02:00
|
|
|
* CHEMM/ZHEMM
|
2015-06-24 07:52:19 +02:00
|
|
|
* SSYRK/DSYRK/CSYRK/ZSYRK
|
2015-07-12 15:14:35 +02:00
|
|
|
* CHERK/ZHERK
|
2015-06-26 08:12:56 +02:00
|
|
|
* SSYR2K/DSYR2K/CSYR2K/ZSYR2K
|
2015-07-12 15:14:35 +02:00
|
|
|
* CHER2K/ZHER2K
|
|
|
|
* STRMM/DTRMM/CTRMM/ZTRMM
|
2015-06-24 07:52:19 +02:00
|
|
|
|
2015-06-21 09:13:08 +02:00
|
|
|
Version 0.2.0
|
2015-06-17 07:12:45 +02:00
|
|
|
- Added support for complex conjugate transpose
|
2015-06-20 16:47:50 +02:00
|
|
|
- Several host-code performance improvements
|
|
|
|
- Improved testing infrastructure and coverage
|
2015-06-15 08:41:37 +02:00
|
|
|
- Added level-2 routines:
|
2015-06-20 16:47:50 +02:00
|
|
|
* SGEMV/DGEMV/CGEMV/ZGEMV
|
2015-06-17 07:12:45 +02:00
|
|
|
- Added level-3 routines:
|
2015-06-20 16:47:50 +02:00
|
|
|
* CGEMM/ZGEMM
|
|
|
|
* CSYMM/ZSYMM
|
2015-06-15 08:41:37 +02:00
|
|
|
|
2015-05-30 12:30:43 +02:00
|
|
|
Version 0.1.0
|
|
|
|
- Initial preview version release to GitHub
|
|
|
|
- Supported level-1 routines:
|
2015-06-20 16:47:50 +02:00
|
|
|
* SAXPY/DAXPY/CAXPY/ZAXPY
|
2015-05-30 12:30:43 +02:00
|
|
|
- Supported level-3 routines:
|
2015-06-20 16:47:50 +02:00
|
|
|
* SGEMM/DGEMM
|
|
|
|
* SSYMM/DSYMM
|