mirror of
https://github.com/CNugteren/CLBlast.git
synced 2024-08-21 04:22:27 +02:00
154 lines
6.5 KiB
Plaintext
154 lines
6.5 KiB
Plaintext
|
|
Development version (next release)
|
|
- Fixed a bug when using offsets in the direct version of the GEMM kernels
|
|
|
|
Version 0.10.0
|
|
- Updated to version 8.0 of the CLCudaAPI C++11 OpenCL header
|
|
- Changed the enums in the C API to avoid potential name clashes with external code
|
|
- Added a Netlib CBLAS compatible API (not recommended for full control over performance)
|
|
- Greatly improved the way exceptions are handled in the library (thanks to 'intelfx')
|
|
- Improved performance of GEMM kernels for small sizes by using a direct single-kernel implementation
|
|
- Fixed a bug in the tests and samples related to waiting for an invalid event
|
|
- Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters
|
|
- Fixed a bug in the TRMM routine that would overwrite input data before consuming everything
|
|
- Added support for compilation under Visual Studio 2013 (MSVC++ 12.0)
|
|
- Added an option to set OpenCL compiler options through the env variable CLBLAST_BUILD_OPTIONS
|
|
- Added an option to run tuned kernels multiple times to average execution times
|
|
- Added an option to build a static version of the library
|
|
- Made it possible to use the command-line environmental vars everywhere and without re-running CMake
|
|
- Various minor fixes and enhancements
|
|
- Added tuned parameters for various devices (see README)
|
|
|
|
Version 0.9.0
|
|
- Updated to version 6.0 of the CLCudaAPI C++11 OpenCL header
|
|
- Improved performance significantly of rotated GEMV computations
|
|
- Improved performance of unseen/un-tuned devices by a better default tuning parameter selection
|
|
- Fixed proper MSVC dllimport and dllexport declarations
|
|
- Fixed memory leaks related to events not being released
|
|
- Fixed a bug with a size_t and cl_ulong mismatch on 32-bit systems
|
|
- Fixed a bug related to the cache and retrieval of programs based on the OpenCL context
|
|
- Fixed a performance issue (caused by fp16 support) by optimizing alpha/beta parameter passing to kernels
|
|
- Fixed a bug in the OpenCL kernels: now placing __kernel before __attribute__
|
|
- Fixed a bug in level-3 routines when beta is zero and matrix C contains NaNs
|
|
- Added an option (-warm_up) to do a warm-up run before timing in the performance clients
|
|
- Various minor fixes and enhancements
|
|
- Added tuned parameters for various devices (see README)
|
|
|
|
Version 0.8.0
|
|
- Added support for half-precision floating-point (fp16) in the library
|
|
- Made it possible to compile the performance tests (clients) separately from the correctness tests
|
|
- Made a reference BLAS and head-to-head performance comparison optional in the clients
|
|
- Increased the verbosity of the "-verbose" option in the correctness tests
|
|
- Refactored the host code for better compilation times and fewer lines of code
|
|
- Added Appveyor continuous integration and increased coverage of the Travis builds
|
|
- Improved the API documentation
|
|
- Various minor fixes and enhancements
|
|
- Added tuned parameters for various devices (see README)
|
|
- Added half-precision routines:
|
|
* Level-1: HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
|
|
* Level-2: HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV/HGER/HSYR/HSPR/HSYR2/HSPR2
|
|
* Level-3: HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
|
|
- Added non-BLAS routines:
|
|
* SOMATCOPY/DOMATCOPY/COMATCOPY/ZOMATCOPY/HOMATCOPY (matrix copy, scaling, and/or transpose)
|
|
|
|
Version 0.7.1
|
|
- Improved performance of large power-of-2 xGEMM kernels for AMD GPUs
|
|
- Fixed a bug in the xGEMM routine related to the event incorrectly set
|
|
- Made MSVC link the run-time libraries statically
|
|
|
|
Version 0.7.0
|
|
- Added exports to be able to create a DLL on Windows (thanks to Marco Hutter)
|
|
- Made the library thread-safe
|
|
- Performance and correctness tests can now (on top of clBLAS) be performed against CPU BLAS libraries
|
|
- Fixed the use of events within the library
|
|
- Changed the enum parameters to match the raw values of the cblas standard
|
|
- Fixed the cache of previously compiled binaries and added a function to fill or clear it
|
|
- Various minor fixes and enhancements
|
|
- Added a preliminary version of the API documentation
|
|
- Added additional sample programs
|
|
- Added tuned parameters for various devices (see README)
|
|
- Added level-1 routines:
|
|
* SNRM2/DNRM2/ScNRM2/DzNRM2
|
|
* SASUM/DASUM/ScASUM/DzASUM
|
|
* SSUM/DSUM/ScSUM/DzSUM (non-absolute version of the above xASUM BLAS routines)
|
|
* iSAMAX/iDAMAX/iCAMAX/iZAMAX
|
|
* iSMAX/iDMAX/iCMAX/iZMAX (non-absolute version of the above ixAMAX BLAS routines)
|
|
* iSMIN/iDMIN/iCMIN/iZMIN (non-absolute minimum version of the above ixAMAX BLAS routines)
|
|
|
|
Version 0.6.0
|
|
- Added support for MSVC (Visual Studio) 2015
|
|
- Added tuned parameters for various devices (see README)
|
|
- Now automatically generates C++ code from JSON tuning results
|
|
- Added level-2 routines:
|
|
* SGER/DGER
|
|
* CGERU/ZGERU
|
|
* CGERC/ZGERC
|
|
* CHER/ZHER
|
|
* CHPR/ZHPR
|
|
* CHER2/ZHER2
|
|
* CHPR2/ZHPR2
|
|
* CSYR/ZSYR
|
|
* CSPR/ZSPR
|
|
* CSYR2/ZSYR2
|
|
* CSPR2/ZSPR2
|
|
|
|
Version 0.5.0
|
|
- Improved structure and performance of level-2 routines (xSYMV/xHEMV)
|
|
- Reduced compilation time of level-3 OpenCL kernels
|
|
- Added level-1 routines:
|
|
* SSWAP/DSWAP/CSWAP/ZSWAP
|
|
* SSCAL/DSCAL/CSCAL/ZSCAL
|
|
* SCOPY/DCOPY/CCOPY/ZCOPY
|
|
* SDOT/DDOT
|
|
* CDOTU/ZDOTU
|
|
* CDOTC/ZDOTC
|
|
- Added level-2 routines:
|
|
* SGBMV/DGBMV/CGBMV/ZGBMV
|
|
* CHBMV/ZHBMV
|
|
* CHPMV/ZHPMV
|
|
* SSBMV/DSBMV
|
|
* SSPMV/DSPMV
|
|
* STRMV/DTRMV/CTRMV/ZTRMV
|
|
* STBMV/DTBMV/CTBMV/ZTBMV
|
|
* STPMV/DTPMV/CTPMV/ZTPMV
|
|
|
|
Version 0.4.0
|
|
- Now using the Claduc C++11 interface to OpenCL
|
|
- Added plain C API for increased compatibility (clblast_c.h)
|
|
- Re-organized tuner infrastructure and added JSON output
|
|
- Removed clBLAS sources, it should now be installed separately for testing
|
|
- Added Travis continuous integration
|
|
- Added level-2 routines:
|
|
* CHEMV/ZHEMV
|
|
* SSYMV/DSYMV
|
|
|
|
Version 0.3.0
|
|
- Re-organized test/client infrastructure to avoid code duplication
|
|
- Added an optional bypass for pre/post-processing kernels in level-3 routines
|
|
- Significantly improved performance of level-3 routines on AMD GPUs
|
|
- Added level-3 routines:
|
|
* CHEMM/ZHEMM
|
|
* SSYRK/DSYRK/CSYRK/ZSYRK
|
|
* CHERK/ZHERK
|
|
* SSYR2K/DSYR2K/CSYR2K/ZSYR2K
|
|
* CHER2K/ZHER2K
|
|
* STRMM/DTRMM/CTRMM/ZTRMM
|
|
|
|
Version 0.2.0
|
|
- Added support for complex conjugate transpose
|
|
- Several host-code performance improvements
|
|
- Improved testing infrastructure and coverage
|
|
- Added level-2 routines:
|
|
* SGEMV/DGEMV/CGEMV/ZGEMV
|
|
- Added level-3 routines:
|
|
* CGEMM/ZGEMM
|
|
* CSYMM/ZSYMM
|
|
|
|
Version 0.1.0
|
|
- Initial preview version release to GitHub
|
|
- Supported level-1 routines:
|
|
* SAXPY/DAXPY/CAXPY/ZAXPY
|
|
- Supported level-3 routines:
|
|
* SGEMM/DGEMM
|
|
* SSYMM/DSYMM
|