Cedric Nugteren
|
f7a16d427c
|
Fixed a compilation issue under MSVC 2013
|
2017-05-26 22:10:56 +02:00 |
|
Cedric Nugteren
|
8400ee3a09
|
Fixed an TRSM issue caused by incorrect block size calculation
|
2017-05-15 22:04:55 +02:00 |
|
Cedric Nugteren
|
512b83dbad
|
Fixed a missing synchronization barrier in the invert kernel; fixes TRSM tests
|
2017-05-14 20:27:35 +02:00 |
|
Cedric Nugteren
|
f151e56daa
|
Added the IxAMIN routines: absolute minimum version of IxAMAX
|
2017-05-12 20:01:33 -07:00 |
|
Cedric Nugteren
|
86e8df60f1
|
Fixed a bug in the TRSM routine; tests now pass
|
2017-05-12 17:43:56 -07:00 |
|
Cedric Nugteren
|
71933c3411
|
Added tuning results for the AMD Radeon Fiji GPU
|
2017-05-11 22:53:52 -07:00 |
|
Cedric Nugteren
|
1df28a15fc
|
Re-added random tuning for GEMM after accidental removal
|
2017-05-11 22:12:38 -07:00 |
|
Cedric Nugteren
|
1c33af6eab
|
Re-added Titan X (Pascal) tuning results based on more averaging when tuning
|
2017-04-23 17:58:56 +02:00 |
|
Cedric Nugteren
|
3eea8dc998
|
Increased the default number of runs for the tuner from 2 up to 10 for fast kernels
|
2017-04-22 13:56:07 +02:00 |
|
Cedric Nugteren
|
192199c9cb
|
Fixed the direct vs indirect setting for NVIDIA GPUs
|
2017-04-22 13:43:27 +02:00 |
|
Cedric Nugteren
|
e41d204856
|
Increased the default number of runs for GEMV tuning; updated GEMV tuning results for Iris Pro
|
2017-04-21 22:12:20 +02:00 |
|
Cedric Nugteren
|
d7314d4f8e
|
Tuned the direct versus indirect GEMM kernel trade-off point for NVIDIA GPUs
|
2017-04-20 22:19:09 +02:00 |
|
Cedric Nugteren
|
409a5a2ad0
|
Fixed a namespace clash with CUDA FP16 for the half-datatype
|
2017-04-17 16:47:15 +02:00 |
|
Cedric Nugteren
|
2673f50518
|
Merge branch 'development' into benchmarking
|
2017-04-16 19:41:14 +02:00 |
|
Cedric Nugteren
|
10205d773e
|
Added a new Xaxpy kernel in between the regular and fast version in
|
2017-04-14 20:16:10 +02:00 |
|
Cedric Nugteren
|
f7f8ec644f
|
Fixed CUDA malloc and cuBLAS handles: cuBLAS as a performance-reference now works
|
2017-04-13 21:31:27 +02:00 |
|
Cedric Nugteren
|
22b3ea9256
|
Merge branch 'development' into cublas_reference
Conflicts:
scripts/generator/generator.py
|
2017-04-10 20:11:45 +02:00 |
|
Cedric Nugteren
|
7374c37e2e
|
Fixed a compilation issue under MSVC and GCC
|
2017-04-10 08:38:24 +02:00 |
|
Cedric Nugteren
|
2d45c37676
|
Removed const-vector-of-const-objects from the database class to remain according to the C++11 standard
|
2017-04-10 07:40:27 +02:00 |
|
Cedric Nugteren
|
fb6c78ea07
|
Added a special override database for the Apple CPU implementation on OS X: this makes the test work, it does not focus on good performance
|
2017-04-07 07:37:30 +02:00 |
|
Cedric Nugteren
|
d28ee082b0
|
Uses float2 and double2 for base complex data-types instead of a custom struct; fixes bug on Apple OpenCL
|
2017-04-07 07:35:15 +02:00 |
|
Cedric Nugteren
|
ce369702d8
|
Added some missing const-ness
|
2017-04-07 07:34:32 +02:00 |
|
Cedric Nugteren
|
b24d364743
|
Layed the groundwork for cuBLAS comparisons in the clients
|
2017-04-02 18:06:15 +02:00 |
|
Cedric Nugteren
|
b84d2296b8
|
Separated host-device and device-host memory copies from execution of the CBLAS reference code; for fair timing and code de-duplication
|
2017-04-01 13:36:24 +02:00 |
|
Cedric Nugteren
|
c27d2f0c1e
|
Added an (optional) non-direct implementation of the batched GEMM routine
|
2017-03-19 16:04:04 +01:00 |
|
Cedric Nugteren
|
2fd04dae83
|
Added batched versions of the pad/copy/transpose kernels
|
2017-03-19 15:57:44 +01:00 |
|
Cedric Nugteren
|
11bb30e72b
|
Added the possibility to tune batched kernels
|
2017-03-14 20:29:51 +01:00 |
|
Cedric Nugteren
|
7b8f8fce68
|
Added initial naive version of the batched GEMM routine based on the direct GEMM kernel
|
2017-03-11 16:02:45 +01:00 |
|
Cedric Nugteren
|
49e04c7fce
|
Added API and test infrastructure for the batched GEMM routine
|
2017-03-10 21:24:35 +01:00 |
|
Cedric Nugteren
|
d754586b49
|
Added proper testing of the alpha parameter; finalized the batched AXPY implementation
|
2017-03-10 20:49:59 +01:00 |
|
Cedric Nugteren
|
92a657290a
|
Fixed a small compilation bug for MSVC related to a floating-point constant
|
2017-03-10 20:30:10 +01:00 |
|
Cedric Nugteren
|
878d93e7dc
|
Implemented a batched version of the AXPY kernel
|
2017-03-08 20:36:35 +01:00 |
|
Cedric Nugteren
|
fa0a9c689f
|
Make batched routines based on offsets instead of a vector of cl_mem objects - undoing many earlier changes
|
2017-03-08 20:10:20 +01:00 |
|
Cedric Nugteren
|
6aba0bbae7
|
Minor fixes to the client w.r.t. the addition of the batch count
|
2017-03-05 16:44:16 +01:00 |
|
Cedric Nugteren
|
b114ea49a9
|
Added first naive version of the batched AXPY routine
|
2017-03-05 15:06:14 +01:00 |
|
Cedric Nugteren
|
cdf354f895
|
Adjusted the test-infrastructure to support testing of batched-versions of routines
|
2017-03-05 15:04:16 +01:00 |
|
Cedric Nugteren
|
7f14b11f1e
|
Changed the way the test-data is generated: now using a single MT generator and distribution for all data
|
2017-03-05 11:13:47 +01:00 |
|
Cedric Nugteren
|
f9a520b3af
|
Prepared generator for batched routines; added batched AXPY routine interface
|
2017-03-05 10:38:38 +01:00 |
|
Cedric Nugteren
|
e9ef037549
|
Added tuning results for the Radeon HD6750M GPU (Apple OpenCL)
|
2017-03-04 15:24:55 +01:00 |
|
Cedric Nugteren
|
e993ee077b
|
Added a proper data-preparation function for the TRSM tests
|
2017-03-04 15:21:33 +01:00 |
|
Cedric Nugteren
|
3fc73851f7
|
Added proper support for the b_offset argument in TRSM
|
2017-03-01 21:23:33 +01:00 |
|
Cedric Nugteren
|
00281dad26
|
Fixed half-precision bugs in HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM related to incorrect constants
|
2017-02-27 21:00:04 +01:00 |
|
Cedric Nugteren
|
e09c26c706
|
Split the GEMM kernel further up to prevent C1091 in MSVC
|
2017-02-26 15:03:12 +01:00 |
|
Cedric Nugteren
|
ea6790665d
|
Merge branch 'development' into triangular_solvers
|
2017-02-26 14:51:45 +01:00 |
|
Cedric Nugteren
|
df7638c305
|
Fixed an out-of-bounds memory access when filling a matrix with a constant
|
2017-02-26 14:31:05 +01:00 |
|
Cedric Nugteren
|
b7310036ed
|
Removed half-precision support from the TRSM routine; too unstable
|
2017-02-26 12:56:21 +01:00 |
|
Cedric Nugteren
|
a433987441
|
Fixes division in the kernel for inversion of complex numbers
|
2017-02-26 10:18:45 +01:00 |
|
Cedric Nugteren
|
e47d95887c
|
Added PrepareData function for TRSM to create proper test input
|
2017-02-25 12:23:04 +01:00 |
|
Cedric Nugteren
|
2f2a510c38
|
Implemented a simple row-major to col-major problem conversion for TRSM
|
2017-02-24 21:08:44 +01:00 |
|
Cedric Nugteren
|
1e5b5157bc
|
Fixed a few issues with the TRSM routine; some tests still failing
|
2017-02-22 20:31:33 +01:00 |
|