whisper.cpp

Commit Graph

Author	SHA1	Message	Date
slaren	0878ab7c15	cuda : fix tensor size calculation for non-split buffer (llama/5145)	2024-01-27 17:19:52 +02:00
slaren	c65edd5b64	ggml-alloc : add 10% margin to the buffer sizes (llama/5149)	2024-01-27 17:19:52 +02:00
snadampal	3c8d14e9c5	ggml : update softmax n_task calculation (llama/5126) updated the n_task calculation to use max number of threads possible. This has improved the prompt eval performance by around 5% for DOT kernels and by around 10% for MMLA kernels on AWS Graviton3.	2024-01-27 17:19:52 +02:00
Paul Tsochantaris	c3977cb2ce	metal : remove unused `n_buffers` and `buffers` (llama/5129)	2024-01-27 17:19:52 +02:00
Georgi Gerganov	6da1661bc2	metal : show compile log messages	2024-01-27 17:19:51 +02:00
Engininja2	cc56540661	cuda : fix 2-bit quants on amd hip (llama/5105) * cuda : fix 2-bit quants on amd hip * use __low2float intrinsic function for new quants	2024-01-27 17:19:51 +02:00
slaren	94c1ae8668	llama : pre-allocate input tensors in a separate buffer (llama/5100)	2024-01-27 17:19:51 +02:00
Georgi Gerganov	55d54359e0	metal : disable support for MUL_MAT F32 x F16	2024-01-27 17:19:51 +02:00
Johannes Gäßler	d33c2ad354	CUDA: more info when no device code (llama/5088)	2024-01-27 17:19:51 +02:00
Georgi Gerganov	9afa7ff624	minor : clean-up some warnings and style (llama/5094) * minor : clean-up some warnings and style ggml-ci * ggml : add comment	2024-01-27 17:19:51 +02:00
Reinforce-II	0649289f02	ggml : parallelize FP32 conversion when using BLAS (llama/5045) * make GGML_TASK_INIT phase can be run in multithread * multithreaded dequantize in mul_mat when using blas library * minor fixes * update outdated comment * fix coding style * simplify code Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-27 17:19:51 +02:00
XiaotaoChen	aaeaa43878	llava : MobileVLM support (llama/4954) * MobileVLM native implementation * delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake * move android script to example/llava directory * Fix the editor config checks --------- Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>	2024-01-27 17:19:51 +02:00
slaren	078b8e23bf	llama : run all KQV ops on the CPU with no KV offload (llama/5049) ggml-ci	2024-01-27 17:19:51 +02:00
Kylin	74da3e1757	cuda : fix compile error in jetson platform (llama/4975) * cuda: fix compile error in jetson platform * cuda: update comment in ggml-cuda.cu * cuda: update ggml-cuda.cu comment	2024-01-27 17:19:50 +02:00
Judd	2d2c93a798	ggml : check ggml_add src1 type (ggml/708) Co-authored-by: Judd <foldl@boxvest.com>	2024-01-27 17:19:50 +02:00
Michael Rienstra	4bbb60efce	docs : make model options / model install methods clearer (#1806 ) * Make models more "discoverable" * Clean up code block language identifiers * make 3 options clearer * undo Prettier formatter change * docs: `$` shell prompt, consistently * docs: minor changes	2024-01-26 17:39:54 +02:00
trixirt	1cf679dec4	cmake : make libwhisper.so position independent (#1792 ) This is similar to how libllama.so is built. Signed-off-by: Tom Rix <trix@redhat.com>	2024-01-22 15:02:35 +02:00
Georgi Gerganov	41026c1e4b	cmake : temporary remove VLA check (#1795 )	2024-01-22 14:51:42 +02:00
Neuman Vong	d6b9be21d7	whisper.android : return output from benchmarks (#1785 ) Benchmarks are failing because JNI expects a jstring and the benchmarks are missing a return statement (i.e., returning null). The functions actually build a jstring but don't return it, so this seems to have been an oversight. This patch returns the jstring and now the benchmarks run successfully. Fixes #1783.	2024-01-19 16:17:38 +02:00
Ryan Hitchman	c0329acde8	server : implement "verbose_json" format with token details (#1781 ) * examples/server: implement "verbose_json" format with token details. This is intended to mirror the format of openai's Python whisper.transcribe() return values. * server: don't write WAV to a temporary file if not converting * server: use std::lock_guard instead of manual lock/unlock	2024-01-18 22:58:42 +02:00
Georgi Gerganov	fb466b3417	ggml : sync ggml-metal.m	2024-01-18 11:03:13 +02:00
Georgi Gerganov	1f50a7d29f	sync : llama.cpp	2024-01-17 21:23:33 +02:00
Georgi Gerganov	1de21b913d	sync : ggml	2024-01-17 21:22:38 +02:00
Georgi Gerganov	4aea058e5a	ggml : add IQ2 to test-backend-ops + refactoring (llama/4990) * ggml : add IQ2 to test-backend-ops + refactoring ggml-ci * cuda : update supports_op for IQ2 ggml-ci * ci : enable LLAMA_CUBLAS=1 for CUDA nodes ggml-ci * cuda : fix out-of-bounds-access in `mul_mat_vec_q` ggml-ci * tests : avoid creating RNGs for each Q tensor ggml-ci * tests : avoid creating RNGs for each tensor ggml-ci	2024-01-17 21:21:10 +02:00
Georgi Gerganov	fd10234363	imatrix : offload to GPU support (llama/4957) * backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * imatrix : offload to GPU support * imatrix : fix ggml_mul_mat_id hanlding ggml-ci * ci : add imatrix test ggml-ci * ci : rearrange output ggml-ci	2024-01-17 21:21:10 +02:00
Georgi Gerganov	8fb5c6a409	backend : add eval callback (llama/4935) * backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * simple : no need for ggml_is_contiguous + fix bool parse * llama : fix callback placement in llama_context_params * backend : avoid double-ask callback calls * simple : restore examples, imatrix will serve as a demo	2024-01-17 21:21:10 +02:00
Georgi Gerganov	2fe5fbfcc2	metal : create autorelease pool during library build (llama/4970) * metal : create autorelease pool during library build ggml-ci * test : simplify ggml-ci	2024-01-17 21:21:10 +02:00
Kawrakow	01637e1a4c	ggml : importance matrix support for legacy quants (llama/4969) * imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-17 21:21:10 +02:00
Alex Azarov	1b349eb1f9	metal : log `recommendedMaxWorkingSetSize` on iOS 16+ (llama/4936) * metal: Log `recommendedMaxWorkingSetSize` on iOS 16+ * Only log on iOS and macOS, ignoring tvOS and other platforms * Check for Xcode version before using recommendedMaxWorkingSetSize --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-17 21:21:10 +02:00
Justine Tunney	138eaebead	ggml : introduce GGML_CALL function annotation (llama/4850) This change makes it possible to build ggml-cuda.cu and ggml-metal.m as independent dynamic shared objects, that may be conditionally linked at runtime in a multiplatform binary. It introduces a GGML_CALL annotation that documents which functions have a cyclic call relationship, between the application code and GPU modules. This change does nothing, unless the build defines -DGGML_MULTIPLATFORM which causes back-references and function pointers to conform to MS ABI which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms	2024-01-17 21:21:09 +02:00
Georgi Gerganov	61b9192f27	cuda : fix dequantize kernel names (llama/4938)	2024-01-17 21:21:09 +02:00
Kawrakow	161b51d91a	CUDA: faster dequantize kernels for Q4_0 and Q4_1 (llama/4938) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-17 21:21:09 +02:00
Kawrakow	f904b31a7d	Add ability to use importance matrix for all k-quants (llama/4930) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-17 21:21:09 +02:00
Benjamin Heiniger	f6614155e4	talk-llama : optional wake-up command and audio confirmation (#1765 ) * talk-llama: add optional wake-word detection from command * talk-llama: add optional audio confirmation before generating answer * talk-llama: fix small formatting issue in output * talk-llama.cpp: fix Windows build	2024-01-16 15:52:01 +02:00
Przemysław Pawełczyk	f5f159c320	server : fix building and simplify lib deps on Windows (#1772 ) * make : fix server example building on MSYS2 environments (Windows) It was not working since commit `eff3570f78` when server was introduced. * cmake : simplify server example lib deps on Windows server uses httplib::Server, not httplib::SSLServer, so there is no need to mention cryptographic libraries in target_link_libraries. Winsock (ws2_32) suffices here. Also use plain library names like we use in other places.	2024-01-15 15:48:13 +02:00
Georgi Gerganov	6ebba525f1	talk-llama : sync llama.cpp	2024-01-14 18:08:20 +02:00
Georgi Gerganov	2a5874441d	talk-llama : llama.cpp	2024-01-14 11:06:28 +02:00
Georgi Gerganov	d08445c9ad	sync : ggml	2024-01-14 10:55:18 +02:00
Alex Azarov	4a945696cb	metal : correctly set SIMD support flags on iOS (llama/4923) * Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad * log a little bit more info on iOS	2024-01-14 10:54:09 +02:00
Kawrakow	dabc964d83	2-bit quantizations (llama/4897) * imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 10:54:09 +02:00
Georgi Gerganov	654baf693d	scripts : sync-ggml-am.sh add option to skip commits	2024-01-14 10:53:19 +02:00
Georgi Gerganov	f001a3b7b6	talk-llama : sync llama.cpp	2024-01-14 00:13:17 +02:00
Georgi Gerganov	c615f2c335	sync : ggml	2024-01-14 00:12:17 +02:00
Georgi Gerganov	d839dd0242	examples : adapt to metal API	2024-01-14 00:11:45 +02:00
Johannes Gäßler	435847891c	ggml: cache sin/cos for RoPE (llama/4908)	2024-01-14 00:11:45 +02:00
Georgi Gerganov	182f290808	metal : remove old API (llama/4919) ggml-ci	2024-01-14 00:11:45 +02:00
Georgi Gerganov	447dfc11fc	metal : disable log for loaded kernels (llama/4794)	2024-01-14 00:11:45 +02:00
texmex76	9aa9f3b84e	gguf : fix potential infinite for-loop (llama/4600) Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>	2024-01-14 00:11:44 +02:00
Georgi Gerganov	396ebd1e80	metal : refactor kernel loading code (llama/4794) * metal : detect more GPU families * metal : refactor kernel loading * metal : set kernel family requirements * metal : fix kernel init + fix compile options * metal : take into account simdgroup reduction support * metal : print only skipped kernels * metal : fix check for simdgroup reduction support * metal : check for Metal 3 * metal : free allocations * metal : normalize encoder:setComputePipelineStatus calls ggml-ci * metal : fix Metal3 family check ggml-ci * metal : check for simdgroup matrix mul. feature ggml-ci	2024-01-14 00:11:44 +02:00
Johannes Gäßler	12490f4398	CUDA: faster q8_0 -> f16 dequantization (llama/4895)	2024-01-14 00:11:44 +02:00

... 2 3 4 5 6 ...

1110 Commits (9a0b59d990be319952a4a02b9164b3b2327cd454) All Branches Search

1110 Commits (9a0b59d990be319952a4a02b9164b3b2327cd454)

All Branches