Commit Graph

1110 Commits (9a0b59d990be319952a4a02b9164b3b2327cd454)

Author SHA1 Message Date
RhinoDevel db078a9ba8
talk-llama : add optional CLI arg to set the bot name (#1764) 2024-01-13 20:51:35 +02:00
james wolf a13a7da5ad
examples : add python example for transcription (#1744)
* rebase and add simple python interface

* moved python files to examples/python
2024-01-13 19:37:18 +02:00
Georgi Gerganov 519f8e8684
whisper : load the model into multiple buffers of max size 1GB (#1763) 2024-01-13 17:47:40 +02:00
Georgi Gerganov 40ae0962f4
talk-llama : sync llama.cpp 2024-01-12 22:04:51 +02:00
Georgi Gerganov 1560288048
sync : ggml 2024-01-12 21:56:50 +02:00
slaren 1ad6fafd91
backend_sched : fix assignments
ggml-ci
2024-01-12 21:55:42 +02:00
slaren 70840aed5f
llama : ggml-backend integration (llama/4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (llama/4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (llama/4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 21:55:42 +02:00
Johannes Gäßler b24d18feb9
CUDA: fix softmax compile for old CUDA versions (llama/4862) 2024-01-12 21:55:41 +02:00
Kawrakow 3fa98f4395
Importance Matrix calculation (llama/4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 21:55:41 +02:00
Sơn Phan Trung d05b7ee90e
models : make all scripts to be POSIX Compliant (#1725)
* download-coreml-model: make it POSIX-compliant

* download-ggml-model: posix compliant (2nd)

* minor edit

* forgot to add newline

* generate-coreml-interface: far more straightforward

* generate-coreml-model: done with the posix thingy

* typo

* Update download-ggml-model.sh

* fix

* fix typo

* another fix

* Update download-coreml-model.sh

* Update download-ggml-model.sh

* Update download-coreml-model.sh
2024-01-12 14:11:04 +02:00
Georgi Gerganov 6dcee35129
ggml : fix 32-bit ARM compat for IQ2_XS (#1758)
* ggml : fix 32-bit ARM compat

* ggml : fix fix

* ggml : fix fix fix
2024-01-12 14:02:30 +02:00
Boris Bliznioukov 5cb345f5e9
go : add SetInitialPrompt method to bindings (#1753) 2024-01-12 13:44:50 +02:00
George Hindle fbcb52d3cd
server : add more parameters to server api (#1754)
* feat(server): add more parameters to server api

* fix(server): reset params to original parsed values for each request
2024-01-12 13:42:52 +02:00
Georgi Gerganov 6b01e3fedd
whisper : fix segment length with params.no_timestamps == true 2024-01-12 13:37:38 +02:00
George Hindle f7908f9bb8
params : don't compute timestamps when not printing them (#1755) 2024-01-12 13:24:38 +02:00
Georgi Gerganov 00b7a4be02
talk-llama : sync llama.cpp 2024-01-11 22:10:10 +02:00
Georgi Gerganov 04b0a768b8
swift : remove local ggml.h reference 2024-01-11 22:00:12 +02:00
Georgi Gerganov 87670425f2
swift : track ggml release branch 2024-01-11 21:57:40 +02:00
Georgi Gerganov 32e71a1861
sync : ggml 2024-01-11 21:54:17 +02:00
Georgi Gerganov 9c857cf280
sync : llama.cpp 2024-01-11 21:50:01 +02:00
Kawrakow 97b12212dd
ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:01 +02:00
Paul Tsochantaris 9fa34d79ec
metal : put encoder debug group behind a define (llama/4873) 2024-01-11 21:50:01 +02:00
Georgi Gerganov a0a64a19dd
metal : improve dequantize precision to match CPU (llama/4836)
ggml-ci
2024-01-11 21:50:01 +02:00
Georgi Gerganov bbc23611fa
ggml : fix vld1q_s8_x4 32-bit compat (llama/4828)
* ggml : fix vld1q_s8_x4 32-bit compat

ggml-ci

* ggml : fix 32-bit ARM compat (cont)

ggml-ci
2024-01-11 21:50:01 +02:00
Johannes Gäßler e9783a1fb4
CUDA: faster softmax via shared memory + fp16 math (llama/4742) 2024-01-11 21:50:01 +02:00
Georgi Gerganov 9e0cc28792
metal : fix deprecation warning (ggml/690) 2024-01-11 21:50:00 +02:00
Timothy Cronin 73072a7c73
ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693) 2024-01-11 21:50:00 +02:00
Jack Mousseau a8ba1262ff
metal : wrap each operation in debug group (ggml/690) 2024-01-11 21:50:00 +02:00
leejet e66a9a7806
ggml : change GGML_MAX_NAME at compile time (ggml/682)
* change GGML_MAX_NAME to 128

* allow controlling the value of GGML_MAX_NAME through external macro definitions
2024-01-11 21:50:00 +02:00
Halalaluyafail3 338442d773
Fix execlp call (ggml/689)
NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
2024-01-11 21:50:00 +02:00
Kawrakow 10651bddf6
SOTA 2-bit quants (llama/4773)
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:00 +02:00
Johannes Gäßler 53d4d0b30d
CUDA: fixed redundant value dequantization (llama/4809) 2024-01-11 21:50:00 +02:00
Konstantin Zhuravlyov 2865e4710b
ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (llama/4787) 2024-01-11 21:50:00 +02:00
Georgi Gerganov c46a74a19d
ggml : do not sched_yield when calling BLAS (llama/4761)
* ggml : do not sched_yield when calling BLAS

ggml-ci

* ggml : fix do_yield logic

ggml-ci

* ggml : simplify do_yield logic

ggml-ci
2024-01-11 21:50:00 +02:00
Georgi Gerganov 46dc49a6a1
ggml : include stdlib.h before intrin.h (llama/4736) 2024-01-11 21:49:59 +02:00
Alexandru Mariuti cc7f872131
swift : checkout ggml commit instead of branch (#1750) 2024-01-10 18:12:06 +02:00
RhinoDevel bcc1658cd0
talk-llama : add optional Piper TTS support (#1749)
Add commented-out command to optionally use Piper (https://github.com/rhasspy/piper) as text-to-speech solution for the talk-llama example. Piper voices sound almost like real people which is a big improvement (e.g.) from something like espeak.
2024-01-10 16:15:28 +02:00
Emmanuel Schmidbauer c46886f599
server : add request path option(#1741) 2024-01-08 22:39:51 +00:00
Georgi Gerganov 29f78392c1
main : add cli option to disable system prints (#1740) 2024-01-08 16:41:28 +02:00
Georgi Gerganov 022756a872
server : fix server temperature + add temperature_inc (#1729)
* server : fix server temperature + add temperature_inc

* server : change dashes to underscores in parameter names
2024-01-07 13:35:14 +02:00
Georgi Gerganov 3b8c2dff57
talk-llama : sync latest llama.cpp 2024-01-06 17:22:57 +02:00
Georgi Gerganov 0b9af32a8b
release : v1.5.4 2024-01-05 17:11:27 +02:00
Erik Scholz 11b1b63b14
fix : cuda order of synchronization when setting a buffer (ggml/679)
* fix : cuda order of synchronization when setting a buffer

* also sync before memcpy

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-05 17:01:59 +02:00
Georgi Gerganov 0e26a6c92e
metal : switch back to default.metallib (ggml/681)
ggml-ci
2024-01-05 16:31:30 +02:00
Georgi Gerganov 66d8f0b7f1
ggml : fix q2_k bpw in comments (ggml/680) 2024-01-05 16:31:20 +02:00
Yajing Tang ba5bcde874
coreml : fix ANE optimized encoder (#1716) 2024-01-04 16:28:30 +02:00
Georgi Gerganov ab0a8593c5
whisper.swiftui : add .gitignore 2024-01-04 15:00:27 +02:00
Georgi Gerganov 668ffc9b23
whispser : reset the "batched" timings (#1721) 2024-01-04 13:38:39 +02:00
Georgi Gerganov 9962371f71
release : v1.5.3 2024-01-03 19:36:33 +02:00
Ashraful Islam 993acb5d41
swift : update Package.swift to use ggml as package dependency (#1701)
* updates Package.swift to use ggml as dependency

* cleans up the Package.swift file by removing redundant source files

* updates ggml url src to ggerganov
2024-01-03 19:30:26 +02:00