Commit Graph

895 Commits (00b7a4be02ca82d53ac69dd2dd438c16e2af7658)

Author SHA1 Message Date
Georgi Gerganov 00b7a4be02
talk-llama : sync llama.cpp 2024-01-11 22:10:10 +02:00
Georgi Gerganov 04b0a768b8
swift : remove local ggml.h reference 2024-01-11 22:00:12 +02:00
Georgi Gerganov 87670425f2
swift : track ggml release branch 2024-01-11 21:57:40 +02:00
Georgi Gerganov 32e71a1861
sync : ggml 2024-01-11 21:54:17 +02:00
Georgi Gerganov 9c857cf280
sync : llama.cpp 2024-01-11 21:50:01 +02:00
Kawrakow 97b12212dd
ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:01 +02:00
Paul Tsochantaris 9fa34d79ec
metal : put encoder debug group behind a define (llama/4873) 2024-01-11 21:50:01 +02:00
Georgi Gerganov a0a64a19dd
metal : improve dequantize precision to match CPU (llama/4836)
ggml-ci
2024-01-11 21:50:01 +02:00
Georgi Gerganov bbc23611fa
ggml : fix vld1q_s8_x4 32-bit compat (llama/4828)
* ggml : fix vld1q_s8_x4 32-bit compat

ggml-ci

* ggml : fix 32-bit ARM compat (cont)

ggml-ci
2024-01-11 21:50:01 +02:00
Johannes Gäßler e9783a1fb4
CUDA: faster softmax via shared memory + fp16 math (llama/4742) 2024-01-11 21:50:01 +02:00
Georgi Gerganov 9e0cc28792
metal : fix deprecation warning (ggml/690) 2024-01-11 21:50:00 +02:00
Timothy Cronin 73072a7c73
ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693) 2024-01-11 21:50:00 +02:00
Jack Mousseau a8ba1262ff
metal : wrap each operation in debug group (ggml/690) 2024-01-11 21:50:00 +02:00
leejet e66a9a7806
ggml : change GGML_MAX_NAME at compile time (ggml/682)
* change GGML_MAX_NAME to 128

* allow controlling the value of GGML_MAX_NAME through external macro definitions
2024-01-11 21:50:00 +02:00
Halalaluyafail3 338442d773
Fix execlp call (ggml/689)
NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
2024-01-11 21:50:00 +02:00
Kawrakow 10651bddf6
SOTA 2-bit quants (llama/4773)
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:00 +02:00
Johannes Gäßler 53d4d0b30d
CUDA: fixed redundant value dequantization (llama/4809) 2024-01-11 21:50:00 +02:00
Konstantin Zhuravlyov 2865e4710b
ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (llama/4787) 2024-01-11 21:50:00 +02:00
Georgi Gerganov c46a74a19d
ggml : do not sched_yield when calling BLAS (llama/4761)
* ggml : do not sched_yield when calling BLAS

ggml-ci

* ggml : fix do_yield logic

ggml-ci

* ggml : simplify do_yield logic

ggml-ci
2024-01-11 21:50:00 +02:00
Georgi Gerganov 46dc49a6a1
ggml : include stdlib.h before intrin.h (llama/4736) 2024-01-11 21:49:59 +02:00
Alexandru Mariuti cc7f872131
swift : checkout ggml commit instead of branch (#1750) 2024-01-10 18:12:06 +02:00
RhinoDevel bcc1658cd0
talk-llama : add optional Piper TTS support (#1749)
Add commented-out command to optionally use Piper (https://github.com/rhasspy/piper) as text-to-speech solution for the talk-llama example. Piper voices sound almost like real people which is a big improvement (e.g.) from something like espeak.
2024-01-10 16:15:28 +02:00
Emmanuel Schmidbauer c46886f599
server : add request path option(#1741) 2024-01-08 22:39:51 +00:00
Georgi Gerganov 29f78392c1
main : add cli option to disable system prints (#1740) 2024-01-08 16:41:28 +02:00
Georgi Gerganov 022756a872
server : fix server temperature + add temperature_inc (#1729)
* server : fix server temperature + add temperature_inc

* server : change dashes to underscores in parameter names
2024-01-07 13:35:14 +02:00
Georgi Gerganov 3b8c2dff57
talk-llama : sync latest llama.cpp 2024-01-06 17:22:57 +02:00
Georgi Gerganov 0b9af32a8b
release : v1.5.4 2024-01-05 17:11:27 +02:00
Erik Scholz 11b1b63b14
fix : cuda order of synchronization when setting a buffer (ggml/679)
* fix : cuda order of synchronization when setting a buffer

* also sync before memcpy

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-05 17:01:59 +02:00
Georgi Gerganov 0e26a6c92e
metal : switch back to default.metallib (ggml/681)
ggml-ci
2024-01-05 16:31:30 +02:00
Georgi Gerganov 66d8f0b7f1
ggml : fix q2_k bpw in comments (ggml/680) 2024-01-05 16:31:20 +02:00
Yajing Tang ba5bcde874
coreml : fix ANE optimized encoder (#1716) 2024-01-04 16:28:30 +02:00
Georgi Gerganov ab0a8593c5
whisper.swiftui : add .gitignore 2024-01-04 15:00:27 +02:00
Georgi Gerganov 668ffc9b23
whispser : reset the "batched" timings (#1721) 2024-01-04 13:38:39 +02:00
Georgi Gerganov 9962371f71
release : v1.5.3 2024-01-03 19:36:33 +02:00
Ashraful Islam 993acb5d41
swift : update Package.swift to use ggml as package dependency (#1701)
* updates Package.swift to use ggml as dependency

* cleans up the Package.swift file by removing redundant source files

* updates ggml url src to ggerganov
2024-01-03 19:30:26 +02:00
Finn Voorhees a3d0aa73d1
ggml : add error handling to graph_compute (#1714) 2024-01-03 15:39:43 +02:00
Georgi Gerganov 14c57952f7 cuda : simplify expression
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-03 14:43:51 +02:00
Georgi Gerganov 6c369d6788 cuda : mark I16 and I32 ops as unsupported
ggml-ci
2024-01-03 14:43:51 +02:00
Georgi Gerganov 4cdd9aad9b metal : add kernel_get_rows_i32
ggml-ci
2024-01-03 14:43:51 +02:00
Georgi Gerganov f38c057503 metal : optimize ggml_mul_mat_id (faster Mixtral PP) (llama/4725)
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (llama/4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

* metal : optimizing ggml_mul_mat_id (wip)

* metal : minor fix

* metal : opt mul_mm_id
2024-01-03 14:43:51 +02:00
Georgi Gerganov 1e5544b39b metal : enable shader debugging (cmake option) (llama/4705)
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (llama/4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

ggml-ci
2024-01-03 14:43:51 +02:00
Georgi Gerganov d5673af79f ggml : add ggml_vdotq_s32 alias (llama/4715)
ggml-ci
2024-01-03 14:43:51 +02:00
Johannes Gäßler a28dacec65 CUDA: fixed tensor cores not being used on RDNA3 (llama/4697) 2024-01-03 14:43:51 +02:00
automaticcat dbe29d4e33 ggml : add ggml_cpu_has_avx_vnni() (llama/4589)
* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:43:51 +02:00
Johannes Gäßler fe3a67c546 CUDA: fix tensor core logic for Pascal and HIP (llama/4682) 2024-01-03 14:43:51 +02:00
hydai b138ff2be3 cuda: fix vmm oom issue on NVIDIA AGX Orin (llama/4687)
Signed-off-by: hydai <hydai@secondstate.io>
2024-01-03 14:43:51 +02:00
Guillaume Wenzek cf6f1e4181 ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639)
* add more int ops

* ggml_compute_forward_dup_bytes

* add tests

* PR comments

* tests : minor indentations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:43:51 +02:00
Georgi Gerganov 620a223814 scripts : fix sync order + metal sed 2024-01-03 14:43:51 +02:00
Andreu Huguet f39f9690ec
examples : fix WASM Stack Overflow (#1713)
Fix for problem:

"""
RuntimeError: Aborted(Stack overflow! Stack cookie has been overwritten at 0x12be2b10, expected hex dwords 0x89BACDFE and 0x2135467, but received 0x00000000 0x00000000)
"""

That appears when executing the WASM example with the newer versions.
2024-01-02 16:50:04 +00:00
bobqianic f9ca90256b
docker : fix the publishing of the CUDA Docker image (#1704) 2023-12-30 23:12:31 +02:00