Commit graph

1760 commits

Author SHA1 Message Date
Daniel Bevenius 23b5e12eb5
simple : update error message for KV cache check (#4324)
This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-04 18:04:21 +02:00
Miwa / Ensan d208995c6d
swift : fix concatenation method to avoid invalid UTF8 stringfication (#4325) 2023-12-04 18:03:49 +02:00
Miwa / Ensan 5c9f90cba1
swift : fix prompt tokenization logic (#4321) 2023-12-04 15:43:45 +02:00
Ikko Eltociear Ashimine 4fa44e84ad
grammar-parser : fix typo (#4318)
preceeding -> preceding
2023-12-04 09:57:35 +02:00
Georgi Gerganov fbbc42827b
ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (#4308)
* ggml : fix soft max out-of-bounds access

ggml-ci

* ggml : reuse ggml_get_n_tasks() in ggml_graph_plan()

ggml-ci
2023-12-03 15:56:35 +02:00
Georgi Gerganov adf3de4f69
ggml : fix soft max out-of-bounds access (#4307)
ggml-ci
2023-12-03 15:56:22 +02:00
Ed Lee 33e171d1e9
server : fix OpenAI API stop field to be optional (#4299)
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)
2023-12-03 11:10:43 +02:00
Rickard Edén 6949b50df5
py : add grammar to oai like api (#4294) 2023-12-03 11:03:25 +02:00
Georgi Gerganov d7b800b8bc
llama : pad KV cache size (#4280)
* llama : pad KV cache size to 32

* metal : try to improve batched decoding
2023-12-03 10:58:16 +02:00
Georgi Gerganov 5a7d3125e7
llama : avoid using "optional" keyword (#4283) 2023-12-01 20:39:12 +02:00
Georgi Gerganov d5a1cbde60
llama : support optional tensors (#4283) 2023-12-01 20:35:47 +02:00
Miwa / Ensan b220222a64
swift : fix token_to_piece implementation (#4278)
* Fix token_to_piece implementation in Swift

* Fix errors
2023-12-01 20:19:45 +02:00
Jared Van Bortel 511f52c334
build : enable libstdc++ assertions for debug builds (#4275) 2023-12-01 20:18:35 +02:00
CausalLM 03562f3a86
llama : support attention bias on LLaMA architecture (#4283)
* Support attention_bias on LLaMA architecture

QKVO bias, should fix InternLM (https://github.com/ggerganov/llama.cpp/issues/3133) and works for LLaMAfied Qwen models (https://github.com/ggerganov/llama.cpp/pull/3743#issuecomment-1825923608).

* check existence of qkvo bias while loading llama models

Tested on LLaMA2, CUDA and CPU.

* Update llama.cpp
2023-12-01 20:17:06 +02:00
Shijie 37c746d687
llama : add Qwen support (#4281)
* enable qwen to llama.cpp

* llama : do not GPU split bias tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-01 20:16:31 +02:00
Georgi Gerganov 880f57973b
llama : fix integer overflow during quantization (#4284)
happens with multi-threaded quantization of Qwen-72B

ggml-ci
2023-12-01 18:42:11 +02:00
Daniel Bevenius 8d6d9f033b
py : add requirements file for convert-hf-to-gguf.py (#4277)
This commit adds a requirements file for the convert-hf-to-gguf.py
script, and also add the torch and transformers packages to it.

The motivation for this is that currently running convert-hf-to-gguf.py
will produce the following error:
```console
$ python3 -m venv venv
$ source venv/bin/activate
(venv) $ pip install -r requirements.txt
Collecting numpy==1.24.4
Collecting sentencepiece==0.1.98
Collecting gguf>=0.1.0
Installing collected packages: sentencepiece, numpy, gguf
Successfully installed gguf-0.5.1 numpy-1.24.4 sentencepiece-0.1.98

(venv) $ python convert-hf-to-gguf.py --help
Traceback (most recent call last):
  File "llama.cpp/convert-hf-to-gguf.py", line 16, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'
```
With this commit, and using requirements-hf-to-gguf.txt instead of
requirements.txt, the script can be run and shows the help output.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-01 11:41:56 +02:00
Georgi Gerganov ef47ec18da
ggml : add ggml_soft_max_ext (#4256)
* metal : implement soft_max_ext

* cuda : implement soft_max_ext

* ggml : implement soft_max_ext (CPU)

* batched-bench : print threads

ggml-ci

* metal : simplify soft_max encoding

ggml-ci

* cuda : use 512 threads for soft_max instead of 32

* ggml : update soft max cpu

* cuda : do warp-based block reduce

* cuda : increase max block size to 1024

* cuda : fix warp reduction initialization of shared mem

* metal : warp-based reduction for soft max kernel

* metal : warp-based reduce for rms_norm

* metal : simplify soft max kernel

ggml-ci

* alloc : fix build with debug
2023-12-01 10:51:24 +02:00
Ziad Ben Hadj-Alouane 1d144112c0
server : add --log-disable to disable logging to file (#4260)
* * add --log-disable to disable logging to file in the server example

* * typo fix
2023-12-01 00:25:49 +02:00
Ziad Ben Hadj-Alouane f43f09366d
server : add single-client multi-prompt support (#4232)
* * add multiprompt support

* * cleanup

* * more cleanup

* * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests

* * remove all references to mutex_multitasks

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* * change to set

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-12-01 00:25:04 +02:00
WillCorticesAI d2809a3ba2
make : fix Apple clang determination bug (#4272)
Co-authored-by: Will Findley <findley@gmail.com>
2023-12-01 00:23:44 +02:00
Jared Van Bortel 15f5d96037
build : fix build info generation and cleanup Makefile (#3920)
* cmake : fix joining of REAL_GIT_DIR

* fix includes with help from include-what-you-use

* make : remove unneeded deps and add test-rope target

* fix C includes in C++ source files

* Revert "fix includes with help from include-what-you-use"

This reverts commit 635e9fadfd.
2023-12-01 00:23:08 +02:00
John 33c9892af5
llava : ShareGPT4V compatibility (vision encoder only loading) (#4172)
* ShareGPT4 compatibility (vision encoder only loading)

Load only a CLIP vision encoder (as supplied by ShareGPT finetunes)
Corrects the argument parsing for --img_mean and --img_std (which were previously not parsed but attempted to access)
Defines defaults for img_mean and img_std which are equal to the llava 1.5 CLIP encoder, so you do not have to provide them

* Update convert-image-encoder-to-gguf.py
2023-11-30 23:11:14 +01:00
Andrew Godfrey 8efa0f6ebe
main : pass LOG_TEE callback to llama.cpp log (#4033)
* main : Call llama_log_set to use LOG_TEE

* tabs to spaces
2023-11-30 23:56:19 +02:00
vodkaslime 524907aa76
readme : fix (#4135)
* fix: readme

* chore: resolve comments

* chore: resolve comments
2023-11-30 23:49:21 +02:00
Juraj Bednar 3bd2c7ce1b
docker : add finetune option (#4211) 2023-11-30 23:46:01 +02:00
Miwa / Ensan bde629bb53
batched.swift : update README.md (#4214)
docs: update how to run
2023-11-30 23:45:17 +02:00
Li Tan f7f9e06212
cmake : fix the metal file foder path (#4217) 2023-11-30 23:44:11 +02:00
Dawid Wysocki 74daabae69
readme : fix typo (#4253)
llama.cpp uses GitHub Actions, not Gitlab Actions.
2023-11-30 23:43:32 +02:00
Daniel Bevenius b18c66ca6e
llama : fix alignment of general.name in print meta (#4254)
* llama: fix alignment of general.name in print meta

This commit fixes the alignment of the general.name field in the
llm_load_print_meta function.

Currently the output looks like this:
```console
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name   = LLaMA v2
```
And with this commit it looks like this:
```console
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name     = LLaMA v2
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama: fix alignment of special tokens

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-11-30 23:43:08 +02:00
slaren f4d973cecb
convert.py : fix llama/llama2 conversion due to vocab_size=-1 (#4258) 2023-11-30 23:42:23 +02:00
tarcey 954e22858c
llama : fix typical sampling (#4261)
Typical sampling was broken because after copying new_candidates into canditates, the "sorted" bool is left at "true", but the new data is no longer sorted according to probability. Patch to set "sorted" to false.

Test: Generating with temp=0.0001 (approx. argmax)  should generate the same sequence at typical>=1.0 and typical=0.9999 (approx. disabled, but enters the typical sampling codepath).
2023-11-30 23:40:23 +02:00
rhjdvsgsgks e2bd725f4b
py : fix oai proxy (#3972)
* fix oai proxy

fix generation not stoped while bot stop talking in chat mode

fix possible `slot_id` not exist

response for cors (and pre flight)

* oai proxy: workaround for some client (such as Chatbox)

* use stop as separator to replace hardcoded `\n`
2023-11-30 22:50:40 +02:00
Georgi Gerganov 1f5cd83275
examples : add readme files 2023-11-29 11:00:17 +02:00
Peter Sugihara 4fea3420ee
readme : add FreeChat (#4248) 2023-11-29 09:16:34 +02:00
Jared Van Bortel 64e64aa255
ggml : restore abort() in GGML_ASSERT (#4242) 2023-11-28 11:51:11 +02:00
Georgi Gerganov 8406b0924b
ggml : re-enable BLAS for CPU when src0 != F32 + remove redundant full offload checks in llama.cpp (#4240)
* ggml : use blas even if src0 is not F32

* llama : use n_threads_batch only when n_tokens >= 32

ggml-ci

* llama : revert n_threads_batch logic

ggml-ci
2023-11-28 10:32:03 +02:00
bandoti b38a16dfcf
cmake : fix issue with version info not getting baked into LlamaConfig.cmake (#3970)
* Split CPP generation from build-info query

* Remove blank lines

* Add BUILD_SHARED_LIBS option
2023-11-27 21:25:42 +02:00
Kasumi 0dab8cd7cc
readme : add Amica to UI list (#4230) 2023-11-27 19:39:42 +02:00
Bailey Chittle bb03290c17
examples : iOS example with swift ui (#4159)
* copy to llama.cpp as subdir

* attempt enabling metal, fails

* ggml metal compiles!

* Update README.md

* initial conversion to new format, utf8 errors?

* bug fixes, but now has an invalid memory access :(

* added O3, now has insufficient memory access

* begin sync with master

* update to match latest code, new errors

* fixed it!

* fix for loop conditionals, increase result size

* fix current workflow errors

* attempt a llama.swiftui workflow

* Update .github/workflows/build.yml

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-27 16:56:52 +02:00
Jared Van Bortel f3b269813f
ggml : fix -Warray-bounds warning with gcc (#4231) 2023-11-26 22:58:43 -05:00
Georgi Gerganov 3e73d31d9c
lookahead : support -n -1 infinite generation 2023-11-26 21:52:23 +02:00
Georgi Gerganov 9656026b53
readme : update hot topics 2023-11-26 20:42:51 +02:00
Georgi Gerganov 922754a8d6
lookahead : add example for lookahead decoding (#4207)
* lookahead : init

* lookahead : generate and store n-grams

* lookahead : use loop instead recursion to generate n-grams

* lookahead : initial working implementation

* lookahead : filter repeating n-grams

* lookahead : use deterministic init

* lookahead : add to Makefile

* lookahead : fix a bug in the seq_id of the lookahead tokens

* lookahead : add comments

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-11-26 20:33:07 +02:00
Xiao-Yong Jin 22da05536f
metal : fix yarn (#4220)
get the correct n_orig_ctx in metal
2023-11-26 10:30:02 +02:00
Galunid 1ddb52ec38
scripts : Use mmap in torch load (#4202)
* Use mmap in torch load, prefer .bin files when loading

* Revert .bin > .safetensors preference
2023-11-25 22:45:02 +01:00
Marcus Dunn f837c3a992
llama : grammar reserve space in decode_utf8 (#4210)
* reserve space for codepoints

* improvement for the appended 0
2023-11-25 18:58:23 +02:00
crasm 3014b5415d
Update docs for yarn_ext_factor <0.0 as unspecified instead of NaN (#4189) 2023-11-25 10:47:07 -05:00
Georgi Gerganov 04814e718e
readme : update hot topics 2023-11-25 12:02:13 +02:00
Georgi Gerganov af19d35734
server : OAI API compatibility (#4198)
* Add openai-compatible POST /v1/chat/completions API endpoint to server example

* fix code style

* Update server README.md

* Improve server README.md

* Fix server.cpp code style according to review

* server : some style changes

* server : indentation

* server : enable special tokens during tokenization by default

* server : minor code style

* server : change random string generator

* straightforward /v1/models endpoint

---------

Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com>
Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
2023-11-25 11:29:06 +02:00