Commit graph

472 commits

Author SHA1 Message Date
Justin Parker f2eb19bd8b
server : throw an error when slot unavailable (#4741) 2024-01-03 10:43:19 +02:00
Phil H 0ef3ca2ac6
server : add token counts to html footer (#4738)
* server: add token counts to stats

* server: generate hpp

---------

Co-authored-by: phiharri <ph@got-root.co.uk>
2024-01-02 17:48:49 +02:00
Georgi Gerganov 32866c5edd
editorconfig : fix whitespace and indentation #4710 2024-01-02 13:28:15 +02:00
minarchist 5d7002d437
server : add --override-kv parameter (#4710)
* Changes to server to allow metadata override

* documentation

* flake.nix: expose full scope in legacyPackages

* flake.nix: rocm not yet supported on aarch64, so hide the output

* flake.nix: expose checks

* workflows: nix-ci: init; build flake outputs

* workflows: nix-ci: add a job for eval

* workflows: weekly `nix flake update`

* workflows: nix-flakestry: drop tag filters

...and add a job for flakehub.com

* workflows: nix-ci: add a qemu job for jetsons

* flake.nix: suggest the binary caches

* flake.lock: update

to a commit recently cached by nixpkgs-cuda-ci

---------

Co-authored-by: John <john@jLap.lan>
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
2024-01-02 12:38:15 +02:00
Daniel Bevenius 775ac8712a
finetune: fix typo in README.md (#4733)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-02 10:16:55 +01:00
Georgi Gerganov 9fbda719de
clip : refactor + bug fixes (#4696)
* clip : refactor + bug fixes

ggml-ci

* server : add log message
2023-12-30 23:24:42 +02:00
Georgi Gerganov 0235b9b571
clip : use ggml_backend_buffer_is_host (#4205) 2023-12-29 18:53:34 +02:00
Steward Garcia ce18d727a4
clip : enable gpu backend (#4205)
* clip: enable CUDA backend

* add missing kernels

* add enough padding for alignment

* remove ggml_repeat of clip.cpp

* add metal backend

* llava : fixes

- avoid ggml_repeat
- use GGML_USE_ instead of CLIP_USE_ macros
- remove unused vars

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-29 18:52:15 +02:00
Cuong Trinh Manh 97bbca6e85
cmake : fix ld warning duplicate libraries libllama.a (#4671)
* fix "ld: warning: ignoring duplicate libraries: '../libllama.a'"

* fix warning in example.
2023-12-29 16:39:15 +02:00
Justine Tunney 4af4801566
llava-cli : refactor to use sampling library (#4669)
This change makes it possible to use flags like `--grammar` when using
the `llava-cli` program. The rest is just code cleanup deleting a long
standing TODO comment.

This change also ensures that logging information is emitted to stderr
which helps the `llava-cli` command be more friendly to shell scripts.

See Mozilla-Ocho/llamafile@1cd334f
2023-12-29 16:38:38 +02:00
Justine Tunney db49ff8ed7
server : replace sleep with condition variables (#4673)
The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.

See mozilla-Ocho/llamafile@711344b
2023-12-29 16:24:12 +02:00
SakuraUmi 60f55e888c
server : fix OpenAI server sampling w.r.t. penalty. (#4675) 2023-12-29 16:22:44 +02:00
Karthik Sethuraman b93edd22f5
server : allow to generate multimodal embeddings (#4681) 2023-12-29 16:22:10 +02:00
andrijdavid 82d6eab224
main-cmake-pkg : fix build issue (#4665)
* Fix main-cmake-pkg compilation

* Use glob to load common files

* cmake : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-29 16:18:20 +02:00
Peter Sugihara afd997ab60
llama.swiftui : fix infinite loop, ouput timings, buff UI (#4674)
* fix infinite loop

* slight UI simplification, clearer UX

* clearer UI text, add timings to completion log
2023-12-29 15:58:56 +02:00
Justine Tunney 65e5f6dadb
Fix OpenAI server sampling w.r.t. temp and seed (#4668)
The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.

The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.

See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29
2023-12-28 15:20:00 -04:00
Daniel Bevenius 879b690a9e
finetune : fix output formatting in print_params (#4653)
This commit fixes the output formatting in the print_params function
which currently looks like this:
```console
print_params: n_vocab:   32000
print_params: n_ctx:     128
print_params: n_embd:    4096
print_params: n_ff:      11008
print_params: n_head:    32
print_params: n_head_kv: 32
print_params: n_layer:   32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
```
With this comit the output will look like this:
```console
print_params: n_vocab               : 32000
print_params: n_ctx                 : 128
print_params: n_embd                : 4096
print_params: n_ff                  : 11008
print_params: n_head                : 32
print_params: n_head_kv             : 32
print_params: n_layer               : 32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-27 16:16:55 +02:00
Alexey Parfenov 6123979952
server : allow to specify custom prompt for penalty calculation (#3727) 2023-12-23 11:31:49 +02:00
LeonEricsson 7082d24cec
lookup : add prompt lookup decoding example (#4484)
* initial commit, going through initializations

* main loop finished, starting to debug

* BUG: generates gibberish/repeating tokens after a while

* kv_cache management

* Added colors to distinguish drafted tokens (--color). Updated README

* lookup : fix token positions in the draft batch

* lookup : use n_draft from CLI params

* lookup : final touches

---------

Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-22 18:05:56 +02:00
Georgi Gerganov afefa319f1
ggml : change ggml_scale to take a float instead of tensor (#4573)
* ggml : change ggml_scale to take a float instead of tensor

* ggml : fix CPU implementation

* tests : fix test-grad0

ggml-ci
2023-12-21 23:20:49 +02:00
Georgi Gerganov 32259b2dad
gguf : simplify example dependencies 2023-12-21 23:08:14 +02:00
Georgi Gerganov 0e18b2e7d0
llama.swiftui : add tinyllama 1.1B F16 2023-12-18 20:17:43 +02:00
Georgi Gerganov 6ff39b129d
llama.swiftui : add more models 2023-12-18 20:05:12 +02:00
Georgi Gerganov 800a489e4a
llama.swiftui : add bench functionality (#4483)
* llama.swiftui : add bench button

* llama.swiftui : initial bench functionality

* force to use n_gpu_layers on simulator

* add download buttons & expose llamaState.loadModel

* update project.pbxproj

* comment #Preview & fix editorconfig check

* gitignore : xcode stuff

* llama.swiftui : UX improvements

* llama.swiftui : avoid data copy via "downloadTask"

* llama.swiftui : remove model from project

* llama : remove "mostly" from model infos

* llama.swiftui : improve bench

---------

Co-authored-by: jhen <developer@jhen.me>
2023-12-17 19:38:41 +02:00
slaren 45668633fd
finetune : keep allocs alive until all allocations are done (#4486) 2023-12-17 16:05:56 +01:00
olexiyb 0ffc92d2d2
server : disable llm logs if SERVER_VERBOSE is off (#3792) 2023-12-17 17:02:16 +02:00
AdithyanI 8edd2b40fd
server : fix grammar being ignored (#4494)
Fix bug in identifying the grammar.
2023-12-17 16:57:56 +02:00
Alexey Parfenov eb16dae7e7
server : fix possible ambiguity in content type charset (#4501) 2023-12-17 16:56:09 +02:00
mzcu 62bd52b7bf
server : allow requests larger than 8K (#4500) 2023-12-17 16:54:37 +02:00
ShadovvBeast 88ae8952b6
server : add optional API Key Authentication example (#4441)
* Add API key authentication for enhanced server-client security

* server : to snake_case

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 13:49:01 +02:00
slaren cafcd4f895
ggml : remove n_dims from ggml_tensor (#4469)
ggml-ci
2023-12-14 16:52:08 +01:00
LostRuins 20a68a7030
ggml : add ggml_row_size() (fixes llama out of space) (#4461)
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

* do not cast to size_t, instead just use doubles

* ggml : add ggml_row_size(), deprecate ggml_type_sizef()

* ggml : fix row size compute to avoid overflows

* tests : fix sizey -> sizez

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14 14:13:33 +02:00
shibe2 948ff137ec
server : fix handling of characters that span multiple tokens when streaming (#4446) 2023-12-13 21:57:15 +02:00
kalomaze fecac45658
server : tweak default sampling parameters (#4367)
* Set a more typical Top P setting as the default

* Update temp max
2023-12-12 12:12:35 +02:00
Richard Kiss 9494d7c477
english : use typos to fix comments and logs (#4354) 2023-12-12 11:53:36 +02:00
Vladimir Zorin d9d4cfef64
server : fix local model name in server (#4420) 2023-12-12 11:25:29 +02:00
Yueh-Po Peng 8a7b2fa528
Update README.md (#4388)
Fix small typo.
2023-12-10 23:27:38 +01:00
Georgi Gerganov bcc0eb4591
llama : per-layer KV cache + quantum K cache (#4309)
* per-layer KV

* remove unnecessary copies

* less code duplication, offload k and v separately

* llama : offload KV cache per-layer

* llama : offload K shift tensors

* llama : offload for rest of the model arches

* llama : enable offload debug temporarily

* llama : keep the KV related layers on the device

* llama : remove mirrors, perform Device -> Host when partial offload

* common : add command-line arg to disable KV cache offloading

* llama : update session save/load

* llama : support quantum K cache (#4312)

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* readme : add API change notice

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
Hongyu Ouyang 81bc9214a3
train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351)
On commit b1108 (44c117f4) xaedes added

    ggml_allocr * alloc = NULL;

    ... (many lines in between)

    if (alloc) {
        ggml_allocr_free(alloc);
    }

Which is correct, but it's easy to lose context after many lines in between.

On commit b1287 (0e76a899) xaedes made a big change. From here on, alloc is freed eagerly.

    alloc = ggml_allocr_new(...)
    ... (short lines of code)
    ggml_allocr_free(alloc)

This happens a few times, but alloc is never set to NULL, and many lines below,
we still have

    if (alloc) {
        ggml_allocr_free(alloc);
    }

which causes a double-free.
2023-12-07 12:25:22 +02:00
Georgi Gerganov 05cd6e5036
server : recognize cache_prompt parameter in OAI API (#4347) 2023-12-06 20:21:59 +02:00
stduhpf da5eaef1f3
speculative : support --color (#4343)
* speculative: add some colors

* minor : add braces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-06 10:08:17 +02:00
MaggotHATE 52c8bc3cf3
sampling : custom samplers order (#4285)
* Samplers sequence order w parameter

* Cleaned commented code

* Fixed formatting

* Rewrote with unordered_map

* Revert and rewrite, too many problems and safeguards would be needed

* Fixed code style

* Code style fixes according to review

* More readable samplers input string, fixed help

* Style fix in sampler_queue

* Formatting fixes

* Fixing whitespaces
2023-12-05 12:05:51 +02:00
Daniel Bevenius 23b5e12eb5
simple : update error message for KV cache check (#4324)
This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-04 18:04:21 +02:00
Miwa / Ensan d208995c6d
swift : fix concatenation method to avoid invalid UTF8 stringfication (#4325) 2023-12-04 18:03:49 +02:00
Miwa / Ensan 5c9f90cba1
swift : fix prompt tokenization logic (#4321) 2023-12-04 15:43:45 +02:00
Ed Lee 33e171d1e9
server : fix OpenAI API stop field to be optional (#4299)
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)
2023-12-03 11:10:43 +02:00
Rickard Edén 6949b50df5
py : add grammar to oai like api (#4294) 2023-12-03 11:03:25 +02:00
Georgi Gerganov d5a1cbde60
llama : support optional tensors (#4283) 2023-12-01 20:35:47 +02:00
Miwa / Ensan b220222a64
swift : fix token_to_piece implementation (#4278)
* Fix token_to_piece implementation in Swift

* Fix errors
2023-12-01 20:19:45 +02:00
Georgi Gerganov ef47ec18da
ggml : add ggml_soft_max_ext (#4256)
* metal : implement soft_max_ext

* cuda : implement soft_max_ext

* ggml : implement soft_max_ext (CPU)

* batched-bench : print threads

ggml-ci

* metal : simplify soft_max encoding

ggml-ci

* cuda : use 512 threads for soft_max instead of 32

* ggml : update soft max cpu

* cuda : do warp-based block reduce

* cuda : increase max block size to 1024

* cuda : fix warp reduction initialization of shared mem

* metal : warp-based reduction for soft max kernel

* metal : warp-based reduce for rms_norm

* metal : simplify soft max kernel

ggml-ci

* alloc : fix build with debug
2023-12-01 10:51:24 +02:00