Commit graph

2454 commits

Author SHA1 Message Date
slaren 2bf8d0f7c4
backend : offload large batches to GPU (#6083)
* backend : offload large batches to GPU

* fix hip

* code cleanup

* fix CUDA split buffers

* Update ggml-backend-impl.h

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix memset without set_device

* imatrix : remove sched affix from weight names

* sched : add a new split if the current one has too many inputs
reduce max inputs per split
more cleanup

* update backends

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-03-18 11:03:04 +01:00
DAN™ 496bc79bc2
common : tidy-up argument parsing (#6105)
* Tidy-up argument parsing.

* Missing ref.

* common : minor

* common : add static classifier

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-18 10:27:44 +02:00
Thérence 9b03719ad7
convert : add support for CamembertModel architecture (#6119)
Adding support for CamembertModel architecture used by :
https://huggingface.co/dangvantuan/sentence-camembert-large
2024-03-18 10:17:00 +02:00
Romain D 3a6efdd03c
convert : use f32 outtype for bf16 tensors (#6106)
The old behaviour is to use f16, but bf16 to f16 is not a lossless conversion.
Change the outtype to f32 to default to a lossless conversion.
2024-03-18 10:04:41 +02:00
Pierrick Hymbert d01b3c4c32
common: llama_load_model_from_url using --model-url (#6098)
* common: llama_load_model_from_url with libcurl dependency

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-17 19:12:37 +01:00
Georgi Gerganov cd776c37c9
ci : close all stale issues at once (#6115) 2024-03-17 18:51:57 +01:00
GainLee dc0f612548
ggml:fix finding transfer queue family index error (#6094)
Co-authored-by: GainLee <ligen@meizu.com>
2024-03-17 18:12:22 +01:00
AmirAli Mirian c47cf414ef
ggml : add AVX512F SIMD (#6088) 2024-03-16 17:52:02 +02:00
Daniel Bevenius b5f4ae09c3
gritlm : add initial README.md (#6086)
* gritlm: add initial README.md to examples/gritlm

This commit adds a suggestion for an initial README.md for the gritlm
example.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! gritlm: add initial README.md to examples/gritlm

Use the `scripts/hf.sh` script to download the model file.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! gritlm: add initial README.md to examples/gritlm

Fix editorconfig-checker error in examples/gritlm/README.md.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-03-16 17:46:29 +02:00
Xuan Son Nguyen dfbfdd60f9
readme : add wllama as a wasm binding (#6100) 2024-03-16 17:42:08 +02:00
DAN™ 15961ec04d
common : refactor nested if causing error C1061 on MSVC (#6101)
* Refactor nested if causing error C1061 on MSVC.

* Revert back and remove else's.

* Add flag to track found arguments.
2024-03-16 17:39:15 +02:00
Pierrick Hymbert a56d09a440
ci : close inactive issue with workflow (#6053)
* issues: ci - close inactive issue with workflow

* ci: close issue, change workflow schedule time
2024-03-16 14:20:53 +02:00
slaren d84c48505f
llama : fix Baichuan2 13B (#6092) 2024-03-15 23:14:16 +02:00
Theia Vogel 877b4d0c62
llama : add support for control vectors (#5970)
* control vector api and implementation

* control-vectors : minor code style updates

* disable control vector when data == nullptr

use -1 for disabled range (also on init) in case we ever support controlling layer 0 (embeddings)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-15 22:43:02 +02:00
Andrew Canis 12247f4c69
llama : add Command-R support (#6033)
Information about the Command-R 35B model (128k context) can be found at:
	https://huggingface.co/CohereForAI/c4ai-command-r-v01

Based on the llama2 model with a few changes:

1) New hyper parameter to scale output logits (logit_scale)
2) Uses LayerNorm instead of RMSNorm
3) Transfomer layers have a single shared LayerNorm that feeds into both the
   self-attention and FFN layers in parallel. There is no post-attention LayerNorm.
4) No support for Rotary Position Embeddings (RoPE) scaling
5) No biases used

Find GGUF files here:
	https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF

To convert model to GGUF format yourself:

1) Download Command-R Hugging Face safetensors:
	git lfs install
	git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01

2) Run:
	python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01
2024-03-15 22:41:22 +02:00
Ting Lou 4e9a7f7f7f
llava : change API to pure C style for Rust FFI bindgen (#6079)
Co-authored-by: Lou Ting <louting.t@alibaba-inc.com>
2024-03-15 16:31:05 +02:00
slaren 3020327f6c
cuda : disable unused cudaLaunchHostFunc code (#6078) 2024-03-15 14:24:03 +02:00
Neo Zhang Jianyu 46acb36767
fix set main gpu error (#6073) 2024-03-15 18:53:53 +08:00
Georgi Gerganov 131b058409
make : ggml-metal.o depends on ggml.h 2024-03-15 11:38:40 +02:00
AidanBeltonS 753e36f650
[SYCL] Fix non-intel device selection (#6042)
* Fix non-intel device selection

* Update ggml-sycl.cpp

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

* Update ggml-sycl.cpp

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2024-03-15 14:56:20 +05:30
Ondřej Čertík 7ce2c77f88
gguf : add support for I64 and F64 arrays (#6062)
* gguf : add support for I64 and F64 arrays

GGML currently does not support I64 or F64 arrays and they are not often
used in machine learning, however if in the future the need arises, it
would be nice to add them now, so that the types are next to the other
types I8, I16, I32 in the enums, and it also reserves their type number.

Furthermore, with this addition the GGUF format becomes very usable for
most computational applications of NumPy (being compatible with the most
common NumPy dtypes: i8, i16, i32, i64, f32, f64), providing a faster,
and more versatile alternative to the `npz` format, and a simpler
alternative to the `hdf5` format.

The change in this PR seems small, not significantly increasing the
maintenance burden. I tested this from Python using GGUFWriter/Reader
and `gguf-dump`, as well as from C, everything seems to work.

* Fix compiler warnings
2024-03-15 10:46:51 +02:00
Xuan Son Nguyen aab606a11f
llama : add Orion chat template (#6066) 2024-03-15 10:44:57 +02:00
slaren b0bc9f4a9d
llama-bench : use random tokens to improve accuracy with mixtral (#6069) 2024-03-15 10:22:24 +02:00
Georgi Gerganov 4755afd1cb
llama : fix integer overflow during quantization (#6063) 2024-03-14 22:58:41 +02:00
Steve Grubb 6e0438da3c
gguf : fix resource leaks (#6061)
There several places where a gguf context is allocated. A call to gguf_free
is missing in some error paths. Also on linux, llama-bench was missing a
fclose.
2024-03-14 20:29:32 +02:00
Ondřej Čertík 727107707a
gguf-py : bump version to 0.8.0 (#6060) 2024-03-14 19:57:31 +02:00
Michael Podvitskiy 69ff61397d
llama : support models without vocabulary (#5798)
* additional methods to read model and ctx parameters

* vocab size as a part of a model metadata

* models without vocabulary, convert.py part

* models without vocabulary, llama.cpp part

* PR clean up

* converter scrypt fixes

* llama_vocab_type update (renamed the new key)

* pr review fixes

* revert function renaming

* one more NoVocab assert
2024-03-14 18:21:56 +02:00
Georgi Gerganov 044ec4b2a5
embedding : add EOS token if not present (#899) 2024-03-14 15:14:14 +02:00
Georgi Gerganov 77178eedc8
gguf-py : fix dtype check (#6045) 2024-03-14 13:32:14 +02:00
Jian Liao 15a333260a
readme : improve readme for Llava-1.6 example (#6044)
Co-authored-by: Jian Liao <jianliao@adobe.com>
2024-03-14 13:18:23 +02:00
Pierrick Hymbert 43241adf22
server: disable debug release type sanitizer, simplify trigger (#6047)
- increase time out for server
 - do not fail fast
2024-03-14 13:15:39 +02:00
Georgi Gerganov a44bc969e4
llama : fix typo 2024-03-14 13:13:06 +02:00
Michael Podvitskiy 2c4fb69246
llama : optimize defrag moves + fix fragmentation calculation (#6037)
* attempt to reduce the impact of a worst-case scenario

* fragmentation calculation fix

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-14 12:56:48 +02:00
Ondřej Čertík 3ca23481dd
gguf-py : add support for I8, I16 and I32 (#6045)
* Refactor dtype handling to be extensible

This code is equivalent as before, but now it is prepared to easily add
more NumPy dtypes.

* Add support for I8, I16 and I32

These types are allowed in the GGUF specification.

* Add support for I8, I16 and I32 to gguf_writer

* Add support for I8, I16, I32 to gguf_reader
2024-03-14 12:40:14 +02:00
Georgi Gerganov 3fe8d7a17f
ggml : designate enum vals for integer types (#6050) 2024-03-14 12:38:37 +02:00
Georgi Gerganov 68265ebfc6
embedding : print all resulting embeddings (#899) 2024-03-14 12:37:20 +02:00
Georgi Gerganov 381da2d9f0
metal : build metallib + fix embed path (#6015)
* metal : build metallib + fix embed path

ggml-ci

* metal : fix embed build + update library load logic

ggml-ci

* metal : fix embeded library build

ggml-ci

* ci : fix iOS builds to use embedded library
2024-03-14 11:55:23 +02:00
Georgi Gerganov 0fd6c1f015
embedding : print cosine similarity (#899) 2024-03-14 10:12:29 +02:00
Linwei Wang 19885d205e
readme : update details about running llama in Termux on Android (#6039) 2024-03-13 20:34:40 +02:00
Georgi Gerganov 76a936c893
readme : update API changes and hot topics 2024-03-13 20:33:56 +02:00
Clint Herron 463628372d
grammar : handle missing "root" node (#6004) 2024-03-13 20:10:40 +02:00
slaren f30ea47a87
llama : add pipeline parallelism support (#6017)
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-13 18:54:21 +01:00
slaren d8fd0ccf6a
test-backend-ops : skip CPU backend by default (#6028) 2024-03-13 15:58:30 +02:00
AidanBeltonS b3d978600f
Update get version (#6025) 2024-03-13 18:47:54 +05:30
Xuan Son Nguyen 99b71c068f
Server: Use multi-task for embeddings endpoint (#6001)
* use multitask for embd endpoint

* specify types

* remove redundant {"n_predict", 0}
2024-03-13 11:39:11 +01:00
slaren 306d34be7a
ci : remove tidy-review (#6021) 2024-03-12 17:55:19 +02:00
Georgi Gerganov 8030da7afe
ggml : reuse quantum structs across backends (#5943)
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
2024-03-12 14:27:20 +02:00
Georgi Gerganov 184215e783
ggml : fix UB in IQ2_S and IQ3_S (#6012) 2024-03-12 13:49:55 +02:00
Georgi Gerganov 48358b2e5b
sycl : update IQ1_S kernels (WIP - not working!) (#5995)
* sycl : try to fix after IQ1_S changes

* sycl : iq1s_grid -> iq1s_grid_gpu

* sycl : fix grid type
2024-03-12 11:15:05 +02:00
gliptic 5cdb371731
grammar : fix unnecessarily retained pointer to rules (#6003) 2024-03-11 21:59:03 +02:00