llama.cpp

Author	SHA1	Message	Date
slaren	d232aca5a7	llama : initial ggml-backend integration (#4520 ) * llama : initial ggml-backend integration * add ggml-metal * cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST access all tensor data with ggml_backend_tensor_get/set * add ggml_backend_buffer_clear zero-init KV cache buffer * add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data * disable gpu backends with ngl 0 * more accurate mlock * unmap offloaded part of the model * use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap * update quantize and lora * update session copy/set to use ggml-backend ggml-ci * use posix_fadvise instead of posix_fadvise64 * ggml_backend_alloc_ctx_tensors_from_buft : remove old print * llama_mmap::align_offset : use pointers instead of references for out parameters * restore progress_callback behavior * move final progress_callback call to load_all_data * cuda : fix fprintf format string (minor) * do not offload scales * llama_mmap : avoid unmapping the same fragments again in the destructor * remove unnecessary unmap * metal : add default log function that prints to stderr, cleanup code ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-21 21:07:46 +01:00
Marcus Dunn	31f27758fa	llama : allow getting n_batch from llama_context in c api (#4540 ) * allowed getting n_batch from llama_context in c api * changed to use `uint32_t` instead of `int` * changed to use `uint32_t` instead of `int` in `llama_n_ctx` * Update llama.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-21 21:57:48 +02:00
Finn Voorhees	56fa50819f	metal : fix `ggml_metal_log` vargs (#4373 )	2023-12-21 21:55:02 +02:00
Erik Garrison	0f630fbc92	cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449 ) * AMD ROCm: handle UMA memory VRAM expansions This resolves #2797 by allowing ROCm AMD GPU users with a UMA to dynamically expand the VRAM allocated to the GPU. Without this, AMD ROCm users with shared CPU/GPU memory usually are stuck with the BIOS-set (or fixed) framebuffer VRAM, making it impossible to load more than 1-2 layers. Note that the model is duplicated in RAM because it's loaded once for the CPU and then copied into a second set of allocations that are managed by the HIP UMA system. We can fix this later. * clarify build process for ROCm on linux with cmake * avoid using deprecated ROCm hipMallocHost * keep simplifying the change required for UMA * cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON	2023-12-21 21:45:32 +02:00
arlo-phoenix	562cf222b5	ggml-cuda: Fix HIP build by adding define for __trap (#4569 ) Regression of `1398823922` HIP doesn't have trap, only abort	2023-12-21 20:13:25 +01:00
Jared Van Bortel	8fe03ffdda	common : remove incorrect --model-draft default (#4568 )	2023-12-21 19:55:34 +02:00
Johannes Gäßler	9154494808	CUDA: mul_mat_id always on GPU for batches >= 32 (#4553 )	2023-12-21 18:42:59 +01:00
Georgi Gerganov	c083718c89	readme : update coding guidelines	2023-12-21 19:27:14 +02:00
howlger	880e352277	py : open merges file as 'utf-8' (#4566 ) Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error: ``` Traceback (most recent call last): File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module> model_instance.set_vocab() File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab self._set_vocab_gpt2() File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2 special_vocab = gguf.SpecialVocab(dir_model, load_merges=True) File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__ self._load(Path(path)) File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load self._try_load_merges_txt(path) File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt for line in fp: File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined> ```	2023-12-21 19:07:34 +02:00
bobqianic	66f35a2f48	cuda : better error message for ggml_get_rows (#4561 ) * Update ggml-cuda.cu * Update ggml-cuda.cu * Update ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-21 19:06:44 +02:00
slaren	1398823922	cuda : replace asserts in wrong architecture checks with __trap (#4556 ) * cuda : replace asserts in wrong architecture checks with __trap * make bad_arch noreturn, remove returns	2023-12-21 18:02:30 +01:00
Johannes Gäßler	d3223afdad	llama : disable per-tensor info prints on model load (#4562 )	2023-12-21 18:34:17 +02:00
LoganDark	1d7a1912ce	Fix access violation in ggml_cuda_free_data if tensor->extra is NULL (#4554 )	2023-12-21 10:59:27 +01:00
Johannes Gäßler	799fc22689	CUDA: Faster Mixtral prompt processing (#4538 ) * CUDA: make MoE tensors contiguous for batch size>1 * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-20 15:41:22 +01:00
Eric Sommerlade	328b83de23	ggml : fixed check for _MSC_VER (#4535 ) Co-authored-by: Eric Sommerlade <ersomme@microsoft.com>	2023-12-19 18:17:01 +02:00
arlo-phoenix	a7aee47b98	ggml-cuda: Fix HIP build (#4528 ) regression of #4490 Adds defines for two new datatypes cublasComputeType_t, cudaDataType_t. Currently using deprecated hipblasDatatype_t since newer ones very recent.	2023-12-18 22:33:45 +01:00
Georgi Gerganov	0e18b2e7d0	llama.swiftui : add tinyllama 1.1B F16	2023-12-18 20:17:43 +02:00
Georgi Gerganov	6ff39b129d	llama.swiftui : add more models	2023-12-18 20:05:12 +02:00
Ebey Abraham	b9e74f9bca	llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490 ) * phi2 implementation * fix breaking change * phi-2 : various fixes * phi-2 : use layer norm eps * py : whitespaces * llama : fix meta KV override bug * convert : phi don't add BOS token * convert : revert "added_tokens_decoder" change * phi-2 : scale Q instead of KQ for better precision * ggml : fix NeoX rope to rotate just first n_dims * cuda : less diff in the rope_neox kernel * ggml : add ggml_mul_mat_set_prec ggml-ci * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> * cuda : ggml_cuda_op_mul_mat_cublas support F32 precision * cuda : remove oboslete comment --------- Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2023-12-18 19:27:47 +02:00
hankcs	3c04bf6da8	llama : fix try_override for bool_value which always return true (#4519 )	2023-12-18 15:14:58 +02:00
Jared Van Bortel	2994f0c5a2	decode : fix logits_valid for legacy API (#4516 )	2023-12-17 19:39:02 -05:00
Georgi Gerganov	b1306c4394	readme : update hot topics	2023-12-17 20:16:23 +02:00
Georgi Gerganov	800a489e4a	llama.swiftui : add bench functionality (#4483 ) * llama.swiftui : add bench button * llama.swiftui : initial bench functionality * force to use n_gpu_layers on simulator * add download buttons & expose llamaState.loadModel * update project.pbxproj * comment #Preview & fix editorconfig check * gitignore : xcode stuff * llama.swiftui : UX improvements * llama.swiftui : avoid data copy via "downloadTask" * llama.swiftui : remove model from project * llama : remove "mostly" from model infos * llama.swiftui : improve bench --------- Co-authored-by: jhen <developer@jhen.me>	2023-12-17 19:38:41 +02:00
Jared Van Bortel	f7f468a97d	gguf-py : fail fast on nonsensical special token IDs (#4489 )	2023-12-17 10:45:46 -05:00
Matheus Gabriel Alves Silva	919c40660f	build : Check the ROCm installation location (#4485 ) * build : Check the ROCm installation location * more generic approach * fixup! It was returning the path instead of the command output * fixup! Trailing whitespace	2023-12-17 17:23:33 +02:00
slaren	45668633fd	finetune : keep allocs alive until all allocations are done (#4486 )	2023-12-17 16:05:56 +01:00
olexiyb	0ffc92d2d2	server : disable llm logs if SERVER_VERBOSE is off (#3792 )	2023-12-17 17:02:16 +02:00
AdithyanI	8edd2b40fd	server : fix grammar being ignored (#4494 ) Fix bug in identifying the grammar.	2023-12-17 16:57:56 +02:00
Alexey Parfenov	eb16dae7e7	server : fix possible ambiguity in content type charset (#4501 )	2023-12-17 16:56:09 +02:00
mzcu	62bd52b7bf	server : allow requests larger than 8K (#4500 )	2023-12-17 16:54:37 +02:00
Bach Le	5daa5f54fd	Link to cublas dynamically on Windows even with LLAMA_STATIC (#4506 )	2023-12-17 11:57:33 +01:00
slaren	c6c4fc081c	lora : add support for non-llama models (#3333 ) * lora : add support for non-llama models ggml-ci * avoid leaking ggml_context on failure cleanup ggml-ci * lora : allow 1d tensors * lora : include embd and output layers in size calculation * fix style	2023-12-16 18:58:46 +01:00
Jared Van Bortel	8a5be3bd58	llama : sanity checks for access to logits (#4274 ) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-15 22:16:15 -05:00
ShadovvBeast	88ae8952b6	server : add optional API Key Authentication example (#4441 ) * Add API key authentication for enhanced server-client security * server : to snake_case --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-15 13:49:01 +02:00
slaren	ee4725a686	ggml : group mul_mat_id rows by matrix (cpu only) (#4480 ) * ggml : group mul_mat_id rows by matrix (cpu only) * remove mmid parameters from mm forward * store row groups in wdata and calculate only once in GGML_TASK_INIT ggml-ci	2023-12-15 12:45:50 +01:00
slaren	6744dbe924	ggml : use ggml_row_size where possible (#4472 ) * ggml : use ggml_row_size where possible ggml-ci * ggml : move ggml_nbytes_split to ggml-cuda.cu	2023-12-14 20:05:21 +01:00
slaren	cafcd4f895	ggml : remove n_dims from ggml_tensor (#4469 ) ggml-ci	2023-12-14 16:52:08 +01:00
wonjun Jang	c50e400163	py : add protobuf dependency (#4466 )	2023-12-14 14:44:49 +02:00
LostRuins	20a68a7030	ggml : add ggml_row_size() (fixes llama out of space) (#4461 ) * Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values * do not cast to size_t, instead just use doubles * ggml : add ggml_row_size(), deprecate ggml_type_sizef() * ggml : fix row size compute to avoid overflows * tests : fix sizey -> sizez --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-14 14:13:33 +02:00
Georgi Gerganov	55e87c3749	ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453 )	2023-12-14 10:35:29 +02:00
wonjun Jang	873637afc7	convert : support loading vocab from fast tokenizer config (#3633 ) * Add HFVocab into convert.py * Update convert.py * Update convert.py * add bytes_to_unicode function * change add_meta_vocab fucntion * remove debug code * remove byte_encoder * Add newline between classes * Check tokenizer.json when tokenizer.model is not exist. * Move transformers dependency to local code * Add error context with 'raise from' * Add fast tokenizer option to BpeVocab * Update convert.py * Add VocabLoader and remove Vocab class Add transformers dependency * remove added tokens and check newline token to decide spm or bpe * Update convert.py * Add special token type * Update convert.py * Update convert.py * Update convert.py * Fix typo in convert.py * Fix when params.n_vocab < tokenizer vocab size * update vocab class * change funtion name * Remove unused variable/functions, add types to class variable and methods, delete blank liens * fix flake8 warnings * code style cleanup * make mypy happy * change exception --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2023-12-14 10:09:34 +02:00
BarfingLemurs	0353a18401	readme : update supported model list (#4457 )	2023-12-14 09:38:49 +02:00
shibe2	948ff137ec	server : fix handling of characters that span multiple tokens when streaming (#4446 )	2023-12-13 21:57:15 +02:00
Georgi Gerganov	4d98d9a656	sync : ggml (SD ops, tests, kernels) (#4444 ) * sync : ggml (SD ops, tests, kernels) ggml-ci * cuda : restore im2col ggml-ci * metal : fix accuracy of dequantization kernels ggml-ci * cuda : restore correct im2col ggml-ci * metal : try to fix moe test by reducing expert size ggml-ci * cuda : fix bin bcast when src1 and dst have different types ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-13 21:54:54 +02:00
Jared Van Bortel	70f806b821	build : detect host compiler and cuda compiler separately (#4414 )	2023-12-13 12:10:10 -05:00
Siwen Yu	9fb13f9584	common : add `--version` option to show build info in CLI (#4433 )	2023-12-13 14:50:14 +02:00
Georgi Gerganov	113f9942fc	readme : update hot topics	2023-12-13 14:05:38 +02:00
slaren	799a1cb13b	llama : add Mixtral support (#4406 ) * convert : support Mixtral as LLAMA arch * convert : fix n_ff typo * llama : model loading * ggml : sync latest ggml_mul_mat_id * llama : update graph to support MoE * llama : fix cur -> cur_expert * llama : first working version * llama : fix expert weighting in the FFN * ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) * ggml : add n_as argument to ggml_mul_mat_id * ggml : fix ggml_get_rows to take into account ne02 / ne11 * metal : add more general support for ggml_get_rows + tests * llama : add basic support for offloading moe with CUDA * metal : add/mul/div use general kernel when src1 not cont * metal : reduce the kernel launches for ggml_mul_mat_id * ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D * ggml : update get_rows f16 and q * cuda : support non-contiguous src1 in get_rows * llama : offload missing ffn_moe_silu * metal : fix ggml_get_rows to work with non-cont src1 * metal : add indirect mat-vec kernels for all quantization types * llama : do not quantize expert gating tensors * llama : add n_expert and n_expert_used to hparams + change quants * test-backend-ops : add moe test * cuda : fix get_rows when ncols is odd * convert : determine n_ctx correctly * metal : fix ggml_mul_mat_id for F32 * test-backend-ops : make experts more evenly probable (test_moe) * test-backend-ops : cleanup, add moe test for batches * test-backend-ops : add cpy from f32 -> all types test * test-backend-ops : fix dequantize block offset * llama : fix hard-coded number of experts * test-backend-ops : simplify and disable slow tests to avoid CI timeout * test-backend-ops : disable MOE test with thread sanitizer * cuda : fix mul_mat_id with multi gpu * convert : use 1e6 rope_freq_base for mixtral * convert : fix style * convert : support safetensors format * gguf-py : bump version * metal : add cpy f16 -> f32 kernel * metal : fix binary ops for ne10 % 4 != 0 * test-backend-ops : add one more sum_rows test * ggml : do not use BLAS with ggml_mul_mat_id * convert-hf : support for mixtral-instruct (#4428) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy * metal : fix soft_max kernels ref: `1914017863` * metal : limit kernels to not use more than the allowed threads --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Radek Pilar <github@mrkva.eu>	2023-12-13 14:04:25 +02:00
kalomaze	fecac45658	server : tweak default sampling parameters (#4367 ) * Set a more typical Top P setting as the default * Update temp max	2023-12-12 12:12:35 +02:00
Richard Kiss	9494d7c477	english : use `typos` to fix comments and logs (#4354 )	2023-12-12 11:53:36 +02:00

1 2 3 4 5 ...

1676 commits