llama.cpp

Author	SHA1	Message	Date
klosax	b5fe67f8c6	Perplexity: Compute scores correlated to HellaSwag (#2312 ) * Add parameter --perplexity-lines to perplexity.cpp	2023-07-22 14:21:24 +02:00
whoreson	24baa54ac1	examples : basic VIM plugin VIM plugin for server exe	2023-07-22 13:34:51 +03:00
Richard Roberson	7d5f18468c	examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models (#2311 ) * Resync my fork with new llama.cpp commits * examples : rename to use dash instead of underscore --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-21 22:01:10 +03:00
Ikko Eltociear Ashimine	03e566977b	examples : fix typo in minigpt4.py (#2298 ) promt -> prompt	2023-07-21 14:53:07 +03:00
Georgi Gerganov	513f861953	ggml : fix rope args order + assert (#2054 )	2023-07-21 14:51:34 +03:00
Guillaume "Vermeille" Sanchez	ab0e26bdfb	llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280 )	2023-07-21 13:58:36 +03:00
Jose Maldonado	73643f5fb1	gitignore : changes for Poetry users + chat examples (#2284 ) A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native Add two examples for interactive mode using Llama2 models (thx TheBloke for models) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-21 13:53:27 +03:00
Georgi Gerganov	ae178ab46b	llama : make tensor_split ptr instead of array (#2272 )	2023-07-21 13:10:51 +03:00
Hatsune Miku	019fe257bb	MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287 ) * Miku.sh: Set default model to llama-2-7b-chat * Miku.sh: Set ctx_size to 4096 * Miku.sh: Add in-prefix/in-suffix opts * Miku.sh: Switch sampler to mirostat_v2 and tiny prompt improvements	2023-07-21 11:13:18 +03:00
Przemysław Pawełczyk	9cf022a188	make : fix embdinput library and server examples building on MSYS2 (#2235 ) * make : fix embdinput library and server examples building on MSYS2 * cmake : fix server example building on MSYS2	2023-07-21 10:42:21 +03:00
wzy	b1f4290953	cmake : install targets (#2256 ) fix #2252	2023-07-19 10:01:11 +03:00
Georgi Gerganov	d01bccde9f	ci : integrate with ggml-org/ci (#2250 ) * ci : run ctest ggml-ci * ci : add open llama 3B-v2 tests ggml-ci * ci : disable wget progress output ggml-ci * ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations ggml-ci * tests : try to fix tail free sampling test ggml-ci * ci : add K-quants ggml-ci * ci : add short perplexity tests ggml-ci * ci : add README.md * ppl : add --chunks argument to limit max number of chunks ggml-ci * ci : update README	2023-07-18 14:24:43 +03:00
Georgi Gerganov	6cbf9dfb32	llama : shorten quantization descriptions	2023-07-18 11:50:49 +03:00
Xiao-Yong Jin	6e7cca4047	llama : add custom RoPE (#2054 ) * Implement customizable RoPE The original RoPE has pre-defined parameters theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2] Our customizable RoPE, ggml_rope_custom_inplace, uses theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2] with the default matches the original scale = 1.0 base = 10000 The new command line arguments --rope-freq-base --rope-freq-scale set the two new RoPE parameter. Recent researches show changing these two parameters extends the context limit with minimal loss. 1. Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k 2. Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595 3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5 * ggml-metal: fix custom rope * common: fix argument names in help * llama: increase MEM_REQ_EVAL for MODEL_3B It avoids crashing for quantized weights on CPU. Better ways to calculate the required buffer size would be better. * llama: make MEM_REQ_EVAL depend on n_ctx * server: use proper Content-Type in curl examples Without the header Content-Type: application/json, curl will POST with Content-Type: application/x-www-form-urlencoded Though our simple server doesn't care, the httplib.h used has a limit with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192 With Content-Type: application/json, we can send large json data. * style : minor fixes, mostly indentations * ggml : fix asserts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-15 13:34:16 +03:00
Shangning Xu	c48c525f87	examples : fixed path typos in embd-input (#2214 )	2023-07-14 21:40:05 +03:00
Howard Su	32c5411631	Revert "Support using mmap when applying LoRA (#2095 )" (#2206 ) Has perf regression when mlock is used. This reverts commit `2347463201`.	2023-07-13 21:58:25 +08:00
Spencer Sutton	5bf2a27718	ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178 ) * Add ggml changes * Update train-text-from-scratch for change * mpi : adapt to new ggml_tensor->src --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-11 19:31:10 +03:00
Bach Le	c9c74b4e3f	llama : add classifier-free guidance (#2135 ) * Initial implementation * Remove debug print * Restore signature of llama_init_from_gpt_params * Free guidance context * Make freeing of guidance_ctx conditional * Make Classifier-Free Guidance a sampling function * Correct typo. CFG already means context-free grammar. * Record sampling time in llama_sample_classifier_free_guidance * Shift all values by the max value before applying logsoftmax * Fix styling based on review	2023-07-11 19:18:43 +03:00
Howard Su	2347463201	Support using mmap when applying LoRA (#2095 ) * Support using mmap when applying LoRA * Fix Linux * Update comment to reflect the support lora with mmap	2023-07-11 22:37:01 +08:00
Evan Miller	5656d10599	mpi : add support for distributed inference via MPI (#2099 ) * MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-10 18:49:56 +03:00
Nigel Bosch	db4047ad5c	main : escape prompt prefix/suffix (#2151 )	2023-07-09 11:56:18 +03:00
Qingyou Meng	1d656d6360	ggml : change ggml_graph_compute() API to not require context (#1999 ) * ggml_graph_compute: deprecate using ggml_context, try resolve issue #287 * rewrite: no longer consider backward compitability; plan and make_plan * minor: rename ctx as plan; const * remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward * add static ggml_graph_compute_sugar() * minor: update comments * reusable buffers * ggml : more consistent naming + metal fixes * ggml : fix docs * tests : disable grad / opt + minor naming changes * ggml : add ggml_graph_compute_with_ctx() - backwards compatible API - deduplicates a lot of copy-paste * ci : enable test-grad0 * examples : factor out plan allocation into a helper function * llama : factor out plan stuff into a helper function * ci : fix env * llama : fix duplicate symbols + refactor example benchmark * ggml : remove obsolete assert + refactor n_tasks section * ggml : fix indentation in switch * llama : avoid unnecessary bool * ggml : remove comments from source file and match order in header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-07 19:24:01 +03:00
Judd	36680f6e40	convert : update for baichuan (#2081 ) 1. guess n_layers; 2. relax warnings on context size; 3. add a note that its derivations are also supported. Co-authored-by: Judd <foldl@boxvest.com>	2023-07-06 19:23:49 +03:00
tslmy	a17a2683d8	alpaca.sh : update model file name (#2074 ) The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `ggmlv3.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.	2023-07-06 19:17:50 +03:00
Tobias Lütke	31cfbb1013	Expose generation timings from server & update completions.js (#2116 ) * use javascript generators as much cleaner API Also add ways to access completion as promise and EventSource * export llama_timings as struct and expose them in server * update readme, update baked includes * llama : uniform variable names + struct init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson	983b555e9d	Update Server Instructions (#2113 ) * Update server instructions for web front end * Update server README * Remove duplicate OAI instructions * Fix duplicate text --------- Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 21:03:19 +03:00
Stephan Walter	1b107b8550	ggml : generalize `quantize_fns` for simpler FP16 handling (#1237 ) * Generalize quantize_fns for simpler FP16 handling * Remove call to ggml_cuda_mul_mat_get_wsize * ci : disable FMA for mac os actions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson	8567c76b53	Update server instructions for web front end (#2103 ) Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 18:13:35 +03:00
Nigel Bosch	7f0e9a775e	embd-input: Fix input embedding example unsigned int seed (#2105 )	2023-07-05 07:33:33 +08:00
jwj7140	f257fd2550	Add an API example using server.cpp similar to OAI. (#2009 ) * add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding	2023-07-04 21:06:12 +03:00
Tobias Lütke	7ee76e45af	Simple webchat for server (#1998 ) * expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-04 16:05:27 +02:00
Henri Vasserman	1cf14ccef1	fix server crashes (#2076 )	2023-07-04 00:05:23 +03:00
WangHaoranRobin	d7d2e6a0f0	server: add option to output probabilities for completion (#1962 ) * server: add option to output probabilities for completion * server: fix issue when handling probability output for incomplete tokens for multibyte character generation * server: fix llama_sample_top_k order * examples/common.h: put all bool variables in gpt_params together	2023-07-03 00:38:44 +03:00
Georgi Gerganov	79f634a19d	embd-input : fix returning ptr to temporary	2023-07-01 18:46:00 +03:00
Georgi Gerganov	04606a1599	train : fix compile warning	2023-07-01 18:45:44 +03:00
Howard Su	b8c8dda75f	Use unsigned for random seed (#2006 ) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-29 06:15:15 -07:00
Johannes Gäßler	7f9753fa12	CUDA GPU acceleration for LoRAs + f16 models (#1970 )	2023-06-28 18:35:54 +02:00
ningshanwutuobang	cfa0750bc9	llama : support input embeddings directly (#1910 ) * add interface for float input * fixed inpL shape and type * add examples of input floats * add test example for embd input * fixed sampling * add free for context * fixed add end condition for generating * add examples for llava.py * add READMD for llava.py * add READMD for llava.py * add example of PandaGPT * refactor the interface and fixed the styles * add cmake build for embd-input * add cmake build for embd-input * Add MiniGPT-4 example * change the order of the args of llama_eval_internal * fix ci error	2023-06-28 18:53:37 +03:00
Howard Su	0be54f75a6	baby-llama : fix build after ggml_rope change (#2016 )	2023-06-27 08:07:13 +03:00
Georgi Gerganov	181e8d9755	llama : fix rope usage after ChatGLM change	2023-06-27 00:37:33 +03:00
David Yang	eaa6ca5a61	ggml : increase max tensor name + clean up compiler warnings in train-text (#1988 ) * Clean up compiler warnings in train-text Some brackets to disambiguate order of operations * Increase GGML_MAX_NAME Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues	2023-06-26 22:45:32 +03:00
zrm	b853d45601	ggml : add NUMA support (#1556 ) * detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-26 20:57:59 +03:00
anon998	c2a08f87b8	fix server sampling: top k sampler first (#1977 ) Co-authored-by: anon <anon@example.org>	2023-06-25 10:48:36 +02:00
Didzis Gosko	527b6fba1d	llama : make model stateless and context stateful (llama_state) (#1797 ) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-24 11:47:58 +03:00
Henri Vasserman	20568fe60f	[Fix] Reenable server embedding endpoint (#1937 ) * Add back embedding feature * Update README	2023-06-20 01:12:39 +03:00
Kawrakow	90cc59d6ab	examples : fix examples/metal (#1920 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-18 10:52:10 +03:00
Georgi Gerganov	4f9c43e3bd	minor : warning fixes	2023-06-17 20:24:11 +03:00
Johannes Gäßler	2c9380dd2f	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
Georgi Gerganov	051e1b0e6a	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00
Randall Fitzgerald	794db3e7b9	Server Example Refactor and Improvements (#1570 ) A major rewrite for the server example. Note that if you have built something on the previous server API, it will probably be incompatible. Check out the examples for how a typical chat app could work. This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing. Summary of the changes: - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos - applies missing top k sampler - removes interactive mode/terminal-like behavior, removes exclude parameter - moves threads and batch size to server command-line parameters - adds LoRA loading and matches command line parameters with main example - fixes stopping on EOS token and with the specified token amount with n_predict - adds server timeouts, host, and port settings - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text - sets defaults for unspecified parameters between requests - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming - adds CORS headers to responses - adds request logging, exception printing and optional verbose logging - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string - adds printing an error when it can't bind to the host/port specified - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses - prints timing and build info on startup - adds logit bias to request parameters - removes embedding mode - updates documentation; adds streaming Node.js and Bash examples - fixes code formatting - sets server threads to 1 since the current global state doesn't work well with simultaneous requests - adds truncation of the input prompt and better context reset - removes token limit from the input prompt - significantly simplified the logic and removed a lot of variables --------- Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Felix Hellmann <privat@cirk2.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>	2023-06-17 14:53:04 +03:00

1 2 3 4

189 commits