Commit graph

472 commits

Author SHA1 Message Date
Johannes Gäßler 6b73ef1201
YAML result logging + preset script (#2657) 2023-08-28 17:59:39 +02:00
Cebtenzzre ebcee207b6
quantize : make output filename optional again (#2823)
* quantize : make output filename optional again

* quantize : fix path parsing on Windows

suggested by @slaren
2023-08-28 09:32:25 +03:00
Olivier Chafik 230d46c723
examples : update llama2.c converter to read vocab and write models in GGUF format (#2751)
* llama2.c: direct gguf output (WIP)

* Simplify vector building logic

* llama2.c gguf conversion: fix token types in converter

* llama2.c: support copying vocab from a llama gguf model file

* llama2.c: update default path for vocab model + readme

* llama2.c: use defines for gguf keys

* llama2.c: escape whitespaces w/ U+2581 in vocab converter the llama.cpp way

* llama2.c converter: cleanups + take n_ff from config
2023-08-27 17:13:31 +03:00
Kawrakow 463173a6c0
llama : speedup tokenization (#2831)
* Speedup tokenization

On current master it takes ~3.2 seconds to tokenize
Wikitext. With this change it becomes ~525 ms.

* Fixit: it was missing the piece after the last found occurence

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-27 16:50:33 +03:00
Georgi Gerganov d0cee0d36d
gguf : add 64-bit support (GGUF v2) (#2821)
* gguf : bump version to 2

* gguf : add support for 64-bit (no backwards comp yet)

* gguf : v1 backwards comp

* gguf.py : bump GGUF version

* gguf.py : uint64_t on all lengths, sizes and counts, enums still uint32_t

* gguf.py : string lengths uint32_t

* gguf : update all counts to 64-bit

* gguf.py : string len uint64_t and n_dims uint32_t

* gguf : fix typo

* llama.cpp : print gguf version

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-27 14:19:54 +03:00
Georgi Gerganov edd4c14817
llama : more tokenizer fixes (#2810)
* tests : write a Python tokenizer test (wip)

* llama : prefix input text for tokenization with whitespace

* llama : distinguish pieces from decoded text + fix detokenization

* common : add comments

* examples : no longer manually add leading space when tokenizing

* tests : use Python to generate tokenizer tests for C++

* tests : add option to tokenize text files

ggml-ci

* tests : add test-tokenizer-1.py

* llama.cpp : fix LF token

* hellaswag : move the concat space for clarity

* tests : add falcon tests (py + cpp, currently do not pass Unicode)

ggml-ci

* common : temporary separate llama_detokenize calls for SPM and BPE

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-27 14:19:19 +03:00
Bruce MacDonald c1ac54b77a
server : add /detokenize endpoint (#2802)
* Add a /detokenize endpoint to the example server

* remove trailing white-space
2023-08-27 07:11:45 +08:00
Dr. Tom Murphy VII Ph.D 72f895c923
main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (#1528)
* Fix bug in main.cpp where penalize_nl=false has no effect. It modifies the underlying logits array, but at this point we are already working on the candidates copy.

* Suppress redefinition warning for NOMINMAX on mingw. In my installation, this macro is already defined by /usr/lib/gcc/x86_64-w64-mingw32/11/include/c++/x86_64-w64-mingw32/bits/os_defines.h:45.

* main : fix indentation

* main : pass ctx to llama_token_nl()

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-26 21:12:56 +03:00
Kawrakow 771551a793
Fix HellaSwag (#2805)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-26 16:48:53 +03:00
klosax 2ba83c8685
Fix spm whitespaces (#2806)
* llama.cpp : fix spm whitespace escaping + clean up

* main.cpp : spm - add whitespace in front of prompt

* test-tokenizer-0.cpp : spm - add whitespace in front of prompt
2023-08-26 13:45:53 +02:00
lon bae5c5f679
examples : skip unnecessary external lib in server README.md how-to (#2804) 2023-08-26 16:07:43 +08:00
Kawrakow d046dcee08
Faster perplexity computation (#2786)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-25 19:05:02 +03:00
Matt Pulver c82742ac9c
llama : add llama_beam_search() (#2267)
* Add llama_beam_search().

* Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_accept_token().

* Add space around * pointers and & references.

* Add spaces around comparison and assignment operators.

* Prefer west const.

* Use llama_ prefix for structs in global namespace.

* Delete obsolete comment from an earlier revision.

* Change eos to eob in llama_beam and llama_beam_view structs.
2023-08-25 18:18:48 +03:00
slaren 154725c543
llama-bench : add model sizes (#2771)
* llama-bench : add model sizes

* more compact markdown output

* back to GiB

* adjust column sizes
2023-08-25 15:16:19 +02:00
Jhen-Jie Hong 29674ab4e8
server : display token probabilities in the UI (#2489)
* server : add n_probs param in chat UI

* server : keep message data array & show in probabilites component

* server : add simple popover component

* server : fix completion_probabilities undefined if not set n_probs

* server : implement Probabilites

* server : handle bytes

* server : make n_probs max to 10 for easy scroll

* server : adjust for dark/light mode

* server : Fix regenerated prompt

* server : update index.html.hpp

* server : convert prob to percentage + show original value as div title

* server : fix Probabilites not used if included empty str

* server : skip byte pair in display probabilites

* server : remove array check of completion_probabilities in messages

* skip empty array or byte pair (> 1) in Probabilites

* generate index.html.hpp

* fix incorrect prob convert if the str is already a known token

* use final response to show probabilities on stop

* revert unnecessary change

* correct probabilites usage

* remove unused function

* always send partial response for get correct probs of last to_send

* fix typo

* fix content of format_final_response

* refactor probs render & make pColor transparent if not found

* send empty string when got stop_pos in partial

* avoid unnecessary empty data event & send rest of partial tokens on stop

* use <br /> for new line

* skip -1 tok in loop to avoid send '' on end

* trim last new lines on stop

* revert unnecessary change
2023-08-25 18:32:45 +08:00
Henri Vasserman 6bbc598a63
ROCm Port (#1087)
* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP

---------

Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>
2023-08-25 12:09:42 +03:00
Kerfuffle 7694adda8d
Fix for main example getting stuck when -n -2 and --interactive (#2767)
* Fix for main example getting stuck when -n -2 and --interactive

* Add a comment so future generations may suffer less.
2023-08-24 10:11:13 -06:00
Georgi Gerganov cf658adc83
llm : add Falcon support (#2717)
* llama : refactor GGUF constants into static maps

* llama : check if model architecture is known

* llama : refactor llama_model_load_internal()

* gguf : add KV constant maps

* llm : read arch-specific KVs

* convert : add dummy scores + types

* falcon : load tensor data (CPU only)

* llama : fix loading progress bar

* llama : add arch member to llama_model

* falcon : CPU inference working

* falcon : support non-40B models

* falcon : minor

* llama : minor updates

ggml-ci

* convert-falcon-hf-to-gguf.py : fix special token mapping

* llama.cpp : llama default UNK token = id 0

* llama.cpp : fix bpe tokenizer

* llama.cpp : fix the fix of bpe tokenizer

* ggml : pass eps to ggml_norm

* metal : implement RoPE (mode = 2) + avoid ggml_repeat

* ggml : ggml_repeat always creates new tensor

* falcon : copy-paste self-attention from LLaMA

* metal : print extra compute pipeline info

* falcon : minor changes (still chasing the Metal problem)

* llama.cpp : fix linefeed token

* metal : fix GELU kernel numerical stability by using precise::tanh

* metal : temporary workaround for the concurrency optimization bug

* falcon : add CUDA offloading (#2739)

* llama : better model naming and size reporting

* llama : prep new tokenizer support

* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation

* llama : remove oboslete comment

ggml-ci

* common : remove obsolete BPE API + disable test-tokenizer-1

* llama : revert BPE special-case in llama_byte_to_token()

* cuda : add TODOs for RoPE NeoX implementation

* llama : default special tokens based on vocab type

* perplexity : add log for start of tokenization

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-08-23 23:08:04 +03:00
Georgi Gerganov a192860cfe
minor : fix trailing whitespace 2023-08-23 22:37:39 +03:00
Olivier Chafik 95385241a9
examples : restore the functionality to import llama2.c models (#2685)
* Fix import of llama2.c models that don't share weights between embedding layers

* llama2c: reinstate ggmlv3 conversion output + update readme w/ gguf conv

* llama2.c: comment out legacy "load from ggml model" logic

* llama2.c: convert special-cased "<0xXX>" single byte tokens from tokenizer.bin
2023-08-23 22:33:05 +03:00
klosax 5290c38e6e
main : insert bos if no tokens (#2727)
* main.cpp : insert bos if no tokens

* Update examples/main/main.cpp

* Update examples/main/main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-23 16:46:03 +02:00
Cebtenzzre 7c2227a197
chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00
Kawrakow 8207214b6a
Fix values shown in the quantize tool help (#2735)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-23 12:57:12 +03:00
Kawrakow 62959e740e
Strided perplexity (#2714)
* Implementing strided computation of perplexity

* Alternative way to output PPL results

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-23 12:56:42 +03:00
Xiao-Yong Jin b8ad1b66b2
server : allow json array in prompt or content for direct token input (#2306)
* server: allow json array in prompt or content

We accept an array of strings and numbers representing tokens,
in addition to the current string valued prompt or content.

This allows direct token input, so that any special tokens
can be processed and used at the frontend during the construction
of the json data, before sending to the server. And the server
does not need to know or parse special tokens from textual input.

With this, we can use EOS and BOS used in llama-2-chat models.

* server: use tokenizePrompt(json) and default "" if empty prompt

* server: fix prompt check

* server: tokenize endpoint no longer adds BOS
2023-08-23 15:12:12 +08:00
Evan Jones f5fe98d11b
docs : add grammar docs (#2701)
* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
2023-08-22 21:01:57 -04:00
Johannes Gäßler c63bb1d16a
CUDA: use mul_mat_q kernels by default (#2683) 2023-08-22 22:47:05 +02:00
slaren 519c981f8b
embedding : evaluate prompt in batches (#2713) 2023-08-22 16:03:12 +02:00
Georgi Gerganov ef3f333d37
ggml : sync latest (SAM + SD operators, CUDA alibi) (#2709)
* ggml : sync latest (SAM + SD operators, CUDA alibi)

ggml-ci

* ggml : fix tabs
2023-08-22 14:22:08 +03:00
slaren 8e4364f2af
llama-bench : minor fixes (#2695) 2023-08-22 10:56:03 +03:00
Jhen-Jie Hong 226255b44e
server : fallback to default if client param is null (#2688)
* server : fallback to default if client param is null

* server : do not overwrite 404 if status is 500 from exception_handler
2023-08-22 08:32:00 +08:00
Georgi Gerganov 6381d4e110
gguf : new file format with flexible meta data (beta) (#2398)
* gguf : first API pass

* gguf : read header + meta data

* gguf : read tensor info

* gguf : initial model loading - not tested

* gguf : add gguf_get_tensor_name()

* gguf : do not support passing existing ggml_context to gguf_init

* gguf : simplify gguf_get_val

* gguf : gguf.c is now part of ggml.c

* gguf : read / write sample models

* gguf : add comments

* refactor : reduce code duplication and better API (#2415)

* gguf : expose the gguf_type enum through the API for now

* gguf : add array support

* gguf.py : some code style changes

* convert.py : start a new simplified implementation by removing old stuff

* convert.py : remove GGML vocab + other obsolete stuff

* GGUF : write tensor (#2426)

* WIP: Write tensor

* GGUF : Support writing tensors in Python

* refactor : rm unused import and upd todos

* fix : fix errors upd writing example

* rm example.gguf

* gitignore *.gguf

* undo formatting

* gguf : add gguf_find_key (#2438)

* gguf.cpp : find key example

* ggml.h : add gguf_find_key

* ggml.c : add gguf_find_key

* gguf : fix writing tensors

* gguf : do not hardcode tensor names to read

* gguf : write sample tensors to read

* gguf : add tokenization constants

* quick and dirty conversion example

* gguf : fix writing gguf arrays

* gguf : write tensors one by one and code reuse

* gguf : fix writing gguf arrays

* gguf : write tensors one by one

* gguf : write tensors one by one

* gguf : write tokenizer data

* gguf : upd gguf conversion script

* Update convert-llama-h5-to-gguf.py

* gguf : handle already encoded string

* ggml.h : get array str and f32

* ggml.c : get arr str and f32

* gguf.py : support any type

* Update convert-llama-h5-to-gguf.py

* gguf : fix set is not subscriptable

* gguf : update convert-llama-h5-to-gguf.py

* constants.py : add layer norm eps

* gguf.py : add layer norm eps and merges

* ggml.h : increase GGML_MAX_NAME to 64

* ggml.c : add gguf_get_arr_n

* Update convert-llama-h5-to-gguf.py

* add gptneox gguf example

* Makefile : add gptneox gguf example

* Update convert-llama-h5-to-gguf.py

* add gptneox gguf example

* Update convert-llama-h5-to-gguf.py

* Update convert-gptneox-h5-to-gguf.py

* Update convert-gptneox-h5-to-gguf.py

* Update convert-llama-h5-to-gguf.py

* gguf : support custom alignment value

* gguf : fix typo in function call

* gguf : mmap tensor data example

* fix : update convert-llama-h5-to-gguf.py

* Update convert-llama-h5-to-gguf.py

* convert-gptneox-h5-to-gguf.py : Special tokens

* gptneox-main.cpp : special tokens

* Update gptneox-main.cpp

* constants.py : special tokens

* gguf.py : accumulate kv and tensor info data + special tokens

* convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens

* gguf : gguf counterpart of llama-util.h

* gguf-util.h : update note

* convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens

* convert-llama-h5-to-gguf.py : special tokens

* Delete gptneox-common.cpp

* Delete gptneox-common.h

* convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer

* gptneox-main.cpp : gpt2 bpe tokenizer

* gpt2 bpe tokenizer (handles merges and unicode)

* Makefile : remove gptneox-common

* gguf.py : bytesarray for gpt2bpe tokenizer

* cmpnct_gpt2bpe.hpp : comments

* gguf.py : use custom alignment if present

* gguf : minor stuff

* Update gptneox-main.cpp

* map tensor names

* convert-gptneox-h5-to-gguf.py : map tensor names

* convert-llama-h5-to-gguf.py : map tensor names

* gptneox-main.cpp : map tensor names

* gguf : start implementing libllama in GGUF (WIP)

* gguf : start implementing libllama in GGUF (WIP)

* rm binary commited by mistake

* upd .gitignore

* gguf : calculate n_mult

* gguf :  inference with 7B model working (WIP)

* gguf : rm deprecated function

* gguf : start implementing gguf_file_saver (WIP)

* gguf : start implementing gguf_file_saver (WIP)

* gguf : start implementing gguf_file_saver (WIP)

* gguf : add gguf_get_kv_type

* gguf : add gguf_get_kv_type

* gguf : write metadata in gguf_file_saver (WIP)

* gguf : write metadata in gguf_file_saver (WIP)

* gguf : write metadata in gguf_file_saver

* gguf : rm references to old file formats

* gguf : shorter name for member variable

* gguf : rm redundant method

* gguf : get rid of n_mult, read n_ff from file

* Update gguf_tensor_map.py

* Update gptneox-main.cpp

* gguf : rm references to old file magics

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : quantization is working

* gguf : roper closing of file

* gguf.py : no need to convert tensors twice

* convert-gptneox-h5-to-gguf.py : no need to convert tensors twice

* convert-llama-h5-to-gguf.py : no need to convert tensors twice

* convert-gptneox-h5-to-gguf.py : simplify nbytes

* convert-llama-h5-to-gguf.py : simplify nbytes

* gptneox-main.cpp : n_layer --> n_block

* constants.py : n_layer --> n_block

* gguf.py : n_layer --> n_block

* convert-gptneox-h5-to-gguf.py : n_layer --> n_block

* convert-llama-h5-to-gguf.py : n_layer --> n_block

* gptneox-main.cpp : n_layer --> n_block

* Update gguf_tensor_map.py

* convert-gptneox-h5-to-gguf.py : load model in parts to save memory

* convert-llama-h5-to-gguf.py : load model in parts to save memory

* convert : write more metadata for LLaMA

* convert : rm quantization version

* convert-gptneox-h5-to-gguf.py : add file_type key

* gptneox-main.cpp : add file_type key

* fix conflicts

* gguf : add todos and comments

* convert-gptneox-h5-to-gguf.py : tensor name map changes

* Create gguf_namemap.py : tensor name map changes

* Delete gguf_tensor_map.py

* gptneox-main.cpp : tensor name map changes

* convert-llama-h5-to-gguf.py : fixes

* gguf.py : dont add empty strings

* simple : minor style changes

* gguf : use UNIX line ending

* Create convert-llama-7b-pth-to-gguf.py

* llama : sync gguf-llama.cpp with latest llama.cpp (#2608)

* llama : sync gguf-llama.cpp with latest llama.cpp

* minor : indentation + assert

* llama : refactor gguf_buffer and gguf_ctx_buffer

* llama : minor

* gitignore : add gptneox-main

* llama : tokenizer fixes (#2549)

* Merge tokenizer fixes into the gguf branch.

* Add test vocabularies

* convert : update convert-new.py with tokenizer fixes (#2614)

* Merge tokenizer fixes into the gguf branch.

* Add test vocabularies

* Adapt convert-new.py (and fix a clang-cl compiler error on windows)

* llama : sync gguf-llama with llama (#2613)

* llama : sync gguf-llama with llama

* tests : fix build + warnings (test-tokenizer-1 still fails)

* tests : fix wstring_convert

* convert : fix layer names

* llama : sync gguf-llama.cpp

* convert : update HF converter to new tokenizer voodoo magics

* llama : update tokenizer style

* convert-llama-h5-to-gguf.py : add token types

* constants.py : add token types

* gguf.py : add token types

* convert-llama-7b-pth-to-gguf.py : add token types

* gguf-llama.cpp :  fix n_head_kv

* convert-llama-h5-to-gguf.py : add 70b gqa support

* gguf.py : add tensor data layout

* convert-llama-h5-to-gguf.py : add tensor data layout

* convert-llama-7b-pth-to-gguf.py : add tensor data layout

* gptneox-main.cpp : add tensor data layout

* convert-llama-h5-to-gguf.py : clarify the reverse permute

* llama : refactor model loading code (#2620)

* llama : style formatting + remove helper methods

* llama : fix quantization using gguf tool

* llama : simplify gguf_file_saver

* llama : fix method names

* llama : simplify write_header()

* llama : no need to pass full file loader to the file saver

just gguf_ctx

* llama : gguf_file_saver write I32

* llama : refactor tensor names (#2622)

* gguf: update tensor names searched in quantization

* gguf : define tensor names as constants

* gguf : initial write API (not tested yet)

* gguf : write to file API (not tested)

* gguf : initial write API ready + example

* gguf : fix header write

* gguf : fixes + simplify example + add ggml_nbytes_pad()

* gguf : minor

* llama : replace gguf_file_saver with new gguf write API

* gguf : streaming support when writing files

* gguf : remove oboslete write methods

* gguf : remove obosolete gguf_get_arr_xxx API

* llama : simplify gguf_file_loader

* llama : move hparams and vocab from gguf_file_loader to llama_model_loader

* llama : merge gguf-util.h in llama.cpp

* llama : reorder definitions in .cpp to match .h

* llama : minor simplifications

* llama : refactor llama_model_loader (WIP)

wip : remove ggml_ctx from llama_model_loader

wip : merge gguf_file_loader in llama_model_loader

* llama : fix shape prints

* llama : fix Windows build + fix norm_rms_eps key

* llama : throw error on missing KV paris in model meta data

* llama : improve printing + log meta data

* llama : switch print order of meta data

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>

* gguf : deduplicate (#2629)

* gguf : better type names

* dedup : CPU + Metal is working

* ggml : fix warnings about unused results

* llama.cpp : fix line feed and compiler warning

* llama : fix strncpy warning + note token_to_str does not write null

* llama : restore the original load/save session implementation

Will migrate this to GGUF in the future

* convert-llama-h5-to-gguf.py : support alt ctx param name

* ggml : assert when using ggml_mul with non-F32 src1

* examples : dedup simple

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>

* gguf.py : merge all files in gguf.py

* convert-new.py : pick #2427 for HF 70B support

* examples/gguf : no need to keep q option for quantization any more

* llama.cpp : print actual model size

* llama.cpp : use ggml_elements()

* convert-new.py : output gguf (#2635)

* convert-new.py : output gguf (WIP)

* convert-new.py : add gguf key-value pairs

* llama : add hparams.ctx_train + no longer print ftype

* convert-new.py : minor fixes

* convert-new.py : vocab-only option should work now

* llama : fix tokenizer to use llama_char_to_byte

* tests : add new ggml-vocab-llama.gguf

* convert-new.py : tensor name mapping

* convert-new.py : add map for skipping tensor serialization

* convert-new.py : convert script now works

* gguf.py : pick some of the refactoring from #2644

* convert-new.py : minor fixes

* convert.py : update to support GGUF output

* Revert "ci : disable CI temporary to not waste energy"

This reverts commit 7e82d25f40.

* convert.py : n_head_kv optional and .gguf file extension

* convert.py : better always have n_head_kv and default it to n_head

* llama : sync with recent PRs on master

* editorconfig : ignore models folder

ggml-ci

* ci : update ".bin" to ".gguf" extension

ggml-ci

* llama : fix llama_model_loader memory leak

* gptneox : move as a WIP example

* llama : fix lambda capture

ggml-ci

* ggml : fix bug in gguf_set_kv

ggml-ci

* common.h : .bin --> .gguf

* quantize-stats.cpp : .bin --> .gguf

* convert.py : fix HF tensor permuting / unpacking

ggml-ci

* llama.cpp : typo

* llama : throw error if gguf fails to init from file

ggml-ci

* llama : fix tensor name grepping during quantization

ggml-ci

* gguf.py : write tensors in a single pass (#2644)

* gguf : single pass for writing tensors + refactoring writer

* gguf : single pass for writing tensors + refactoring writer

* gguf : single pass for writing tensors + refactoring writer

* gguf : style fixes in simple conversion script

* gguf : refactor gptneox conversion script

* gguf : rename h5 to hf (for HuggingFace)

* gguf : refactor pth to gguf conversion script

* gguf : rm file_type key and method

* gguf.py : fix vertical alignment

* gguf.py : indentation

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* convert-gptneox-hf-to-gguf.py : fixes

* gguf.py : gptneox mapping

* convert-llama-hf-to-gguf.py : fixes

* convert-llama-7b-pth-to-gguf.py : fixes

* ggml.h : reverse GGUF_MAGIC

* gguf.py : reverse GGUF_MAGIC

* test-tokenizer-0.cpp : fix warning

* llama.cpp : print kv general.name

* llama.cpp : get special token kv and linefeed token id

* llama : print number of tensors per type + print arch + style

* tests : update vocab file with new magic

* editorconfig : fix whitespaces

* llama : re-order functions

* llama : remove C++ API + reorganize common source in /common dir

* llama : minor API updates

* llama : avoid hardcoded special tokens

* llama : fix MPI build

ggml-ci

* llama : introduce enum llama_vocab_type + remove hardcoded string constants

* convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested

* falcon-main.cpp : falcon inference example

* convert-falcon-hf-to-gguf.py : remove extra kv

* convert-gptneox-hf-to-gguf.py : remove extra kv

* convert-llama-7b-pth-to-gguf.py : remove extra kv

* convert-llama-hf-to-gguf.py : remove extra kv

* gguf.py : fix for falcon 40b

* falcon-main.cpp : fix for falcon 40b

* convert-falcon-hf-to-gguf.py : update ref

* convert-falcon-hf-to-gguf.py : add tensor data layout

* cmpnct_gpt2bpe.hpp : fixes

* falcon-main.cpp : fixes

* gptneox-main.cpp : fixes

* cmpnct_gpt2bpe.hpp : remove non-general stuff

* Update examples/server/README.md

Co-authored-by: slaren <slarengh@gmail.com>

* cmpnct_gpt2bpe.hpp : cleanup

* convert-llama-hf-to-gguf.py : special tokens

* convert-llama-7b-pth-to-gguf.py : special tokens

* convert-permute-debug.py : permute debug print

* convert-permute-debug-master.py : permute debug for master

* convert-permute-debug.py : change permute type of attn_q

* convert.py : 70b model working (change attn_q permute)

* Delete convert-permute-debug-master.py

* Delete convert-permute-debug.py

* convert-llama-hf-to-gguf.py : fix attn_q permute

* gguf.py : fix rope scale kv

* convert-llama-hf-to-gguf.py : rope scale and added tokens

* convert-llama-7b-pth-to-gguf.py : rope scale and added tokens

* llama.cpp : use rope scale kv

* convert-llama-7b-pth-to-gguf.py : rope scale fix

* convert-llama-hf-to-gguf.py : rope scale fix

* py : fix whitespace

* gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682)

* First pass at converting GGMLv3 LLaMA models to GGUF

* Cleanups, better output during conversion

* Fix vocab space conversion logic

* More vocab conversion fixes

* Add description to converted GGUF files

* Improve help text, expand warning

* Allow specifying name and description for output GGUF

* Allow overriding vocab and hyperparams from original model metadata

* Use correct params override var name

* Fix wrong type size for Q8_K

Better handling of original style metadata

* Set default value for gguf add_tensor raw_shape KW arg

* llama : improve token type support (#2668)

* Merge tokenizer fixes into the gguf branch.

* Add test vocabularies

* Adapt convert-new.py (and fix a clang-cl compiler error on windows)

* Improved tokenizer test

But does it work on MacOS?

* Improve token type support

- Added @klosax code to convert.py
- Improved token type support in vocabulary

* Exclude platform dependent tests

* More sentencepiece compatibility by eliminating magic numbers

* Restored accidentally removed comment

* llama : add API for token type

ggml-ci

* tests : use new tokenizer type API (#2692)

* Merge tokenizer fixes into the gguf branch.

* Add test vocabularies

* Adapt convert-new.py (and fix a clang-cl compiler error on windows)

* Improved tokenizer test

But does it work on MacOS?

* Improve token type support

- Added @klosax code to convert.py
- Improved token type support in vocabulary

* Exclude platform dependent tests

* More sentencepiece compatibility by eliminating magic numbers

* Restored accidentally removed comment

* Improve commentary

* Use token type API in test-tokenizer-1.cpp

* py : cosmetics

* readme : add notice about new file format

ggml-ci

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: goerch <jhr.walter@t-online.de>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-08-21 23:07:43 +03:00
Kawrakow cb1c0727bd
HellaSwag: split token evaluation into batches if needed (#2681)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-21 11:11:31 +03:00
Kawrakow 5e9ff54a67
More efficient Hellaswag implementation (#2677)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-20 16:44:46 +03:00
Georgi Gerganov 1f0bccb279
server : better default prompt (#2646) 2023-08-19 05:45:36 +08:00
Jhen-Jie Hong f63564adfa
server : update xxd usage for older versions compatibility (#2649)
* server : update xxd usage for older versions compatibility

* remove unused $func
2023-08-19 05:41:32 +08:00
slaren 097e121e2f
llama : add benchmark example (#2626)
* llama : add benchmark example

* add to examples CMakeLists.txt

* fix msvc build

* add missing include

* add Bessel's correction to stdev calculation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* improve markdown formatting

* add missing include

* print warning is NDEBUG is not defined

* remove n_prompt and n_gen from the matrix, use each value separately instead

* better checks for non-optimized builds

* llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call

* fix json formatting

* add sql output

* add basic cpu and gpu info (linx/cuda only)

* markdown: also show values that differ from the default

* markdown: add build id

* cleanup

* improve formatting

* formatting

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-18 12:44:58 +02:00
Georgi Gerganov e9b12c332e
perplexity : more meaningful ETA number - 2 decimal points 2023-08-18 12:48:55 +03:00
staviq 10151bee2e
server : support for saving templates in browser LocalStorage (#2486)
* support for templates in browser LocalStorage

* sync accepted #2409 fix from upstream

* convert autosave invocation to useEffect

* Apply suggestions from code review

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>

* Regen index.html.cpp, suggested from code review

---------

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
2023-08-18 07:34:01 +08:00
Kerfuffle 8dae7ce684
Add --cfg-negative-prompt-file option for examples (#2591)
Add --cfg-negative-prompt-file option for examples
2023-08-17 07:29:44 -06:00
Jhen-Jie Hong 3ebb00935f
server : add missing /json-schema-to-grammar.mjs (#2616)
fixes #2611
2023-08-15 06:14:14 +08:00
Cheng Shao d75561df20
server : add --numa support (#2524) 2023-08-14 16:36:42 +03:00
Jhen-Jie Hong 2feb8934eb
server : fix default grammar by use empty string in the UI (#2604) 2023-08-14 16:20:17 +08:00
Jhen-Jie Hong 5517d6e692
server : implement json-schema-to-grammar.mjs & add grammar param in the UI (#2588)
* server : implement json-schema-to-grammar.mjs by follow python impl

* server : add grammar support in chat.mjs

* server : implement grammer param in the UI

* server : generate .hpp

* server : remove trailing whitespaces

* server : generate .hpp

* server : fix sort of prop pairs

* server : optimize regex & iteration
2023-08-14 15:16:54 +08:00
byte-6174 b19edd54d5
Adding support for llama2.c models (#2559) 2023-08-12 01:17:25 +02:00
Equim 53dc399472
server: fixed wrong variable name in timing json (#2579)
* server: fixed wrong variable name in timing json

* remove redunct entry
2023-08-12 00:35:14 +02:00
DannyDaemonic 9ca4abed89
Handle ENABLE_VIRTUAL_TERMINAL_PROCESSING more gracefully on earlier versions of Windows. 2023-08-10 13:11:36 -07:00
Christian Demsar e59fcb2bc1
Add --n-predict -2 for stopping generation on full context (#2565) 2023-08-10 16:28:27 +02:00
Martin Krasser 1638757767
Fix grammar-based sampling issue in server (#2566) 2023-08-10 13:16:38 +03:00
Martin Krasser f5bfea0580
Allow passing grammar to completion endpoint (#2532)
* Allow passing grammar to completion endpoint
2023-08-08 16:29:19 +03:00
chaihahaha 7ed8d1fe7f
llm.vim : multiline autocompletion, get rid of "^@" (#2543) 2023-08-08 15:07:02 +03:00
Georgi Gerganov e7f94d6fdc
vim : bring back simple llm.vim example 2023-08-08 15:06:18 +03:00
AustinMroz 2d7baaf50f
vim : streaming and more (#2495)
* Update Vim plugin

* Remove getbufoneline usage, Add input bind example.

getbufoneline() appears to be a recently added function and has been
replaced with getbufline for compatibility.

An additional example that explains how to add a keybind that works in
insert mode was added.
2023-08-08 14:44:48 +03:00
klosax f3c3b4b167
Add --rope-scale parameter (#2544)
* common.cpp : Add --rope-scale parameter
* README.md : Add info about using linear rope scaling
2023-08-07 19:07:19 +02:00
DannyDaemonic 86c3219895
console : fix issue related to Windows 11 PowerShell console mode persistence (#2521) 2023-08-06 09:49:34 +03:00
Jonas Wunderlich 332311234a
fix firefox autoscroll (#2519) 2023-08-04 22:16:11 +02:00
Cebtenzzre 182af739c4
server: regenerate completion.js.hpp (#2515) 2023-08-04 21:00:57 +02:00
DannyDaemonic 3498588e0f
Add --simple-io option for subprocesses and break out console.h and cpp (#1558) 2023-08-04 08:20:12 -07:00
Stephen Nichols 5f631c2679
Fixing race condition in server and partial stream handling in frontend. (#2391)
* Fixing race condition in server.cpp and partial stream handling in completion.js

* Reverting assert edits.

* Adding newline to eof
2023-08-04 13:37:24 +02:00
Borislav Stanimirov ff966e7ca6
build : fix several cast and printf warnings (#2499) 2023-08-04 13:07:21 +03:00
Evan Jones 8183159cf3
examples : generate JSON according to schema (#1887)
* examples : add JSON schema grammars

* complete JSON grammar

* ensure primitive types can be used as root of schema

* support integer type and adjust usage text
2023-08-02 22:05:44 -04:00
Eve 81844fbcfd
tests : Fix compilation warnings (Linux/GCC) (#2451)
* fix hellaswag print format, cast away warning in test-double-float

* c++11 cannot use designated initializers

* add static to test-grad0.c internal functions

* use memcpy in test-double-float.c

* port c tests to c++

* use initializer list for ggml_init_params
2023-08-02 11:06:19 +03:00
Bono Lv c574bddb36
fix a typo in examples/server/README.md (#2478) 2023-08-01 14:54:28 +02:00
ebraminio 86aeb27734
server : Support dark mode (#2414)
* server : Support dark mode

So it respects user system light / dark settings.

* Update index.html.hpp by running ./deps.sh
2023-08-01 10:56:23 +02:00
Johannes Gäßler 0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues (#2453) 2023-07-31 15:44:35 +02:00
klosax 8a88e5855c
perplexity : add Hellaswag calculation (#2389)
* common.h : add hellaswag / remove perplexity-lines

* common.cpp : add hellaswag / remove perplexity-lines

* perplexity.cpp : add hellswag scores / remove perplexity-lines

* perplexity.cpp : clean up

* common.h : change default param value

* common.cpp : Change default param

* perplexity.cpp : alter wording

* common.h : alter wording

* common.cpp : alter wording
2023-07-28 21:25:36 +03:00
Georgi Gerganov d73b8d48b4
examples : fix whitespace 2023-07-28 21:05:08 +03:00
nhamanasu 34ae1caf7f
examples : server chat mode with llama2 (#2400)
* add: server chat mode with llama2

* fix: remove the unnecessary last \n
2023-07-28 21:02:10 +03:00
Weird Constructor d91f3f0c55
readme : fix the description of the Tail free sampling (TFS) method (#2431) 2023-07-28 11:44:43 +03:00
Rand Xie 65cdf34bdc
llama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433) 2023-07-28 11:42:53 +03:00
Kawrakow eb542d3932
Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 18:35:53 +03:00
Xiao-Yong Jin 0c06204fb3
main : add --in-prefix-bos to prefix BOS to user inputs; keep EOS (#2304)
* add `--in-prefix-bos` to prefix BOS to user inputs; keep EOS

The BOS precedes the string specified by `--in-prefix`.
Model generated EOS is now kept in the context.

It provides a way to strictly following the prompt format used in
Llama-2-chat.

The EOS handling also benefits some existing finetunes that uses
EOS to mark the end of turn.

* examples/common: move input_prefix_bos to other bools
2023-07-25 15:19:11 +03:00
slaren d5512b782b
server: add rms_norm_eps parameter (#2380) 2023-07-25 12:36:17 +03:00
Henri Vasserman c798308e3a
[Server] Escape HTML in webchat (#2368)
* escape HTML in webchat
* add amp
2023-07-25 10:27:34 +03:00
slaren 41c674161f
make rms_norm_eps a parameter (#2374)
* make rms_norm_eps a parameter

* add rms_norm_eps to command line

* fix baby llama, test-grad0

* use scientific notation for eps param in the help

ggml-ci
2023-07-24 17:57:12 +02:00
Aarni Koskela b3f138d058
Chat UI extras (#2366)
* makefile: correct deps for server

* server: tighten settings layout a little

* server: expose all currently configured generation params in UI

* server: expose remaining generation params, for the adventurous

* server: embetter mirostat fields
2023-07-24 17:54:22 +03:00
Evan Jones 84e09a7d8b
llama : add grammar-based sampling (#1773)
* llama, main : constrain sampling to grammar

* allow loading grammar from file

* fix whitespace errors

* handle & print parser errors

* add comments to grammar syntax and allow newlines where unambiguous

* add missing include

* support alternates in root rule

* fix bugs with empty token and EOS

* adjust JSON grammar

* remove swp file

* rewrite ternary expressions

Co-authored-by: Henri Vasserman <henv@hot.ee>

* use struct for grammar elements and add Unicode support

* add unicode escapes

* add inverse char ranges

* only sample full tokens (no peeking or truncation)

* llama : minor style changes

blindly applied in online editor - hopefully I didn't break something

* update help text

* add warning message if EOS is disabled

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 23:58:10 -04:00
IgnacioFDM 4f06592cc6
Add gqa parameter support to the server (#2351)
* Add gqa parameter support to the server
* Change help from stderr to stdout
2023-07-23 23:31:17 +03:00
wzy 57921ca6db
common : n_threads == -1 uses std:🧵:hardware_concurrency() (#2347)
* Fix #2345, fix incorrect n_threads

* Update examples/common.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 16:33:02 +03:00
Georgi Gerganov e76d630df1
llama : grouped-query attention + LLaMAv2 70B support (#2276)
* CUDA: GQA implementation

* llama : support for GQA and LLaMAv2 70B

ggml-ci

* py : fix hparams parsing (if-else blocks)

ggml-ci

* py : oh boy ..

ggml-ci

* help : fix gqa value for 70B

ggml-ci

---------

Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23 15:09:47 +03:00
maddes8cht 1d0824b247
llama : print help to stdout (#2338) 2023-07-23 14:59:48 +03:00
AustinMroz 355c80f49e
examples : simplify vim plugin (#2327)
Uses builtin json_encode and json_decode functions to simplify escaping
Removes the need for temp files
2023-07-23 14:16:48 +03:00
Georgi Gerganov b47b8a9cfe
llama : optimize memory buffers (#2325) 2023-07-22 21:17:57 +03:00
klosax b5fe67f8c6
Perplexity: Compute scores correlated to HellaSwag (#2312)
* Add parameter --perplexity-lines to perplexity.cpp
2023-07-22 14:21:24 +02:00
whoreson 24baa54ac1
examples : basic VIM plugin
VIM plugin for server exe
2023-07-22 13:34:51 +03:00
Richard Roberson 7d5f18468c
examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models (#2311)
* Resync my fork with new llama.cpp commits

* examples : rename to use dash instead of underscore

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 22:01:10 +03:00
Ikko Eltociear Ashimine 03e566977b
examples : fix typo in minigpt4.py (#2298)
promt -> prompt
2023-07-21 14:53:07 +03:00
Georgi Gerganov 513f861953
ggml : fix rope args order + assert (#2054) 2023-07-21 14:51:34 +03:00
Guillaume "Vermeille" Sanchez ab0e26bdfb
llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280) 2023-07-21 13:58:36 +03:00
Jose Maldonado 73643f5fb1
gitignore : changes for Poetry users + chat examples (#2284)
A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native
Add two examples for interactive mode using Llama2 models (thx TheBloke for models)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:53:27 +03:00
Georgi Gerganov ae178ab46b
llama : make tensor_split ptr instead of array (#2272) 2023-07-21 13:10:51 +03:00
Hatsune Miku 019fe257bb
MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287)
* Miku.sh: Set default model to llama-2-7b-chat

* Miku.sh: Set ctx_size to 4096

* Miku.sh: Add in-prefix/in-suffix opts

* Miku.sh: Switch sampler to mirostat_v2 and tiny prompt improvements
2023-07-21 11:13:18 +03:00
Przemysław Pawełczyk 9cf022a188
make : fix embdinput library and server examples building on MSYS2 (#2235)
* make : fix embdinput library and server examples building on MSYS2

* cmake : fix server example building on MSYS2
2023-07-21 10:42:21 +03:00
wzy b1f4290953
cmake : install targets (#2256)
fix #2252
2023-07-19 10:01:11 +03:00
Georgi Gerganov d01bccde9f
ci : integrate with ggml-org/ci (#2250)
* ci : run ctest

ggml-ci

* ci : add open llama 3B-v2 tests

ggml-ci

* ci : disable wget progress output

ggml-ci

* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations

ggml-ci

* tests : try to fix tail free sampling test

ggml-ci

* ci : add K-quants

ggml-ci

* ci : add short perplexity tests

ggml-ci

* ci : add README.md

* ppl : add --chunks argument to limit max number of chunks

ggml-ci

* ci : update README
2023-07-18 14:24:43 +03:00
Georgi Gerganov 6cbf9dfb32
llama : shorten quantization descriptions 2023-07-18 11:50:49 +03:00
Xiao-Yong Jin 6e7cca4047
llama : add custom RoPE (#2054)
* Implement customizable RoPE

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

1. Extending Context to 8K
   kaiokendev
   https://kaiokendev.github.io/til#extending-context-to-8k

2. Extending Context Window of Large Language Models via Positional Interpolation
   Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
   https://arxiv.org/abs/2306.15595

3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
   https://www.reddit.com/user/bloc97
   https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

* ggml-metal: fix custom rope

* common: fix argument names in help

* llama: increase MEM_REQ_EVAL for MODEL_3B

It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.

* llama: make MEM_REQ_EVAL depend on n_ctx

* server: use proper Content-Type in curl examples

Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded

Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192

With Content-Type: application/json, we can send large json data.

* style : minor fixes, mostly indentations

* ggml : fix asserts

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 13:34:16 +03:00
Shangning Xu c48c525f87
examples : fixed path typos in embd-input (#2214) 2023-07-14 21:40:05 +03:00
Howard Su 32c5411631
Revert "Support using mmap when applying LoRA (#2095)" (#2206)
Has perf regression when mlock is used.

This reverts commit 2347463201.
2023-07-13 21:58:25 +08:00
Spencer Sutton 5bf2a27718
ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178)
* Add ggml changes

* Update train-text-from-scratch for change

* mpi : adapt to new ggml_tensor->src

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-11 19:31:10 +03:00
Bach Le c9c74b4e3f
llama : add classifier-free guidance (#2135)
* Initial implementation

* Remove debug print

* Restore signature of llama_init_from_gpt_params

* Free guidance context

* Make freeing of guidance_ctx conditional

* Make Classifier-Free Guidance a sampling function

* Correct typo. CFG already means context-free grammar.

* Record sampling time in llama_sample_classifier_free_guidance

* Shift all values by the max value before applying logsoftmax

* Fix styling based on review
2023-07-11 19:18:43 +03:00
Howard Su 2347463201
Support using mmap when applying LoRA (#2095)
* Support using mmap when applying LoRA

* Fix Linux

* Update comment to reflect the support lora with mmap
2023-07-11 22:37:01 +08:00
Evan Miller 5656d10599
mpi : add support for distributed inference via MPI (#2099)
* MPI support, first cut

* fix warnings, update README

* fixes

* wrap includes

* PR comments

* Update CMakeLists.txt

* Add GH workflow, fix test

* Add info to README

* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)

* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()

* mpi : move all MPI logic into ggml-mpi

Not tested yet

* mpi : various fixes - communication now works but results are wrong

* mpi : fix output tensor after MPI compute (still not working)

* mpi : fix inference

* mpi : minor

* Add OpenMPI to GH action

* [mpi] continue-on-error: true

* mpi : fix after master merge

* [mpi] Link MPI C++ libraries to fix OpenMPI

* tests : fix new llama_backend API

* [mpi] use MPI_INT32_T

* mpi : factor out recv / send in functions and reuse

* mpi : extend API to allow usage with outer backends (e.g. Metal)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-10 18:49:56 +03:00
Nigel Bosch db4047ad5c
main : escape prompt prefix/suffix (#2151) 2023-07-09 11:56:18 +03:00
Qingyou Meng 1d656d6360
ggml : change ggml_graph_compute() API to not require context (#1999)
* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287

* rewrite: no longer consider backward compitability; plan and make_plan

* minor: rename ctx as plan; const

* remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward

* add static ggml_graph_compute_sugar()

* minor: update comments

* reusable buffers

* ggml : more consistent naming + metal fixes

* ggml : fix docs

* tests : disable grad / opt + minor naming changes

* ggml : add ggml_graph_compute_with_ctx()

- backwards compatible API
- deduplicates a lot of copy-paste

* ci : enable test-grad0

* examples : factor out plan allocation into a helper function

* llama : factor out plan stuff into a helper function

* ci : fix env

* llama : fix duplicate symbols + refactor example benchmark

* ggml : remove obsolete assert + refactor n_tasks section

* ggml : fix indentation in switch

* llama : avoid unnecessary bool

* ggml : remove comments from source file and match order in header

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07 19:24:01 +03:00
Judd 36680f6e40
convert : update for baichuan (#2081)
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.

Co-authored-by: Judd <foldl@boxvest.com>
2023-07-06 19:23:49 +03:00
tslmy a17a2683d8
alpaca.sh : update model file name (#2074)
The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
2023-07-06 19:17:50 +03:00
Tobias Lütke 31cfbb1013
Expose generation timings from server & update completions.js (#2116)
* use javascript generators as much cleaner API

Also add ways to access completion as promise and EventSource

* export llama_timings as struct and expose them in server

* update readme, update baked includes

* llama : uniform variable names + struct init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson 983b555e9d
Update Server Instructions (#2113)
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text

---------

Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 21:03:19 +03:00
Stephan Walter 1b107b8550
ggml : generalize quantize_fns for simpler FP16 handling (#1237)
* Generalize quantize_fns for simpler FP16 handling

* Remove call to ggml_cuda_mul_mat_get_wsize

* ci : disable FMA for mac os actions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson 8567c76b53
Update server instructions for web front end (#2103)
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 18:13:35 +03:00
Nigel Bosch 7f0e9a775e
embd-input: Fix input embedding example unsigned int seed (#2105) 2023-07-05 07:33:33 +08:00
jwj7140 f257fd2550
Add an API example using server.cpp similar to OAI. (#2009)
* add api_like_OAI.py
* add evaluated token count to server
* add /v1/ endpoints binding
2023-07-04 21:06:12 +03:00
Tobias Lütke 7ee76e45af
Simple webchat for server (#1998)
* expose simple web interface on root domain

* embed index and add --path for choosing static dir

* allow server to multithread

because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.


* let's try this with the xxd tool instead and see if msvc is happier with that

* enable server in Makefiles

* add /completion.js file to make it easy to use the server from js

* slightly nicer css

* rework state management into session, expose historyTemplate to settings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-04 16:05:27 +02:00
Henri Vasserman 1cf14ccef1
fix server crashes (#2076) 2023-07-04 00:05:23 +03:00
WangHaoranRobin d7d2e6a0f0
server: add option to output probabilities for completion (#1962)
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
2023-07-03 00:38:44 +03:00
Georgi Gerganov 79f634a19d
embd-input : fix returning ptr to temporary 2023-07-01 18:46:00 +03:00
Georgi Gerganov 04606a1599
train : fix compile warning 2023-07-01 18:45:44 +03:00
Howard Su b8c8dda75f
Use unsigned for random seed (#2006)
* Use unsigned for random seed. Keep -1 as the value to use a time based seed.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-29 06:15:15 -07:00
Johannes Gäßler 7f9753fa12
CUDA GPU acceleration for LoRAs + f16 models (#1970) 2023-06-28 18:35:54 +02:00
ningshanwutuobang cfa0750bc9
llama : support input embeddings directly (#1910)
* add interface for float input

* fixed inpL shape and type

* add examples of input floats

* add test example for embd input

* fixed sampling

* add free for context

* fixed add end condition for generating

* add examples for llava.py

* add READMD for llava.py

* add READMD for llava.py

* add example of PandaGPT

* refactor the interface and fixed the styles

* add cmake build for embd-input

* add cmake build for embd-input

* Add MiniGPT-4 example

* change the order of the args of llama_eval_internal

* fix ci error
2023-06-28 18:53:37 +03:00
Howard Su 0be54f75a6
baby-llama : fix build after ggml_rope change (#2016) 2023-06-27 08:07:13 +03:00
Georgi Gerganov 181e8d9755
llama : fix rope usage after ChatGLM change 2023-06-27 00:37:33 +03:00
David Yang eaa6ca5a61
ggml : increase max tensor name + clean up compiler warnings in train-text (#1988)
* Clean up compiler warnings in train-text

Some brackets to disambiguate order of operations

* Increase GGML_MAX_NAME

Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
2023-06-26 22:45:32 +03:00
zrm b853d45601
ggml : add NUMA support (#1556)
* detect NUMA systems and pin work threads to nodes (linux)

* disable mmap prefetch/readahead for NUMA systems

* avoid sending finalize op to thread pool if it does nothing

* silence robot

* fix args

* make --numa a param

* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement

* lower synchronization overhead

* statically allocate

* move numa state to g_state

* add description for --numa

* ggml : minor style changes

* ggml : minor style + try fix sanitizer build

* llama : allow to initialize backend with NUMA support

* llama : avoid ggml include in llama-util.h

* ggml : style / formatting

* ggml : fix handling of ops with n_threads > n_tasks > 1

* server : utilize numa parameter

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-26 20:57:59 +03:00
anon998 c2a08f87b8
fix server sampling: top k sampler first (#1977)
Co-authored-by: anon <anon@example.org>
2023-06-25 10:48:36 +02:00
Didzis Gosko 527b6fba1d
llama : make model stateless and context stateful (llama_state) (#1797)
* llama : make model stateless and context stateful

* llama : minor cleanup

* llama : update internal API declaration

* Apply suggestions from code review

fix style

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Missing model memory release

* Fix style

* Add deprecated warning for public API function llama_init_from_file

* Update public API use cases: move away from deprecated llama_init_from_file

* Deprecate public API function llama_apply_lora_from_file

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24 11:47:58 +03:00
Henri Vasserman 20568fe60f
[Fix] Reenable server embedding endpoint (#1937)
* Add back embedding feature

* Update README
2023-06-20 01:12:39 +03:00
Kawrakow 90cc59d6ab
examples : fix examples/metal (#1920)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-18 10:52:10 +03:00
Georgi Gerganov 4f9c43e3bd
minor : warning fixes 2023-06-17 20:24:11 +03:00
Johannes Gäßler 2c9380dd2f
Only one CUDA stream per device for async compute (#1898) 2023-06-17 19:15:02 +02:00
Georgi Gerganov 051e1b0e6a
llama : fix kv_cache n init (close #1903) 2023-06-17 19:31:20 +03:00
Randall Fitzgerald 794db3e7b9
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.

Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.

This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.

Summary of the changes:

- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict 
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables

---------

Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 14:53:04 +03:00
Jiří Podivín 5ddf7ea1fb
hooks : setting up flake8 and pre-commit hooks (#1681)
Small, non-functional changes were made to non-compliant files.
These include breaking up long lines, whitespace sanitation and
unused import removal.

Maximum line length in python files was set to a generous 125 chars,
in order to minimize number of changes needed in scripts and general
annoyance. The "txt" prompts directory is excluded from the checks
as it may contain oddly formatted files and strings for a good reason.

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2023-06-17 13:32:48 +03:00
David Yang 92f20d9942
train : get raw text instead of page with html (#1905)
We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work.
2023-06-17 09:51:54 +03:00
SuperUserNameMan b41b4cad6f
examples : add "simple" (#1840)
* Create `simple.cpp`

* minimalist example `CMakeLists.txt`

* Update Makefile for minimalist example

* remove 273: Trailing whitespace

* removed trailing white spaces simple.cpp

* typo and comments simple.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-16 21:58:09 +03:00
FrankHB 5b9ccaf104
Fixed possible macro redefinition (#1892)
MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined.
2023-06-16 21:25:01 +03:00
Borislav Stanimirov 9cbf50c041
build : fix and ignore MSVC warnings (#1889) 2023-06-16 21:23:53 +03:00
yangli2 c36e81da62
examples : add chat-vicuna.sh (#1854)
Co-authored-by: Yang Li <yangliyl@google.com>
2023-06-15 21:05:53 +03:00
Srinivas Billa 9dda13e5e1
readme : server compile flag (#1874)
Explicitly include the server make instructions for C++ noobsl like me ;)
2023-06-15 20:36:38 +03:00
Johannes Gäßler 6b8312e797
Better error when using both LoRA + GPU layers (#1861) 2023-06-15 19:06:46 +02:00
Johannes Gäßler 254a7a7a5f
CUDA full GPU acceleration, KV cache in VRAM (#1827)
* Fixed CUDA RoPE

* ggml_cuda_mul_mat_vec_p021

* ggml_cuda_scale

* ggml_cuda_diag_mask_inf

* ggml_is_permuted

* ggml_cuda_cpy

* flatten rows for ggml_cuda_op

* Added a --low-vram option

* Fixed Windows performance

* Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
2023-06-14 19:47:19 +02:00
0xspringtime 9254920265
baby-llama : fix operator!= (#1821)
* Update baby-llama.cpp

Seems to be an error in the implementation of the operator!= function. It attempts to compare the this pointer (a llama_hparams_lora object) with the other pointer (a llama_hparams object) using memcmp. This can lead to incorrect results because the sizes of the objects being compared (sizeof(llama_hparams) and sizeof(llama_hparams_lora)) are different, should now be able to compare two llama_hparams_lora objects for inequality.

* Update baby-llama.cpp

* Update baby-llama.cpp
2023-06-13 22:37:54 +03:00
xaedes e32089b2c2
train : improved training-from-scratch example (#1652)
* add python wrapper

https://gist.github.com/abetlen/2b90e5f153f6efd00931d098de5c73ce

* fix decoding error. adds errors=ignore parameter

* add python bindings for functions to get and set the whole llama state
(rng, logits, embedding and kv_cache)

* update python bindings

* add text generating baby-llama from scratch example

* fix race condition bug in ggml_compute_forward_diag_mask_f32

* implement ggml_soft_max_back for more performant backward pass of soft_max

avoids creating big intermediate matrices of size n_embd x n_embd for llama layers and n_vocab x n_vocab for cross entropy loss

* improve softmax backward pass

go from quadratic runtime to linear runtime by simplifying the formulas

* fix race condition bug in non-inplace ggml_compute_forward_diag_mask_f32

memcpy needs to be synchronized across threads to avoid race conditions.
=> do it in INIT phase

* fix bug in ggml_compute_forward_soft_max_back_f32 on DEBUG build

* improve performance of mul_mat backward pass

avoid transpose by using mul_mat with swapped arguments

* avoid printing too much newlines in baby-llama-text

* activate threading in baby-llama-text

* add ggml_out_prod and use it for mul_mat backward pass for improved performance

performance stats report improvement from 37 seconds to 16 seconds runtime during my training tests

* better weight initialization improves training convergence at start

* better weight initialization improves training convergence at start

* improve ggml_out_prod performance

- change iteration order (>15s -> 10s runtime)
- parallelize over one more dimension: over dst matrix rows (10s -> <5s runtime)

* add llama sampler, shuffle samples and constrain sampling to tokens occurring in train data

* fix get_samples call, add model tensor names, increase model size, start training samples after newline

* save train trained model to checkpoint and load model to be trained from checkpoint

* use inplace functions where possible

* initialize rng with srand

* use different arguments for input and output checkpoint

* ggml fixes to support backward pass on inplace operations

* remove duplicate include

* fix cross entropy loss

- add target probabilities for each sample which is then used in cross entropy loss

* print used memory before and after optimization

* sample with non-greedy sampling parameters at the end of training

* add cmake target for baby-llama-text

* add ggml_add1_inplace to header

* enable gradient propagation for inplace add1 and scale operations

those functions backward passes don't need the original src0, so they also work when forward is inplace

* implement AdamW in ggml_opt_adam by adding weight decay parameter (default 0.001f)

also add a schedule parameter (default 1.0f) that can be used to scale alpha and decay according to learning schedule.
setting the decay parameter to zero disables AdamW resulting in normal Adam optimizer.

since the difference between Adam and AdamW is minimal it is not implemented as another optimizer, but integrated into the existing Adam optimizer.

* use inplace operations in cross_entropy_loss

* fix random weight initialization scale

* add missing default parameters for adam optimizer

* add ggml_opt_context, so that we can properly resume training

otherwise the optimizer states, tracking statistics about the error function and its derivates,
will reset to zero each time ggml_opt is called, hindering convergence on resumed training.

now the optimizer context and all its memory is stored in a separate struct.

* fix bug in llama_sample_token_mirostat_v2

when all candidates are filtered out through mu threshold, the following soft_max operation will fail.
so keep at least one.

* add forward function without using cache, for more performant training

during training on whole samples no cache is required.
removing the cache and simplifying the remaining code results in performance and memory usage improvement.

* print suppressed newline tokens as string "\n"

printing too much actual newlines is suppressed to avoid flooding the console.

* store optimizer state in training checkpoint and add learning schedule

persistent optimizer state allows to resume training without resetting the optimizer
learning schedule consists of linear warmup ramp followed by cosine decay with restarts

* remove unused functions

* fix bug in get_samples which corrupted training targets

* save checkpoint only when it was trained

* simplify code

* remove trailing whitespace

* simplify backward pass for SQRT

* replace inefficient repeat backward pass with dedicated repeat_back operation

* add ggml_cross_entropy_loss with backward pass for faster training

cross entropy loss can also be implemented using softmax and log, but as dedicated operation it is faster and especially avoids unnecessary memory overhead.

* add tests for cross_entropy_loss backward pass

finite differences regularly results in estimated gradient of zero, despite the backward pass giving non zero gradient.
_probably_ the finite differences fails due to numerical issues

* use ggml_cross_entropy_loss in text training example

* remove trailing whitespace

* slightly improve how cross entropy loss is compute

btw: directly implemented cross entropy loss seems to have way lower magnitudes than when implemented with softmax and log.
probably the input to log gets closer to zero due to float numerics.
maybe the multiplication by (1.0-eps)/sum is more accurate..

* add llama_get_vocab to get the vocabulary as output parameters

* set default model.type for unknown models with few layers

* add export of training checkpoint to llama compatible model file

* get vocabulary for exporting training checkpoint to llama compatible model file

* implement backward pass of flash attention

* bugfixes for backward pass of flash attention

* test flash attention backward pass

need to set loose error bounds to pass.
the finitie differences are close to numeric limits and often return quite different values than the backward pass.
reducing eps further lets the gradients vanish completely.
likewise setting eps to big results in wronger values.
the softmax in the middle of the function is probably the most responsible for the numeric issues using finite differences.

* add option to train with flash attention and move options to the top of the main function

training from scratch also works with flash attention
training convergence and generation results after fix number of iterations are worse than when not using flash attention.
maybe there still lingers a bug in the flash attention backward pass?
but training works, just with slower convergence.

flash attention is still worth to use, because it requires way less memory and is faster with high n_ctx

* add train_params and command line option parser

* remove unnecessary comments

* add train params to specify memory size

* remove python bindings

* rename baby-llama-text to train-text-from-scratch

* replace auto parameters in lambda function

* add #include <climits>

* add explicit cast to fix compile error

"error: non-constant-expression cannot be narrowed from type 'int64_t' (aka 'long long') to 'uint32_t' (aka 'unsigned int') in initializer list [-Wc++11-narrowing]"

* remove trailing whitespace

* add ggml_opt_resume_g which accepts forward and backward cgraphs

* fix formulas in comments

* bug fix for ggml_compute_forward_get_rows_back_f32

the result should be set to zero, not to whatever data is in opt0

* improve training memory usage with scratch buffers

instead of relying on the automatic backward pass, we manually create the graph for the backward pass.
it turns out that all backward pass operations need only temporary memory which can be reused after each layer.

will compute backward pass for ALL model parameters

* add option to use scratch buffers in training or not

make it configurable because currently training with scratch buffers implies flash attention and optimization over all parameters.

* ci : disable temporary

* store view offset and permute axes in opt[0] instead of storing it in padding

use memcpy to store offset, because offset is of type size_t.
when storing it as int32_t offset would have to be smaller than 2^31 which is not necessarily true.

* minor : fix compile warnings + minor style changes

* fix bug in threaded indices calculation of ggml_compute_forward_flash_attn_back_f32

* store view offset like in master branch

* bug fix in forward_batch_wo_cache_flash_attn_train

* scratch buffer bug fixes in forward_batch_wo_cache_flash_attn_train

data of permute and reshape is the same as their input.
if we want to preserve the output of permute/reshape, we also need to preserve their inputs.

replace reshape(src0, src1) with reshape_nd calls so that we don't need src1.

replace (temporary) t03 with ggml_repeat(ctx0, layer.attention_norm, t02).
in the future we could also use the new broadcasting ggml_mul to avoid these repeat calls.
for this we need backward pass of broadcasting ggml_mul.

* remove unnecessary scratch buffer 0

buf 0 is persistent memory, so we can just disable scratch for this by using buf -1

* avoid creating unnecessary grad tensors

previously we need to create grads for model parameters, so that expand(..) correctly populates cgraph->leafs & cgraph->grads
this wasted memory, because unnecessary grad for each op were automatically created:
the automatically generated grad was unnecessary because we later manually set the grad (e.g. t35->grad = expand(gb, ...) ).
this discarded the automatically generated grad resulting in wasted memory.

improved this by changing expand(..) to not use ggml_build_forward_expand.
expand set cgraph->nodes but not the leafs.
cgraph->leafs & cgraph->grads are set in another pass after the last expand call.

* print used training seed

* zero initialize gfbuf and gbbuf

* ci : re-enable workflows + add README for training

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-13 22:04:40 +03:00
Georgi Gerganov 2347e45e7b
llama : do a warm-up eval at start for better timings (#1824) 2023-06-13 20:20:07 +03:00
Kerfuffle 74d4cfa343
Allow "quantizing" to f16 and f32 (#1787)
* Allow "quantizing" to f16 and f32

Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS

Add brief help to the list of quantization types in the quantize tool

Ignore case for quantization type arguments in the quantize tool
2023-06-13 04:23:23 -06:00
Kerfuffle fa84c4b3e8
Fix issue where interactive mode crashes when input exceeds ctx size (#1789)
* Fix issue where interactive mode in the main example crashes when input exceeds ctx size

* Ensure the context size is at least 8 tokens in the main example.

Closes #1768
2023-06-11 08:19:17 -06:00
Kerfuffle 4f0154b0ba
llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691)
* Add support for quantizing already quantized models

* Threaded dequantizing and f16 to f32 conversion

* Clean up thread blocks with spares calculation a bit

* Use std::runtime_error exceptions.
2023-06-10 10:59:17 +03:00
Willy Tarreau 35a84916fb
main: add the possibility to open the prompt cache read-only (#1640)
The prompt cache constitutes a nice speed up when using the same prompt
prefix across multiple evaluations, but when using it, it will also be
updated, which is not always desirable. One use case is to have a large
prompt containing some context and usage rules, and a second part
containing variable data of the problem being studied. In this case it's
desirable to be able to save the first part once, and to always reuse it
as-is without updating it with the second part.

The new argument --prompt-cache-ro enables this read-only mode on the
prompt cache. The prompt's contents that match the cache are loaded
from the cache but the rest is not modified. This allowed to reduce a
total analysis time from 112s to 49.7s here, without having to backup
and restore a copy of the prompt, which takes significant time at 500
MB.

Signed-off-by: Willy Tarreau <w@1wt.eu>
2023-06-06 22:10:17 -04:00
Johannes Gäßler 17366df842
Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)
* CUDA multi GPU + scratch

ggml_cuda_compute_forward

Tensor parallelism

ggml_cuda_add

ggml_cuda_rms_norm

ggml_cuda_silu

CUDA scratch buffer

--main-gpu CLI option
2023-06-06 21:33:23 +02:00
Kawrakow 99009e72f8
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
* Starting to add k-quantization to ggml

I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.

* Adding Q3_K and Q8_K (de)-quantization

* Q3_K now working on CUDA and AVX2/scalar

CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).

* Some improvement for Q3_K on CUDA

It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.

* Some more CUDA optimizations for Q3_K

Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.

* Adding Q4_K - scalar, AVX2, CUDA

Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).

* Adding Q6_K - scalar, AVX2, CUDA

Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).

* Adding Q5_K - scalar, AVX2, CUDA

Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.

* Per convention, all QX_K quantizations use Q5_K for output.weight

* Adding quantization mixes

* Quantization mixes: didn't quite get what I wanted in the last commit

* Q4_K dot product for ARM_NEON

* Q6_K dot product for ARM_NEON

* Q5_K dot product for ARM_NEON

* Adding Q3_K dot for ARM_NEON

It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.

* A very slightly faster ARM_NEON Q3_K dot

* Adding Q2_K - just CUDA for now

Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.

* Adding scalar and AVX2 Q2_K dot

* Adding ARM_NEON Q2_K dot

About the same performance as Q4_K.

* A slightly faster ARM_NEON Q2_K dot

Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.

* Fixed bug in Q2_K CUDA dot product kernel

Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.

In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
  ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).

* Don't print zeros/NaNs when no count histogram has been collected

* A 10% faster CUDA vector dot kernel for Q3_K

Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.

* A slightly daster Q4_K AVX2 dot product

For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.

* A slightly faster ARM_NEON A4_K dot product

* Minor

* Fix quantization error test

We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.

* Fix docker build

I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.

* Added forgotten ggml.o dependence on k_quants.h to the Makefile

* Had unintentionally committed the Makefile with -Ofast enabled

* ggml : rename k_quants -> ggml-quants-k, use lowercase in code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 22:56:18 +03:00
Georgi Gerganov ecb217db4f
llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
Evan Jones 136476e898
Fix prompt cache saving and chat-persistent rollover (#1678)
* Fix prompt cache saving and chat-persistent rollover (fixes #1670)

* clang-tidy

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2023-06-03 07:28:45 -04:00
DannyDaemonic 248367605e
Work around for recalculating logits in cached prompts (Fixes #1585) (#1609)
* Work around for recalculating logits in cached prompts
2023-05-29 05:13:40 -07:00
Kerfuffle 1b78ed2081
Only show -ngl option when relevant + other doc/arg handling updates (#1625)
1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
2023-05-28 11:48:57 -06:00
Vladimir Zorin 337aea1139
examples : add --alias option to gpt_params to set use friendly model name (#1614) 2023-05-28 20:14:24 +03:00
Kerfuffle 0df7d63e5b
Include server in releases + other build system cleanups (#1610)
Set `LLAMA_BUILD_SERVER` in workflow so the `server` example gets build. This currently only applies to Windows builds because it seems like only Windows binary artifacts are included in releases.

Add `server` example target to `Makefile` (still uses `LLAMA_BUILD_SERVER` define and does not build by default)

Fix issue where `vdot` binary wasn't removed when running `make clean`.

Fix compile warnings in `server` example.

Add `.hpp` files to trigger workflow (the server example has one).
2023-05-27 11:04:14 -06:00
Kerfuffle 66874d4fbc
Some improvements to loading the session with --prompt-cache (#1550)
Improvements to loading the session with `--prompt-cache` in the `main` example.

1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt.
2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.
2023-05-25 20:18:01 -06:00
Senemu 1359b6aba5
chat-persistent.sh : use bracket expressions in grep (#1564) 2023-05-24 09:16:22 +03:00
Steward Garcia 7e4ea5beff
examples : add server example with REST API (#1443)
* Added httplib support

* Added readme for server example

* fixed some bugs

* Fix the build error on Macbook

* changed json11 to nlohmann-json

* removed some whitespaces

* remove trailing whitespace

* added support custom prompts and more functions

* some corrections and added as cmake option
2023-05-21 20:51:18 +03:00
Georgi Gerganov ec2e10c444
llama : add llama_init_backend() API (close #1527) 2023-05-20 11:06:37 +03:00
DannyDaemonic d2c59b8ba4
Fix for mingw (#1462) 2023-05-20 00:40:02 -07:00
Evan Jones 943e6081cc
examples : add persistent chat (#1495)
* examples : add persistent chat

* examples : fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-19 20:39:51 +03:00
Jason McCartney 7694b52b9a
main : make reverse prompt option act as a stop token in non-interactive mode (#1032)
* Make reverse prompt option act as a stop token in non-interactive scenarios

* Making requested review changes

* Update gpt_params_parse and fix a merge error

* Revert "Update gpt_params_parse and fix a merge error"

This reverts commit 2bb2ff1748.

* Update gpt_params_parse and fix a merge error take 2
2023-05-19 20:24:59 +03:00
Georgi Gerganov 4b7e245adf
minor : fix compile warnings 2023-05-19 20:14:51 +03:00
DannyDaemonic ee9654138a
Fixes #1511 lambda issue for w64devkit (mingw) (#1513)
* Fix for w64devkit and mingw
2023-05-18 19:30:40 +02:00
Stephan Walter dc271c52ed
Remove unused n_parts parameter (#1509) 2023-05-17 22:12:01 +00:00
rankaiyx c238b5873a
benchmark-matmul: Print the average of the test results (#1490) 2023-05-17 16:47:58 +02:00
András Salamon 9560655409
define default model path once, sync path with readme (#1366) 2023-05-16 17:46:34 +02:00
zrm 63d20469b8
fix get_num_physical_cores() (#1436)
* fix get_num_physical_cores()
had been broken on complex topologies because "cpu cores" in /proc/cpuinfo is per-"physical id"

* Add spaces to maintain consistent formatting

---------

Co-authored-by: slaren <ddevesa@gmail.com>
2023-05-15 04:25:42 +02:00
slaren b5c9295eef
benchmark-matmul: fix clang-tidy issues, report results in GFLOPS (#1458)
* benchmark-matmul: fix command line parsing, replace macros with functions, report results in GFLOPS
2023-05-14 22:46:00 +02:00
Johannes Gäßler 905d87b70a
ggml : GPU-accelerated token generation (#1412)
* CUDA kernel for q4_0 dequant. + mat. vec. mult.

* Added q4_1 via template

* Added missing __syncthreads();

* --gpu_layers -> --gpu-layers

* Shorter dequantize_mul_mat_vec line

* q5_0 dequantize_mul_mat kernel

* More readable dequantize_mul_mat_vec logic

* dequantize_mul_mat_vec kernels for q5_1, q8_0, f16

* llama : offload "output" tensor to GPU too + coding style fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 16:38:36 +03:00
xaedes f954edda93
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama

- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW

implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.

this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.

still missing backward passes for llama:

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

* implement 5 of 6 missing backward pass operations used by llama

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK

GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.

GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...

GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.

Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.

still not completely implemented backward passes for llama:

- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer

* norm & rms_norm can not be threaded:

after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.

* remove already resolved TODO

* implement backward pass of ggml_rope and ggml_rope_back

* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back

* add test-grad0.c

* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console

* test both gradients of mul_mat

* disable graph dot export as it floods console

* bug fixes for silu_back

* successfully test silu backward

* bug fix for scale backward pass

use sum instead of mean for gradient of scalar scale parameter

* successfully test scale backward

* improve performance of sum backward pass

use add1(x,y) instead of add(x,repeat(y,x))

* improve performance of sqr backward pass

use scale(x,y) instead of mul(x,repeat(y,x))

* successfully test rope backward

* bug fix for cpy backward pass

* successfully test cpy backward

* bug fix for reshape backward pass

* successfully test reshape backward

* add test-opt.c

this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c

* correctly implement softmax backward pass using new operation ggml_diag

ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]

* successfully test soft_max backward

* align shape annotations

* add shape annotations for llama

* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.

with this we can duplicate tensor of any typ as long as they are contiguous.

* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads

when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy

* bug fix for add_at forward

required for view backward pass

src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.

* successfully test view backward

* minor code format improvement

* fix ggml_forward_add functions to work correctly with transposed tensors

uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.

* fix ggml_forward_add1 functions to work correctly with transposed tensors

uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.

* test-grad0.c : add print_elements to help with debugging

* successfully test permute backward

* some minor test-grad0 fixes

* fix sub, mul and div functions to work correctly with transposed tensors

uses the same logic as in add

* implement ggml_cont backward pass

* successfully test transpose backward and permute for all permutations

also test sub, mul and div up to max n_dims

* test-grad0.c add TODO for view_2d and view_3d

add_at (required for view backward pass) is a bit tricky for n_dims > 1.

* fix comments

* successfully test diag_mask_inf and diag_mask_zero backward

* test-grad0 : fix test for div

nargs and ndims was swapped, corrupting the stack

* fix diag_mask to work with non-inplace input

* move dup call into the actual add_at functions

* fix get rows backward pass

* successfully test get_rows backward

* fix view backward pass

add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.

* successfully test backward pass of view_1d, view_2d and view_3d

* fix backward pass for rms_norm

I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.

* successfully test backward pass of rms_norm

some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:

rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324

it is due to the test logic in check_gradients that they fail.

* add todos for llama backward pass

- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.

* add operation ggml_sum_rows

ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]

* add missing GGML_OP_SUM_ROWS

* fix backward pass for repeat

requires ggml_sum_rows

* successfully test backward pass of repeat

* update quantization types in switch-case of add_at and add1

* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.

had to increase maximum number of optimization parameters to train from scratch.

* fix softmax in baby-llama example

* switching from training with adam to lbfgs produces much better results in the baby-llama example

* train with two examples, creating new tensors each time..

* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt

when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt

* train on multiple examples, generate & print tokens with trained model afterwards

ctx0 for evaluation and optimization is renewed for each sample

* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d

* fix soft_max backward pass for input->ne[1] != 1

* add ggml_log operation necessary for cross entropy loss

* add test for ggml_log gradients

* implement backward pass for ggml_sum_rows, necessary for cross entropy loss

* implement ggml_repeat support for rank > 2 tensors

* add test for ggml_sum_rows gradients

* fix training get_example_targets

predict the next token, not the current token!

* add square_error_loss and cross_entropy_loss functions

* optimize loss over multiple samples

this increases computation graph, need parallel batched forward for more efficiency.

* fix backward pass for add_at and change arguments to have same order as in view

* add ggml_set(ctx, a, b) to set b in view of a and return modified a

necessary to set values into kv_self cache and properly propagate the gradients

* fix kv_self gradients for training

use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients

* replace inplace operations for training with copying operations to allow gradient propagation

* add GGML_ASSERT to catch ggml_rope and back value errors

* add trainable lora-only model with all big matrices C split into A,B with A*B=C

this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.

training this instead of the normal model resulted in much worse results though...

* vastly improve training results

instead of logit targets 0 and 1 use -1 and +1.

* shorten code using a variable

* change name of GGML_OP_ADD_AT to GGML_OP_ACC

* smaller default values for baby llama model parameters

* update static assert of GGML_OP_COUNT

* remove shape annotations in llama_eval_internal

* revert disabling of threading for rms_norm and norm

* rename print functions in baby-llama example

* fix call to ggml_set_name

* add missing include for strcmp, etc

* remove trailing whitespace

* reduce number of test-grad0 iterations

avoid exceeding timeout of automated tests

* remove busy loop that was used as sleep for slower sinus wave generation

* disable slow tests grad0 and opt to avoid exceeding timeouts

* c++ in baby-llama example

use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros

* c++ in baby-llama example

use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros

* ggml : fix compiler warnings + cosmetic changes

* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back

* swap arguments to vDSP_vdiv call

documentation for vDSP_vdiv states: "Note that B comes before A!"

* swap arguments to vDSP_vdiv call

documentation for vDSP_vdiv states: "Note that B comes before A!"

* ggml : swap vDSP_vsub args as per documentation

* add parallel batched forward function for baby-llama training

* cleanup code for batched training

* remove trailing whitespace

* minor : fix compiler warnings + indentation style

* ggml : fix null ptr deref in backward pass

* ggml : remove Q4_2 remnants

* ggml : fix clang-tidy warnings

* baby-llama : couple of clang-tidy warnings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 15:56:40 +03:00
Rinne 6456a4eb9f
embedding : remove unused code (#1426) 2023-05-13 10:24:20 +03:00
Georgi Gerganov fb62f92433
llama : fix --mtest option (close #1414) 2023-05-12 21:44:20 +03:00
Johannes Gäßler 773ee249fb
CLI args use - instead of _, backwards compatible (#1416) 2023-05-12 14:34:55 +00:00
Georgi Gerganov b9fd7eee57
ggml : remove bit shuffling (#1405)
* ggml : remove Q4_0 bit shufling (ARM NEON)

* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)

* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)

* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)

* ggml : remove Q5_0 bit shuffling (ARM NEON)

* ggml : 2x faster scalar implementations

* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)

* ggml : simplify scalar dot

* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit

* ggml : fix Q4_1 quantization

* ggml : update cuBLAS + normalize variable names

* ggml : remove Q4_2 mode

* ggml : minor formatting

* ggml : fix Q5_0 quantization

* scripts : add script for measuring the time per token

* AVX implementations (#1370)

* ggml : uniform 5th bit extraction

* llama : produce error upon loading old model files

* llama : fix model magic/version write

* ggml : speed-up Q5_0 + Q5_1 at 4 threads

* ggml : preserve old Q4 and Q5 formats

* ggml : simplify Q8_1 - no need for low / high sums anymore

* ggml : fix Q8_0 and Q8_1 rounding

* Revert "AVX implementations (#1370)"

This reverts commit 948d124837.

* ggml : fix AVX2 implementation

* sha : update hashes for 7B and 13B

* readme : update timings + remove warning banner

* llama : update v2 PR number to 1405

* ggml : fix WASM comments

* ggml : back to original bit order

* readme : add note that Q4 and Q5 have been changed

* llama : fix return for unknown version

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-12 00:23:08 +03:00
Evan Jones cf348a60e0
main : add option to save full output to session (#1338)
* main : add option to save full output to session

* split behavior into --session and --prompt-cache

* restore original implementation with new names

* PR comments

* move the check for incompatible parameters to gpt_params_parse

* Fix whitespace

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

---------

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
2023-05-10 11:37:14 -04:00
DannyDaemonic e6a46b0ed1
Locale fix for Windows (#1379) 2023-05-09 19:53:28 +02:00
DannyDaemonic 41654efea8
Interface improvements and --multiline-input (previously --author-mode) (#1040)
* Interface improvements
* Multiline input
* Track character width
* Works with all characters and control codes + Windows console fixes
2023-05-08 19:45:48 -07:00
Georgi Gerganov f9a6364912
llama : require first token to be BOS (#1303)
* llama : require first token to be BOS

* scripts : add ppl-run-all.sh

* perplexity : add BOS for each chunk

* readme : update perplexity values after BOS fix

* perplexity : add clarifying comments
2023-05-08 17:41:54 +03:00
Johannes Gäßler 1f48b0abcf
Documented CUDA reproducibility, added warning (#1346) 2023-05-08 02:42:01 +02:00
Jed Fox 3924088512
Remove default arguments from sampling functions (#1343) 2023-05-06 17:01:47 -04:00
slaren 94c5652fc0
quantize: make output filename optional, default to ggml-model-<ftype>.bin (#1301) 2023-05-05 00:58:56 +02:00
44670 2edbdb0f99
main : add --in-suffix option (#1318)
* adding --in-suffix option

* print input suffix before generation
2023-05-04 18:41:12 +03:00
DannyDaemonic db1080876a
Only escape prompts when used with -e (#1311) 2023-05-04 05:08:25 -07:00
DannyDaemonic c65a7fbfa9
Update main's README.md with new features (#1296) 2023-05-04 03:02:59 -07:00
Tomas f647ce040f
fix #1224 reverse prompt and multi line (#1297)
* fix reverse prompt and multi line

* Code Formatting

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-04 03:02:30 -07:00
khimaros 6daa09d879
examples : read chat prompts from a template file (#1196) 2023-05-03 20:58:11 +03:00
CRD716 a8a2efdc81
examples : various prompt and example fixes (#1298)
* fix dan.txt

* miku prompt improvements

* use common characters
2023-05-03 18:26:47 +03:00
DannyDaemonic 2485d7a4d3
Process escape sequences given in prompts (#1173) 2023-05-02 18:46:20 -07:00
DannyDaemonic 13b0c68ed7
Handle signals properly on Windows (#1123) 2023-05-02 18:01:57 -07:00
slaren bf4b22ffe4
fix missing parameters in llama_init_from_gpt_params (#1293) 2023-05-03 01:36:45 +02:00
Ron Evans 67c77799e0
examples : add llama_init_from_gpt_params() common function (#1290)
Signed-off-by: deadprogram <ron@hybridgroup.com>
2023-05-02 23:39:51 +03:00
Georgi Gerganov 0e6cbff1b7
llama : fix compile warnings 2023-05-02 23:09:08 +03:00
Ron Evans 8c9be35ff9
examples : improve vertical alignment of a few variables (#1286)
Signed-off-by: deadprogram <ron@hybridgroup.com>
2023-05-02 20:53:52 +03:00
Robert Brisita 2bb992f034
llama : allow 0 as a seed number. (#1275) 2023-05-02 19:23:44 +03:00
Ron Evans e2cd506999
main : switch input_noecho to input_echo to remove negation (#979)
Signed-off-by: deadprogram <ron@hybridgroup.com>
2023-05-02 19:13:26 +03:00
DannyDaemonic f4cef87edf
Add git-based build information for better issue tracking (#1232)
* Add git-based build information for better issue tracking

* macOS fix

* "build (hash)" and "CMAKE_SOURCE_DIR" changes

* Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages

* Fix conditional dependency on missing target

* Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile

* 4 space indenting for cmake, attempt to clean up my mess in Makefile

* Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it
2023-05-01 18:23:47 +02:00
Georgi Gerganov 70269cae37
llama : fix session load / save (#1263) 2023-05-01 14:54:59 +03:00
jon-chuang a5d30b1f53
common : better default number of threads (#934)
* commit

* fix

* try-catch

* apply code review

* improve

* improve

* add macos headers

* done

* remove color

* fix windows

* minor

* fix

* Apply suggestions from code review

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

* remove

* minor

* minor

---------

Co-authored-by: jon-chuang <jon-chuang@users.noreply.github.com>
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
2023-04-30 21:41:35 +03:00
Stephan Walter f0d70f147d
Various fixes to mat_mul benchmark (#1253) 2023-04-30 12:32:37 +00:00
Georgi Gerganov 305eb5afd5
build : fix reference to old llama_util.h 2023-04-29 13:53:12 +03:00
Georgi Gerganov 84ca9c2ecf
examples : fix save-load-state + rename llama-util.h 2023-04-29 13:48:11 +03:00
Georgi Gerganov 334637e43e
common : change default parameters to pre-#1126 (#1223) 2023-04-29 09:51:06 +03:00
Ivan Stepanov dd7eff57d8
llama : new sampling algorithms (#1126)
* Sample interface, new samplers.

New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat

Ignore EOS fix: -inf should be used.

* mirostat

* Added --logit-bias and --no-penalize-nl, removed std::span

* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)

Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)

* Save and load example adjust

* Tests

* Windows build fix

* Windows test fix
2023-04-29 08:34:41 +03:00
Stephan Walter 36d19a603b
Remove Q4_3 which is no better than Q5 (#1218) 2023-04-28 23:10:43 +00:00
CRD716 5fba3c016b
examples : add Jeopardy example (#1168)
* Basic Setup

* Prevent Results.txt from coming up

* Prefixes, Line separators, etc

* editorcheck

* introduction to give more consistent results

* Basic graph thing

* Grading, ready for testing!

* Y'all ready to get funky?

* fix column removal stuff

* missed a few
2023-04-28 19:13:33 +03:00
Evan Jones 1481a9cf25
llama : add session file format and saved sessions in main (#1169) 2023-04-28 18:59:37 +03:00
Georgi Gerganov 574406dc7e
ggml : add Q5_0 and Q5_1 quantization (#1187)
* ggml : add Q5_0 quantization (cuBLAS only)

* ggml : fix Q5_0 qh -> uint32_t

* ggml : fix q5_0 histogram stats

* ggml : q5_0 scalar dot product

* ggml : q5_0 ARM NEON dot

* ggml : q5_0 more efficient ARM NEON using uint64_t masks

* ggml : rename Q5_0 -> Q5_1

* ggml : adding Q5_0 mode

* quantize : add Q5_0 and Q5_1 to map

* ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195)

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-26 23:14:13 +03:00
Pavol Rusnak 859fee6dfb
quantize : use map to assign quantization type from string (#1191)
instead of `int` (while `int` option still being supported)

This allows the following usage:

`./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0`

instead of:

`./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`
2023-04-26 18:43:27 +02:00
Georgi Gerganov 7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179)
* ggml : add Q8_0 quantization format (rename the old one to Q8_1)

* tests : fix test-quantize-fns

* ggml : finalize Q8_0 implementation

* ggml : use q4_0_q8_0 and q4_2_q8_0

* ggml : fix Q8_0 dot product bug (ARM)

* ggml : Q8_0 unroll x2

* ggml : fix bug - using wrong block type

* ggml : extend quantize_fns_t with "vec_dot_type"

* ggml : fix Q8_0 to use 255 values out of 256

* ggml : fix assert using wrong QK4_2 instead of QK4_3
2023-04-25 23:40:51 +03:00
xaedes 0c5692345d
examples : add save_load_state example (#1150)
* add save_load_state example

* use <cstdio> instead of <iostream> and fprintf / printf instead of cout

* renamed save-load-state example files replacing underscores by dashes
2023-04-24 19:23:31 +03:00
mgroeber9110 9b0a4d4214
examples/main README improvements and some light refactoring (#1131) 2023-04-24 15:45:32 +00:00
slaren 1d78fecdab
Fix LoRA acronym (#1145) 2023-04-23 23:03:44 +02:00
DannyDaemonic edce63baa9
Added README.md for main with examples and explanations (#1139) 2023-04-23 15:37:02 +00:00
Stephan Walter c50b628810
Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122) 2023-04-22 10:54:13 +00:00
wbpxre150 36b4f7e064
llama : print timings on ctrl+c exit (#1021)
* print timings on ctrl+c exit

* remove redundant free memory call.

* add global pointer to ctx.
2023-04-22 11:56:35 +03:00
eiery 10f19c1121
llama : have n_batch default to 512 (#1091)
* set default n_batch to 512 when using BLAS

* spacing

* alternate implementation of setting different n_batch for BLAS

* set n_batch to 512 for all cases
2023-04-22 11:27:05 +03:00
Clint Herron e9a9cb0c54
examples : Improve Alpaca Default Repeat Penalty: Better Match Alpaca.cpp Experience (#1107)
* Moving parameters to separate lines for readability.

* Increasing repeate_penalty to 1.1 to make alpaca more usable by default.

* Adding trailing newline.
2023-04-22 09:54:33 +03:00
Alex Klinkhamer 9411288271
main : evaluate tokens in batches after swapping context (#1014)
* examples : evaluate tokens in batches after swapping context

* Update examples/main/main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-21 21:18:09 +03:00
slaren 3d59769c3b
Show perplexity ETA in hours and minutes (#1096) 2023-04-21 14:57:57 +02:00
Kawrakow 38de86a711
llama : multi-threaded quantization (#1075)
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-20 20:42:27 +03:00
Georgi Gerganov e0305ead3a
ggml : add Q4_3 quantization (#1082) 2023-04-20 20:35:53 +03:00
Georgi Gerganov 77a73403ca
ggml : add new Q4_2 quantization (ARM only) (#1046)
* ggml : Q4_2 ARM

* ggml : add ggml_is_quantized()

* llama : update llama_type_name() with Q4_2 entry

* ggml : speed-up q4_2

- 4 threads: ~100ms -> ~90ms
- 8 threads:  ~55ms -> ~50ms

* ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
2023-04-18 23:54:57 +03:00
slaren 315a95a4d3
Add LoRA support (#820) 2023-04-17 17:28:55 +02:00
Georgi Gerganov eb17a026fd
quantize-stats : fix bug in --type argument 2023-04-17 17:31:06 +03:00
Pavol Rusnak 489537e6cf
examples: add missing <ctime> include for time() (#1011) 2023-04-16 10:13:00 +00:00
Ivan Komarov c12b14b77f
benchmark : fix result validation in benchmark-q4_0-matmult (#987) 2023-04-15 08:51:54 +03:00
Pavol Rusnak c85e03d12e
Revert "main : alternative instruct mode (Vicuna support, etc.) (#863)" (#982)
This reverts commit f4d277ae17.
2023-04-14 22:58:43 +03:00
Pavol Rusnak c56b715269
Expose type name from ggml (#970)
Avoid duplication of type names in utils

Co-authored-by: Håkon H. Hitland <haakon@likedan.net>
2023-04-14 20:05:37 +02:00
Tomáš Pazdiora f4d277ae17
main : alternative instruct mode (Vicuna support, etc.) (#863)
* Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt

* Add multiline mode, update text input.

* bugfix

* update implementation

* typos

* Change --multiline implementation to be toggled by EOF.

* bugfix

* default multiline mode

* add more configs

* update formating

* update formatting

* apply suggestions
2023-04-14 18:19:17 +03:00
Gary Linscott be87b6ed20
perplexity : add support for batch size to --perplexity (#407)
* Add support to batch size for perplexity

* Revert "Fix memory allocation issues and seg faults"

This reverts commit 4870e455b3.

* update from merge

* Remove perplexity from main

* updates

* Update batch size for efficiency
2023-04-14 00:50:42 +03:00
CRD716 0e07e6a839
common : remove unnecessary includes (#947) 2023-04-13 18:39:25 +03:00
Georgi Gerganov 9190e8eac8
llama : merge llama_internal.h into llama.h
Hide it behind an #ifdef
2023-04-13 18:04:45 +03:00
CRD716 8cda5c981d
fix whitespace (#944) 2023-04-13 16:03:57 +02:00
niansa/tuxifan 107980d970
examples : add -n to alpaca and gpt4all scripts (#706) 2023-04-13 16:03:39 +03:00
SebastianApel 95ea26f6e9
benchmark : add tool for timing q4_0 matrix multiplication (#653)
* Initial version of q4_0 matrix multiplication benchmark

* Bugfix: Added dependency to ggml.o to benchmark

* Reviewer requests: added parameter for threads, switched to ggml_time_us()

* Reviewer input: removed rtsc, use epsilon for check

* Review comment: Removed set_locale

* Feature: Param for numer of iterations, Bugfix for use of parameter threads

* Reviewer suggestion: Moved to examples

* Reviewer feedback: Updated clean: and benchmark: sections

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-13 15:46:23 +03:00
Pavol Rusnak 8b679987cd
Fix whitespace, add .editorconfig, add GitHub workflow (#883) 2023-04-11 19:45:44 +00:00
Stephan Walter 3e6e70d8e8
Add enum llama_ftype, sync ggml_type to model files (#709) 2023-04-11 15:03:51 +00:00
comex 2663d2c678
Windows fixes (#890)
Mostly for msys2 and mingw64 builds, which are different from each other
and different from standard Visual Studio builds.  Isn't Windows fun?

- Define _GNU_SOURCE in more files (it's already used in ggml.c for
  Linux's sake).

- Don't use PrefetchVirtualMemory if not building for Windows 8 or later
  (mingw64 doesn't by default).  But warn the user about this situation
  since it's probably not intended.

- Check for NOMINMAX already being defined, which it is on mingw64.

- Actually use the `increment` variable (bug in my `pizza` PR).

- Suppress unused variable warnings in the fake pthread_create and
  pthread_join implementations for Windows.

- (not Windows-related) Remove mention of `asprintf` from comment;
  `asprintf` is no longer used.

Fixes #871.
2023-04-11 15:19:54 +02:00
comex f963b63afa Rewrite loading code to try to satisfy everyone:
- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

    - Indicate loading progress when using mmap + mlock.  (Which led me
      to the interesting observation that on my Linux machine, with a
      warm file cache, mlock actually takes some time, whereas mmap
      without mlock starts almost instantly...)

      - To help implement this, move mlock support from ggml to the
        loading code.

- madvise/PrefetchVirtualMemory support (based on #740)

- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.  The exceptions are converted to error codes at the
  API boundary.)

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)
2023-04-10 01:10:46 +02:00
Tomáš Pazdiora aaf3b23deb
fix for windows utf-8 input (#840)
Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros
2023-04-08 17:49:39 +02:00
unbounded 62cfc54f77
Add quantize-stats command for testing quantization (#728)
Command that calculates some statistics over the errors introduced by
quantization, like mean square error, max error and some percentile errors for layer
weights. Should be useful for testing quantization improvements.

Exposes some internal state from ggml and llama for testing
2023-04-08 00:09:18 +02:00
Sergey Alirzaev cc9cee8e9e
Do not crash when it has nothing to say. (#796)
Otherwise observing this in the interactive mode:
/usr/lib/gcc/x86_64-pc-linux-gnu/12/include/g++-v12/bits/stl_vector.h:1230: reference std::vector<int>::back() [_Tp = int, _Alloc = std::allocator<int>]: Assertion '!this->empty()' failed.
2023-04-06 17:59:11 +02:00
at8u ff05d05c96
miku.sh : add executable bit (#780) 2023-04-05 18:59:13 +03:00
at8u 88ed5761b8
examples : add Miku.sh (#724)
* Add Miku.sh to examples

* Add missing line to prompt in Miku.sh

* Add --keep param to Miku.sh

* Remove '[end_of_conversation]' line from Miku.sh

No longer is necessary.
2023-04-05 17:32:42 +03:00
mgroeber9110 53dbba7695
Windows: reactive sigint handler after each Ctrl-C (#736) 2023-04-03 18:00:55 +02:00
Leonardo Neumann 6e7801d08d
examples : add gpt4all script (#658) 2023-04-02 10:56:20 +03:00
Murilo Santana 5b70e7de4c
fix default params for examples/main (#697) 2023-04-02 04:41:12 +02:00