llama.cpp/examples
Kawrakow 4c4cb30736
IQ3_S: a much better alternative to Q3_K (#5676)
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* Resurrecting iq3_xs

After all the experimentation, nothing was better than this.

* Minor PPL improvement via a block scale fudge factor

* Minor improvement via 3 neighbours

* iq3_xs: working scalar and AVX2 dot products

* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)

* iq3_xs: working Metal implementation

* Adding IQ3_M - IQ3_XS mix with mostly Q4_K

* iiq3_xs: a 3.4375 bpw variant

* iq3_xs: make CUDA work for new version

* iq3_xs: make scalar and AVX2 work for new version

* iq3_s: make ARM_NEON work with new version

* iq3_xs: make new version work on metal

Performance is very similar to Q3_K_S

* iq3_xs: tiny Metal speed improvement

* iq3_xs: tiny Metal speed improvement

* Fix stupid warning

* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS

* iq3_xs: rename to iq3_s

* iq3_s: make tests pass

* Move Q3_K_XS mix to 3.25 bpw

* Attempt to fix failing tests

* Another attempt to fix the Windows builds

* Attempt to fix ROCm

* ROCm again

* iq3_s: partial fix for QK_K = 64

* iq3_s: make it work on metal for QK_K = 64

Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.

* Will this fix ROCm?

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-24 16:23:52 +02:00
..
baby-llama baby-llama : allocate graphs in ggml_context (#5573) 2024-02-19 10:25:38 +02:00
batched ggml, common, examples, tests : fixed type arguments in printf (#5528) 2024-02-18 18:20:12 +02:00
batched-bench ggml, common, examples, tests : fixed type arguments in printf (#5528) 2024-02-18 18:20:12 +02:00
batched.swift ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
beam-search ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
benchmark 2-bit quantizations (#4897) 2024-01-14 09:45:56 +02:00
convert-llama2c-to-ggml ggml, common, examples, tests : fixed type arguments in printf (#5528) 2024-02-18 18:20:12 +02:00
embedding ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
export-lora ci : add an option to fail on compile warning (#3952) 2024-02-17 23:03:14 +02:00
finetune finetune : rename feed-forward tensors (w1/w2/w3) (#4839) 2024-02-13 15:15:42 +02:00
gguf gguf : simplify example dependencies 2023-12-21 23:08:14 +02:00
imatrix ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
infill ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
jeopardy parallel : add option to load external prompt file (#3416) 2023-10-06 16:16:38 +03:00
llama-bench ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
llama.android ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
llama.swiftui ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
llava llava : add --skip-unknown to 1.6 convert.py (#5632) 2024-02-21 15:36:57 +02:00
lookahead ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
lookup ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
main examples : do not assume BOS when shifting context (#5622) 2024-02-21 10:33:54 -05:00
main-cmake-pkg main-cmake-pkg : fix build issue (#4665) 2023-12-29 16:18:20 +02:00
parallel ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
passkey ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
perplexity ci : fix wikitext url + compile warnings (#5569) 2024-02-18 22:39:30 +02:00
quantize IQ3_S: a much better alternative to Q3_K (#5676) 2024-02-24 16:23:52 +02:00
quantize-stats refactor : switch to emplace_back to avoid extra object (#5291) 2024-02-03 13:23:37 +02:00
save-load-state llama : minimize size used for state save/load (#4820) 2024-01-13 18:29:43 +02:00
server server: init functional tests (#5566) 2024-02-24 12:28:55 +01:00
simple ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
speculative ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
sycl [SYCL] update guide of SYCL backend (#5254) 2024-02-02 15:53:27 +08:00
tokenize ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
train-text-from-scratch ggml, common, examples, tests : fixed type arguments in printf (#5528) 2024-02-18 18:20:12 +02:00
alpaca.sh alpaca.sh : update model file name (#2074) 2023-07-06 19:17:50 +03:00
base-translate.sh examples : improve base-translate.sh script (#4783) 2024-01-06 11:40:24 +02:00
chat-13B.bat Create chat-13B.bat (#592) 2023-03-29 20:21:09 +03:00
chat-13B.sh examples : read chat prompts from a template file (#1196) 2023-05-03 20:58:11 +03:00
chat-persistent.sh llama : fix session saving/loading (#3400) 2023-10-03 21:04:01 +03:00
chat-vicuna.sh examples : add chat-vicuna.sh (#1854) 2023-06-15 21:05:53 +03:00
chat.sh main : log file (#2748) 2023-08-30 09:29:32 +03:00
CMakeLists.txt gguf : add python reader example (#5216) 2024-02-13 19:56:38 +02:00
gpt4all.sh examples : add -n to alpaca and gpt4all scripts (#706) 2023-04-13 16:03:39 +03:00
json-schema-to-grammar.py examples : support minItems/maxItems in JSON grammar converter (#5039) 2024-02-19 16:14:07 +02:00
llama.vim llama.vim : added api key support (#5090) 2024-01-23 08:51:27 +02:00
llama2-13b.sh gitignore : changes for Poetry users + chat examples (#2284) 2023-07-21 13:53:27 +03:00
llama2.sh gitignore : changes for Poetry users + chat examples (#2284) 2023-07-21 13:53:27 +03:00
llm.vim llm.vim : stop generation at multiple linebreaks, bind to <F2> (#2879) 2023-08-30 09:50:55 +03:00
make-ggml.py make-ggml.py : compatibility with more models and GGUF (#3290) 2023-09-27 19:25:12 +03:00
Miku.sh MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287) 2023-07-21 11:13:18 +03:00
pydantic-models-to-grammar-examples.py examples : make pydantic scripts pass mypy and support py3.8 (#5099) 2024-01-25 14:51:24 -05:00
pydantic_models_to_grammar.py examples : make pydantic scripts pass mypy and support py3.8 (#5099) 2024-01-25 14:51:24 -05:00
reason-act.sh chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00
server-llama2-13B.sh chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00