diff --git a/README.md b/README.md index f41250147..09c5b1b92 100644 --- a/README.md +++ b/README.md @@ -597,6 +597,11 @@ Several quantization methods are supported. They differ in the resulting model d | 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 | | 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 | +- [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684) +- recent k-quants improvements + - [#2707](https://github.com/ggerganov/llama.cpp/pull/2707) + - [#2807](https://github.com/ggerganov/llama.cpp/pull/2807) + ### Perplexity (measuring model quality) You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better). diff --git a/examples/perplexity/README.md b/examples/perplexity/README.md index eacfb17c6..50e1af011 100644 --- a/examples/perplexity/README.md +++ b/examples/perplexity/README.md @@ -1,3 +1,21 @@ # perplexity TODO + +## Llama 2 70B Scorechart +Quantization | Model size (GiB) | Perplexity | Delta to fp16 +-- | -- | -- | -- +Q4_0 | 36.20 | 3.5550 | 3.61% +Q4_1 | 40.20 | 3.5125 | 2.37% +Q5_0 | 44.20 | 3.4744 | 1.26% +Q2_K | 27.27 | 3.7339 | 8.82% +Q3_K_S | 27.86 | 3.7019 | 7.89% +Q3_K_M | 30.83 | 3.5932 | 4.72% +Q3_K_L | 33.67 | 3.5617 | 3.80% +Q4_K_S | 36.39 | 3.4852 | 1.57% +Q4_K_M | 38.54 | 3.4725 | 1.20% +Q5_K_S | 44.20 | 3.4483 | 0.50% +Q5_K_M | 45.41 | 3.4451 | 0.40% +Q6_K | 52.70 | 3.4367 | 0.16% +fp16 | 128.5 | 3.4313 | - + diff --git a/examples/quantize/README.md b/examples/quantize/README.md index f349e913e..c8b9a27a0 100644 --- a/examples/quantize/README.md +++ b/examples/quantize/README.md @@ -1,3 +1,44 @@ # quantize TODO + +## Llama 2 7B + +Quantization | Bits per Weight (BPW) +-- | -- +Q2_K | 3.35 +Q3_K_S | 3.50 +Q3_K_M | 3.91 +Q3_K_L | 4.27 +Q4_K_S | 4.58 +Q4_K_M | 4.84 +Q5_K_S | 5.52 +Q5_K_M | 5.68 +Q6_K | 6.56 + +## Llama 2 13B +Quantization | Bits per Weight (BPW) +-- | -- +Q2_K | 3.34 +Q3_K_S | 3.48 +Q3_K_M | 3.89 +Q3_K_L | 4.26 +Q4_K_S | 4.56 +Q4_K_M | 4.83 +Q5_K_S | 5.51 +Q5_K_M | 5.67 +Q6_K | 6.56 + +# Llama 2 70B + +Quantization | Bits per Weight (BPW) +-- | -- +Q2_K | 3.40 +Q3_K_S | 3.47 +Q3_K_M | 3.85 +Q3_K_L | 4.19 +Q4_K_S | 4.53 +Q4_K_M | 4.80 +Q5_K_S | 5.50 +Q5_K_M | 5.65 +Q6_K | 6.56