Commit graph

110 commits

Author SHA1 Message Date
Georgi Gerganov 7a74dee6b4
llama : temporary disable Q6_K output quantization (#1711) 2023-06-06 09:39:38 +03:00
Spencer Sutton 590250f7a9
metal : add checks for buffer size (#1706)
Co-authored-by: Spencer Sutton <Spencer.Sutton@precisely.com>
2023-06-06 06:28:17 +03:00
mgroeber9110 c2df36d60d
llama : consistently catch and throw only exceptions deriving from std::exception (#1599)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 23:24:29 +03:00
kiltyj 9d0693bce3
metal : use shared buffers between CPU and GPU (#1696)
* Use MTLDevice.newBufferWithBytesNoCopy to share buffers between CPU and GPU

* Page-align buffers used by Metal

* Remove trailing whitespace

* Only import unistd.h for Metal builds

* metal : remove unnecessary copies

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 23:24:04 +03:00
Kawrakow 99009e72f8
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
* Starting to add k-quantization to ggml

I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.

* Adding Q3_K and Q8_K (de)-quantization

* Q3_K now working on CUDA and AVX2/scalar

CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).

* Some improvement for Q3_K on CUDA

It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.

* Some more CUDA optimizations for Q3_K

Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.

* Adding Q4_K - scalar, AVX2, CUDA

Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).

* Adding Q6_K - scalar, AVX2, CUDA

Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).

* Adding Q5_K - scalar, AVX2, CUDA

Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.

* Per convention, all QX_K quantizations use Q5_K for output.weight

* Adding quantization mixes

* Quantization mixes: didn't quite get what I wanted in the last commit

* Q4_K dot product for ARM_NEON

* Q6_K dot product for ARM_NEON

* Q5_K dot product for ARM_NEON

* Adding Q3_K dot for ARM_NEON

It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.

* A very slightly faster ARM_NEON Q3_K dot

* Adding Q2_K - just CUDA for now

Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.

* Adding scalar and AVX2 Q2_K dot

* Adding ARM_NEON Q2_K dot

About the same performance as Q4_K.

* A slightly faster ARM_NEON Q2_K dot

Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.

* Fixed bug in Q2_K CUDA dot product kernel

Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.

In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
  ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).

* Don't print zeros/NaNs when no count histogram has been collected

* A 10% faster CUDA vector dot kernel for Q3_K

Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.

* A slightly daster Q4_K AVX2 dot product

For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.

* A slightly faster ARM_NEON A4_K dot product

* Minor

* Fix quantization error test

We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.

* Fix docker build

I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.

* Added forgotten ggml.o dependence on k_quants.h to the Makefile

* Had unintentionally committed the Makefile with -Ofast enabled

* ggml : rename k_quants -> ggml-quants-k, use lowercase in code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 22:56:18 +03:00
Henri Vasserman 5220a991a5
Increase 3B scratch buffers. (#1698)
The 128 MB was too optimistic.
Too bad it is not dynamically computed.
2023-06-05 13:43:08 +03:00
Georgi Gerganov d1f563a743
llama : fix Metal KV cache sync (close #1695) 2023-06-05 10:19:03 +03:00
Georgi Gerganov ecb217db4f
llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
0cc4m dcb2ed4826
OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653)
* Use events instead of clFinish, where possible

* OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

* Reduce queueing overhead for contiguous tensors by using single mul kernel call

* Adapt to #1612 cl_mem malloc changes

* Reduce code duplication between cuda and opencl branches

* Improve implementation
2023-06-04 08:12:05 +02:00
Henri Vasserman ffb06a345e
OpenLLaMA 3B support (#1588)
This adds support to llama.cpp to load the model.

Currently missing are changes that are required from convert.py to convert the model correctly. It needs some changes to start reading the JSON configuration for HF models instead of deriving the values by guessing.

Co-authored-by: FNsi <125447286+FNsi@users.noreply.github.com>
2023-05-30 21:24:22 +03:00
0cc4m 2e6cd4b025
OpenCL Token Generation Acceleration (#1459)
* Move back to C++ for OpenCL

* Refactor OpenCL code to work more like the CUDA code, add missing functions

* Deduplicate dequant kernels

* Add OpenCL compile options

* Use compile args for preprocessing constants

* Restore default platform + device selection by id behavior

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Henri Vasserman <henv@hot.ee>
2023-05-23 00:33:24 +03:00
Juuso Alasuutari 29cf5596fe
llama : define magic numbers as integer constants (#1518) (#1520)
The underlying representation of multibyte character literals is
implementation-defined. This could, at least in principle, cause
cross-build data export/import issues independent of endianness.

Define magic numbers as integer literals to be on the safe side.

Signed-off-by: Juuso Alasuutari <juuso.alasuutari@gmail.com>
2023-05-20 15:58:15 +03:00
Johannes Gäßler affc76edfd
cuda : loading models directly into VRAM, norm calculation on GPU, broadcasting for ggml_mul (#1483)
* Broadcasting for ggml_mul

* CUDA kernel for ggml_mul, norms in VRAM

* GPU weights not in RAM, direct loading with cuFile

* fixup! GPU weights not in RAM, direct loading with cuFile

* fixup! GPU weights not in RAM, direct loading with cuFile

* define default model path once, sync path with readme (#1366)

* ~7% faster Q5_1 AVX2 code (#1477)

* convert.py: Support models which are stored in a single pytorch_model.bin (#1469)

* Support models in a single pytorch_model.bin

* Remove spurious line with typo

* benchmark-matmul: Print the average of the test results (#1490)

* Remove unused n_parts parameter (#1509)

* Fixes #1511 lambda issue for w64devkit (mingw) (#1513)

* Fix for w64devkit and mingw

* make kv_f16 the default for api users (#1517)

* minor : fix compile warnings

* readme : adds WizardLM to the list of supported models (#1485)

* main : make reverse prompt option act as a stop token in non-interactive mode (#1032)

* Make reverse prompt option act as a stop token in non-interactive scenarios

* Making requested review changes

* Update gpt_params_parse and fix a merge error

* Revert "Update gpt_params_parse and fix a merge error"

This reverts commit 2bb2ff1748.

* Update gpt_params_parse and fix a merge error take 2

* examples : add persistent chat (#1495)

* examples : add persistent chat

* examples : fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* tests : add missing header

* ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)

* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics

* ggml : fix scalar implementation of Q4_1 dot

* llama : fix compile warnings in llama_set_state_data()

* llama : fix name shadowing and C4146 (#1526)

* Fix name shadowing and C4146

* Fix if macros not using defined when required

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Code style

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix for mingw (#1462)

* llama : add llama_init_backend() API (close #1527)

* feature : add blis and other BLAS implementation support (#1502)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Revert "feature : add blis and other BLAS implementation support (#1502)"

This reverts commit 07e9ace0f9.

* GPU weights not in RAM, direct loading with cuFile

* llama : code style fixes + progress print fix

* ggml : ggml_mul better broadcast support

* cmake : workarounds for cufile when CMake version < 3.25

* gg rebase fixup

* Loop in llama.cpp, fixed progress callback

* Attempt clang-tidy fix

* llama : fix vram size computation

* Add forgotten fclose()

---------

Co-authored-by: András Salamon <ott2@users.noreply.github.com>
Co-authored-by: Ilya Kurdyukov <59548320+ilyakurdyukov@users.noreply.github.com>
Co-authored-by: Tom Jobbins <784313+TheBloke@users.noreply.github.com>
Co-authored-by: rankaiyx <rankaiyx@rankaiyx.com>
Co-authored-by: Stephan Walter <stephan@walter.name>
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: David Kennedy <dakennedyd@gmail.com>
Co-authored-by: Jason McCartney <jmac@theroot.org>
Co-authored-by: Evan Jones <evan.q.jones@gmail.com>
Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zenix <zenixls2@gmail.com>
2023-05-20 15:19:28 +03:00
Georgi Gerganov ec2e10c444
llama : add llama_init_backend() API (close #1527) 2023-05-20 11:06:37 +03:00
Maxime 503db28849
llama : fix name shadowing and C4146 (#1526)
* Fix name shadowing and C4146

* Fix if macros not using defined when required

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Code style

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 10:22:37 +03:00
Georgi Gerganov 8a203f9fa1 llama : fix compile warnings in llama_set_state_data() 2023-05-20 10:14:43 +03:00
Georgi Gerganov 2d5db48371
ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)
* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics
2023-05-19 22:17:18 +03:00
Georgi Gerganov 4b7e245adf
minor : fix compile warnings 2023-05-19 20:14:51 +03:00
Erik Scholz 5ea4339273
make kv_f16 the default for api users (#1517) 2023-05-18 19:31:01 +02:00
Stephan Walter dc271c52ed
Remove unused n_parts parameter (#1509) 2023-05-17 22:12:01 +00:00
Georgi Gerganov 5a5aeb1e91
llama : fix unused warning 2023-05-13 16:55:14 +03:00
Johannes Gäßler 905d87b70a
ggml : GPU-accelerated token generation (#1412)
* CUDA kernel for q4_0 dequant. + mat. vec. mult.

* Added q4_1 via template

* Added missing __syncthreads();

* --gpu_layers -> --gpu-layers

* Shorter dequantize_mul_mat_vec line

* q5_0 dequantize_mul_mat kernel

* More readable dequantize_mul_mat_vec logic

* dequantize_mul_mat_vec kernels for q5_1, q8_0, f16

* llama : offload "output" tensor to GPU too + coding style fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 16:38:36 +03:00
xaedes f954edda93
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama

- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW

implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.

this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.

still missing backward passes for llama:

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

* implement 5 of 6 missing backward pass operations used by llama

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK

GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.

GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...

GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.

Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.

still not completely implemented backward passes for llama:

- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer

* norm & rms_norm can not be threaded:

after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.

* remove already resolved TODO

* implement backward pass of ggml_rope and ggml_rope_back

* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back

* add test-grad0.c

* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console

* test both gradients of mul_mat

* disable graph dot export as it floods console

* bug fixes for silu_back

* successfully test silu backward

* bug fix for scale backward pass

use sum instead of mean for gradient of scalar scale parameter

* successfully test scale backward

* improve performance of sum backward pass

use add1(x,y) instead of add(x,repeat(y,x))

* improve performance of sqr backward pass

use scale(x,y) instead of mul(x,repeat(y,x))

* successfully test rope backward

* bug fix for cpy backward pass

* successfully test cpy backward

* bug fix for reshape backward pass

* successfully test reshape backward

* add test-opt.c

this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c

* correctly implement softmax backward pass using new operation ggml_diag

ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]

* successfully test soft_max backward

* align shape annotations

* add shape annotations for llama

* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.

with this we can duplicate tensor of any typ as long as they are contiguous.

* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads

when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy

* bug fix for add_at forward

required for view backward pass

src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.

* successfully test view backward

* minor code format improvement

* fix ggml_forward_add functions to work correctly with transposed tensors

uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.

* fix ggml_forward_add1 functions to work correctly with transposed tensors

uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.

* test-grad0.c : add print_elements to help with debugging

* successfully test permute backward

* some minor test-grad0 fixes

* fix sub, mul and div functions to work correctly with transposed tensors

uses the same logic as in add

* implement ggml_cont backward pass

* successfully test transpose backward and permute for all permutations

also test sub, mul and div up to max n_dims

* test-grad0.c add TODO for view_2d and view_3d

add_at (required for view backward pass) is a bit tricky for n_dims > 1.

* fix comments

* successfully test diag_mask_inf and diag_mask_zero backward

* test-grad0 : fix test for div

nargs and ndims was swapped, corrupting the stack

* fix diag_mask to work with non-inplace input

* move dup call into the actual add_at functions

* fix get rows backward pass

* successfully test get_rows backward

* fix view backward pass

add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.

* successfully test backward pass of view_1d, view_2d and view_3d

* fix backward pass for rms_norm

I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.

* successfully test backward pass of rms_norm

some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:

rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324

it is due to the test logic in check_gradients that they fail.

* add todos for llama backward pass

- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.

* add operation ggml_sum_rows

ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]

* add missing GGML_OP_SUM_ROWS

* fix backward pass for repeat

requires ggml_sum_rows

* successfully test backward pass of repeat

* update quantization types in switch-case of add_at and add1

* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.

had to increase maximum number of optimization parameters to train from scratch.

* fix softmax in baby-llama example

* switching from training with adam to lbfgs produces much better results in the baby-llama example

* train with two examples, creating new tensors each time..

* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt

when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt

* train on multiple examples, generate & print tokens with trained model afterwards

ctx0 for evaluation and optimization is renewed for each sample

* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d

* fix soft_max backward pass for input->ne[1] != 1

* add ggml_log operation necessary for cross entropy loss

* add test for ggml_log gradients

* implement backward pass for ggml_sum_rows, necessary for cross entropy loss

* implement ggml_repeat support for rank > 2 tensors

* add test for ggml_sum_rows gradients

* fix training get_example_targets

predict the next token, not the current token!

* add square_error_loss and cross_entropy_loss functions

* optimize loss over multiple samples

this increases computation graph, need parallel batched forward for more efficiency.

* fix backward pass for add_at and change arguments to have same order as in view

* add ggml_set(ctx, a, b) to set b in view of a and return modified a

necessary to set values into kv_self cache and properly propagate the gradients

* fix kv_self gradients for training

use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients

* replace inplace operations for training with copying operations to allow gradient propagation

* add GGML_ASSERT to catch ggml_rope and back value errors

* add trainable lora-only model with all big matrices C split into A,B with A*B=C

this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.

training this instead of the normal model resulted in much worse results though...

* vastly improve training results

instead of logit targets 0 and 1 use -1 and +1.

* shorten code using a variable

* change name of GGML_OP_ADD_AT to GGML_OP_ACC

* smaller default values for baby llama model parameters

* update static assert of GGML_OP_COUNT

* remove shape annotations in llama_eval_internal

* revert disabling of threading for rms_norm and norm

* rename print functions in baby-llama example

* fix call to ggml_set_name

* add missing include for strcmp, etc

* remove trailing whitespace

* reduce number of test-grad0 iterations

avoid exceeding timeout of automated tests

* remove busy loop that was used as sleep for slower sinus wave generation

* disable slow tests grad0 and opt to avoid exceeding timeouts

* c++ in baby-llama example

use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros

* c++ in baby-llama example

use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros

* ggml : fix compiler warnings + cosmetic changes

* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back

* swap arguments to vDSP_vdiv call

documentation for vDSP_vdiv states: "Note that B comes before A!"

* swap arguments to vDSP_vdiv call

documentation for vDSP_vdiv states: "Note that B comes before A!"

* ggml : swap vDSP_vsub args as per documentation

* add parallel batched forward function for baby-llama training

* cleanup code for batched training

* remove trailing whitespace

* minor : fix compiler warnings + indentation style

* ggml : fix null ptr deref in backward pass

* ggml : remove Q4_2 remnants

* ggml : fix clang-tidy warnings

* baby-llama : couple of clang-tidy warnings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 15:56:40 +03:00
Georgi Gerganov 0cd22e190a
llama : fix various warnings 2023-05-13 11:23:15 +03:00
Georgi Gerganov 738ace394a
llama : free ggml context in set / copy state data (close #1425) 2023-05-13 09:08:52 +03:00
Georgi Gerganov b9fd7eee57
ggml : remove bit shuffling (#1405)
* ggml : remove Q4_0 bit shufling (ARM NEON)

* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)

* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)

* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)

* ggml : remove Q5_0 bit shuffling (ARM NEON)

* ggml : 2x faster scalar implementations

* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)

* ggml : simplify scalar dot

* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit

* ggml : fix Q4_1 quantization

* ggml : update cuBLAS + normalize variable names

* ggml : remove Q4_2 mode

* ggml : minor formatting

* ggml : fix Q5_0 quantization

* scripts : add script for measuring the time per token

* AVX implementations (#1370)

* ggml : uniform 5th bit extraction

* llama : produce error upon loading old model files

* llama : fix model magic/version write

* ggml : speed-up Q5_0 + Q5_1 at 4 threads

* ggml : preserve old Q4 and Q5 formats

* ggml : simplify Q8_1 - no need for low / high sums anymore

* ggml : fix Q8_0 and Q8_1 rounding

* Revert "AVX implementations (#1370)"

This reverts commit 948d124837.

* ggml : fix AVX2 implementation

* sha : update hashes for 7B and 13B

* readme : update timings + remove warning banner

* llama : update v2 PR number to 1405

* ggml : fix WASM comments

* ggml : back to original bit order

* readme : add note that Q4 and Q5 have been changed

* llama : fix return for unknown version

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-12 00:23:08 +03:00
Pavol Rusnak 003ba2fb43
llama : fix hparams shadow (#1367)
fixes #1363
2023-05-08 17:48:21 +03:00
Georgi Gerganov f9a6364912
llama : require first token to be BOS (#1303)
* llama : require first token to be BOS

* scripts : add ppl-run-all.sh

* perplexity : add BOS for each chunk

* readme : update perplexity values after BOS fix

* perplexity : add clarifying comments
2023-05-08 17:41:54 +03:00
Jed Fox 3924088512
Remove default arguments from sampling functions (#1343) 2023-05-06 17:01:47 -04:00
Evan Jones e216aa0463
llama : only copy used KV cache in get / set state (#1272)
* llama : only copy used KV cache in get / set state

* switch to ggml for copying k, v

* avoid designated initializers
2023-05-02 22:26:13 -04:00
Georgi Gerganov 0e6cbff1b7
llama : fix compile warnings 2023-05-02 23:09:08 +03:00
Robert Brisita 2bb992f034
llama : allow 0 as a seed number. (#1275) 2023-05-02 19:23:44 +03:00
slaren 2d099e5193
ggml: add names to tensors (#1268)
* ggml: add names to tensors

* minor improvements to dot file formatting
2023-05-02 16:03:00 +02:00
Georgi Gerganov 70269cae37
llama : fix session load / save (#1263) 2023-05-01 14:54:59 +03:00
slaren b925f1f1b0
cuBLAS: fall back to pageable memory if pinned alloc fails (#1233)
* cuBLAS: fall back to pageable memory if pinned alloc fails

* cuBLAS: do not use pinned memory if env variable GGML_CUDA_NO_PINNED is set
2023-05-01 13:32:22 +02:00
Alex Klinkhamer 90b19bd6ee
llama : let context be const when accessing const data (#1261) 2023-05-01 10:24:20 +03:00
Georgi Gerganov 214b6a3570
ggml : adjust mul_mat_f16 work memory (#1226)
* llama : minor - remove explicity int64_t cast

* ggml : reduce memory buffer for F16 mul_mat when not using cuBLAS

* ggml : add asserts to guard for incorrect wsize
2023-04-29 18:43:28 +03:00
Georgi Gerganov 84ca9c2ecf
examples : fix save-load-state + rename llama-util.h 2023-04-29 13:48:11 +03:00
Ivan Stepanov dd7eff57d8
llama : new sampling algorithms (#1126)
* Sample interface, new samplers.

New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat

Ignore EOS fix: -inf should be used.

* mirostat

* Added --logit-bias and --no-penalize-nl, removed std::span

* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)

Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)

* Save and load example adjust

* Tests

* Windows build fix

* Windows test fix
2023-04-29 08:34:41 +03:00
slaren 7fc50c051a
cuBLAS: use host pinned memory and dequantize while copying (#1207)
* cuBLAS: dequantize simultaneously while copying memory

* cuBLAS: use host pinned memory

* cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory

* cuBLAS: also pin kv cache

* fix rebase
2023-04-29 02:04:18 +02:00
Stephan Walter 36d19a603b
Remove Q4_3 which is no better than Q5 (#1218) 2023-04-28 23:10:43 +00:00
Evan Jones 1481a9cf25
llama : add session file format and saved sessions in main (#1169) 2023-04-28 18:59:37 +03:00
0cc4m 7296c961d9
ggml : add CLBlast support (#1164)
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing

* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers

* Finish merge of ClBlast support

* Move CLBlast implementation to separate file

Add buffer reuse code (adapted from slaren's cuda implementation)

* Add q4_2 and q4_3 CLBlast support, improve code

* Double CLBlast speed by disabling OpenBLAS thread workaround

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>

* Fix device selection env variable names

* Fix cast in opencl kernels

* Add CLBlast to CMakeLists.txt

* Replace buffer pool with static buffers a, b, qb, c

Fix compile warnings

* Fix typos, use GGML_TYPE defines, improve code

* Improve btype dequant kernel selection code, add error if type is unsupported

* Improve code quality

* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read

Co-authored-by: Henri Vasserman <henv@hot.ee>

* Use c compiler for opencl files

* Simplify code, fix include

* First check error, then release event

* Make globals static, fix indentation

* Rename dequant kernels file to conform with other file names

* Fix import cl file name

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28 17:57:16 +03:00
Georgi Gerganov 574406dc7e
ggml : add Q5_0 and Q5_1 quantization (#1187)
* ggml : add Q5_0 quantization (cuBLAS only)

* ggml : fix Q5_0 qh -> uint32_t

* ggml : fix q5_0 histogram stats

* ggml : q5_0 scalar dot product

* ggml : q5_0 ARM NEON dot

* ggml : q5_0 more efficient ARM NEON using uint64_t masks

* ggml : rename Q5_0 -> Q5_1

* ggml : adding Q5_0 mode

* quantize : add Q5_0 and Q5_1 to map

* ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195)

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-26 23:14:13 +03:00
Ásgeir Bjarni Ingvarsson 87a6f846d3
Allow setting the rng seed after initialization. (#1184)
The llama_set_state_data function restores the rng state to what it
was at the time llama_copy_state_data was called. But users may want
to restore the state and proceed with a different seed.
2023-04-26 22:08:43 +02:00
Georgi Gerganov 7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179)
* ggml : add Q8_0 quantization format (rename the old one to Q8_1)

* tests : fix test-quantize-fns

* ggml : finalize Q8_0 implementation

* ggml : use q4_0_q8_0 and q4_2_q8_0

* ggml : fix Q8_0 dot product bug (ARM)

* ggml : Q8_0 unroll x2

* ggml : fix bug - using wrong block type

* ggml : extend quantize_fns_t with "vec_dot_type"

* ggml : fix Q8_0 to use 255 values out of 256

* ggml : fix assert using wrong QK4_2 instead of QK4_3
2023-04-25 23:40:51 +03:00
Georgi Gerganov 957c8ae21d
llama : increase scratch buffer size for 65B (ref #1152)
Temporary solution
2023-04-24 18:47:30 +03:00
Georgi Gerganov c4fe84fb0d
llama : refactor get / set state + remove redundant kv cache API (#1143) 2023-04-24 07:40:02 +03:00
Georgi Gerganov e4422e299c
ggml : better PERF prints + support "LLAMA_PERF=1 make" 2023-04-23 18:15:39 +03:00
Stephan Walter c50b628810
Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122) 2023-04-22 10:54:13 +00:00