Commit graph

1026 commits

Author SHA1 Message Date
bobqianic 9d3d1d23f2
Add files via upload 2024-02-14 01:29:33 +00:00
bobqianic b7bc969d65
Merge pull request #9 from bobqianic/fix
Fix
2024-02-14 00:31:57 +00:00
bobqianic 53d58fb149
rewrite bpe_gpt2_preprocess 2024-02-14 00:31:05 +00:00
bobqianic 6eb97e9114
Add files via upload 2024-02-14 00:24:54 +00:00
bobqianic dac6533892
Add files via upload 2024-02-13 03:02:25 +00:00
bobqianic b668927591
Add files via upload 2024-02-13 03:01:57 +00:00
bobqianic 0f6ad6c2f5
fix bugs 2024-02-13 02:51:28 +00:00
bobqianic 99e5322a79
Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-11 15:19:01 +00:00
bobqianic 047ae5b51a
reduce error rate 2024-02-10 23:02:01 +00:00
bobqianic 56a7a22080
Reduce error rate 2024-02-10 21:48:30 +00:00
bobqianic 14fef7cc23
Add files via upload 2024-02-10 18:00:08 +00:00
bobqianic 221d8d969b
Update Makefile 2024-02-10 17:58:12 +00:00
bobqianic f8c8d493af
Update CMakeLists.txt 2024-02-10 17:56:28 +00:00
bobqianic 0806bc330e
Add files via upload 2024-02-10 17:55:11 +00:00
bobqianic a29a3c8c29
bpe_tokenizer implementation 2024-02-10 17:53:30 +00:00
Georgi Gerganov 02b4c52c12
talk-llama : sync llama.cpp 2024-02-10 10:10:59 +02:00
Georgi Gerganov 518199c09e
sync : ggml 2024-02-10 09:56:47 +02:00
Georgi Gerganov 8b17a2f776
src : relocate new backend sources 2024-02-10 09:55:47 +02:00
Michael Podvitskiy b6d2827914
ggml : fix error C2078: too many initializers for MSVC ARM64 (llama/5404) 2024-02-10 09:55:47 +02:00
Johannes Gäßler 9711bae0b3
CUDA: more warps for mmvq on NVIDIA (llama/5394) 2024-02-10 09:55:47 +02:00
Johannes Gäßler eec38f63bd
CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386) 2024-02-10 09:55:47 +02:00
0cc4m ef5e6b746f
Basic Vulkan Multi-GPU implementation (llama/5321)
* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-10 09:55:47 +02:00
Johannes Gäßler 77bf6b5f56
CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370) 2024-02-10 09:55:47 +02:00
Kawrakow b562fff9d0
Slight quantization improvement for Q4_K and Q5_K (llama/5361)
* Q4_K: slightly better quantization

* Q5_K: slightly better quantization

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-10 09:55:47 +02:00
Johannes Gäßler b5dec374f4
CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351) 2024-02-10 09:55:47 +02:00
Kawrakow fa0dc6167c
ggml : make use of ggml-quants.h possible in C++ code (llama/5338)
* Make use of ggml-quants.h possible in C++ code

* One cannot possibly be defining static_assert in a C++ compilation

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-10 09:55:47 +02:00
Dr. Tom Murphy VII Ph.D 55bcd62a4b
ggml : avoid duplicating function calls using MIN/MAX macros (llama/5325)
* Avoid duplicating function calls when using MIN/MAX macros.

Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:

https://godbolt.org/z/Ee4KMrvKh

Code behaves exactly the same.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-10 09:55:46 +02:00
Kawrakow 0ed762d691
iq2_xxs: tune quantization (llama/5320)
We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-10 09:55:46 +02:00
slaren 1b5bb7792e
cuda : fix LLAMA_CUDA_F16 (llama/5262) 2024-02-10 09:55:46 +02:00
Georgi Gerganov 9b735cea77
metal : add im2col F32 dst support (llama/5132) 2024-02-10 09:55:46 +02:00
JidongZhang-THU 12c462d656
llava : add MobileVLM support (llama/5132)
* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-10 09:55:46 +02:00
slaren fc7b0e2c28
ggml : limit n_threads to the max n_tasks (llama/5238) 2024-02-10 09:55:46 +02:00
Jared Van Bortel f850a067ed
kompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226) 2024-02-10 09:55:46 +02:00
Michael Podvitskiy f75e1197f1
ggml : add abort_callback for cpu backend (ggml/725)
* a way to use abort_callback with the cpu backend

* whisper update
2024-02-10 09:55:46 +02:00
Georgi Gerganov aa8a75e287
extra : update sync scripts 2024-02-10 09:55:19 +02:00
Valentin Gosu 80e8a2ea39
server : allow CORS request with authorization headers (#1850)
Whisper plugin in Obsidian requires an API key which is
then sent as an authorization header.
However, the presence of an authorization header requires
a CORS Preflight, so both the OPTIONS method and
the Access-Control-Allow-Headers: authorization must be
handled.
2024-02-09 17:42:41 +02:00
Neuman Vong 19f8048139
whisper.android : how to build with CLBlast (#1809)
* FetchContent

* OpenCL

* Documentation and make optional

* Specify GGML build options in build.gradle

* Use gradle properties

* @ggerganov

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* @gpokat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-09 17:39:05 +02:00
Didzis Gosko 0f80e5a80a
whisper : expose CUDA device setting in public API (#1840)
* Makefile : allow to override CUDA_ARCH_FLAG

* whisper : allow to select GPU (CUDA) device from public API
2024-02-09 17:27:47 +02:00
Didzis Gosko b6559333ff
make : add macOS deployment target option (#1839) 2024-02-09 17:26:29 +02:00
Georgi Gerganov 434b8f3b96
talk-llama : stream response (#1121) 2024-02-06 19:56:12 +02:00
Georgi Gerganov 7a74e929c8
sync : ggml (#0) 2024-01-30 21:30:26 +02:00
Kawrakow 361ecebe90
ggml : fix IQ3_XXS on Metal (llama/5219)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 21:28:00 +02:00
Georgi Gerganov 807cbc672e
sync : ggml (llama/0) 2024-01-30 21:27:59 +02:00
Kawrakow 98ae5276b7
Faster AVX2 dot product for IQ2_XS (llama/5187)
* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Peter Reid <peter@peterreid.net>
2024-01-30 21:27:59 +02:00
Kawrakow 6adb969b09
SOTA 3-bit quants (llama/5196)
* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 21:27:59 +02:00
Paul Tsochantaris 8a7d6ff51a
ggml alloc: Fix for null dereference on alloc failure (llama/5200)
* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix
2024-01-30 21:27:59 +02:00
Jared Van Bortel 25f650a8e8
Nomic Vulkan backend (llama/4456)
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: niansa <anton-sa@web.de>
Co-authored-by: Adam Treat <treat.adam@gmail.com>
Co-authored-by: Aaron Miller <apage43@ninjawhale.com>
Co-authored-by: ToKiNoBug <tokinobug@163.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-30 21:27:59 +02:00
slaren 44e517f074
ggml : add max buffer sizes to opencl and metal backends (llama/5181) 2024-01-30 21:27:59 +02:00
Paul Tsochantaris cb9de61659
metal : free metal objects (llama/5161)
* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix
2024-01-30 21:27:59 +02:00
Georgi Gerganov a2ef80d66f
gguf : fix comparison (ggml/715)
ggml-ci
2024-01-30 21:27:59 +02:00