Commit graph

181 commits

Author SHA1 Message Date
Kawrakow f904b31a7d
Add ability to use importance matrix for all k-quants (llama/4930)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-17 21:21:09 +02:00
Kawrakow dabc964d83
2-bit quantizations (llama/4897)
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-14 10:54:09 +02:00
Johannes Gäßler 435847891c
ggml: cache sin/cos for RoPE (llama/4908) 2024-01-14 00:11:45 +02:00
texmex76 9aa9f3b84e
gguf : fix potential infinite for-loop (llama/4600)
Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>
2024-01-14 00:11:44 +02:00
slaren 70840aed5f
llama : ggml-backend integration (llama/4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (llama/4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (llama/4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 21:55:42 +02:00
Kawrakow 3fa98f4395
Importance Matrix calculation (llama/4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 21:55:41 +02:00
Kawrakow 97b12212dd
ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:01 +02:00
Timothy Cronin 73072a7c73
ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693) 2024-01-11 21:50:00 +02:00
Halalaluyafail3 338442d773
Fix execlp call (ggml/689)
NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
2024-01-11 21:50:00 +02:00
Kawrakow 10651bddf6
SOTA 2-bit quants (llama/4773)
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:50:00 +02:00
Georgi Gerganov c46a74a19d
ggml : do not sched_yield when calling BLAS (llama/4761)
* ggml : do not sched_yield when calling BLAS

ggml-ci

* ggml : fix do_yield logic

ggml-ci

* ggml : simplify do_yield logic

ggml-ci
2024-01-11 21:50:00 +02:00
automaticcat dbe29d4e33 ggml : add ggml_cpu_has_avx_vnni() (llama/4589)
* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:43:51 +02:00
Guillaume Wenzek cf6f1e4181 ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639)
* add more int ops

* ggml_compute_forward_dup_bytes

* add tests

* PR comments

* tests : minor indentations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:43:51 +02:00
Georgi Gerganov e77b27c331
sync : ggml (VMM, sync-ggml-am, dotprod ARM fixes, CUDA fixes) (#1691)
* scripts : add sync-ggml-am.sh

* sync : ggml (VMM, ARM dot prod fix, etc.)

* build : fix CUDA build

* ggml : fix some mul mat cases + add tests for src1 F16

dbd02958fa
2023-12-29 11:30:47 +02:00
Georgi Gerganov 3a5302108d
sync : ggml (ggml_scale, ggml_row_size, etc.) (#1677)
* sync : ggml

* sync : llama.cpp

* talk-llama : fix obsolete param

* ggml-alloc : fix ggml_tallocr_is_own

* talk.wasm : update to new ggml

* ggml : fix type punning in ggml_scale

* ggml : cuda jetson + arm quants warnings
2023-12-22 17:53:39 +02:00
Georgi Gerganov 8171e621fc
sync : ggml (Metal fixes, new ops, tests) (#1633)
* sync : ggml (Metal fixes, new ops, tests)

* cuda : fix bin bcast when src1 and dst have different types
2023-12-13 21:55:03 +02:00
Georgi Gerganov afce6fa113
sync : ggml (new ops, new backend, etc) (#1602)
* sync : ggml (new ops, new backend, etc)

* whisper : remove obsolete broadcasting code

* ggml : remove backend self-registers + fix ggml_concat + n_task logic

* metal : fix assert

* metal : print resource path

* whisper : fix bug if metal init fails
2023-12-07 22:27:19 +02:00
Georgi Gerganov e369243ebd
ggml : re-enable blas for src0 != F32 (#1583) 2023-12-01 23:57:52 +02:00
Georgi Gerganov d4353e48f7
sync : ggml (ggml-alloc + linker + gguf fixes) (#1501) 2023-11-17 10:00:07 +02:00
Georgi Gerganov b0502836b8
whisper : add full CUDA and Metal offloading (#1472)
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
2023-11-12 15:31:08 +02:00
Georgi Gerganov 0cbef75422
ggml : fix MIN / MAX macro re-definition 2023-11-07 16:08:46 +02:00
Georgi Gerganov f96e1c5b78
sync : ggml (backend v2, k-quants, CUDA opts, Metal opts, etc.) (#1422)
* sync : ggml (backend v2, k-quants, CUDA opts, Metal opts, etc.)

* metal : allow env metal variable to override resource path (#1415)

* Allow env variable to override resource path

* Update ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* sync : restore common / main from `master`

* sync : restore whisper from `master`

* talk-llama : update to latest llama.cpp

* ruby : fix build

* ggml : fix 32-bit ARM build

* ggml : fix MIN / MAX macro collisions + update ios bindings

* ggml : fix ifdefs and MIN / MAX again

* exampels : fix Obj-C and Swift examples

* ggml : fix 32-bit ARM compatibility

* ggml : one more attempt to fix 32-bit ARM compat

* whisper : fix support for larger graphs

---------

Co-authored-by: Chris Raethke <codesoda@users.noreply.github.com>
2023-11-03 21:35:05 +02:00
Georgi Gerganov 80c1512fd5
sync : ggml (const correctness) 2023-09-15 14:49:56 +03:00
Georgi Gerganov b8432f28f4
metal : add F32 support + update bench output 2023-09-15 13:56:08 +03:00
Georgi Gerganov 93935980f8
whisper : Metal and ggml-alloc support (#1270)
* metal : init

* whisper : factor out graph builds

* whisper : allocate encoder and decoder using ggml-alloc

* whisper : ggml-alloc is now supported

* whisper : CoreML support ggml-alloc

* build : fix ggml-alloc

* ios : update submodule

* extra : update sync-ggml.sh script to also sync ggml-alloc

* ci : see if this is causing the crash

* whisper : refactor ggml-alloc init

* whisper.android : try to fix build

* whisper : initial Metal version

* ci : try to debug vmem issue

* metal : decoder works on GPU!

* metal : add multi-decoder support

* ggml : fix ggml_nbytes (probably temp solution)

* metal : run "cross" step on the GPU

* whisper : remove ggml_repeat in the encoder

* whisper : offload the Encoder to Metal

* ggml : use simpler ggml_bytes() implementation

* ggml-alloc : try to make CI happy by reducing vram to 128GB

* whisper : add whisper_allocr to wrap ggml_allocr

* whisper : factor out alloc init in a function

* cmake : update to support Metal build

* whisper : add <functional> header

* objc : fix build (no Metal yet)

* ios : add Metal support

* swiftui : fix build

* metal : speed-up KQ multiplication

* metal : sync latest llama.cpp kernels

* readme : add Metal info

* ios : update submodule

* coreml : add code to toggle Core ML config (CPU, ANE, GPU)

* bench : fix timings by running a pre-heat

* bench : start benching the decoder

* whisper : add ggml_mul_mat_pad

* bench : fix uninitialized vars

* whisper : add comment for disabling mul-mat padding

* whisper : add description of ggml_mul_mat_pad

* whisper : clean-up ggml_mul_mat_pad

* metal : remove the "concurrent" flag

* bench : variable n_past

* ios : update SPM package
2023-09-15 12:18:18 +03:00
Georgi Gerganov 3fec2119e6
whisper : fix bench regression + fix performance when using CPU BLAS (#1275)
* whisper : fix bench regression

* ggml : use sched_yield when using BLAS + add comment
2023-09-12 13:54:04 +03:00
Georgi Gerganov b39809668a
sync : ggml (HBM + Metal + style) (#1264) 2023-09-08 17:58:31 +03:00
Przemysław Pawełczyk b55b505690
build : do not use _GNU_SOURCE gratuitously (#1129)
* Do not use _GNU_SOURCE gratuitously.

What is needed to build whisper.cpp and examples is availability of
stuff defined in The Open Group Base Specifications Issue 6
(https://pubs.opengroup.org/onlinepubs/009695399/) known also as
Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions,
plus some stuff from BSD that is not specified in POSIX.1.

Well, that was true until NUMA support was added recently in ggml,
so enable GNU libc extensions for Linux builds to cover that.

There is no need to penalize musl libc which simply follows standards.

Not having feature test macros in source code gives greater flexibility
to those wanting to reuse it in 3rd party app, as they can build it with
minimal FTM (_XOPEN_SOURCE=600) or other FTM depending on their needs.

It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2.

* examples : include SDL headers before other headers

Avoid macOS build error when _DARWIN_C_SOURCE is not defined, brought by
SDL2 relying on Darwin extension memset_pattern4/8/16 (from string.h).

* make : enable BSD extensions for DragonFlyBSD to expose RLIMIT_MEMLOCK

* make : use BSD-specific FTMs to enable alloca on BSDs

* make : fix OpenBSD build by exposing newer POSIX definitions

* cmake : follow recent FTM improvements from Makefile
2023-09-07 12:36:14 +03:00
Przemysław Pawełczyk ace6c12ec6
ggml : posixify pagesize (#1251)
* ggml : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD

sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml.c

* metal : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD

sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml-metal.m
2023-09-06 18:19:36 +03:00
Georgi Gerganov c3f319d7c2
ggml : sync latest llama.cpp (view_src + alloc improvements) (#1247)
* ggml : sync latest llama.cpp (view_src + alloc improvements)

* ggml : fix build
2023-09-05 20:57:27 +03:00
Georgi Gerganov 59a3d0cb57
ggml : sync (ggml-alloc, GPU, eps, etc.) (#1220)
* ggml : sync (ggml-alloc, GPU, eps, etc.)

* ggml : fix build

* wasm : fix build
2023-09-05 13:54:40 +03:00
ChangSeok Oh 8e30bf3c02
ggml : fix compilation errors incurred by -Werror (#1227)
The -Werror warning option turns all warnings into errors. This PR makes
the compiler happy to build ggml.c and whisper.cpp with the stricter option.
2023-08-30 22:09:15 +03:00
Przemysław Pawełczyk 25466aa1c3
ggml : fix compiling when SSE3 is available but not SSSE3 (#1210)
It got broken in commit 3998465721.
2023-08-27 21:37:31 +03:00
Przemysław Pawełczyk 601c2d2181
ggml : detect SSSE3 (#1211)
* ggml : add ggml_cpu_has_ssse3

* whisper : show SSSE3 in system info

* make : detect SSSE3 via cpuinfo
2023-08-27 21:36:41 +03:00
alonfaraj 3998465721
ci : more platforms coverage (#1101)
* add multi platform

* add image name

* fix

* fix /bin/sh path

* add missing \

* add all platforms for check

* remove platforms

* remove s390x

* - add arm v6
- format run cmd

* remove arm v6

* - bump checkout to v3
- use setup emsdk action
- add arch to all ubuntu jobs

* mymindstorm/setup-emsdk to v12

* add missing QEMU step

* add fail-fast: false for debug

* add freebsd

* remark all jobs except freebsd for test

* add sudo

* enable all tests again

* format

* check __AVX__ support before include immintrin.h

* try auto detect flag by cmake

* fix check for immintrin.h

* fix include check for immintrin.h

* Remove all platforms for sanitizer build except amd64

We have no clue why they failed.

---------

Co-authored-by: Alon Faraj <alon.faraj@mapcore.com>
2023-07-16 23:00:34 +03:00
Georgi Gerganov 8ba42095c5
Revert "ggml : do not use _GNU_SOURCE gratuitously (#1027)"
This reverts commit 3f7a03ebe3.
2023-07-02 21:53:52 +03:00
Georgi Gerganov d6509bf78d
ggml : sync latest repo (mostly refactoring changes) 2023-07-02 21:46:09 +03:00
Przemysław Pawełczyk 3f7a03ebe3
ggml : do not use _GNU_SOURCE gratuitously (#1027)
* Do not use _GNU_SOURCE gratuitously.

What is needed to build whisper.cpp and examples is availability of
stuff defined in The Open Group Base Specifications Issue 6
(https://pubs.opengroup.org/onlinepubs/009695399/) known also as
Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions.

There is no need to penalize musl libc which simply follows standards.

Not having feature test macros in source code gives greater flexibility
to those wanting to reuse it in 3rd party app, as they can build it with
minimal FTM (_XOPEN_SOURCE=600) or other FTM depending on their needs.

It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2.

* examples : include SDL headers before other headers

This is an attempt at fixing macOS build error coming from SDL2 relying
on Darwin extension memset_pattern4/8/16 coming from Apple's string.h.
2023-06-25 16:34:30 +03:00
Georgi Gerganov 5feb0dffba
ggml : sync latest ggml lib 2023-06-25 14:30:44 +03:00
Georgi Gerganov 429b9785c0
ggml : update WASM SIMD 2023-05-20 20:00:06 +03:00
Georgi Gerganov e410cfc3ce
ggml : sync latest ggml repo
- new Q4 and Q8 quantization
- updated CUDA
2023-05-20 18:56:30 +03:00
Georgi Gerganov aaf0d41c7c
ggml : add AVX dot products 2023-05-14 18:56:46 +03:00
Georgi Gerganov e693074aa6
ggml : sync latest ggml
- New Q4 and Q5 formats
- Various improvements
2023-05-14 18:04:23 +03:00
Georgi Gerganov 5974c8facd ggml : fix 32-bit ARM build + quantization 2023-05-02 21:52:26 +03:00
Georgi Gerganov 0bcb64b184
ggml : sync ggml (clBLAST + tensor names) 2023-05-02 21:24:18 +03:00
Georgi Gerganov feac80dd3f
ggml : fix UB (int << 31) 2023-04-30 22:27:30 +03:00
Georgi Gerganov 794b162a46
whisper : add integer quantization support (#540)
* whisper : add integer quantization support

* examples : add common-ggml + prepare to add "quantize" tool

* whisper : quantization tool ready

* whisper : fix F32 support

* whisper : try to fix shared lib linkage

* wasm : update quantized models to Q5

* bench.wasm : remove "medium" button

* bench.wasm : fix custom model button

* ggml : add Q5_0 and Q5_1 WASM SIMD

* wasm : add quantized models to all WASM examples

* wasm : bump DB version number to 2

* talk-llama : update example to latest llama.cpp

* node : increase test timeout to 10s

* readme : add information for model quantization

* wasm : add links to other examples
2023-04-30 18:51:57 +03:00
Georgi Gerganov 0ccd6746c9
ggml : fix WASM build 2023-04-29 21:37:23 +03:00
Georgi Gerganov d9b550c0a1
ggml : fix 32-bit ARM NEON (#836)
* ggml : add support for 32-bit ARM

* ggml : fix

* ggml : fix
2023-04-29 21:33:33 +03:00
Georgi Gerganov e9b091c92a
ggml : use vzip instead of vuzp for consistency 2023-04-29 21:14:09 +03:00
Georgi Gerganov 1f30b99208
ggml : fix WASM build 2023-04-29 20:21:25 +03:00
Georgi Gerganov 05c3ea3bc8
ggml : sync with ggml repo (warning fixes + asserts) 2023-04-29 19:33:28 +03:00
Georgi Gerganov acec73ab6e
ggml : sync latest ggml + llama.cpp updates (quantization) 2023-04-29 12:32:28 +03:00
Jhen-Jie Hong ea1f8a50d4
ggml, ci : fix build on whisper.android (ARM_NEON) + add CI (#764)
* ggml : fix undefined symbol by remove inline handle

* ggml : make own ggml_aligned_malloc function

* ci: add ios/android build
2023-04-15 14:21:58 +03:00
Georgi Gerganov 677ad754a0
ggml : sync latest ggml 2023-04-14 19:20:39 +03:00
novag 463e46338c
ggml : fix q4_1 dot product types (#759)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-14 13:34:20 +03:00
Georgi Gerganov 2f889132c6
ggml : sync latest changes from ggml and llama.cpp 2023-04-13 18:53:44 +03:00
Georgi Gerganov ebef1e8620
ggml : fix WASM build 2023-04-10 23:18:29 +03:00
Georgi Gerganov 69b8503935
ggml : backport llama.cpp updates (close #709)
- About x2 overall performance improvement on Apple Silicon
- Results should now be the same for different number of threads (not
  tested)
2023-04-10 22:28:54 +03:00
Georgi Gerganov 4a0deb8b1e
talk-llama : add new example + sync ggml from llama.cpp (#664)
* talk-llama : talk with LLaMA AI

* talk.llama : disable EOS token

* talk-llama : add README instructions

* ggml : fix build in debug
2023-03-27 21:00:32 +03:00
Georgi Gerganov f3ee4a9673
whisper : reduce memory usage during inference (#431)
* ggml : add "scratch" buffer support

* ggml : support for scratch ring-buffer

* ggml : bug fix in ggml_repeat()

* ggml : error on scratch buffer overflow

* whisper : use scratch buffers during inference (base model only)

* whisper : update memory usage for all models

* whisper : fix encoder memory usage

* whisper : use whisper_context functions instead of macros

* whisper : fix FF + remove it from README

* ggml : reuse ggml_new_i32

* ggml : refactor the scratch buffer storage

* whisper : reorder scratch buffers in the decoder

* main : add option to disable temp fallback

* Update README.md
2023-02-04 09:45:52 +02:00
fitzsim ae16c21e9c
whisper : PPC64 big-endian support (#398)
* ggml : set cache line size to 128 on POWER9

* whisper : add PPC64 big endian support
2023-01-23 20:48:10 +02:00
Georgi Gerganov 1290fc6457
bench : add memcpy and ggml_mul_mat benchmarks 2023-01-18 20:31:46 +02:00
Georgi Gerganov 4ef3398e8f
ggml : remove obsolete zeroing + comment fixes (#390) 2023-01-08 20:21:03 +02:00
Abitofevrything 8d7b29cedd
ggml : correct behaviour of ggml_vec_sum_f32 (#390) 2023-01-08 20:06:09 +02:00
Georgi Gerganov 52a3e0c92a
ggml : improve vec_dot_f16 unrolling in flash_attn_f16 2023-01-08 11:41:18 +02:00
Georgi Gerganov f30b5d322c
ggml : fix bug in new soft max computation 2023-01-07 21:00:07 +02:00
Georgi Gerganov d347a59a5f
ggml : when using BLAS start only 1 CPU thread 2023-01-07 19:48:56 +02:00
Georgi Gerganov 6394c906af
ggml : fix running tasks with variable number of threads 2023-01-07 19:20:18 +02:00
Georgi Gerganov 74ffa14e1d
ggml : unroll ggml_vec_dot_f16 in ggml_compute_forward_flash_attn_f16 2023-01-07 19:19:40 +02:00
Georgi Gerganov 65fdcbbbbb
whisper : revert accidental MB change 2023-01-07 16:18:21 +02:00
Georgi Gerganov d61d55cd4b
ggml : speed-up soft max via Accelerate + unroll 2023-01-07 16:16:42 +02:00
Georgi Gerganov d51fc3ee0a
ggml : use vDSP_sve and vDSP_maxv from Accelerate 2023-01-07 16:10:16 +02:00
Georgi Gerganov f82a7dd019
ggml : make gcc happy (minor) 2023-01-07 09:34:39 +02:00
Abitofevrything a62170c656
ggml : add SSE3 and fp16 conversion lookup table (#368)
* Improves WASM performance:
  On MacBook M1 Pro, I observe 25% faster using Firefox and 35% faster using Chrome

* Add support for SSE3 SIMD

* Add SSE3 to system information

* Add Imath support for fp16-fp32 conversions

* Add Imath to system information

* Wrap Imath calls to avoid static function warnings

* Drop Imath; Add lookup table for f16 -> f32 conversions

* Remove TODO comments

* Update SSE3 to new macro arguments

* Correct updated macro definitions

* Prefer static inline where possible

* ggml : static inlines + add public f16 <-> f32 conversions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-01-06 18:45:59 +02:00
Thomas Fitzsimmons 1944e7c33e whisper : document POWER VSX support 2023-01-05 23:53:00 +02:00
Thomas Fitzsimmons 49a8dd6732 ggml : reorganize POWER9 ppc64le SIMD code 2023-01-05 23:53:00 +02:00
Thomas Fitzsimmons 8c7f642286 ggml : change f16 load and store macro arguments 2023-01-05 23:53:00 +02:00
Georgi Gerganov 0a0cfa7985
ggml : add void to argument-less functions 2023-01-05 21:40:38 +02:00
Georgi Gerganov d51c5eb906
ggml : define MIN / MAX only if not defined (minor) 2023-01-05 21:16:52 +02:00
Thomas Fitzsimmons 424c410c42 ggml : improve f16 acceleration for POWER9 ppc64le 2022-12-31 10:02:19 +02:00
Georgi Gerganov 4e0b2069e7
ggml : barrier refactor + static functions 2022-12-28 19:00:53 +02:00
Georgi Gerganov ac521a566e
ggml : simplify the SIMD code (#324)
* ggml : simplify the SIMD code

* ggml : generic reduce for all register sizes + comments
2022-12-24 10:22:28 +02:00
Georgi Gerganov 7282e2109e
ggml : use vaddvq_f32 for slightly more efficient reduce 2022-12-23 13:48:19 +02:00
Thomas Fitzsimmons 466ceebb78 ggml : add f16 acceleration for POWER9 ppc64le 2022-12-23 13:23:58 +02:00
Andy Maloney 493d94130d
ggml : make consts static (#317)
These shouldn't be able to be referenced outside the compilation unit.
2022-12-23 11:05:27 +02:00
Andy Maloney fa463313ad
minor : small code cleanups (#302)
* Small code cleanups

- fix indentation
- remove extra semicolons
- remove extra break after returns in case statements
- remove unnecessary call to .data() on string
- use empty() instead of checking size()
- no need to check for nullptr before free
- remove unnecessary initialization of string to ""

* minor : switch case always break

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2022-12-22 17:06:19 +02:00
Kevin Brothaler e1432dd91a Check for both __ARM_NEON and __ARM_FEATURE_FMA so that the project can be compiled for armv7a.
Android armeabi-v7a's NEON support doesn't support FMA unless configured with `-mfpu=neon-fp-armv8`, which would need runtime checks.
* Also removed ABI filter from Android project.
2022-12-22 16:47:54 +02:00
katsu560 419b8a6402 Add AVX,AVX2 support for ggml_vec_scale_f32 2022-12-17 19:40:10 +02:00
Georgi Gerganov a7047b2a28
ggml : implement ggml_compute_forward_dup_f16() special cases 2022-12-16 21:50:41 +02:00
Georgi Gerganov 0f11759406
ggml : make more compatible with c99 (#262) 2022-12-16 18:00:12 +02:00
Georgi Gerganov f66ac6dc4f
ggml : fix indentation 2022-12-13 23:09:21 +02:00
Georgi Gerganov 9955fa4ed7
ggml : make compatible with c99 (#262) 2022-12-13 23:07:49 +02:00
Roland Rabien e70d47baab
Remove C++20 requirement (#257)
* Remove C++20 requirement

* Roll back C features not supported in VS2017
2022-12-11 20:03:07 +02:00
Georgi Gerganov 3b1aacbe6d talk : talk with AI in the terminal 2022-12-10 16:51:58 +02:00
Georgi Gerganov 50a061b313
ggml : add alternative cblas_sgemm call 2022-12-08 23:48:04 +02:00
Al Hoang 04a16bbf11 fix compilation on haiku 2022-12-08 09:20:57 +02:00
Georgi Gerganov b6597539f9
ggml : fix typo in previous commit 2022-12-06 22:12:57 +02:00
Georgi Gerganov 9a4b7a916e
ggml : use macros to inline FP16 <-> FP32 conversions 2022-12-06 22:09:26 +02:00
Georgi Gerganov f8ec718b76
ggml : add F16C CPU flag check 2022-12-06 21:56:56 +02:00