Commit graph

171 commits

Author SHA1 Message Date
Georgi Gerganov e36ecdccc8
build : on Mac OS enable Metal by default (#2901)
* build : on Mac OS enable Metal by default

* make : try to fix build on Linux

* make : move targets back to the top

* make : fix target clean

* llama : enable GPU inference by default with Metal

* llama : fix vocab_only logic when GPU is enabled

* common : better `n_gpu_layers` assignment

* readme : update Metal instructions

* make : fix merge conflict remnants

* gitignore : metal
2023-09-04 22:26:24 +03:00
Leng Yue 5b8530d88c
make : add speculative example (#3003) 2023-09-04 13:39:57 +03:00
Alon afc43d5f82
cov : add Code Coverage and codecov.io integration (#2928)
* update .gitignore

* makefile: add coverage support (lcov, gcovr)

* add code-coverage workflow

* update code coverage workflow

* wun on ubuntu 20.04

* use gcc-8

* check why the job hang

* add env vars

* add LLAMA_CODE_COVERAGE=1 again

* - add CODECOV_TOKEN
- add missing make lcov-report

* install lcov

* update make file -pb flag

* remove unused  GGML_NITER from workflows

* wrap coverage output files in COV_TARGETS
2023-09-03 11:48:49 +03:00
Cebtenzzre bc054af97a
make : support overriding CFLAGS/CXXFLAGS/CPPFLAGS/LDFLAGS (#2886)
* make : remove unused -DGGML_BIG_ENDIAN

* make : put preprocessor stuff in CPPFLAGS

* make : pass Raspberry Pi arch flags to g++ as well

* make : support overriding CFLAGS/CXXFLAGS/CPPFLAGS/LDFLAGS

* make : fix inverted conditional
2023-09-03 08:26:59 +03:00
Cebtenzzre 6c9c23429b
make : use unaligned vector moves on MinGW (#2945)
Fixes #2922
2023-09-01 16:53:14 +03:00
Cebtenzzre ef15649972
build : fix most gcc and clang warnings (#2861)
* fix most gcc and clang warnings

* baby-llama : remove commented opt_params_adam

* fix some MinGW warnings

* fix more MinGW warnings
2023-09-01 16:34:50 +03:00
Tameem 5aec2cfaac
ggml : add RISC-V vector intrinsics support (#2929)
* added support for RISCV CFLAGS & native compile + cross compile options

* Add RISC-V Vector Intrinsics Support

Added RVV intrinsics for following
   ggml_vec_dot_q4_0_q8_0
   ggml_vec_dot_q4_1_q8_1
   ggml_vec_dot_q5_0_q8_0
   ggml_vec_dot_q5_1_q8_1
   ggml_vec_dot_q8_0_q8_0

Co-authored-by: Sharafat <sharafat.hussain@10xengineers.ai>
Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>

---------

Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>
Co-authored-by: moiz.hussain <moiz.hussain@10xengineers.ai>
Co-authored-by: Sharafat <sharafat.hussain@10xengineers.ai>
2023-09-01 16:27:40 +03:00
staviq 4dcd47d71d
logs : fix mingw-like builds (fixes #2898) (#2911)
* fix mingw-like builds

* formatting

* make LOG_COMPAT easier to override and extend

* simplify win detection

* fix for #2940
2023-09-01 12:07:06 +03:00
Georgi Gerganov c90d135eb4
examples : fix underscore in beam-search + .gitignore (close #2900) 2023-08-30 12:53:24 +03:00
alonfaraj 9509294420
make : add test and update CI (#2897)
* build ci: run make test

* makefile:
- add all
- add test

* enable tests/test-tokenizer-0-llama

* fix path to model

* remove gcc-8 from macos build test

* Update Makefile

* Update Makefile
2023-08-30 12:42:51 +03:00
staviq 8341a25957
main : log file (#2748)
* initial, base LOG macro

* add *.log to .gitignore

* added basic log file handler

* reverted log auto endline to better mimic printf

* remove atomics and add dynamic log target

* log_enable/disable, LOG_TEE, basic usage doc

* update .gitignore

* mv include to common, params, help msg

* log tostring helpers, token vectors pretty prints

* main: replaced fprintf/LOG_TEE, some trace logging

* LOG_DISABLE_LOGS compile flag, wrapped f in macros

* fix LOG_TEELN and configchecker

* stub LOG_DUMP_CMDLINE for WIN32 for now

* fix msvc

* cleanup main.cpp:273

* fix stray whitespace after master sync

* log : fix compile warnings

- do not use C++20 stuff
- use PRIu64 to print uint64_t
- avoid string copies by using const ref
- fix ", ##__VA_ARGS__" warnings
- compare strings with == and !=

* log : do not append to existing log + disable file line func by default

* log : try to fix Windows build

* main : wip logs

* main : add trace log

* review: macro f lowercase, str append to sstream

* review: simplify ifs and str comparisons

* fix MSVC, formatting, FMT/VAL placeholders

* review: if/else cleanup

* review: if/else cleanup (2)

* replace _ prefix with _impl suffix

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-30 09:29:32 +03:00
Cebtenzzre 849408957c
tests : add a C compliance test (#2848)
* tests : add a C compliance test

* make : build C compliance test by default

* make : fix clean and make sure C test fails on clang

* make : move -Werror=implicit-int to CFLAGS
2023-08-30 09:20:26 +03:00
Cebtenzzre d4b5e16c32
make : fix clang tests build, add missing examples (#2859)
* make : do not pass headers to the compiler

This fixes building tests with clang.

* make : add missing examples

* make : fix build-info.h dependencies
2023-08-29 11:42:41 +03:00
Georgi Gerganov 3a007648f2
metal : add option to disable debug logs (close #2764) 2023-08-29 11:33:46 +03:00
alonfaraj 75fafcbccc
make : fix tests build (#2855)
* makefile:
- fix test name
- add missing tests build

* editorconfig : fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-28 18:38:35 +03:00
Henri Vasserman 6bbc598a63
ROCm Port (#1087)
* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP

---------

Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>
2023-08-25 12:09:42 +03:00
Georgi Gerganov 6381d4e110
gguf : new file format with flexible meta data (beta) (#2398)
* gguf : first API pass

* gguf : read header + meta data

* gguf : read tensor info

* gguf : initial model loading - not tested

* gguf : add gguf_get_tensor_name()

* gguf : do not support passing existing ggml_context to gguf_init

* gguf : simplify gguf_get_val

* gguf : gguf.c is now part of ggml.c

* gguf : read / write sample models

* gguf : add comments

* refactor : reduce code duplication and better API (#2415)

* gguf : expose the gguf_type enum through the API for now

* gguf : add array support

* gguf.py : some code style changes

* convert.py : start a new simplified implementation by removing old stuff

* convert.py : remove GGML vocab + other obsolete stuff

* GGUF : write tensor (#2426)

* WIP: Write tensor

* GGUF : Support writing tensors in Python

* refactor : rm unused import and upd todos

* fix : fix errors upd writing example

* rm example.gguf

* gitignore *.gguf

* undo formatting

* gguf : add gguf_find_key (#2438)

* gguf.cpp : find key example

* ggml.h : add gguf_find_key

* ggml.c : add gguf_find_key

* gguf : fix writing tensors

* gguf : do not hardcode tensor names to read

* gguf : write sample tensors to read

* gguf : add tokenization constants

* quick and dirty conversion example

* gguf : fix writing gguf arrays

* gguf : write tensors one by one and code reuse

* gguf : fix writing gguf arrays

* gguf : write tensors one by one

* gguf : write tensors one by one

* gguf : write tokenizer data

* gguf : upd gguf conversion script

* Update convert-llama-h5-to-gguf.py

* gguf : handle already encoded string

* ggml.h : get array str and f32

* ggml.c : get arr str and f32

* gguf.py : support any type

* Update convert-llama-h5-to-gguf.py

* gguf : fix set is not subscriptable

* gguf : update convert-llama-h5-to-gguf.py

* constants.py : add layer norm eps

* gguf.py : add layer norm eps and merges

* ggml.h : increase GGML_MAX_NAME to 64

* ggml.c : add gguf_get_arr_n

* Update convert-llama-h5-to-gguf.py

* add gptneox gguf example

* Makefile : add gptneox gguf example

* Update convert-llama-h5-to-gguf.py

* add gptneox gguf example

* Update convert-llama-h5-to-gguf.py

* Update convert-gptneox-h5-to-gguf.py

* Update convert-gptneox-h5-to-gguf.py

* Update convert-llama-h5-to-gguf.py

* gguf : support custom alignment value

* gguf : fix typo in function call

* gguf : mmap tensor data example

* fix : update convert-llama-h5-to-gguf.py

* Update convert-llama-h5-to-gguf.py

* convert-gptneox-h5-to-gguf.py : Special tokens

* gptneox-main.cpp : special tokens

* Update gptneox-main.cpp

* constants.py : special tokens

* gguf.py : accumulate kv and tensor info data + special tokens

* convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens

* gguf : gguf counterpart of llama-util.h

* gguf-util.h : update note

* convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens

* convert-llama-h5-to-gguf.py : special tokens

* Delete gptneox-common.cpp

* Delete gptneox-common.h

* convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer

* gptneox-main.cpp : gpt2 bpe tokenizer

* gpt2 bpe tokenizer (handles merges and unicode)

* Makefile : remove gptneox-common

* gguf.py : bytesarray for gpt2bpe tokenizer

* cmpnct_gpt2bpe.hpp : comments

* gguf.py : use custom alignment if present

* gguf : minor stuff

* Update gptneox-main.cpp

* map tensor names

* convert-gptneox-h5-to-gguf.py : map tensor names

* convert-llama-h5-to-gguf.py : map tensor names

* gptneox-main.cpp : map tensor names

* gguf : start implementing libllama in GGUF (WIP)

* gguf : start implementing libllama in GGUF (WIP)

* rm binary commited by mistake

* upd .gitignore

* gguf : calculate n_mult

* gguf :  inference with 7B model working (WIP)

* gguf : rm deprecated function

* gguf : start implementing gguf_file_saver (WIP)

* gguf : start implementing gguf_file_saver (WIP)

* gguf : start implementing gguf_file_saver (WIP)

* gguf : add gguf_get_kv_type

* gguf : add gguf_get_kv_type

* gguf : write metadata in gguf_file_saver (WIP)

* gguf : write metadata in gguf_file_saver (WIP)

* gguf : write metadata in gguf_file_saver

* gguf : rm references to old file formats

* gguf : shorter name for member variable

* gguf : rm redundant method

* gguf : get rid of n_mult, read n_ff from file

* Update gguf_tensor_map.py

* Update gptneox-main.cpp

* gguf : rm references to old file magics

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : start implementing quantization (WIP)

* gguf : quantization is working

* gguf : roper closing of file

* gguf.py : no need to convert tensors twice

* convert-gptneox-h5-to-gguf.py : no need to convert tensors twice

* convert-llama-h5-to-gguf.py : no need to convert tensors twice

* convert-gptneox-h5-to-gguf.py : simplify nbytes

* convert-llama-h5-to-gguf.py : simplify nbytes

* gptneox-main.cpp : n_layer --> n_block

* constants.py : n_layer --> n_block

* gguf.py : n_layer --> n_block

* convert-gptneox-h5-to-gguf.py : n_layer --> n_block

* convert-llama-h5-to-gguf.py : n_layer --> n_block

* gptneox-main.cpp : n_layer --> n_block

* Update gguf_tensor_map.py

* convert-gptneox-h5-to-gguf.py : load model in parts to save memory

* convert-llama-h5-to-gguf.py : load model in parts to save memory

* convert : write more metadata for LLaMA

* convert : rm quantization version

* convert-gptneox-h5-to-gguf.py : add file_type key

* gptneox-main.cpp : add file_type key

* fix conflicts

* gguf : add todos and comments

* convert-gptneox-h5-to-gguf.py : tensor name map changes

* Create gguf_namemap.py : tensor name map changes

* Delete gguf_tensor_map.py

* gptneox-main.cpp : tensor name map changes

* convert-llama-h5-to-gguf.py : fixes

* gguf.py : dont add empty strings

* simple : minor style changes

* gguf : use UNIX line ending

* Create convert-llama-7b-pth-to-gguf.py

* llama : sync gguf-llama.cpp with latest llama.cpp (#2608)

* llama : sync gguf-llama.cpp with latest llama.cpp

* minor : indentation + assert

* llama : refactor gguf_buffer and gguf_ctx_buffer

* llama : minor

* gitignore : add gptneox-main

* llama : tokenizer fixes (#2549)

* Merge tokenizer fixes into the gguf branch.

* Add test vocabularies

* convert : update convert-new.py with tokenizer fixes (#2614)

* Merge tokenizer fixes into the gguf branch.

* Add test vocabularies

* Adapt convert-new.py (and fix a clang-cl compiler error on windows)

* llama : sync gguf-llama with llama (#2613)

* llama : sync gguf-llama with llama

* tests : fix build + warnings (test-tokenizer-1 still fails)

* tests : fix wstring_convert

* convert : fix layer names

* llama : sync gguf-llama.cpp

* convert : update HF converter to new tokenizer voodoo magics

* llama : update tokenizer style

* convert-llama-h5-to-gguf.py : add token types

* constants.py : add token types

* gguf.py : add token types

* convert-llama-7b-pth-to-gguf.py : add token types

* gguf-llama.cpp :  fix n_head_kv

* convert-llama-h5-to-gguf.py : add 70b gqa support

* gguf.py : add tensor data layout

* convert-llama-h5-to-gguf.py : add tensor data layout

* convert-llama-7b-pth-to-gguf.py : add tensor data layout

* gptneox-main.cpp : add tensor data layout

* convert-llama-h5-to-gguf.py : clarify the reverse permute

* llama : refactor model loading code (#2620)

* llama : style formatting + remove helper methods

* llama : fix quantization using gguf tool

* llama : simplify gguf_file_saver

* llama : fix method names

* llama : simplify write_header()

* llama : no need to pass full file loader to the file saver

just gguf_ctx

* llama : gguf_file_saver write I32

* llama : refactor tensor names (#2622)

* gguf: update tensor names searched in quantization

* gguf : define tensor names as constants

* gguf : initial write API (not tested yet)

* gguf : write to file API (not tested)

* gguf : initial write API ready + example

* gguf : fix header write

* gguf : fixes + simplify example + add ggml_nbytes_pad()

* gguf : minor

* llama : replace gguf_file_saver with new gguf write API

* gguf : streaming support when writing files

* gguf : remove oboslete write methods

* gguf : remove obosolete gguf_get_arr_xxx API

* llama : simplify gguf_file_loader

* llama : move hparams and vocab from gguf_file_loader to llama_model_loader

* llama : merge gguf-util.h in llama.cpp

* llama : reorder definitions in .cpp to match .h

* llama : minor simplifications

* llama : refactor llama_model_loader (WIP)

wip : remove ggml_ctx from llama_model_loader

wip : merge gguf_file_loader in llama_model_loader

* llama : fix shape prints

* llama : fix Windows build + fix norm_rms_eps key

* llama : throw error on missing KV paris in model meta data

* llama : improve printing + log meta data

* llama : switch print order of meta data

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>

* gguf : deduplicate (#2629)

* gguf : better type names

* dedup : CPU + Metal is working

* ggml : fix warnings about unused results

* llama.cpp : fix line feed and compiler warning

* llama : fix strncpy warning + note token_to_str does not write null

* llama : restore the original load/save session implementation

Will migrate this to GGUF in the future

* convert-llama-h5-to-gguf.py : support alt ctx param name

* ggml : assert when using ggml_mul with non-F32 src1

* examples : dedup simple

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>

* gguf.py : merge all files in gguf.py

* convert-new.py : pick #2427 for HF 70B support

* examples/gguf : no need to keep q option for quantization any more

* llama.cpp : print actual model size

* llama.cpp : use ggml_elements()

* convert-new.py : output gguf (#2635)

* convert-new.py : output gguf (WIP)

* convert-new.py : add gguf key-value pairs

* llama : add hparams.ctx_train + no longer print ftype

* convert-new.py : minor fixes

* convert-new.py : vocab-only option should work now

* llama : fix tokenizer to use llama_char_to_byte

* tests : add new ggml-vocab-llama.gguf

* convert-new.py : tensor name mapping

* convert-new.py : add map for skipping tensor serialization

* convert-new.py : convert script now works

* gguf.py : pick some of the refactoring from #2644

* convert-new.py : minor fixes

* convert.py : update to support GGUF output

* Revert "ci : disable CI temporary to not waste energy"

This reverts commit 7e82d25f40.

* convert.py : n_head_kv optional and .gguf file extension

* convert.py : better always have n_head_kv and default it to n_head

* llama : sync with recent PRs on master

* editorconfig : ignore models folder

ggml-ci

* ci : update ".bin" to ".gguf" extension

ggml-ci

* llama : fix llama_model_loader memory leak

* gptneox : move as a WIP example

* llama : fix lambda capture

ggml-ci

* ggml : fix bug in gguf_set_kv

ggml-ci

* common.h : .bin --> .gguf

* quantize-stats.cpp : .bin --> .gguf

* convert.py : fix HF tensor permuting / unpacking

ggml-ci

* llama.cpp : typo

* llama : throw error if gguf fails to init from file

ggml-ci

* llama : fix tensor name grepping during quantization

ggml-ci

* gguf.py : write tensors in a single pass (#2644)

* gguf : single pass for writing tensors + refactoring writer

* gguf : single pass for writing tensors + refactoring writer

* gguf : single pass for writing tensors + refactoring writer

* gguf : style fixes in simple conversion script

* gguf : refactor gptneox conversion script

* gguf : rename h5 to hf (for HuggingFace)

* gguf : refactor pth to gguf conversion script

* gguf : rm file_type key and method

* gguf.py : fix vertical alignment

* gguf.py : indentation

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* convert-gptneox-hf-to-gguf.py : fixes

* gguf.py : gptneox mapping

* convert-llama-hf-to-gguf.py : fixes

* convert-llama-7b-pth-to-gguf.py : fixes

* ggml.h : reverse GGUF_MAGIC

* gguf.py : reverse GGUF_MAGIC

* test-tokenizer-0.cpp : fix warning

* llama.cpp : print kv general.name

* llama.cpp : get special token kv and linefeed token id

* llama : print number of tensors per type + print arch + style

* tests : update vocab file with new magic

* editorconfig : fix whitespaces

* llama : re-order functions

* llama : remove C++ API + reorganize common source in /common dir

* llama : minor API updates

* llama : avoid hardcoded special tokens

* llama : fix MPI build

ggml-ci

* llama : introduce enum llama_vocab_type + remove hardcoded string constants

* convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested

* falcon-main.cpp : falcon inference example

* convert-falcon-hf-to-gguf.py : remove extra kv

* convert-gptneox-hf-to-gguf.py : remove extra kv

* convert-llama-7b-pth-to-gguf.py : remove extra kv

* convert-llama-hf-to-gguf.py : remove extra kv

* gguf.py : fix for falcon 40b

* falcon-main.cpp : fix for falcon 40b

* convert-falcon-hf-to-gguf.py : update ref

* convert-falcon-hf-to-gguf.py : add tensor data layout

* cmpnct_gpt2bpe.hpp : fixes

* falcon-main.cpp : fixes

* gptneox-main.cpp : fixes

* cmpnct_gpt2bpe.hpp : remove non-general stuff

* Update examples/server/README.md

Co-authored-by: slaren <slarengh@gmail.com>

* cmpnct_gpt2bpe.hpp : cleanup

* convert-llama-hf-to-gguf.py : special tokens

* convert-llama-7b-pth-to-gguf.py : special tokens

* convert-permute-debug.py : permute debug print

* convert-permute-debug-master.py : permute debug for master

* convert-permute-debug.py : change permute type of attn_q

* convert.py : 70b model working (change attn_q permute)

* Delete convert-permute-debug-master.py

* Delete convert-permute-debug.py

* convert-llama-hf-to-gguf.py : fix attn_q permute

* gguf.py : fix rope scale kv

* convert-llama-hf-to-gguf.py : rope scale and added tokens

* convert-llama-7b-pth-to-gguf.py : rope scale and added tokens

* llama.cpp : use rope scale kv

* convert-llama-7b-pth-to-gguf.py : rope scale fix

* convert-llama-hf-to-gguf.py : rope scale fix

* py : fix whitespace

* gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682)

* First pass at converting GGMLv3 LLaMA models to GGUF

* Cleanups, better output during conversion

* Fix vocab space conversion logic

* More vocab conversion fixes

* Add description to converted GGUF files

* Improve help text, expand warning

* Allow specifying name and description for output GGUF

* Allow overriding vocab and hyperparams from original model metadata

* Use correct params override var name

* Fix wrong type size for Q8_K

Better handling of original style metadata

* Set default value for gguf add_tensor raw_shape KW arg

* llama : improve token type support (#2668)

* Merge tokenizer fixes into the gguf branch.

* Add test vocabularies

* Adapt convert-new.py (and fix a clang-cl compiler error on windows)

* Improved tokenizer test

But does it work on MacOS?

* Improve token type support

- Added @klosax code to convert.py
- Improved token type support in vocabulary

* Exclude platform dependent tests

* More sentencepiece compatibility by eliminating magic numbers

* Restored accidentally removed comment

* llama : add API for token type

ggml-ci

* tests : use new tokenizer type API (#2692)

* Merge tokenizer fixes into the gguf branch.

* Add test vocabularies

* Adapt convert-new.py (and fix a clang-cl compiler error on windows)

* Improved tokenizer test

But does it work on MacOS?

* Improve token type support

- Added @klosax code to convert.py
- Improved token type support in vocabulary

* Exclude platform dependent tests

* More sentencepiece compatibility by eliminating magic numbers

* Restored accidentally removed comment

* Improve commentary

* Use token type API in test-tokenizer-1.cpp

* py : cosmetics

* readme : add notice about new file format

ggml-ci

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: goerch <jhr.walter@t-online.de>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-08-21 23:07:43 +03:00
slaren 097e121e2f
llama : add benchmark example (#2626)
* llama : add benchmark example

* add to examples CMakeLists.txt

* fix msvc build

* add missing include

* add Bessel's correction to stdev calculation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* improve markdown formatting

* add missing include

* print warning is NDEBUG is not defined

* remove n_prompt and n_gen from the matrix, use each value separately instead

* better checks for non-optimized builds

* llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call

* fix json formatting

* add sql output

* add basic cpu and gpu info (linx/cuda only)

* markdown: also show values that differ from the default

* markdown: add build id

* cleanup

* improve formatting

* formatting

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-18 12:44:58 +02:00
drbh 7cf54e1f74
tests : adds simple llama grammar tests (#2618)
* adds simple llama grammar tests

* fix lint and add Makefile

* 0 terminate code_points

* avoid dangling pointers in candidate cleanup

* cleanup grammar at end of test
2023-08-17 10:41:01 +03:00
Shouzheng Liu bf83bff674
metal : matrix-matrix multiplication kernel (#2615)
* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.
2023-08-16 23:07:04 +03:00
drbh ee77efea2a
test : add simple grammar parsing tests (#2594)
* adds simple grammar parsing tests

* adds cassert header
2023-08-13 17:00:48 +03:00
byte-6174 b19edd54d5
Adding support for llama2.c models (#2559) 2023-08-12 01:17:25 +02:00
Johannes Gäßler 25d43e0eb5
CUDA: tuned mul_mat_q kernels (#2546) 2023-08-09 09:42:34 +02:00
Martin Krasser f5bfea0580
Allow passing grammar to completion endpoint (#2532)
* Allow passing grammar to completion endpoint
2023-08-08 16:29:19 +03:00
GiviMAD 34a14b28ff
[Makefile] Move ARM CFLAGS before compilation (#2536) 2023-08-07 09:21:46 +03:00
DannyDaemonic 3498588e0f
Add --simple-io option for subprocesses and break out console.h and cpp (#1558) 2023-08-04 08:20:12 -07:00
Eve 81844fbcfd
tests : Fix compilation warnings (Linux/GCC) (#2451)
* fix hellaswag print format, cast away warning in test-double-float

* c++11 cannot use designated initializers

* add static to test-grad0.c internal functions

* use memcpy in test-double-float.c

* port c tests to c++

* use initializer list for ggml_init_params
2023-08-02 11:06:19 +03:00
Johannes Gäßler 49e7cb5bb1
CUDA: fixed LLAMA_FAST compilation option (#2473) 2023-07-31 21:02:19 +02:00
Johannes Gäßler 0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues (#2453) 2023-07-31 15:44:35 +02:00
slaren a113689571
ggml : add graph tensor allocator (#2411)
* ggml : add graph tensor allocator

* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset

* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset
2023-07-30 15:58:01 +02:00
Johannes Gäßler 11f3ca06b8
CUDA: Quantized matrix matrix multiplication (#2160)
* mmq implementation for non k-quants

* q6_K

* q2_K

* q3_k

* q4_K

* vdr

* q5_K

* faster q8_1 loading

* loop unrolling

* add __restrict__

* q2_K sc_high

* GGML_CUDA_MMQ_Y

* Updated Makefile

* Update Makefile

* DMMV_F16 -> F16

* Updated README, CMakeLists

* Fix CMakeLists.txt

* Fix CMakeLists.txt

* Fix multi GPU out-of-bounds
2023-07-29 23:04:44 +02:00
Cebtenzzre 6df1f5940f
make : build with -Wmissing-prototypes (#2394) 2023-07-26 21:00:04 +03:00
Aarni Koskela b3f138d058
Chat UI extras (#2366)
* makefile: correct deps for server

* server: tighten settings layout a little

* server: expose all currently configured generation params in UI

* server: expose remaining generation params, for the adventurous

* server: embetter mirostat fields
2023-07-24 17:54:22 +03:00
Evan Jones 84e09a7d8b
llama : add grammar-based sampling (#1773)
* llama, main : constrain sampling to grammar

* allow loading grammar from file

* fix whitespace errors

* handle & print parser errors

* add comments to grammar syntax and allow newlines where unambiguous

* add missing include

* support alternates in root rule

* fix bugs with empty token and EOS

* adjust JSON grammar

* remove swp file

* rewrite ternary expressions

Co-authored-by: Henri Vasserman <henv@hot.ee>

* use struct for grammar elements and add Unicode support

* add unicode escapes

* add inverse char ranges

* only sample full tokens (no peeking or truncation)

* llama : minor style changes

blindly applied in online editor - hopefully I didn't break something

* update help text

* add warning message if EOS is disabled

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 23:58:10 -04:00
Jose Maldonado 91171b8072
make : fix CLBLAST compile support in FreeBSD (#2331)
* Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD

* More general use-case for CLBLAST support (Linux and FreeBSD)
2023-07-23 14:52:08 +03:00
Jose Maldonado 73643f5fb1
gitignore : changes for Poetry users + chat examples (#2284)
A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native
Add two examples for interactive mode using Llama2 models (thx TheBloke for models)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:53:27 +03:00
Georgi Gerganov a814d04f81
make : fix indentation 2023-07-21 13:50:55 +03:00
Sky Yan 42c7c2e2e9
make : support customized LLAMA_CUDA_NVCC and LLAMA_CUDA_CCBIN (#2275)
Under certain environment, nvcc and gcc is installed under customized path but not standard path

Co-authored-by: Yan Lin <yanlin@baidu.com>
2023-07-21 13:38:57 +03:00
Jiří Podivín 54e3bc76fe
make : add new target for test binaries (#2244)
Programs in the tests directory are now build with target tests
and placed in the same location.

* clean target was expanded to remove new binaries

* test target binaries are listed in a variable

* Locations of binaries were added to the .gitignore

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:09:16 +03:00
Przemysław Pawełczyk 9cf022a188
make : fix embdinput library and server examples building on MSYS2 (#2235)
* make : fix embdinput library and server examples building on MSYS2

* cmake : fix server example building on MSYS2
2023-07-21 10:42:21 +03:00
wzy 7dabc66f3c
make : use pkg-config for OpenBLAS (#2222) 2023-07-14 22:05:08 +03:00
James Reynolds 229aab351c
make : fix combination of LLAMA_METAL and LLAMA_MPI (#2208)
Fixes https://github.com/ggerganov/llama.cpp/issues/2166 by moving commands after the CFLAGS are changed.
2023-07-14 20:34:40 +03:00
Evan Miller 5656d10599
mpi : add support for distributed inference via MPI (#2099)
* MPI support, first cut

* fix warnings, update README

* fixes

* wrap includes

* PR comments

* Update CMakeLists.txt

* Add GH workflow, fix test

* Add info to README

* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)

* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()

* mpi : move all MPI logic into ggml-mpi

Not tested yet

* mpi : various fixes - communication now works but results are wrong

* mpi : fix output tensor after MPI compute (still not working)

* mpi : fix inference

* mpi : minor

* Add OpenMPI to GH action

* [mpi] continue-on-error: true

* mpi : fix after master merge

* [mpi] Link MPI C++ libraries to fix OpenMPI

* tests : fix new llama_backend API

* [mpi] use MPI_INT32_T

* mpi : factor out recv / send in functions and reuse

* mpi : extend API to allow usage with outer backends (e.g. Metal)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-10 18:49:56 +03:00
dylan 84525e7962
docker : add support for CUDA in docker (#1461)
Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07 21:25:25 +03:00
Johannes Gäßler 924dd22fd3
Quantized dot products for CUDA mul mat vec (#2067) 2023-07-05 14:19:42 +02:00
Henri Vasserman acc111caf9
Allow old Make to build server. (#2098)
Also make server build by default.

Tested with Make 3.82
2023-07-04 15:38:04 +03:00
ZhouYuChen 23c7c6fc91
Update Makefile: clean simple (#2097) 2023-07-04 14:15:16 +02:00
ningshanwutuobang cfa0750bc9
llama : support input embeddings directly (#1910)
* add interface for float input

* fixed inpL shape and type

* add examples of input floats

* add test example for embd input

* fixed sampling

* add free for context

* fixed add end condition for generating

* add examples for llava.py

* add READMD for llava.py

* add READMD for llava.py

* add example of PandaGPT

* refactor the interface and fixed the styles

* add cmake build for embd-input

* add cmake build for embd-input

* Add MiniGPT-4 example

* change the order of the args of llama_eval_internal

* fix ci error
2023-06-28 18:53:37 +03:00
Kawrakow 6769e944c7
k-quants : support for super-block size of 64 (#2001)
* k_quants: WIP super-blocks with 64 weights

* k_quants: WIP super-blocks with 64 weights

Q6_K scalar and AVX2 works

* k_quants: WIP super-blocks with 64 weights

Q4_K scalar and AVX2 works

* k_quants: WIP super-blocks with 64 weights

Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)

* k_quants: WIP super-blocks with 64 weights

Q3_K scalar and AVX2 works.

* k_quants: WIP super-blocks with 64 weights

Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar

* k_quants: WIP super-blocks with 64 weights

Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,

* k_quants: WIP super-blocks with 64 weights

Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.

* k_quants: WIP super-blocks with 64 weights

Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.

* k_quants: WIP super-blocks with 64 weights

Q3_K working on CUDA.

* k_quants: WIP super-blocks with 64 weights

Q5_K working on CUDA, and with this CUDA is done.

* k_quants: WIP super-blocks with 64 weights

Q6_K working on ARM_NEON

* k_quants: WIP super-blocks with 64 weights

Q4_K working on ARM_NEON, but quite a bit slower than 256 weights

* k_quants: WIP super-blocks with 64 weights

Q2_K working on ARM_NEON, but quite a bit slower than 256 weights

* k_quants: WIP super-blocks with 64 weights

Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.

* k_quants: WIP super-blocks with 64 weights

Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.

With that, we have full support for ARM_NEON, although
performance is not quite there.

* k_quants: WIP super-blocks with 64 weights

Slightly more efficient Q3_K and Q5_K

* k_quants: WIP super-blocks with 64 weights

Another small improvement for Q3_K and Q5_K on ARM_NEON

* k_quants: WIP super-blocks with 64 weights

Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.

* k_quants: WIP super-blocks with 64 weights

* We are able to pass preprocessor macros to the Metal
  compiler
* Q6_K works and is actually slightly more efficient than
  the QK_K = 256 version (25.2 ms vs 25.8 ms)

* k_quants: WIP super-blocks with 64 weights

Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).

* k_quants: WIP super-blocks with 64 weights

Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).

* k_quants: WIP super-blocks with 64 weights

Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).

* k_quants: WIP super-blocks with 64 weights

Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).

* k_quants: call them _K, not _k, also on Metal

* k_quants: correctly define QK_K in llama.cpp

* Fixed bug in q4_K quantization added with the 64-block addition

* Simplify via lambda

* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64

Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.

* k_quants: switch Q4_K to 4-bit scales when QK_K = 64

 Here the loss in accuracy is greater than for Q3_K,
 but the Q4_K points still move further to the left on
 the perplexity vs size curve.

* k_quants: forgot to add the Metal changes in last commit

* k_quants: change Q5_K to be type 0 when QK_K = 64

Still needs AVX2 implementation

* k_quants: AVX2 implementation for new 64-weight Q5_K

* k_quants: 10% faster ARM_NEON Q5_K dot product

* k_quants: fixed issue caused by merging with master

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26 19:43:07 +03:00
Johannes Gäßler 16b9cd1939
Convert vector to f16 for dequantize mul mat vec (#1913)
* Convert vector to f16 for dmmv

* compile option

* Added compilation option description to README

* Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
2023-06-19 10:23:56 +02:00
Georgi Gerganov ce2c7d72e2
metal : handle buffers larger than device's maxBufferLength (#1826)
* metal : handle buffers larger than device's maxBufferLength

* metal : print more verbose device info + handle errors

* metal : fix prints for overlapping views

* metal : minimize view overlap to try to utilize device memory better
2023-06-18 09:09:47 +03:00
Georgi Gerganov b2416493ab
make : do not print help for simple example 2023-06-17 20:55:03 +03:00
DaniAndTheWeb 86c7571864
make : update for latest Arch (#1701)
With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.
2023-06-17 19:17:22 +03:00
Randall Fitzgerald 794db3e7b9
Server Example Refactor and Improvements (#1570)
A major rewrite for the server example.

Note that if you have built something on the previous server API, it will probably be incompatible.
Check out the examples for how a typical chat app could work.

This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing.

Summary of the changes:

- adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos
- applies missing top k sampler
- removes interactive mode/terminal-like behavior, removes exclude parameter
- moves threads and batch size to server command-line parameters
- adds LoRA loading and matches command line parameters with main example
- fixes stopping on EOS token and with the specified token amount with n_predict 
- adds server timeouts, host, and port settings
- adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text
- sets defaults for unspecified parameters between requests
- removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming
- adds CORS headers to responses
- adds request logging, exception printing and optional verbose logging
- adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string
- adds printing an error when it can't bind to the host/port specified
- fixes multi-byte character handling and replaces invalid UTF-8 characters on responses
- prints timing and build info on startup
- adds logit bias to request parameters
- removes embedding mode
- updates documentation; adds streaming Node.js and Bash examples
- fixes code formatting
- sets server threads to 1 since the current global state doesn't work well with simultaneous requests
- adds truncation of the input prompt and better context reset
- removes token limit from the input prompt
- significantly simplified the logic and removed a lot of variables

---------

Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Felix Hellmann <privat@cirk2.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>
2023-06-17 14:53:04 +03:00
SuperUserNameMan b41b4cad6f
examples : add "simple" (#1840)
* Create `simple.cpp`

* minimalist example `CMakeLists.txt`

* Update Makefile for minimalist example

* remove 273: Trailing whitespace

* removed trailing white spaces simple.cpp

* typo and comments simple.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-16 21:58:09 +03:00
Kawrakow 3d01122610
CUDA : faster k-quant dot kernels (#1862)
* cuda : faster k-quant dot kernels

* Imrove Q2_K dot kernel on older GPUs

We now have a K_QUANTS_PER_ITERATION macro, which should be
set to 1 on older and to 2 on newer GPUs.
With this, we preserve the performance of the original
PR on RTX-4080, and are faster compared to master on
GTX-1660.

* Imrove Q6_K dot kernel on older GPUs

Using the same K_QUANTS_PER_ITERATION macro as last commit,
we preserve performance on RTX-4080 and speed up
Q6_K on a GTX-1660.

* Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile

Allowed values are 1 or 2. 2 gives the best performance on
modern GPUs and is set as default. On older GPUs 1 may work
better.

* PR comments

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-16 20:08:44 +03:00
daboe01 cf267d1c71
make : add train-text-from-scratch (#1850)
* make finetuning example accessible

* fixed: targed was in wrong line

* fixed: name of executable was wrong

* fixed: naming of binary

* fixed: model path was wrong

* fixed clean target

* Update examples/train-text-from-scratch/README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-15 20:42:48 +03:00
sandyiscool 37e257c48e
make : clean *.so files (#1857) 2023-06-15 20:36:06 +03:00
Kerfuffle 74d4cfa343
Allow "quantizing" to f16 and f32 (#1787)
* Allow "quantizing" to f16 and f32

Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS

Add brief help to the list of quantization types in the quantize tool

Ignore case for quantization type arguments in the quantize tool
2023-06-13 04:23:23 -06:00
rankaiyx 555275a693
make : add SSSE3 compilation use case (#1659) 2023-06-10 09:41:59 +03:00
Georgi Gerganov 5c64a0952e
k-quants : allow to optionally disable at compile time (#1734)
* k-quants : put behind optional compile flag LLAMA_K_QUANTS

* build : enable k-quants by default
2023-06-07 10:59:52 +03:00
Georgi Gerganov 2d43387daf
ggml : fix builds, add ggml-quants-k.o (close #1712, close #1710) 2023-06-06 10:18:03 +03:00
Kawrakow 99009e72f8
ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
* Starting to add k-quantization to ggml

I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.

* Adding Q3_K and Q8_K (de)-quantization

* Q3_K now working on CUDA and AVX2/scalar

CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).

* Some improvement for Q3_K on CUDA

It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.

* Some more CUDA optimizations for Q3_K

Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.

* Adding Q4_K - scalar, AVX2, CUDA

Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).

* Adding Q6_K - scalar, AVX2, CUDA

Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).

* Adding Q5_K - scalar, AVX2, CUDA

Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.

* Per convention, all QX_K quantizations use Q5_K for output.weight

* Adding quantization mixes

* Quantization mixes: didn't quite get what I wanted in the last commit

* Q4_K dot product for ARM_NEON

* Q6_K dot product for ARM_NEON

* Q5_K dot product for ARM_NEON

* Adding Q3_K dot for ARM_NEON

It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.

* A very slightly faster ARM_NEON Q3_K dot

* Adding Q2_K - just CUDA for now

Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.

* Adding scalar and AVX2 Q2_K dot

* Adding ARM_NEON Q2_K dot

About the same performance as Q4_K.

* A slightly faster ARM_NEON Q2_K dot

Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.

* Fixed bug in Q2_K CUDA dot product kernel

Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.

In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
  ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).

* Don't print zeros/NaNs when no count histogram has been collected

* A 10% faster CUDA vector dot kernel for Q3_K

Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.

* A slightly daster Q4_K AVX2 dot product

For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.

* A slightly faster ARM_NEON A4_K dot product

* Minor

* Fix quantization error test

We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.

* Fix docker build

I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.

* Added forgotten ggml.o dependence on k_quants.h to the Makefile

* Had unintentionally committed the Makefile with -Ofast enabled

* ggml : rename k_quants -> ggml-quants-k, use lowercase in code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 22:56:18 +03:00
Georgi Gerganov ecb217db4f
llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
Johannes Gäßler 3b126f654f
LLAMA_DEBUG adds debug symbols (#1617) 2023-05-28 21:01:02 +02:00
Kerfuffle 0df7d63e5b
Include server in releases + other build system cleanups (#1610)
Set `LLAMA_BUILD_SERVER` in workflow so the `server` example gets build. This currently only applies to Windows builds because it seems like only Windows binary artifacts are included in releases.

Add `server` example target to `Makefile` (still uses `LLAMA_BUILD_SERVER` define and does not build by default)

Fix issue where `vdot` binary wasn't removed when running `make clean`.

Fix compile warnings in `server` example.

Add `.hpp` files to trigger workflow (the server example has one).
2023-05-27 11:04:14 -06:00
Johannes Gäßler 1fcdcc28b1
cuda : performance optimizations (#1530)
* xor hack

* block y dim

* loop unrolling

* Fixed cmake LLAMA_CUDA_BY option

* Removed hipblas compatibility code

* Define GGML_CUDA_DMMV_BLOCK_Y if not defined

* Fewer iters, more ops per iter

* Renamed DMMV X/Y compilation options
2023-05-26 00:07:29 +03:00
0cc4m 2e6cd4b025
OpenCL Token Generation Acceleration (#1459)
* Move back to C++ for OpenCL

* Refactor OpenCL code to work more like the CUDA code, add missing functions

* Deduplicate dequant kernels

* Add OpenCL compile options

* Use compile args for preprocessing constants

* Restore default platform + device selection by id behavior

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Henri Vasserman <henv@hot.ee>
2023-05-23 00:33:24 +03:00
Stefan Sydow 7780e4f479
make : .PHONY clean (#1553) 2023-05-21 17:03:44 +03:00
Zenix b8ee340abe
feature : support blis and other blas implementation (#1536)
* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix: blas changes on ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 17:58:31 +03:00
Georgi Gerganov ea600071cb
Revert "feature : add blis and other BLAS implementation support (#1502)"
This reverts commit 07e9ace0f9.
2023-05-20 12:03:48 +03:00
Zenix 07e9ace0f9
feature : add blis and other BLAS implementation support (#1502)
* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-20 12:02:48 +03:00
sandyiscool 2a5ee023ad
Add alternate include path for openblas (#1476)
In some linux distributions (fedora, for example), the include path for openblas is located at '/usr/local/include'
2023-05-16 10:30:15 +02:00
Georgi Gerganov bda4d7c215 make : fix PERF build with cuBLAS 2023-05-13 17:25:09 +03:00
DaniAndTheWeb 173d0e6419
makefile: automatic Arch Linux detection (#1332)
This commit is a port of a detection method used in koboldcpp's Makefile in order to automatically set the -lcblas option on Arch Linux
2023-05-05 23:57:14 +02:00
Ionoclast Laboratories 2d13786e91
Fix for OpenCL / clbast builds on macOS. (#1329) 2023-05-05 14:18:21 +02:00
DannyDaemonic 55bc5f0900
Call sh on build-info.sh (#1294) 2023-05-02 17:52:35 -07:00
DannyDaemonic f4cef87edf
Add git-based build information for better issue tracking (#1232)
* Add git-based build information for better issue tracking

* macOS fix

* "build (hash)" and "CMAKE_SOURCE_DIR" changes

* Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages

* Fix conditional dependency on missing target

* Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile

* 4 space indenting for cmake, attempt to clean up my mess in Makefile

* Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it
2023-05-01 18:23:47 +02:00
Pavol Rusnak 6f79699286
build: add armv{6,7,8} support to cmake (#1251)
- flags copied from Makefile
- updated comments in both CMakeLists.txt and Makefile to match reality
2023-04-30 20:48:38 +02:00
Stephan Walter f0d70f147d
Various fixes to mat_mul benchmark (#1253) 2023-04-30 12:32:37 +00:00
Georgi Gerganov 214b6a3570
ggml : adjust mul_mat_f16 work memory (#1226)
* llama : minor - remove explicity int64_t cast

* ggml : reduce memory buffer for F16 mul_mat when not using cuBLAS

* ggml : add asserts to guard for incorrect wsize
2023-04-29 18:43:28 +03:00
Georgi Gerganov 305eb5afd5
build : fix reference to old llama_util.h 2023-04-29 13:53:12 +03:00
slaren 7fc50c051a
cuBLAS: use host pinned memory and dequantize while copying (#1207)
* cuBLAS: dequantize simultaneously while copying memory

* cuBLAS: use host pinned memory

* cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory

* cuBLAS: also pin kv cache

* fix rebase
2023-04-29 02:04:18 +02:00
0cc4m 7296c961d9
ggml : add CLBlast support (#1164)
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing

* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers

* Finish merge of ClBlast support

* Move CLBlast implementation to separate file

Add buffer reuse code (adapted from slaren's cuda implementation)

* Add q4_2 and q4_3 CLBlast support, improve code

* Double CLBlast speed by disabling OpenBLAS thread workaround

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>

* Fix device selection env variable names

* Fix cast in opencl kernels

* Add CLBlast to CMakeLists.txt

* Replace buffer pool with static buffers a, b, qb, c

Fix compile warnings

* Fix typos, use GGML_TYPE defines, improve code

* Improve btype dequant kernel selection code, add error if type is unsupported

* Improve code quality

* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read

Co-authored-by: Henri Vasserman <henv@hot.ee>

* Use c compiler for opencl files

* Simplify code, fix include

* First check error, then release event

* Make globals static, fix indentation

* Rename dequant kernels file to conform with other file names

* Fix import cl file name

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28 17:57:16 +03:00
Johannes Gäßler 92a6e13a31
Add Manjaro CUDA include and lib dirs to Makefile (#1212) 2023-04-28 15:40:32 +02:00
slaren e4cf982e0d
Fix cuda compilation (#1128)
* Fix: Issue with CUBLAS compilation error due to missing -fPIC flag

---------

Co-authored-by: B1gM8c <89020353+B1gM8c@users.noreply.github.com>
2023-04-24 17:29:58 +02:00
Georgi Gerganov e4422e299c
ggml : better PERF prints + support "LLAMA_PERF=1 make" 2023-04-23 18:15:39 +03:00
Georgi Gerganov 872c365a91 ggml : fix AVX build + update to new Q8_0 format 2023-04-22 11:08:12 +03:00
slaren 50cb666b8a
Improve cuBLAS performance by using a memory pool (#1094)
* Improve cuBLAS performance by using a memory pool

* Move cuda specific definitions to ggml-cuda.h/cu

* Add CXX flags to nvcc

* Change memory pool synchronization mechanism to a spin lock
General code cleanup
2023-04-21 21:59:17 +02:00
slaren 2005469ea1
Add Q4_3 support to cuBLAS (#1086) 2023-04-20 20:49:53 +02:00
源文雨 5addcb120c
fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080) 2023-04-20 15:28:43 +02:00
slaren 02d6988121
Improve cuBLAS performance by dequantizing on the GPU (#1065) 2023-04-20 03:14:14 +02:00
Stephan Walter f3d4edf504
ggml : Q4 cleanup - remove 4-bit dot product code (#1061)
* Q4 cleanup

* Remove unused AVX512 Q4_0 code
2023-04-19 19:06:37 +03:00
slaren 8944a13296
Add NVIDIA cuBLAS support (#1044) 2023-04-19 11:22:45 +02:00
Kawrakow 5ecff35151
Adding a simple program to measure speed of dot products (#1041)
On my Mac, the direct Q4_1 product is marginally slower
(~69 vs ~55 us for Q4_0). The SIMD-ified ggml version
is now almost 2X slower (~121 us).

On a Ryzen 7950X CPU, the direct product for Q4_1 quantization
is faster than the AVX2 implementation (~60 vs ~62 us).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-04-18 19:00:14 +00:00
Georgi Gerganov e95b6554b4
ggml : add Q8_0 quantization for intermediate results (#951)
* ggml : add Q8_0 quantization for intermediate results

* quantize-stats : fix test + add it to Makefile default

* Q8: use int8_t, AVX/AVX2 optimizations

* ggml : fix quantize_row_q8_0() ARM_NEON rounding

* minor : updates after rebase to latest master

* quantize-stats : delete obsolete strings

* ggml : fix q4_1 dot func

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-15 17:53:22 +03:00
Stephan Walter 93265e988a
make : fix dependencies, use auto variables (#983) 2023-04-14 22:39:48 +03:00
Georgi Gerganov 9190e8eac8
llama : merge llama_internal.h into llama.h
Hide it behind an #ifdef
2023-04-13 18:04:45 +03:00
CRD716 8cda5c981d
fix whitespace (#944) 2023-04-13 16:03:57 +02:00
SebastianApel 95ea26f6e9
benchmark : add tool for timing q4_0 matrix multiplication (#653)
* Initial version of q4_0 matrix multiplication benchmark

* Bugfix: Added dependency to ggml.o to benchmark

* Reviewer requests: added parameter for threads, switched to ggml_time_us()

* Reviewer input: removed rtsc, use epsilon for check

* Review comment: Removed set_locale

* Feature: Param for numer of iterations, Bugfix for use of parameter threads

* Reviewer suggestion: Moved to examples

* Reviewer feedback: Updated clean: and benchmark: sections

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-13 15:46:23 +03:00