llama.cpp/tests
Georgi Gerganov edd4c14817
llama : more tokenizer fixes (#2810)
* tests : write a Python tokenizer test (wip)

* llama : prefix input text for tokenization with whitespace

* llama : distinguish pieces from decoded text + fix detokenization

* common : add comments

* examples : no longer manually add leading space when tokenizing

* tests : use Python to generate tokenizer tests for C++

* tests : add option to tokenize text files

ggml-ci

* tests : add test-tokenizer-1.py

* llama.cpp : fix LF token

* hellaswag : move the concat space for clarity

* tests : add falcon tests (py + cpp, currently do not pass Unicode)

ggml-ci

* common : temporary separate llama_detokenize calls for SPM and BPE

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-27 14:19:19 +03:00
..
CMakeLists.txt llama : more tokenizer fixes (#2810) 2023-08-27 14:19:19 +03:00
test-double-float.cpp tests : Fix compilation warnings (Linux/GCC) (#2451) 2023-08-02 11:06:19 +03:00
test-grad0.cpp tests : Fix compilation warnings (Linux/GCC) (#2451) 2023-08-02 11:06:19 +03:00
test-grammar-parser.cpp gguf : new file format with flexible meta data (beta) (#2398) 2023-08-21 23:07:43 +03:00
test-llama-grammar.cpp gguf : new file format with flexible meta data (beta) (#2398) 2023-08-21 23:07:43 +03:00
test-opt.cpp tests : Fix compilation warnings (Linux/GCC) (#2451) 2023-08-02 11:06:19 +03:00
test-quantize-fns.cpp ggml : generalize quantize_fns for simpler FP16 handling (#1237) 2023-07-05 19:13:06 +03:00
test-quantize-perf.cpp ggml : generalize quantize_fns for simpler FP16 handling (#1237) 2023-07-05 19:13:06 +03:00
test-sampling.cpp ci : integrate with ggml-org/ci (#2250) 2023-07-18 14:24:43 +03:00
test-tokenizer-0-falcon.cpp llama : more tokenizer fixes (#2810) 2023-08-27 14:19:19 +03:00
test-tokenizer-0-falcon.py llama : more tokenizer fixes (#2810) 2023-08-27 14:19:19 +03:00
test-tokenizer-0-llama.cpp llama : more tokenizer fixes (#2810) 2023-08-27 14:19:19 +03:00
test-tokenizer-0-llama.py llama : more tokenizer fixes (#2810) 2023-08-27 14:19:19 +03:00
test-tokenizer-1.cpp llama : more tokenizer fixes (#2810) 2023-08-27 14:19:19 +03:00