llama.cpp/tests
goerch ff5a3f0c09
Work on the BPE tokenizer (#3252)
* Work on the BPE tokenizer

Tokenizer tests work for Falcon-7B

* Try to fix build problem

* Fix debug assertion failure

* Fix MSVC Unicode BOM problem

* Cleanup and an improvement

* Fix compiler warning

* Cleanup

* Test doesn't work over the full range of Unicodes

* Update .gitignore and Makefile

* Another Makefile rule

* Testing Aquila

* Moving byte decoding back to `token_to_piece` ...

... because everyone is using it.

* Guarding some unusable code pathes

* Streamlining code and adding some more assertions

Important change: I'm classifying added tokens as control tokens now for BPE.

* Adding a comment

* Adding another assertion

* Fixed vocabulary guarding assertions

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fix PR for recent change

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fixes for more compiler warnings

* Remove unused code

* Fix initialization of static maps

* Add scores and token types back, adapt gptneox

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update unicode.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update unicode.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Ported Starcoder and added some assertions

* Fix coding style

* Apply @jploski 's fix for missing tokens

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-03 09:16:26 +02:00
..
CMakeLists.txt Work on the BPE tokenizer (#3252) 2023-10-03 09:16:26 +02:00
test-c.c tests : add a C compliance test (#2848) 2023-08-30 09:20:26 +03:00
test-double-float.cpp tests : Fix compilation warnings (Linux/GCC) (#2451) 2023-08-02 11:06:19 +03:00
test-grad0.cpp build : enable more non-default compiler warnings (#3200) 2023-09-28 17:41:44 -04:00
test-grammar-parser.cpp gguf : new file format with flexible meta data (beta) (#2398) 2023-08-21 23:07:43 +03:00
test-llama-grammar.cpp gguf : new file format with flexible meta data (beta) (#2398) 2023-08-21 23:07:43 +03:00
test-opt.cpp build : enable more non-default compiler warnings (#3200) 2023-09-28 17:41:44 -04:00
test-quantize-fns.cpp check C++ code with -Wmissing-declarations (#3184) 2023-09-15 15:38:27 -04:00
test-quantize-perf.cpp check C++ code with -Wmissing-declarations (#3184) 2023-09-15 15:38:27 -04:00
test-rope.cpp llama : custom attention mask + parallel decoding + no context swaps (#3228) 2023-09-28 19:04:36 +03:00
test-sampling.cpp check C++ code with -Wmissing-declarations (#3184) 2023-09-15 15:38:27 -04:00
test-tokenizer-0-falcon.cpp Work on the BPE tokenizer (#3252) 2023-10-03 09:16:26 +02:00
test-tokenizer-0-falcon.py llama : more tokenizer fixes (#2810) 2023-08-27 14:19:19 +03:00
test-tokenizer-0-llama.cpp llama.cpp : split llama_context_params into model and context params (#3301) 2023-09-28 22:42:38 +03:00
test-tokenizer-0-llama.py llama : more tokenizer fixes (#2810) 2023-08-27 14:19:19 +03:00
test-tokenizer-1-bpe.cpp Work on the BPE tokenizer (#3252) 2023-10-03 09:16:26 +02:00
test-tokenizer-1-llama.cpp Work on the BPE tokenizer (#3252) 2023-10-03 09:16:26 +02:00