Commit graph

4 commits

Author SHA1 Message Date
Galunid f23c0359a3
ci : add flake8 to github actions (python linting) (#4129)
Disabled rules:

* E203 Whitespace before ':' - disabled because we often use 'C' Style where values are aligned

* E211 Whitespace before '(' (E211) - disabled because we often use 'C' Style where values are aligned

* E221 Multiple spaces before operator - disabled because we often use 'C' Style where values are aligned

* E225 Missing whitespace around operator - disabled because it's broken so often it seems like a standard

* E231 Missing whitespace after ',', ';', or ':' - disabled because we often use 'C' Style where values are aligned

* E241 Multiple spaces after ',' - disabled because we often use 'C' Style where values are aligned

* E251 Unexpected spaces around keyword / parameter equals - disabled because it's broken so often it seems like a standard

* E261 At least two spaces before inline comment - disabled because it's broken so often it seems like a standard

* E266 Too many leading '#' for block comment - sometimes used as "section" separator

* E501 Line too long - disabled because it's broken so often it seems like a standard

* E701 Multiple statements on one line (colon) - broken only in convert.py when defining abstract methods (we can use# noqa instead)

* E704 Multiple statements on one line - broken only in convert.py when defining abstract methods (we can use# noqa instead)
2023-11-20 11:35:47 +01:00
Jiří Podivín f7d5e97542
py : remove superfluous import statements (#4076)
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Jiri Podivin <jpodivin@redhat.com>
2023-11-17 17:20:53 +02:00
goerch 233fc1c69f
Minor improvements in GPT2 tokenizer (#3567)
* Fixing minor bugs in bpe_gpt2_preprocess

* Don't add bos token in test
2023-10-10 18:59:52 +02:00
Georgi Gerganov edd4c14817
llama : more tokenizer fixes (#2810)
* tests : write a Python tokenizer test (wip)

* llama : prefix input text for tokenization with whitespace

* llama : distinguish pieces from decoded text + fix detokenization

* common : add comments

* examples : no longer manually add leading space when tokenizing

* tests : use Python to generate tokenizer tests for C++

* tests : add option to tokenize text files

ggml-ci

* tests : add test-tokenizer-1.py

* llama.cpp : fix LF token

* hellaswag : move the concat space for clarity

* tests : add falcon tests (py + cpp, currently do not pass Unicode)

ggml-ci

* common : temporary separate llama_detokenize calls for SPM and BPE

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-27 14:19:19 +03:00