llama.cpp/examples/server
compilade c2101a2e90
llama : support Mamba Selective State Space Models (#5328)
* mamba : begin working on support for Mamba SSM

* mamba : begin figuring out how to (ab)use the kv cache for Mamba

* mamba : recurrent inference almost works, but incoherent

* mamba : recurrent inference WORKS!!!

* convert : optionally use d_conv and d_state from config.json for Mamba

* mamba : refactor recurrent conv, resulting in 20% perf increase

It's still slower than I'd like, but I did not really optimize `ggml_exp` yet.

I also refactored `ggml_exp` to work with tensors with more than 2 dimensions.

* ggml : parallelize ggml_exp

This results in 8% faster token generation for Mamba-130M.

* mamba : simplify the conv step with a self-overlapping view

Turns out the conv_state can be made smaller by one column.
Note that this breaks existing GGUFs of Mamba,
because the key_value_length field is tied to the conv_state size.

Convolution with a self-overlapping view is cool!
And it's much simpler than what I initially thought would be necessary
to make the convolution step work with more than 1 token at a time.

Next step is to make the SSM step work on batches of tokens too,
and thus I need to figure out a way to make a parallel selective scan
which will keep the ssm_state small and won't make it bigger
by a factor of (n_layer * batch_size).

* llama : fix Mamba KV self size wrongly displaying as f16 instead of f32

Relatedly, I also tried to see if other types than f32 worked for the states,
but they don't, because of the operators used.
It's probably better anyway to keep lots of precision there,
since the states are small anyway.

* mamba : fix self-overlapping view depth stride

* mamba : handle batches of more than 1 token

This means running Mamba no longer crashes when using the default settings!
And probably also slightly faster prompt processing.
Both batched and non-batched processing yield the same output.

Previously, the state was not cleared when starting a sequence.
Next step is to make the KV cache API work as expected for Mamba models.

* ggml: add ggml_ssm_scan to help with parallel selective scan

If the selective scan was implemented without a custom operator,
there would be waaay too many nodes in the graph. For example,
for Mamba-130M, with a batch size of 512 (the default),
a naive selective scan could add at least 24*512=12288 nodes,
which is more than LLAMA_MAX_NODES (8192),
and that's only for the smallest Mamba model.
So it's much cleaner with a custom operator.
Not sure about the name, though.

* ggml : in ggml_ssm_scan, merge multiple rows in the same vec operation

This will help with performance on CPU if ggml_vec_mul_f32
and ggml_vec_add_f32 are ever optimized with SIMD.

* mamba : very basic quantization support

Mostly works, but there is currently no difference
between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same).
Most of the SSM-specific weights can be kept in f32 without affecting
the size that much, since they are relatively small.
(the linear projection weights are responsible for most of Mamba's size)

Too much quantization seems to make the state degrade quite fast, and
the model begins to output gibberish.
It seems to affect bigger models to a lesser extent than small models,
but I'm not sure by how much.

Experimentation will be needed to figure out which weights are more important
for the _M (and _L?) variants of k-quants for Mamba.

* convert : fix wrong name for layer norm weight of offical Mamba models

I was using Q-bert/Mamba-* models before, which have a slighlty different
naming scheme for the weights.
(they start with "model.layers" instead of "backbone.layers")

* mamba : fuse more steps of the SSM scan in the ggml_ssm_scan operator

This increases performance on CPU by around 30% for prompt processing,
and by around 20% for text generation.

However, it also makes the ggml_exp and ggml_soft_plus operators unused.
Whether or not they should be kept will be decided later.

* convert : for Mamba, also consider the "MambaLMHeadModel" arch name

It's the name of the class of the official implementation,
though they don't use it (yet) in the "architectures" field of config.json

* mamba : fix vocab size problems with official models

The perplexity was waaaay to high for models with a non-round vocab size.
Not sure why, but it needed to be fixed in the metadata.

Note that this breaks existing GGUF-converted Mamba models,
but **only if** the vocab size was not already rounded.

* ggml : remove ggml_exp and ggml_soft_plus

They did not exist anyway outside of this branch,
and since ggml_ssm_scan fused operations together, they are unused.
It's always possible to bring them back if needed.

* mamba : remove some useless comments

No code change.

* convert : fix flake8 linter errors

* mamba : apply suggestions from code review

* mamba : remove unecessary branch for row-wise ssm_state and C multiplication

It was previously done to avoid permuting when only one token is processed
at a time (like when generating text), but permuting is cheap,
and dynamically changing the compute graph is not future-proof.

* ggml : in ggml_ssm_scan, use more appropriate asserts

* ggml : rename the destination pointer in ggml_compute_forward_ssm_scan_f32

* mamba : multiple sequences, but one at a time

This is a step towards making this Mamba implementation usable
with the server example (the way the system prompt is kept when clearing
the client slots will need to be changed before this can work, though).

The KV cache size for this kind of model is tied to the maximum number
of sequences kept at any single time.
For now, this number is obtained from n_parallel (plus one,
to have an extra sequence to dedicate to the system prompt),
but there might be a better way to do this which won't also
make the main example use 2 cells even if only 1 is really used.
(for this specific case, --parallel 0 helps)

Simultaneous sequence processing will probably require changes to
ggml_ssm_scan, and possibly a new operator for the conv step.

* mamba : support llama_kv_cache_seq_cp

This (mis)uses the logic around K shifts, because tokens in a state
can't be shifted anyway, and because inp_K_shift has the right shape and type.
Using ggml_get_rows is a nice way to do copies, but copy chains can't work.
Fortunately, copy chains don't really seem to be used in the examples.

Each KV cell is dedicated to the sequence ID corresponding to its own index.

* mamba : use a state mask

It's cleaner than the previous heuristic of
checking for the pos of the first token in the batch.

inp_KQ_mask could not be re-used for this, because it has the wrong shape
and because it seems more suited to the next step of
simultaneous sequence processing (helping with the problem of
remembering which token belongs to which sequence(s)/state(s)).

* llama : replace the usage of n_ctx with kv_self.size in many places

* mamba : use n_tokens directly instead of n_tok

* mamba : in comments, properly refer to KV cells instead of slots

* mamba : reduce memory usage of ggml_ssm_scan

From 290.37 MiB to 140.68 MiB of CPU compute buffer size
with Mamba 3B with a batch size of 512.

The result tensor of ggml_ssm_scan was previously a big part
of the CPU compute buffer size. To make it smaller,
it does not contain the intermediate ssm states anymore.
Both y and the last ssm state are combined in the result tensor,
because it seems only a single tensor can be returned by an operator
with the way the graph is built.

* mamba : simultaneous sequence processing

A batch can now contain tokens from multiple sequences.

This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.

However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.

* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba

This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).

Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.

Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.

* llama : add inp_s_seq as a new input tensor

The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.

The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.

Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).

* mamba : support llama_kv_cache_seq_cp copy chains

* mamba : support shifting and dividing the kv cache pos

* mamba : make the server and parallel examples work with whole sequences

A seq_id is dedicated to the system prompt in both cases.

* llama : make llama_kv_cache_seq_rm return whether it succeeded or not

* mamba : dedicate an input tensor for state copy indices

This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.

* mamba : adapt perplexity, batched, and batched-bench examples

* perplexity : limit the max number of sequences

This adapts to what the loaded model can provide.

* llama : add llama_n_max_seq to get the upper limit for seq_ids

Used by the perplexity example.

* batched : pass n_parallel to the model's context params

This should have been there already, but it wasn't.

* batched-bench : reserve sequences to support Mamba

* batched-bench : fix tokens being put in wrong sequences

Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.

* mamba : stop abusing attention metadata

This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.

This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
 will not require breaking existing converted Mamba models again)

* gguf-py : add new KV metadata key-value pairs for Mamba

* llama : add new metadata key-value pairs for Mamba

* llama : guard against divisions by zero when n_head is 0

* mamba : rename "unlimited" KV cache property to "recurrent"

* mamba : more correctly update the "used" field of the KV cache

* ggml : in ggml_ssm_scan, use a threshold for soft_plus

This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.

* convert : for Mamba, fallback to internal NeoX tokenizer

The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.

* mamba : support state saving and restoring

* ggml : implicitly pass src tensors through dst for Mamba-related ops

* mamba : clarify some comments

* server : fix cache_tokens not getting correctly resized

Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.

For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.

* convert-hf : support new metadata keys for Mamba

For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

* mamba : rename metadata to be more similar to transformers library

This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".

* mamba : support mamba-*-hf models

These models share their token_embd.weight with their output.weight

* mamba : add missing spaces

This is purely a formatting change.

* convert-hf : omit output.weight when identical with token_embd.weight

Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.

* readme : add Mamba to supported models, and add recent API changes

* mamba : move state_seq and state_mask views outside layer loop

A few tensors were also missing `struct` in front of `ggml_tensor`.
2024-03-08 17:31:00 -05:00
..
public common, server : surface min_keep as its own parameter (#5567) 2024-02-18 21:11:16 +02:00
tests server: metrics: add llamacpp:prompt_seconds_total and llamacpp:tokens_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header. (#5937) 2024-03-08 12:25:04 +01:00
chat-llama2.sh chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00
chat.mjs server : parallel decoding and multimodal (#3677) 2023-10-22 22:53:08 +03:00
chat.sh server : fix context shift (#5195) 2024-01-30 20:17:30 +02:00
CMakeLists.txt server : refactor (#5882) 2024-03-07 11:41:53 +02:00
completion.js.hpp server : remove model.json endpoint (#5371) 2024-02-06 20:08:38 +02:00
deps.sh Revert "server : change deps.sh xxd files to string literals (#5221)" 2024-01-30 21:19:26 +02:00
httplib.h examples : add server example with REST API (#1443) 2023-05-21 20:51:18 +03:00
index.html.hpp Revert "server : change deps.sh xxd files to string literals (#5221)" 2024-01-30 21:19:26 +02:00
index.js.hpp Revert "server : change deps.sh xxd files to string literals (#5221)" 2024-01-30 21:19:26 +02:00
json-schema-to-grammar.mjs.hpp Revert "server : change deps.sh xxd files to string literals (#5221)" 2024-01-30 21:19:26 +02:00
json.hpp english : use typos to fix comments and logs (#4354) 2023-12-12 11:53:36 +02:00
README.md server : refactor (#5882) 2024-03-07 11:41:53 +02:00
server.cpp llama : support Mamba Selective State Space Models (#5328) 2024-03-08 17:31:00 -05:00
utils.hpp server : refactor (#5882) 2024-03-07 11:41:53 +02:00

LLaMA.cpp HTTP Server

Fast, lightweight, pure C/C++ HTTP server based on httplib, nlohmann::json and llama.cpp.

Set of LLM REST APIs and a simple web front end to interact with llama.cpp.

Features:

  • LLM inference of F16 and quantum models on GPU and CPU
  • OpenAI API compatible chat completions and embeddings routes
  • Parallel decoding with multi-user support
  • Continuous batching
  • Multimodal (wip)
  • Monitoring endpoints

The project is under active development, and we are looking for feedback and contributors.

Command line options:

  • --threads N, -t N: Set the number of threads to use during generation.

  • -tb N, --threads-batch N: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.

  • --threads-http N: number of threads in the http server pool to process requests (default: max(std::thread::hardware_concurrency() - 1, --parallel N + 2))

  • -m FNAME, --model FNAME: Specify the path to the LLaMA model file (e.g., models/7B/ggml-model.gguf).

  • -a ALIAS, --alias ALIAS: Set an alias for the model. The alias will be returned in API responses.

  • -c N, --ctx-size N: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of 4096.

  • -ngl N, --n-gpu-layers N: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.

  • -mg i, --main-gpu i: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.

  • -ts SPLIT, --tensor-split SPLIT: When using multiple GPUs this option controls how large tensors should be split across all GPUs. SPLIT is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.

  • -b N, --batch-size N: Set the batch size for prompt processing. Default: 512.

  • --memory-f32: Use 32-bit floats instead of 16-bit floats for memory key+value. Not recommended.

  • --mlock: Lock the model in memory, preventing it from being swapped out when memory-mapped.

  • --no-mmap: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed.

  • --numa STRATEGY: Attempt one of the below optimization strategies that help on some NUMA systems

  • --numa distribute: Spread execution evenly over all nodes

  • --numa isolate: Only spawn threads on CPUs on the node that execution started on

  • --numa numactl: Use the CPU map provided by numactl if run without this previously, it is recommended to drop the system page cache before using this see https://github.com/ggerganov/llama.cpp/issues/1437

  • --numa: Attempt optimizations that help on some NUMA systems.

  • --lora FNAME: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.

  • --lora-base FNAME: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the --lora flag, and specifies the base model for the adaptation.

  • -to N, --timeout N: Server read/write timeout in seconds. Default 600.

  • --host: Set the hostname or ip address to listen. Default 127.0.0.1.

  • --port: Set the port to listen. Default: 8080.

  • --path: path from which to serve static files (default examples/server/public)

  • --api-key: Set an api key for request authorization. By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token. May be used multiple times to enable multiple valid keys.

  • --api-key-file: path to file containing api keys delimited by new lines. If set, requests must include one of the keys for access. May be used in conjunction with --api-key's.

  • --embedding: Enable embedding extraction, Default: disabled.

  • -np N, --parallel N: Set the number of slots for process requests (default: 1)

  • -cb, --cont-batching: enable continuous batching (a.k.a dynamic batching) (default: disabled)

  • -spf FNAME, --system-prompt-file FNAME Set a file to load "a system prompt (initial prompt of all slots), this is useful for chat applications. See more

  • --mmproj MMPROJ_FILE: Path to a multimodal projector file for LLaVA.

  • --grp-attn-n: Set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width --grp-attn-w

  • --grp-attn-w: Set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor --grp-attn-n

  • -n N, --n-predict N: Set the maximum tokens to predict (default: -1)

  • --slots-endpoint-disable: To disable slots state monitoring endpoint. Slots state may contain user data, prompts included.

  • --metrics: enable prometheus /metrics compatible endpoint (default: disabled)

  • --chat-template JINJA_TEMPLATE: Set custom jinja chat template. This parameter accepts a string, not a file name (default: template taken from model's metadata). We only support some pre-defined templates

  • --log-disable: Output logs to stdout only, default: enabled.

  • --log-format FORMAT: Define the log output to FORMAT: json or text (default: json)

Build

server is build alongside everything else from the root of the project

  • Using make:

    make
    
  • Using CMake:

    cmake --build . --config Release
    

Quick Start

To get started right away, run the following command, making sure to use the correct path for the model you have:

Unix-based systems (Linux, macOS, etc.)

./server -m models/7B/ggml-model.gguf -c 2048

Windows

server.exe -m models\7B\ggml-model.gguf -c 2048

The above command will start a server that by default listens on 127.0.0.1:8080. You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url.

Docker

docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080

# or, with CUDA:
docker run -p 8080:8080 -v /path/to/models:/models --gpus all ggerganov/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99

Testing with CURL

Using curl. On Windows curl.exe should be available in the base OS.

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

Advanced testing

We implemented a server test framework using human-readable scenario.

Before submitting an issue, please try to reproduce it with this format.

Node JS Test

You need to have Node.js installed.

mkdir llama-client
cd llama-client

Create a index.js file and put inside this:

const prompt = `Building a website can be done in 10 simple steps:`;

async function Test() {
    let response = await fetch("http://127.0.0.1:8080/completion", {
        method: 'POST',
        body: JSON.stringify({
            prompt,
            n_predict: 512,
        })
    })
    console.log((await response.json()).content)
}

Test()

And run it:

node index.js

API Endpoints

  • GET /health: Returns the current state of the server:

    • 503 -> {"status": "loading model"} if the model is still being loaded.
    • 500 -> {"status": "error"} if the model failed to load.
    • 200 -> {"status": "ok", "slots_idle": 1, "slots_processing": 2 } if the model is successfully loaded and the server is ready for further requests mentioned below.
    • 200 -> {"status": "no slot available", "slots_idle": 0, "slots_processing": 32} if no slot are currently available.
    • 503 -> {"status": "no slot available", "slots_idle": 0, "slots_processing": 32} if the query parameter fail_on_no_slot is provided and no slot are currently available.

    If the query parameter include_slots is passed, slots field will contain internal slots data except if --slots-endpoint-disable is set.

  • POST /completion: Given a prompt, it returns the predicted completion.

    Options:

    prompt: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated. If the prompt is a string or an array with the first element given as a string, a bos token is inserted in the front like main does.

    temperature: Adjust the randomness of the generated text (default: 0.8).

    dynatemp_range: Dynamic temperature range. The final temperature will be in the range of [temperature - dynatemp_range; temperature + dynatemp_range] (default: 0.0, 0.0 = disabled).

    dynatemp_exponent: Dynamic temperature exponent (default: 1.0).

    top_k: Limit the next token selection to the K most probable tokens (default: 40).

    top_p: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95).

    min_p: The minimum probability for a token to be considered, relative to the probability of the most likely token (default: 0.05).

    n_predict: Set the maximum number of tokens to predict when generating text. Note: May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity).

    n_keep: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. By default, this value is set to 0 (meaning no tokens are kept). Use -1 to retain all tokens from the prompt.

    stream: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to true.

    stop: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration (default: []).

    tfs_z: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled).

    typical_p: Enable locally typical sampling with parameter p (default: 1.0, 1.0 = disabled).

    repeat_penalty: Control the repetition of token sequences in the generated text (default: 1.1).

    repeat_last_n: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx-size).

    penalize_nl: Penalize newline tokens when applying the repeat penalty (default: true).

    presence_penalty: Repeat alpha presence penalty (default: 0.0, 0.0 = disabled).

    frequency_penalty: Repeat alpha frequency penalty (default: 0.0, 0.0 = disabled);

    penalty_prompt: This will replace the prompt for the purpose of the penalty evaluation. Can be either null, a string or an array of numbers representing tokens (default: null = use the original prompt).

    mirostat: Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0).

    mirostat_tau: Set the Mirostat target entropy, parameter tau (default: 5.0).

    mirostat_eta: Set the Mirostat learning rate, parameter eta (default: 0.1).

    grammar: Set grammar for grammar-based sampling (default: no grammar)

    seed: Set the random number generator (RNG) seed (default: -1, -1 = random seed).

    ignore_eos: Ignore end of stream token and continue generating (default: false).

    logit_bias: Modify the likelihood of a token appearing in the generated text completion. For example, use "logit_bias": [[15043,1.0]] to increase the likelihood of the token 'Hello', or "logit_bias": [[15043,-1.0]] to decrease its likelihood. Setting the value to false, "logit_bias": [[15043,false]] ensures that the token Hello is never produced. The tokens can also be represented as strings, e.g. [["Hello, World!",-0.5]] will reduce the likelihood of all the individual tokens that represent the string Hello, World!, just like the presence_penalty does. (default: []).

    n_probs: If greater than 0, the response also contains the probabilities of top N tokens for each generated token (default: 0)

    min_keep: If greater than 0, force samplers to return N possible tokens at minimum (default: 0)

    image_data: An array of objects to hold base64-encoded image data and its ids to be reference in prompt. You can determine the place of the image in the prompt as in the following: USER:[img-12]Describe the image in detail.\nASSISTANT:. In this case, [img-12] will be replaced by the embeddings of the image with id 12 in the following image_data array: {..., "image_data": [{"data": "<BASE64_STRING>", "id": 12}]}. Use image_data only with multimodal models, e.g., LLaVA.

    slot_id: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot (default: -1)

    cache_prompt: Re-use previously cached prompt from the last request if possible. This may prevent re-caching the prompt from scratch. (default: false)

    system_prompt: Change the system prompt (initial prompt of all slots), this is useful for chat applications. See more

    samplers: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. (default: ["top_k", "tfs_z", "typical_p", "top_p", "min_p", "temperature"] - these are all the available values)

Result JSON

  • Note: When using streaming mode (stream) only content and stop will be returned until end of completion.

  • completion_probabilities: An array of token probabilities for each completion. The array's length is n_predict. Each item in the array has the following structure:

{
  "content": "<the token selected by the model>",
  "probs": [
    {
      "prob": float,
      "tok_str": "<most likely token>"
    },
    {
      "prob": float,
      "tok_str": "<second most likely tonen>"
    },
    ...
  ]
},

Notice that each probs is an array of length n_probs.

  • content: Completion result as a string (excluding stopping_word if any). In case of streaming mode, will contain the next token as a string.

  • stop: Boolean for use with stream to check whether the generation has stopped (Note: This is not related to stopping words array stop from input options)

  • generation_settings: The provided options above excluding prompt but including n_ctx, model. These options may differ from the original ones in some way (e.g. bad values filtered out, strings converted to tokens, etc.).

  • model: The path to the model loaded with -m

  • prompt: The provided prompt

  • stopped_eos: Indicating whether the completion has stopped because it encountered the EOS token

  • stopped_limit: Indicating whether the completion stopped because n_predict tokens were generated before stop words or EOS was encountered

  • stopped_word: Indicating whether the completion stopped due to encountering a stopping word from stop JSON array provided

  • stopping_word: The stopping word encountered which stopped the generation (or "" if not stopped due to a stopping word)

  • timings: Hash of timing information about the completion such as the number of tokens predicted_per_second

  • tokens_cached: Number of tokens from the prompt which could be re-used from previous completion (n_past)

  • tokens_evaluated: Number of tokens evaluated in total from the prompt

  • truncated: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (tokens_evaluated) plus tokens generated (tokens predicted) exceeded the context size (n_ctx)

  • POST /tokenize: Tokenize a given text.

    Options:

    content: Set the text to tokenize.

    Note that the special BOS token is not added in front of the text and also a space character is not inserted automatically as it is for /completion.

  • POST /detokenize: Convert tokens to text.

    Options:

    tokens: Set the tokens to detokenize.

  • POST /embedding: Generate embedding of a given text just as the embedding example does.

    Options:

    content: Set the text to process.

    image_data: An array of objects to hold base64-encoded image data and its ids to be reference in content. You can determine the place of the image in the content as in the following: Image: [img-21].\nCaption: This is a picture of a house. In this case, [img-21] will be replaced by the embeddings of the image with id 21 in the following image_data array: {..., "image_data": [{"data": "<BASE64_STRING>", "id": 21}]}. Use image_data only with multimodal models, e.g., LLaVA.

  • POST /infill: For code infilling. Takes a prefix and a suffix and returns the predicted completion as stream.

    Options:

    input_prefix: Set the prefix of the code to infill.

    input_suffix: Set the suffix of the code to infill.

    It also accepts all the options of /completion except stream and prompt.

  • GET /props: Return current server settings.

Result JSON

{
  "assistant_name": "",
  "user_name": "",
  "default_generation_settings": { ... },
  "total_slots": 1
}
  • assistant_name - the required assistant name to generate the prompt in case you have specified a system prompt for all slots.

  • user_name - the required anti-prompt to generate the prompt in case you have specified a system prompt for all slots.

  • default_generation_settings - the default generation settings for the /completion endpoint, has the same fields as the generation_settings response object from the /completion endpoint.

  • total_slots - the total number of slots for process requests (defined by --parallel option)

  • POST /v1/chat/completions: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in messages, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only ChatML-tuned models, such as Dolphin, OpenOrca, OpenHermes, OpenChat-3.5, etc can be used with this endpoint.

    Options:

    See OpenAI Chat Completions API documentation. While some OpenAI-specific features such as function calling aren't supported, llama.cpp /completion-specific features such are mirostat are supported.

    Examples:

    You can use either Python openai library with appropriate checkpoints:

    import openai
    
    client = openai.OpenAI(
        base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
        api_key = "sk-no-key-required"
    )
    
    completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
        {"role": "user", "content": "Write a limerick about python exceptions"}
    ]
    )
    
    print(completion.choices[0].message)
    

    ... or raw HTTP requests:

    curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
    {
        "role": "system",
        "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
    },
    {
        "role": "user",
        "content": "Write a limerick about python exceptions"
    }
    ]
    }'
    
  • POST /v1/embeddings: OpenAI-compatible embeddings API.

    Options:

    See OpenAI Embeddings API documentation.

    Examples:

    • input as string

      curl http://localhost:8080/v1/embeddings \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer no-key" \
      -d '{
              "input": "hello",
              "model":"GPT-4",
              "encoding_format": "float"
      }'
      
    • input as string array

      curl http://localhost:8080/v1/embeddings \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer no-key" \
      -d '{
              "input": ["hello", "world"],
              "model":"GPT-4",
              "encoding_format": "float"
      }'
      
  • GET /slots: Returns the current slots processing state. Can be disabled with --slots-endpoint-disable.

Result JSON

[
    {
        "dynatemp_exponent": 1.0,
        "dynatemp_range": 0.0,
        "frequency_penalty": 0.0,
        "grammar": "",
        "id": 0,
        "ignore_eos": false,
        "logit_bias": [],
        "min_p": 0.05000000074505806,
        "mirostat": 0,
        "mirostat_eta": 0.10000000149011612,
        "mirostat_tau": 5.0,
        "model": "llama-2-7b-32k-instruct.Q2_K.gguf",
        "n_ctx": 2048,
        "n_keep": 0,
        "n_predict": 100000,
        "n_probs": 0,
        "next_token": {
            "has_next_token": true,
            "n_remain": -1,
            "n_decoded": 0,
            "stopped_eos": false,
            "stopped_limit": false,
            "stopped_word": false,
            "stopping_word": ""
        },
        "penalize_nl": true,
        "penalty_prompt_tokens": [],
        "presence_penalty": 0.0,
        "prompt": "Say hello to llama.cpp",
        "repeat_last_n": 64,
        "repeat_penalty": 1.100000023841858,
        "samplers": [
            "top_k",
            "tfs_z",
            "typical_p",
            "top_p",
            "min_p",
            "temperature"
        ],
        "seed": 42,
        "state": 1,
        "stop": [
            "\n"
        ],
        "stream": false,
        "task_id": 0,
        "temperature": 0.0,
        "tfs_z": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "typical_p": 1.0,
        "use_penalty_prompt_tokens": false
    }
]
  • GET /metrics: Prometheus compatible metrics exporter endpoint if --metrics is enabled:

Available metrics:

  • llamacpp:prompt_tokens_total: Number of prompt tokens processed.
  • llamacpp:tokens_predicted_total: Number of generation tokens processed.
  • llamacpp:prompt_tokens_seconds: Average prompt throughput in tokens/s.
  • llamacpp:predicted_tokens_seconds: Average generation throughput in tokens/s.
  • llamacpp:kv_cache_usage_ratio: KV-cache usage. 1 means 100 percent usage.
  • llamacpp:kv_cache_tokens: KV-cache tokens.
  • llamacpp:requests_processing: Number of request processing.
  • llamacpp:requests_deferred: Number of request deferred.

More examples

Change system prompt on runtime

To use the server example to serve multiple chat-type clients while keeping the same system prompt, you can utilize the option system_prompt to achieve that. This only needs to be done once to establish it.

prompt: Specify a context that you want all connecting clients to respect.

anti_prompt: Specify the word you want to use to instruct the model to stop. This must be sent to each client through the /props endpoint.

assistant_name: The bot's name is necessary for each customer to generate the prompt. This must be sent to each client through the /props endpoint.

{
    "system_prompt": {
        "prompt": "Transcript of a never ending dialog, where the User interacts with an Assistant.\nThe Assistant is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.\nUser: Recommend a nice restaurant in the area.\nAssistant: I recommend the restaurant \"The Golden Duck\". It is a 5 star restaurant with a great view of the city. The food is delicious and the service is excellent. The prices are reasonable and the portions are generous. The restaurant is located at 123 Main Street, New York, NY 10001. The phone number is (212) 555-1234. The hours are Monday through Friday from 11:00 am to 10:00 pm. The restaurant is closed on Saturdays and Sundays.\nUser: Who is Richard Feynman?\nAssistant: Richard Feynman was an American physicist who is best known for his work in quantum mechanics and particle physics. He was awarded the Nobel Prize in Physics in 1965 for his contributions to the development of quantum electrodynamics. He was a popular lecturer and author, and he wrote several books, including \"Surely You're Joking, Mr. Feynman!\" and \"What Do You Care What Other People Think?\".\nUser:",
        "anti_prompt": "User:",
        "assistant_name": "Assistant:"
    }
}

NOTE: You can do this automatically when starting the server by simply creating a .json file with these options and using the CLI option -spf FNAME or --system-prompt-file FNAME.

Interactive mode

Check the sample in chat.mjs. Run with NodeJS version 16 or later:

node chat.mjs

Another sample in chat.sh. Requires bash, curl and jq. Run with bash:

bash chat.sh

API like OAI

The HTTP server supports OAI-like API

Extending or building alternative Web Front End

The default location for the static files is examples/server/public. You can extend the front end by running the server binary with --path set to ./your-directory and importing /completion.js to get access to the llamaComplete() method.

Read the documentation in /completion.js to see convenient ways to access llama.

A simple example is below:

<html>
  <body>
    <pre>
      <script type="module">
        import { llama } from '/completion.js'

        const prompt = `### Instruction:
Write dad jokes, each one paragraph.
You can use html formatting if needed.

### Response:`

        for await (const chunk of llama(prompt)) {
          document.write(chunk.data.content)
        }
      </script>
    </pre>
  </body>
</html>