llama.cpp/examples/embedding/embedding.cpp

#include "build-info.h"
#include "common.h"
#include "llama.h"

#include <ctime>

#if defined(_MSC_VER)
#pragma warning(disable: 4244 4267) // possible loss of data
#endif

int main(int argc, char ** argv) {
    gpt_params params;

    if (!gpt_params_parse(argc, argv, params)) {
        return 1;
    }

    params.embedding = true;

    print_build_info();

    if (params.seed == LLAMA_DEFAULT_SEED) {
        params.seed = time(NULL);
    }

    fprintf(stderr, "%s: seed  = %u\n", __func__, params.seed);

    std::mt19937 rng(params.seed);
    if (params.random_prompt) {
        params.prompt = gpt_random_prompt(rng);
    }

    llama_backend_init(params.numa);

    llama_model * model;
    llama_context * ctx;

    // load the model
    std::tie(model, ctx) = llama_init_from_gpt_params(params);
    if (model == NULL) {
        fprintf(stderr, "%s: error: unable to load model\n", __func__);
        return 1;
    }

    const int n_ctx_train = llama_n_ctx_train(ctx);
    if (params.n_ctx > n_ctx_train) {
        fprintf(stderr, "%s: warning: model was trained on only %d context tokens (%d specified)\n",
                __func__, n_ctx_train, params.n_ctx);
    }

    // print system information
    {
        fprintf(stderr, "\n");
        fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
                params.n_threads, std::thread::hardware_concurrency(), llama_print_system_info());
    }

    int n_past = 0;

    // tokenize the prompt
    auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);

    if (params.verbose_prompt) {
        fprintf(stderr, "\n");
        fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
        fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
        for (int i = 0; i < (int) embd_inp.size(); i++) {
            fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], llama_token_to_piece(ctx, embd_inp[i]).c_str());
        }
        fprintf(stderr, "\n");
    }

    if (embd_inp.size() > (size_t)params.n_ctx) {
        fprintf(stderr, "%s: error: prompt is longer than the context window (%zu tokens, n_ctx = %d)\n",
                __func__, embd_inp.size(), params.n_ctx);
        return 1;
    }

    while (!embd_inp.empty()) {
        int n_tokens = std::min(params.n_batch, (int) embd_inp.size());
        if (llama_decode(ctx, llama_batch_get_one(embd_inp.data(), n_tokens, n_past, 0), params.n_threads)) {
            fprintf(stderr, "%s : failed to eval\n", __func__);
            return 1;
        }
        n_past += n_tokens;
        embd_inp.erase(embd_inp.begin(), embd_inp.begin() + n_tokens);
    }

    const int n_embd = llama_n_embd(ctx);
    const auto embeddings = llama_get_embeddings(ctx);

    for (int i = 0; i < n_embd; i++) {
        printf("%f ", embeddings[i]);
    }
    printf("\n");

    llama_print_timings(ctx);
    llama_free(ctx);
    llama_free_model(model);

    llama_backend_free();

    return 0;
}
make : restore build-info.h dependency for several targets (#3205) 2023-09-18 16:03:53 +02:00			`#include "build-info.h"`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`#include "common.h"`
			`#include "llama.h"`

examples: add missing <ctime> include for time() (#1011) 2023-04-16 12:13:00 +02:00			`#include <ctime>`

build : fix and ignore MSVC warnings (#1889) 2023-06-16 20:23:53 +02:00			`#if defined(_MSC_VER)`
			`#pragma warning(disable: 4244 4267) // possible loss of data`
			`#endif`

Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`int main(int argc, char ** argv) {`
			`gpt_params params;`

fix some warnings from gcc and clang-tidy (#3038) Co-authored-by: xaedes <xaedes@gmail.com> 2023-09-07 19:22:29 +02:00			`if (!gpt_params_parse(argc, argv, params)) {`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`return 1;`
			`}`

			`params.embedding = true;`

examples : add compiler version and target to build info (#2998) 2023-09-15 22:59:49 +02:00			`print_build_info();`
Add git-based build information for better issue tracking (#1232) * Add git-based build information for better issue tracking * macOS fix * "build (hash)" and "CMAKE_SOURCE_DIR" changes * Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages * Fix conditional dependency on missing target * Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile * 4 space indenting for cmake, attempt to clean up my mess in Makefile * Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it 2023-05-01 18:23:47 +02:00
Use unsigned for random seed (#2006) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-29 15:15:15 +02:00			`if (params.seed == LLAMA_DEFAULT_SEED) {`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`params.seed = time(NULL);`
			`}`

Use unsigned for random seed (#2006) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-29 15:15:15 +02:00			`fprintf(stderr, "%s: seed = %u\n", __func__, params.seed);`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00
			`std::mt19937 rng(params.seed);`
			`if (params.random_prompt) {`
			`params.prompt = gpt_random_prompt(rng);`
			`}`

mpi : add support for distributed inference via MPI (#2099) * MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-07-10 17:49:56 +02:00			`llama_backend_init(params.numa);`
llama : add llama_init_backend() API (close #1527) 2023-05-20 10:06:11 +02:00
llama : make model stateless and context stateful (llama_state) (#1797) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-24 10:47:58 +02:00			`llama_model * model;`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`llama_context * ctx;`

			`// load the model`
llama : make model stateless and context stateful (llama_state) (#1797) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-24 10:47:58 +02:00			`std::tie(model, ctx) = llama_init_from_gpt_params(params);`
			`if (model == NULL) {`
examples : add llama_init_from_gpt_params() common function (#1290) Signed-off-by: deadprogram <ron@hybridgroup.com> 2023-05-02 22:39:51 +02:00			`fprintf(stderr, "%s: error: unable to load model\n", __func__);`
			`return 1;`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`}`

examples : make n_ctx warning work again (#3066) This was broken by commit e36ecdcc ("build : on Mac OS enable Metal by default (#2901)"). 2023-09-08 17:43:35 +02:00			`const int n_ctx_train = llama_n_ctx_train(ctx);`
			`if (params.n_ctx > n_ctx_train) {`
			`fprintf(stderr, "%s: warning: model was trained on only %d context tokens (%d specified)\n",`
			`__func__, n_ctx_train, params.n_ctx);`
			`}`

Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`// print system information`
			`{`
			`fprintf(stderr, "\n");`
			`fprintf(stderr, "system_info: n_threads = %d / %d \| %s\n",`
			`params.n_threads, std::thread::hardware_concurrency(), llama_print_system_info());`
			`}`

			`int n_past = 0;`

			`// tokenize the prompt`
			`auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);`

			`if (params.verbose_prompt) {`
			`fprintf(stderr, "\n");`
			`fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());`
			`fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());`
			`for (int i = 0; i < (int) embd_inp.size(); i++) {`
llama : more tokenizer fixes (#2810) * tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> 2023-08-27 13:19:19 +02:00			`fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], llama_token_to_piece(ctx, embd_inp[i]).c_str());`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`}`
			`fprintf(stderr, "\n");`
			`}`

embedding : evaluate prompt in batches (#2713) 2023-08-22 16:03:12 +02:00			`if (embd_inp.size() > (size_t)params.n_ctx) {`
			`fprintf(stderr, "%s: error: prompt is longer than the context window (%zu tokens, n_ctx = %d)\n",`
			`__func__, embd_inp.size(), params.n_ctx);`
			`return 1;`
			`}`

			`while (!embd_inp.empty()) {`
			`int n_tokens = std::min(params.n_batch, (int) embd_inp.size());`
llama : custom attention mask + parallel decoding + no context swaps (#3228) * tests : verify that RoPE is "additive" * llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask) * ggml : ggml_rope now takes a vector with positions instead of n_past * metal : add rope_f16 kernel + optimize cpy kernels * llama : unified KV cache + batch inference API * llama : add new llama_decode() API that works with llama_batch * llama : add cell_max heuristic for more efficient kv_cache * llama : extend llama_kv_cache API * llama : more robust cell_max heuristic + wip shift * metal : disable concurrency optimization * llama : add llama_kv_cache_shift_seq + no more context swaps * llama : apply K-cache roping for Falcon and Baichuan * speculative : fix KV cache management * parallel : example for serving multiple users in parallel * parallel : disable hot-plug to avoid cache fragmentation * fixes : speculative KV cache + llama worst-case graph * llama : extend batch API to select which logits to output * llama : fix worst case graph build * ggml-cuda : update rope implementation for parallel decoding (#3254) * ggml-cuda : update rope implementation for parallel decoding * better solution for p0 computation * fix rope * simpler rope implementation --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * make : add parallel to build + fix static functions in llama.cpp * simple : fix token counting * parallel : various improvements * llama : fix cell_max logic + rename functions * parallel : try smaller batches when the KV cache is fragmented * parallel : fix sequence termination criteria * llama : silence errors KV cache errors * parallel : remove new line from prompt * parallel : process system prompt once + configurable paramters + llama API * parallel : remove question with short answers * parallel : count cache misses * parallel : print misses on each request * parallel : minor * llama : fix n_kv to never become 0 * parallel : rename hot-plug to continuous-batching * llama : improve llama_batch API + simplify parallel example * simple : add parallel decoding support * simple : improve comments + free batch * ggml-cuda : add rope f16, restore performance with parallel decoding (#3272) * ggml-cuda : add rope f16, restore performance * offload KQ_mask with all models * fix rope shift --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : disable MPI for now ggml-ci * train : make KQ_pos memory buffer permanent via dummy scale op * ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275) ggml-ci * parallel : fix bug (extra BOS) + smaller token_prev array * parallel : fix cases where the input prompts can overflow the batch * parallel : add disabled experimental batch chunking in powers of two * llama : llama.h formatting + comments * simple : add README.md * llama : fix kv cache heuristic when context is less than 32 * parallel : fix crash when `-n -1` * llama : simplify returns if/else branches * metal : use mm kernels for batch size > 2 * examples : utilize new llama_get_logits_ith() * examples : add example for batched decoding * examples : do not eval prompt 2 times (close #3348) * server : clear the KV cache beyond n_past before llama_decode * server : avoid context swaps by shifting the KV cache --------- Co-authored-by: slaren <slarengh@gmail.com> 2023-09-28 18:04:36 +02:00			`if (llama_decode(ctx, llama_batch_get_one(embd_inp.data(), n_tokens, n_past, 0), params.n_threads)) {`
embedding : evaluate prompt in batches (#2713) 2023-08-22 16:03:12 +02:00			`fprintf(stderr, "%s : failed to eval\n", __func__);`
			`return 1;`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`}`
embedding : evaluate prompt in batches (#2713) 2023-08-22 16:03:12 +02:00			`n_past += n_tokens;`
			`embd_inp.erase(embd_inp.begin(), embd_inp.begin() + n_tokens);`
			`}`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00
embedding : evaluate prompt in batches (#2713) 2023-08-22 16:03:12 +02:00			`const int n_embd = llama_n_embd(ctx);`
			`const auto embeddings = llama_get_embeddings(ctx);`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00
embedding : evaluate prompt in batches (#2713) 2023-08-22 16:03:12 +02:00			`for (int i = 0; i < n_embd; i++) {`
			`printf("%f ", embeddings[i]);`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`}`
embedding : evaluate prompt in batches (#2713) 2023-08-22 16:03:12 +02:00			`printf("\n");`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00
			`llama_print_timings(ctx);`
			`llama_free(ctx);`
llama : make model stateless and context stateful (llama_state) (#1797) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-24 10:47:58 +02:00			`llama_free_model(model);`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00
mpi : add support for distributed inference via MPI (#2099) * MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-07-10 17:49:56 +02:00			`llama_backend_free();`

Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 19:26:40 +01:00			`return 0;`
			`}`