Consistency of input embedding vectors when extracted with different methods #9015

lloydwatts60 · 2024-08-13T05:45:13Z

lloydwatts60
Aug 13, 2024

Thank you for making llama.cpp available, it is amazing software.

I have been trying to extract internal representations (i.e. normalized embedding vector in the second layer), to use for training a new neural network. But when I ran it in inference, I got inconsistent results. I eventually traced the problem to getting different normalized embedding vectors in the second layer for the same input tokens, depending on whether I ran the inference in batch mode or streaming mode. I am using llama.cpp server, and the model is llama-2-7b-chat, Q6_K.

I figured that I must be extracting the embedding vectors in the second layer incorrectly, so I tried extracting the embedding vectors in the first layer (not normalized), and I found the same inconsistency. Since the first layer embedding vectors come from calling ggml_get_rows() on the input tokenIDs, I tried extracting the embedding vectors that way (thinking they should be ground truth), and they did not agree with any of the other results!

Here is the code I used in llama.cpp to extract the embedding vectors:

        // the output is always the last tensor in the graph
        struct ggml_tensor * res  = gf->nodes[gf->n_nodes - 1];
        struct ggml_tensor * embd = gf->nodes[gf->n_nodes - 2];          // final embedding vector
        struct ggml_tensor * embeddings = gf->nodes[0];                       // initial embedding vector
        if (true) {
            for (int i=0; i<3; i++) {
                printf("i = %d\n",i);
                // ggml_get_f32_nd method
                // retrieve the embedding vector
                printf("about to print embeddings parameters using ggml_get_f32_nd method\n");
                printf("EmbeddingVector[0:5]         = [");
                for (int jj=0; jj<5; jj++) {
                    // printf("embeddings[%5d][%5d] = %14.11f\n", i, jj, ggml_get_f32_nd(embeddings, jj, i, 0, 0));
                    float floatval = ggml_get_f32_nd(embeddings, jj, i, 0, 0);
                    printf("%12.9f",floatval);
                    if (jj<4) {
                        printf(", ");
                    } else {
                        printf("]\n");
                    }
                }
                printf("EmbeddingVector[4091:4096]   = [");
                for (int jj=4091; jj<4096; jj++) {
                    // printf("embeddings[%5d][%5d] = %14.11f\n", i, jj, ggml_get_f32_nd(embeddings, jj, i, 0, 0));
                    float floatval = ggml_get_f32_nd(embeddings, jj, i, 0, 0);
                    printf("%12.9f",floatval);
                    if (jj<4095) {
                        printf(", ");
                    } else {
                        printf("]\n");
                    }
                }
            }
        }

I also tried several other methods, as below, but they all agreed with the above method:

                        switch (cparams.pooling_type) {
                            case LLAMA_POOLING_TYPE_NONE:
                                {
                                    printf("cparams.pooling_type == LLAMA_POOLING_TYPE_NONE\n");
                                    // extract token embeddings
                                    GGML_ASSERT(lctx.embd != nullptr);
                                    printf("embeddings_out = lctx.embd + n_outputs_prev*n_embd\n");  // original way
                                    // printf("embeddings_out = ((float *)embeddings) + n_outputs_prev*n_embd\n");     // changed by Lloyd
                                    printf("n_outputs_prev = %d\n",n_outputs_prev);
                                    printf("n_embd         = %ld\n",n_embd);
                                    float * embeddings_out = lctx.embd + n_outputs_prev*n_embd;  // original way
                                    // float * embeddings_out = ((float * )embeddings) + n_outputs_prev*n_embd;    // changed by Lloyd
                                    // const int32_t n_outputs_new = lctx.n_outputs;  // only suitable for inference tokens
                                    const int32_t n_outputs_new = embeddings->ne[1];  // suitable for prompt tokens, added by Lloyd
                                    printf("n_outputs_new  = %d\n",n_outputs_new);
                                    printf("n_outputs      = %d\n",n_outputs);
                                    printf("lctx.embd_size = %ld\n",lctx.embd_size);

                                    if (n_outputs_new) {
                                        printf("n_outputs_new  != 0\n");
                                        GGML_ASSERT( n_outputs_prev + n_outputs_new <= n_outputs);
                                        GGML_ASSERT((n_outputs_prev + n_outputs_new)*n_embd <= (int64_t) lctx.embd_size);
                                        printf("call ggml_backend_tensor_get_async\n");
                                        ggml_backend_tensor_get_async(backend_embeddings, embeddings, embeddings_out, 0, n_outputs_new*n_embd*sizeof(float));
                                        printf("done call ggml_backend_tensor_get_async\n");
                                        printf("embeddings should be available now, in lctx.embd (float * lctx.embd) \n");

                                    }
                                } break;

I used this method to directly read the rows of the embedding vector lookup table tok_embd:

                // LOG_TEE("\nNow we can find the expected embedding vectors for the first 3 tokens\n");  // Added by Lloyd

                // LOG_TEE("\nNow let's try a row-selection using ggml_get_rows() on the real 4096x32000 tok_embd matrix\n");  // Added by Lloyd
                struct ggml_init_params params = {
                    /* .mem_size   = */ 16*1024*1024,
                    /* .mem_buffer = */ NULL,
                    /* .no_alloc   = */ false
                };

                // LOG_TEE("allocate memory\n");  // Added by Lloyd
                // memory allocation happens here
                struct ggml_context * ctx = ggml_init(params);

                // LOG_TEE("create mx tensor\n");  // Added by Lloyd

                const int nz = 1;

                struct ggml_tensor * mx = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, nz);
                int selectval;
                selectval = firstToken;  // values of interest:  1, 450, and 20886
                // printf("selectval = %d\n", selectval);
                for (int y = 0; y < nz; y++) {
                    *(int *) ((char *) mx->data + y*mx->nb[0]) = selectval;
                    // printf("mx[%d] = %d\n", y, selectval);
                }

                // LOG_TEE("create ma tensor\n");  // Added by Lloyd

                struct ggml_tensor * ma = model.tok_embd;
                ggml_set_param(ctx, mx); // mx is an input variable

                struct ggml_tensor * f = ggml_get_rows(ctx, ma, mx);
                // LOG_TEE("define graph\n");  // Added by Lloyd
                struct ggml_cgraph * gf = ggml_new_graph(ctx);
                ggml_build_forward_expand(gf, f);

                selectval = firstToken;  // values of interest:  1, 450, and 26428
                printf("selectval = %d\n", selectval);
                for (int y = 0; y < nz; y++) {
                    *(int *) ((char *) mx->data + y*mx->nb[0]) = selectval;
                    // printf("mx[%d] = %d\n", y, selectval);
                }
                // LOG_TEE("call ggml_graph_compute_with_ctx()\n");  // Added by Lloyd
                int n_threads = 12;
                ggml_graph_compute_with_ctx(ctx, gf, n_threads);
                // printf("selectval = %d\n", selectval);
                for (int i=0; i<5; i++) {
                    printf("f[%5d] = %14.11f\n", i, ggml_get_f32_1d(f, i));
                    embeddingVectorSummary0[i] = ggml_get_f32_1d(f, i);
                }
                for (int i=4091; i<4096; i++) {
                    printf("f[%5d] = %14.11f\n", i, ggml_get_f32_1d(f, i));
                    embeddingVectorSummary0[i-4091+5] = ggml_get_f32_1d(f, i);
                }

Prompt:
The Ministry of Finance has fabricated and published the price index for the sale of the Republic of China on Taiwan and the price doubling table for asset revaluation over the years.
The first three tokens are:

    1   <s>
  450   The
20886   Ministry

And here are different results for the input embedding vectors:

about to print expected embeddings vectors directly from ggml_get_rows  (from llama.cpp)
expectEmbedding[0:5]           = [ 0.001822472, -0.003905296,  0.001041412,  0.002863884, -0.003905296]
expectEmbedding[4091:4096]     = [-0.000604987, -0.000907481, -0.009074807,  0.002722442, -0.003629923]

expectEmbedding[0:5]           = [ 0.002085149,  0.006255448, -0.002085149,  0.000695050,  0.011815846]
expectEmbedding[4091:4096]     = [-0.008320808,  0.002912283,  0.002912283,  0.009568930, -0.004992485]

expectEmbedding[0:5]           = [ 0.041777492, -0.026953220, -0.002695322,  0.004042983, -0.002695322]
expectEmbedding[4091:4096]     = [ 0.036323547, -0.010215998,  0.021567106,  0.006810665,  0.013621330]

about to print embeddings parameters using ggml_get_f32_nd method method (from llama.cpp)
embeddingVector[0:5]           = [ 0.054088578, -0.273807883,  0.283411920, -0.147964776, -0.108340040]
embeddingVector[4091:4096]     = [-0.121443428, -0.053086698,  0.390391409,  0.095777482, -0.148853838]

embeddingVector[0:5]           = [ 1.563822508,  1.583688498,  1.193519115, -0.789622843,  1.337926149]
embeddingVector[4091:4096]     = [ 0.132274002, -0.491592288,  1.056189179,  1.711194634,  0.385757208]

embeddingVector[0:5]           = [ 1.515929937,  0.041398883, -1.669434071,  0.641003013, -0.823098660]
embeddingVector[4091:4096]     = [ 0.362618774, -1.438845634,  0.192778036,  0.233350039,  1.312347054]

I also tried extracting the embedding vectors from the Pytorch unquantized implementation of llama-2-7b-chat, and got these further different results:

reading file=./embeddingVectors/embeddingVector_101.pkl  (from Pytorch)
len(embeddingVectors) = 38
embeddingVectors[0][0:5]       = [ 0.599093794,  0.035608001,  0.046576488,  0.155605584, -0.043579109]
embeddingVectors[0][4091:4096] = [-0.180217817, -0.310019761,  0.032882556,  0.087080679,  0.188917875]

embeddingVectors[1][0:5]       = [ 3.824545383, -2.141649723, -1.796370387,  0.944859862, -1.070262312]
embeddingVectors[1][4091:4096] = [ 0.327670574,  1.397792339,  1.419619917, -2.256069660, -1.454381585]

embeddingVectors[2][0:5]       = [ 0.736601352, -1.323347330, -0.565184593,  0.276699811,  0.041776295]
embeddingVectors[2][4091:4096] = [ 2.827355861, -0.910324513,  0.306743860, -1.321659088,  0.815176725]

So my question is: Is there a reason why these input embedding vectors should be different, for the same input tokenIDs?
Is one of them right? Or are all of my attempts wrong? What am I doing wrong?

Thank you in advance for any insights on this problem.
Lloyd

slaren · 2024-08-13T13:59:11Z

slaren
Aug 13, 2024
Collaborator

Tensors in the computation graph may be overwritten by later operations. To avoid this, you can use ggml_set_output to mark a tensor as an output and prevent it from being overwritten.

1 reply

ggerganov Aug 13, 2024
Maintainer

Another option is to use the eval callback: see https://github.com/ggerganov/llama.cpp/tree/master/examples/eval-callback

lloydwatts60 · 2024-08-13T15:45:35Z

lloydwatts60
Aug 13, 2024
Author

@slaren @ggerganov
Thank you both very much for your replies.

I’ve tried adding ggml_set_output(embeddings), but it doesn’t change the output of the embedding vector.

printf("set embeddings = gf->nodes[0]\n");
struct ggml_tensor * embeddings = gf->nodes[0];  // added by Lloyd

printf("ggml_set_output(embeddings)  (to prevent overwriting by later operations)\n");
ggml_set_output(embeddings);                     // as suggested by slaren, should prevent overwriting of this   
                                                 // tensor

gf->n_nodes = 1030
set embeddings = gf->nodes[0]
ggml_set_output(embeddings)  (to prevent overwriting by later operations)
embeddings->name  = inp_embd


about to print expected embeddings vectors directly from ggml_get_rows
expectEmbedding[0:5]         = [ 0.001822472, -0.003905296,  0.001041412,  0.002863884, -0.003905296]
expectEmbedding[4091:4096]   = [-0.000604987, -0.000907481, -0.009074807,  0.002722442, -0.003629923]

about to print embeddings parameters using approved method, copied from backend to lctx.embd
EmbeddingVector[0:5]         = [ 0.054088578, -0.273807883,  0.283411920, -0.147964776, -0.108340040]
EmbeddingVector[4091:4096]   = [-0.121443428, -0.053086698,  0.390391409,  0.095777482, -0.148853838]

about to print embeddings parameters using ggml_get_f32_nd method
EmbeddingVector[0:5]         = [ 0.054088578, -0.273807883,  0.283411920, -0.147964776, -0.108340040]
EmbeddingVector[4091:4096]   = [-0.121443428, -0.053086698,  0.390391409,  0.095777482, -0.148853838]

about to print embeddings parameters using direct read method
EmbeddingVector[0:5]         = [ 0.054088578, -0.273807883,  0.283411920, -0.147964776, -0.108340040]
EmbeddingVector[4091:4096]   = [-0.121443428, -0.053086698,  0.390391409,  0.095777482, -0.148853838]

about to print embeddings parameters using memcpy to vector method
embeddingVector[0:5]         = [ 0.054088578, -0.273807883,  0.283411920, -0.147964776, -0.108340040]
embeddingVector[4091:4096]   = [-0.121443428, -0.053086698,  0.390391409,  0.095777482, -0.148853838]

    tokenNum =      0   token id =     1                attIndex =   681   attProb = 0.002

about to print expected embeddings vectors directly from ggml_get_rows
expectEmbedding[0:5]         = [ 0.002085149,  0.006255448, -0.002085149,  0.000695050,  0.011815846]
expectEmbedding[4091:4096]   = [-0.008320808,  0.002912283,  0.002912283,  0.009568930, -0.004992485]

about to print embeddings parameters using approved method, copied from backend to lctx.embd
EmbeddingVector[0:5]         = [ 1.563822508,  1.583688498,  1.193519115, -0.789622843,  1.337926149]
EmbeddingVector[4091:4096]   = [ 0.132274002, -0.491592288,  1.056189179,  1.711194634,  0.385757208]

about to print embeddings parameters using ggml_get_f32_nd method
EmbeddingVector[0:5]         = [ 1.563822508,  1.583688498,  1.193519115, -0.789622843,  1.337926149]
EmbeddingVector[4091:4096]   = [ 0.132274002, -0.491592288,  1.056189179,  1.711194634,  0.385757208]

about to print embeddings parameters using direct read method
EmbeddingVector[0:5]         = [ 1.563822508,  1.583688498,  1.193519115, -0.789622843,  1.337926149]
EmbeddingVector[4091:4096]   = [ 0.132274002, -0.491592288,  1.056189179,  1.711194634,  0.385757208]

about to print embeddings parameters using memcpy to vector method
embeddingVector[0:5]         = [ 1.563822508,  1.583688498,  1.193519115, -0.789622843,  1.337926149]
embeddingVector[4091:4096]   = [ 0.132274002, -0.491592288,  1.056189179,  1.711194634,  0.385757208]

    tokenNum =      1   token id =   450                attIndex =    66   attProb = 0.004

about to print expected embeddings vectors directly from ggml_get_rows
expectEmbedding[0:5]         = [ 0.041777492, -0.026953220, -0.002695322,  0.004042983, -0.002695322]
expectEmbedding[4091:4096]   = [ 0.036323547, -0.010215998,  0.021567106,  0.006810665,  0.013621330]

about to print embeddings parameters using approved method, copied from backend to lctx.embd
EmbeddingVector[0:5]         = [ 1.515929937,  0.041398883, -1.669434071,  0.641003013, -0.823098660]
EmbeddingVector[4091:4096]   = [ 0.362618774, -1.438845634,  0.192778036,  0.233350039,  1.312347054]

about to print embeddings parameters using ggml_get_f32_nd method
EmbeddingVector[0:5]         = [ 1.515929937,  0.041398883, -1.669434071,  0.641003013, -0.823098660]
EmbeddingVector[4091:4096]   = [ 0.362618774, -1.438845634,  0.192778036,  0.233350039,  1.312347054]

about to print embeddings parameters using direct read method
EmbeddingVector[0:5]         = [ 1.515929937,  0.041398883, -1.669434071,  0.641003013, -0.823098660]
EmbeddingVector[4091:4096]   = [ 0.362618774, -1.438845634,  0.192778036,  0.233350039,  1.312347054]

about to print embeddings parameters using memcpy to vector method
embeddingVector[0:5]         = [ 1.515929937,  0.041398883, -1.669434071,  0.641003013, -0.823098660]
embeddingVector[4091:4096]   = [ 0.362618774, -1.438845634,  0.192778036,  0.233350039,  1.312347054]

    tokenNum =      2   token id = 20886                attIndex =   101   attProb = 0.999

I will try @ggerganov 's suggestion of eval-callback tonight. I think that should give me the ground truth I am looking for, and then I will be able to inspect the code to see how it's done, and incorporate the correct method into my code.
Thank you very much!
Lloyd

1 reply

lloydwatts60 Aug 14, 2024
Author

@slaren @ggerganov

I'm delighted to report that the llama-eval-callback method worked like a charm, giving input embedding vectors which agree with my expected embedding vectors directly from ggml_get_rows.

eval-callback --model ../models/llama-2-7b-chat.Q6_K.gguf --prompt "The Ministry"

gives this output:

Calling llama_graph_compute
In llama_graph_compute
ggml_debug:                 inp_embd = (f32)   GET_ROWS(token_embd.weight{4096, 32000, 1, 1}, inp_tokens{3, 1, 1, 1}}) = {4096, 3, 1, 1}
                                     [
                                      [
                                       [      0.0018,      -0.0039,       0.0010, ...,      -0.0091,       0.0027,      -0.0036],
                                       [      0.0021,       0.0063,      -0.0021, ...,       0.0029,       0.0096,      -0.0050],
                                       [      0.0418,      -0.0270,      -0.0027, ...,       0.0216,       0.0068,       0.0136],
                                      ],
                                     ]
                                     sum = 0.056849

which agrees with my expected embedding vectors:

about to print expected embeddings vectors directly from ggml_get_rows  (from llama.cpp)
expectEmbedding[0:5]           = [ 0.001822472, -0.003905296,  0.001041412,  0.002863884, -0.003905296]
expectEmbedding[4091:4096]     = [-0.000604987, -0.000907481, -0.009074807,  0.002722442, -0.003629923]

expectEmbedding[0:5]           = [ 0.002085149,  0.006255448, -0.002085149,  0.000695050,  0.011815846]
expectEmbedding[4091:4096]     = [-0.008320808,  0.002912283,  0.002912283,  0.009568930, -0.004992485]

expectEmbedding[0:5]           = [ 0.041777492, -0.026953220, -0.002695322,  0.004042983, -0.002695322]
expectEmbedding[4091:4096]     = [ 0.036323547, -0.010215998,  0.021567106,  0.006810665,  0.013621330]

And now I have the code in eval-callback.cpp that shows how to setup the callback function and extract the data from it.

Thank you very much for the help. I would not have been able to figure this out on my own.

Best,
Lloyd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency of input embedding vectors when extracted with different methods #9015

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Consistency of input embedding vectors when extracted with different methods #9015

lloydwatts60 Aug 13, 2024

Replies: 2 comments · 2 replies

slaren Aug 13, 2024 Collaborator

ggerganov Aug 13, 2024 Maintainer

lloydwatts60 Aug 13, 2024 Author

lloydwatts60 Aug 14, 2024 Author

lloydwatts60
Aug 13, 2024

Replies: 2 comments 2 replies

slaren
Aug 13, 2024
Collaborator

ggerganov Aug 13, 2024
Maintainer

lloydwatts60
Aug 13, 2024
Author

lloydwatts60 Aug 14, 2024
Author