Batch size affects model's output #249

oKatanaaa · 2023-03-18T01:03:42Z

I was tinkering with the code and made the following change in line 977, main.cpp (as it seemed wrong to me):
from

if (embd.size() > params.n_batch) {
       break;
}

to

if (embd.size() >= params.n_batch) {
       break;
}

The model's (13B) outputs suddenly changed. Reverted changes and tried to play with the batch_size parameter, it really does affect the output.

Not sure if it's expected behaviour. As far as I understand it shouldn't be the case. A bug? Different batch sizes have different evaluation results (rounding error)?

The text was updated successfully, but these errors were encountered:

alankila · 2023-03-20T07:12:58Z

There is most definitely something wrong in the way prompt is fed into the program. Changes in the batch size can affect the output. Examples:

$ ./main -s 1000 -m models/7B/ggml-model-q4_0.bin --top_k 1 -b 1 -p "This is a sample prompt that expects continuation:"
"I'm going to have you write about your favorite memory. You can use any format, but I want it in my office by 10am tomorrow." (or whatever time) "You may start now..." [end of text]
$ ./main -s 1000 -m models/7B/ggml-model-q4_0.bin --top_k 1 -b 100 -p "This is a sample prompt that expects continuation:"
"The first time I saw you, it was love at first sight. You were so beautiful and charming."
This one has an expectation of completion too! It's just the beginning... [end of text]

I was experimenting with changing the word selection by forcing the most likely output every time using --top_k 1 to eliminate next word sampling.

My guess is that batch size = 1 will give the more "correct" behaviour of the model.

maziyarpanahi · 2023-03-20T07:53:26Z

@alankila I've noticed a similar behavior as batch_size=1 seemed to be more correct. But I was just eye-balling the results, the only definitive way would be to test it like this: #270

alankila · 2023-03-20T08:12:29Z

Hm, yes. I agree. However, I have an interactive assistant with a prompt working and if I put batch size 100 or whatever that eats the entire prompt at once, and all my conversation turns, the model is continuously rather confused and keeps making mistakes in reading what I write. I think it is pretty obvious that higher batch sizes do not work correctly presently.

I also think the defaults are not too good, which is somewhat an issue for this less scientific or quantified approach. To be honest, I can't get the regular --top_p 0.9 to stay on topic, and the repeat_last_n must be considerably lowered for chat mode, or it makes the AI generate the end of chat token instead of replying a lot of the time. My guess is that the "Bob:" text token generation likelihood is lowered too much, and it tends to choose to end the discussion instead.

I am currently using the following parameters with the Bob-like assistant string:

./main -m ./models/7B/ggml-model-q4_0.bin -b 1 --ctx_size 2048 --temp 1.0 --top_k 100 --top_p 0.7 --repeat_last_n 20 --repeat_penalty 1.2 -n 2048 --color -i -r "User:" -p "Transcript of dialog between blah blah blah"

At least in this way, with lower top_p parameter value, the model becomes normally quite coherent, and I can have long chats with it. After enjoying the fairly coherent chat afforded by this model, it is very obvious now that increasing batch size makes the model barely understand what I am saying to it.

setzer22 · 2023-03-21T08:22:02Z

I'm not sure I understand enough of the code to take this conclusion, but I think, the whole deal with batching is sacrificing quality to get speed. This mainly applies to optimizing by crossing the cpu<->gpu boundary less times. But in CPU inference, I'm not even sure batching is significantly faster. I don't feel it has been in my somewhat unscientific tests.

OTOH, to compute the attention mechanism for a token, you need the data for previous tokens to be there, because every token in the input must attend to every previous token. So if you batch them in groups of N, I can imagine there being a downgrade in quality, unless someone has made sure that the matrix multiplications are made in such a way that token data for tokens 0..n-1 is available when computing data for token N.

jarcen · 2023-03-21T08:59:58Z

That's incorrect and it shouldn't sacrifice anything. It also should be faster on CPU. All Pytorch transformers I had to run on CPU were significantly faster at reading prompts than generating text. Transformer's architecture allows to compute activation of a single layer for a whole batch in one go. Under the hood actually there's three steps:

Key and value generation. Each token vector is multiplied by Query, Key and Value matrices. There's no data dependencies between tokens. Matrices are fixed. After that Keys and Values are put onto memory tape(hidden state).
Masked self-attention is performed. Each token's destiny depends only on QKV matrices of itself and tokens that are placed before it. Since QKV matrices are already computed there's no data dependencies at this step.
Results go though feed forward network. Nothing interesting here, FFN is a pure function, it's parameters fixed, it has no hidden state and there are no data dependencies.

This is done for each layer, one by one, batch goes in, batch goes out.

A huge part of why transformers overtook RNNs is this property that allows training on whole data chunks in one pass.

setzer22 · 2023-03-21T10:53:21Z

A huge part of why transformers overtook RNNs is this property that allows training on whole data chunks in one pass.

@jarcen Ok, yeah, this makes sense in general terms and you seem to know more about it than me. Sorry I added noise to the discussion.

One question though, because I'd like to make sure I understood your point correctly:

Each token's destiny depends only on QKV matrices of itself and tokens that are placed before it.

When batching inputs, the "tokens that are placed before it" are part of the batch, and are being computed at the same time, isn't that correct? Or do you mean, the data dependency only goes to previous tokens in previous layers?

And leaving that aside, me (and others) have clearly observed output quality differences when varying the batch size. So if this is not an issue, theoretically speaking, then it may be a bug in the implementation?

jarcen · 2023-03-21T11:27:28Z

They are not being computed at the same time. Computations in one layer are separated in three steps I listed above. Step 2 operates on Query-Key-Value matrices which were already created on step 1. Key-Value matrices are not a part of the batch anymore but part of hidden state. Each Query vector with associated position N looks for Key vectors at positions N, N-1, N-2, N-3, etc... That includes already existed K vectors and the ones that just got added from batch at step 1. If there's four Q vectors then four threads can do that in parallel, there's no data dependency between these threads.

(Note that I'm seemingly using Vector and Matrix interchangeably but matrices are essentially how batching is implemented: each row is a vector. So, individual per-token operations are explained with vectors.)

Example code expressing idea of self-attention with string operations:

List<int> FindAllOccurencesBefore(string str, char symbolToFind, int position) {
    List<int> found = new List<int>();
    while(position > 0) {
        if(str[position] == symbolToFind)
            found.Add(position);
        position--;
    }
    return found;
}

This code can be run in parallel in multiple threads. One thread might start at position 5, another at 6, 7, 8 and so on. They do not conflict in any way. That is what essentially happens at step 2, except characters are Key vectors and symbolToFind is a Query vector. String is already updated at step 1 with new elements appended from batch to the tail.

Now for the quality. Yes, it must be a bug somewhere. I read llama_eval multiple times and can't find any error. I think it hides somewhere in ggml operators but reading that vectorized code is practically impossible.

ggerganov · 2023-03-23T21:23:08Z

Can you guys give a test with latest master - I believe the results should now be the same for different batch sizes

Alumniminium · 2023-04-20T14:55:33Z

Can you guys give a test with latest master - I believe the results should now be the same for different batch sizes

Latest master generates different outputs (using the same seed and prompt, but a different batch size), still.

ebudmada · 2023-11-05T18:07:14Z

I also have this problem. Is this a limitation of llama.cpp? Why is this thread closed?

gjmulder added bug Something isn't working generation quality Quality of model output labels Mar 20, 2023

ggerganov mentioned this issue Mar 23, 2023

Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439

Merged

ggerganov closed this as completed Jul 28, 2023

Phylliida mentioned this issue Dec 14, 2023

Sending in tokens one at a time vs all at once gives different logits #4479

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size affects model's output #249

Batch size affects model's output #249

oKatanaaa commented Mar 18, 2023

alankila commented Mar 20, 2023

maziyarpanahi commented Mar 20, 2023

alankila commented Mar 20, 2023

setzer22 commented Mar 21, 2023

jarcen commented Mar 21, 2023

setzer22 commented Mar 21, 2023 •

edited

Loading

jarcen commented Mar 21, 2023 •

edited

Loading

ggerganov commented Mar 23, 2023

Alumniminium commented Apr 20, 2023

ebudmada commented Nov 5, 2023

Batch size affects model's output #249

Batch size affects model's output #249

Comments

oKatanaaa commented Mar 18, 2023

alankila commented Mar 20, 2023

maziyarpanahi commented Mar 20, 2023

alankila commented Mar 20, 2023

setzer22 commented Mar 21, 2023

jarcen commented Mar 21, 2023

setzer22 commented Mar 21, 2023 • edited Loading

jarcen commented Mar 21, 2023 • edited Loading

ggerganov commented Mar 23, 2023

Alumniminium commented Apr 20, 2023

ebudmada commented Nov 5, 2023

setzer22 commented Mar 21, 2023 •

edited

Loading

jarcen commented Mar 21, 2023 •

edited

Loading