You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Hi!
If you set trigger_error to True, you will see the differences for the decoder-attention (also for the cross-attention) shape when the translation is generated by model.generate() and model(). I don't know if this is a bug or just expected to be different. I have checked that the attention values are the same when all the information is structured the same way (there are differences in precision though, which I think is because model.generate() generates differently than model()).
I would expect to have the same format for the decoder and cross-attention shape regardless of where I use model.generate() or model(). Specifically, I would expect to obtain the result from model(), which for the decoder we obtain a matrix for each layer of the shape (batch_size, attention_heads, generated_tokens - 1, generated_tokens - 1).
The text was updated successfully, but these errors were encountered:
Namely, generate's attention output is a tuple where each item is the attention output of one forward pass. In your example, if you replace e.g. translated_tokens.decoder_attentions by translated_tokens.decoder_attentions[0] you'll obtain the results you were expecting :)
System Info
transformers
version: 4.43.3Who can help?
@gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi!
If you set
trigger_error
toTrue
, you will see the differences for the decoder-attention (also for the cross-attention) shape when the translation is generated by model.generate() and model(). I don't know if this is a bug or just expected to be different. I have checked that the attention values are the same when all the information is structured the same way (there are differences in precision though, which I think is because model.generate() generates differently than model()).Expected behavior
I would expect to have the same format for the decoder and cross-attention shape regardless of where I use model.generate() or model(). Specifically, I would expect to obtain the result from model(), which for the decoder we obtain a matrix for each layer of the shape (batch_size, attention_heads, generated_tokens - 1, generated_tokens - 1).
The text was updated successfully, but these errors were encountered: