Query preprocessing #12

hannah348 · 2024-07-25T16:11:52Z

I was using the process_queries method and I realized that the processor uses left padding. As a result the batch_query["input_ids"] = batch_query["input_ids"][..., processor.image_seq_length :] does cut of the padding tokens and if there were padding tokens in the sequence the end of the image sequence is the beginning of the query. Is that intentional? If so what is the reasoning behind that?

The text was updated successfully, but these errors were encountered:

ManuelFay · 2024-07-26T14:11:26Z

Hello ! Thanks for the catch ! For PaliGemma, the tokenizer should be set with padding side = "right" ! I am pushing an update to force that behaviour and will push updated checkpoints, thanks for the catch !
In practice, as is with a mock image, it just introduces extra noise but should still work !

hannah348 · 2024-07-29T09:53:28Z

Does the model require that extra noise? If this was done during training as well the padding tokens might lead to degradation to performance as the model has less tokens available to represent the text, since the attention mask does not exclude the image tokens in colpali_engine/models/paligemma_colbert_architecture.py line 83

ManuelFay · 2024-07-29T15:49:52Z

In practice, the tokens corresponding to the input_ids that must be replaced by the image soft tokens, currently gets included in the query - since it is associated with nothing (and never trained cause replaced otherwise), this acts as a learned padding token (but with attention). It should not hurt performance particularly and might even act as extra "buffer tokens".

I am retraining checkpoints with the fix (since it happens during training as well) which I will release once I push the update, along with the benchmark results !

ManuelFay · 2024-08-21T10:54:01Z

So everythinh shoild be fixed !

Would be awesome if you can confirm using this new model:
https://huggingface.co/vidore/colpali-v1.1

and the code in branch: https://github.com/illuin-tech/colpali/tree/hard-negs

The base model version is fixed, and padding side is set to right, so issue should be fine @hannah348

## [0.2.0] - 2024-08-29 Large refactoring to adress several issues and add features. This release is not backward compatible with previous versions. The models trained under this version will exhibit degraded performance if used with the previous version of the code and vice versa. [Branch](#23) ### Added - Added multiple training options for training with hard negatives. This leads to better model performance ! - Added options for restarting training from a checkpoint. ### Changed - Optionally load ColPali models from pre-initialized backbones of the same shape to remove any stochastic initialization when loading adapters. This fixes [11](#11) and [17](#17). ### Fixed - Set padding side to right in the tokenizer to fix misalignement issue between different query lengths in the same batch. Fixes [12](#12) - Add 10 extra pad token by default to the query to act as reasoning buffers. This enables the above fix to be made without degrading performance and cleans up the old technique of using <unused> tokens.

ManuelFay closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query preprocessing #12

Query preprocessing #12

hannah348 commented Jul 25, 2024

ManuelFay commented Jul 26, 2024

hannah348 commented Jul 29, 2024

ManuelFay commented Jul 29, 2024 •

edited

Loading

ManuelFay commented Aug 21, 2024

Query preprocessing #12

Query preprocessing #12

Comments

hannah348 commented Jul 25, 2024

ManuelFay commented Jul 26, 2024

hannah348 commented Jul 29, 2024

ManuelFay commented Jul 29, 2024 • edited Loading

ManuelFay commented Aug 21, 2024

ManuelFay commented Jul 29, 2024 •

edited

Loading