Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding in batches with token length leads to different results? #2570

Open
achibb opened this issue Apr 3, 2024 · 4 comments · May be fixed by #2571
Open

encoding in batches with token length leads to different results? #2570

achibb opened this issue Apr 3, 2024 · 4 comments · May be fixed by #2571

Comments

@achibb
Copy link

achibb commented Apr 3, 2024

Hi! I was just wondering:

https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L350

You sort by token length which is really great for performance, however I was wondering if it still could be that sentences with different token length get parsed and pad? For some models it was observed that pad token changes the mean slightly. I was wondering if this is the case that sentence transformers batches also different token lengths?

Thank you for the great repo!
Aaron

@achibb
Copy link
Author

achibb commented Apr 3, 2024

Oh or do you sort by text length? It appears so.
For me, sorting by token length gives clean and quick embeddings.

@achibb
Copy link
Author

achibb commented Apr 3, 2024

approach to first calculate token length instead of text length and sort, here:

#2571

feel free to modify / tell me your opinion

@tomaarsen
Copy link
Collaborator

Hello!

Thanks for reporting and for the PR! Sorting by token size seems a lot more consistent, but I'm a bit wary of the double tokenization. I understand that sorting samples by length allows you to get less padding in a batch, which should process more quickly, but perhaps efficient string sorting + slightly inefficient batches is faster than inefficient tokenization + efficient batches?
Perhaps some performance tests are needed to get a good understanding of what's best.

As for your original question, my understanding is that the padding tokens are all fully ignored. They should not influence the eventual embedding results. If they do, please share a script for reproducing that, I would be quite interested.

  • Tom Aarsen

@achibb
Copy link
Author

achibb commented Apr 3, 2024

Hi Tom - sure thing I will do some tests and share them. (Only generally speaking I experienced a 2x increase but will do a proper test).
Will folllow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants