Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token indices sequence length is longer than the specified maximum sequence length for this model. #6939

Closed
faisal-hearth opened this issue Feb 5, 2021 · 5 comments
Labels
feat / transformer Feature: Transformer resolved The issue was addressed / answered

Comments

@faisal-hearth
Copy link

How to reproduce the behaviour

This problem arose when I switched to spacy 3.0 and started using 'en_core_web_trf' instead of 'en_core_web_lg'

# Here text is 1031 tokens long
nlp = spacy.load('en_core_web_trf')
doc = nlp(text)

After executing, it's resulting in the error below:
Token indices sequence length is longer than the specified maximum sequence length for this model (1313 > 512). Running this sequence through the model will result in indexing errors

The problem seems to be in the transformer Pipeline

Your Environment

  • Operating System: Window 10
  • Python Version Used: 3.9.1
  • spaCy Version Used: 3.0.1
  • Environment Information: en_core_web_trf 3.0.0
@svlandeg svlandeg added the feat / transformer Feature: Transformer label Feb 5, 2021
@adrianeboyd
Copy link
Contributor

Hi, your script doesn't stop and you still end up with a processed doc, though, right? I think this is just a warning (as opposed to an error that stops further processing) from the transformers tokenizer. Internally, spacy-transformers runs the tokenizer and then truncates sequences that are too long so that the actual processing with the transformer model proceeds without errors.

Obviously the final annotation can be affected by the fact that some sequences are truncated, but this allows arbitrary texts to be processed without errors and then aligned back with the spacy tokens.

There can also be some issues with some model types that don't provide their model_max_length correctly, but that shouldn't be an issue for roberta-base in en_core_web_trf.

@svlandeg svlandeg added the more-info-needed This issue needs more information label Feb 5, 2021
@faisal-hearth
Copy link
Author

Yes, the script is still running.
Thanks for clarifying.

@no-response no-response bot removed the more-info-needed This issue needs more information label Feb 5, 2021
@svlandeg svlandeg added the resolved The issue was addressed / answered label Feb 6, 2021
@github-actions
Copy link
Contributor

This issue has been automatically closed because it was answered and there was no follow-up discussion.

@baivabdash
Copy link

Hi, I have a follow-up question. I am facing this 512 tokens warning while using a distilbert model. But my script runs fairly well and the model is able to reach a decent score. So my question is : Is there a chance that just resolving this warning gives me a boost in score? It would be very much helpful if you could explain in detail..

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / transformer Feature: Transformer resolved The issue was addressed / answered
Projects
None yet
Development

No branches or pull requests

4 participants