Token indices sequence length is longer than the specified maximum sequence length for this model. #6939

faisal-hearth · 2021-02-05T09:13:20Z

How to reproduce the behaviour

This problem arose when I switched to spacy 3.0 and started using 'en_core_web_trf' instead of 'en_core_web_lg'

# Here text is 1031 tokens long
nlp = spacy.load('en_core_web_trf')
doc = nlp(text)

After executing, it's resulting in the error below:
Token indices sequence length is longer than the specified maximum sequence length for this model (1313 > 512). Running this sequence through the model will result in indexing errors

The problem seems to be in the transformer Pipeline

Your Environment

Operating System: Window 10
Python Version Used: 3.9.1
spaCy Version Used: 3.0.1
Environment Information: en_core_web_trf 3.0.0

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2021-02-05T11:36:31Z

Hi, your script doesn't stop and you still end up with a processed doc, though, right? I think this is just a warning (as opposed to an error that stops further processing) from the transformers tokenizer. Internally, spacy-transformers runs the tokenizer and then truncates sequences that are too long so that the actual processing with the transformer model proceeds without errors.

Obviously the final annotation can be affected by the fact that some sequences are truncated, but this allows arbitrary texts to be processed without errors and then aligned back with the spacy tokens.

There can also be some issues with some model types that don't provide their model_max_length correctly, but that shouldn't be an issue for roberta-base in en_core_web_trf.

faisal-hearth · 2021-02-05T15:13:33Z

Yes, the script is still running.
Thanks for clarifying.

github-actions · 2021-02-13T00:13:39Z

This issue has been automatically closed because it was answered and there was no follow-up discussion.

baivabdash · 2021-09-05T16:33:51Z

Hi, I have a follow-up question. I am facing this 512 tokens warning while using a distilbert model. But my script runs fairly well and the model is able to reach a decent score. So my question is : Is there a chance that just resolving this warning gives me a boost in score? It would be very much helpful if you could explain in detail..

github-actions · 2021-10-20T00:01:26Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added the feat / transformer Feature: Transformer label Feb 5, 2021

svlandeg added the more-info-needed This issue needs more information label Feb 5, 2021

no-response bot removed the more-info-needed This issue needs more information label Feb 5, 2021

svlandeg added the resolved The issue was addressed / answered label Feb 6, 2021

github-actions bot closed this as completed Feb 13, 2021

ginward mentioned this issue Sep 23, 2021

Does en_core_web_trf truncate documents to 512? #9273

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token indices sequence length is longer than the specified maximum sequence length for this model. #6939

Token indices sequence length is longer than the specified maximum sequence length for this model. #6939

faisal-hearth commented Feb 5, 2021

adrianeboyd commented Feb 5, 2021

faisal-hearth commented Feb 5, 2021

github-actions bot commented Feb 13, 2021

baivabdash commented Sep 5, 2021

github-actions bot commented Oct 20, 2021

Token indices sequence length is longer than the specified maximum sequence length for this model. #6939

Token indices sequence length is longer than the specified maximum sequence length for this model. #6939

Comments

faisal-hearth commented Feb 5, 2021

How to reproduce the behaviour

Your Environment

adrianeboyd commented Feb 5, 2021

faisal-hearth commented Feb 5, 2021

github-actions bot commented Feb 13, 2021

baivabdash commented Sep 5, 2021

github-actions bot commented Oct 20, 2021