Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLP silently hangs indefinitely on strings with many periods #957

Closed
catalytic1618 opened this issue Apr 5, 2017 · 3 comments
Closed

NLP silently hangs indefinitely on strings with many periods #957

catalytic1618 opened this issue Apr 5, 2017 · 3 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@catalytic1618
Copy link

catalytic1618 commented Apr 5, 2017

The following code block induces a hung state in the NLP module (this is text from a real-world corpus, similar errors happen on long Java namespaces (e.g. org.apache.x.y.z).

nlp = spacy.load("en_core_web_sm")
nlp("0.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48")

Our inspection suggests the tokenizer is thrashing, perhaps owing to a regex of exploding complexity. We've routed around the damage with a context manager that uses signal.SIGALRM to timeout if spacy takes too long, but this issue was the source of much confusion as regards seemingly simple jobs that were running for extended periods of time.

@honnibal
Copy link
Member

honnibal commented Apr 5, 2017

Thanks!

I think this is the fancy new URL matching. Likely temporary mitigation if I'm right: nlp.tokenizer.rule_match = None

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Apr 5, 2017
@honnibal
Copy link
Member

honnibal commented Apr 7, 2017

Your analysis was correct -- and the problem was indeed in the URL matching expression introduced in pull request #879, to address Issue #840.

The fix turns out to be very simple, and a bit interesting: all I had to do was switch to @mrabarnett 's regex module -- so I guess this is an interesting example of a pathological back-tracking behaviour "in the wild".

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants