-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization regression #3449
Comments
Thanks for the report - I'll have a look! |
Ok I looked into it. Just a quick question: which version of spaCy are you using now, and was this ever working in a previous version? It does look like this is the intended default behaviour, though. https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py#L30-L31 defines a suffix rule for when one lower-case character comes before a dot, and https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py#L33 for when two upper-case characters are followed by a dot. It looks virtually impossible to change these default rules to accomodate your use-case for one upper-case letter, because it would break other cases such as "Tonight I saw A. Douglas." or "I live in the U.S. since 2 years." And it feels like you would not often have grammatically correct English sentences with It's virtually impossible to cover all use-cases entirely correct out of the box, but you can always write your own custom tokenizer to work around specific issues in your dataset ofcourse! |
Thanks for taking a look into this. This behavior was with Spacy 2.0.18. I had filed a previous bug (#1266) for 1.9. @honnibal replied a few months later that it had been fixed leading me to believe that it is worth fixing. I also agree that this case is fairly unlikely given it usually arises with ungrammatical sentences. However, a lot of text is ungrammatical. My 12 year old twins, for instance, continually use I/me incorrectly. I hear you with regards to the rules. However, with 2.1, where the tokenization is no longer rule based, it seems the cue that "I." is a noun would hint at that it should be "I" PRON and "." PUNCT. In your example, "Tonight I saw A. Douglas." spacy correctly finds "A." PROPN "Douglas" PROPN "." PUNCT leading me to believe that would work. Maybe I'm barking up the wrong tree but it seems worth considering. To that end perhaps this is actually fixed in 2.1?! With 2.1 writing my own tokenizer to fix this is finally possible. Before it was impossible to split a token. Cheers to 2.1! |
Cool, I think a custom splitting rule definitely sounds like a good solution here. I think @honnibal's answer on that other thread was probably mostly referring to the improved parse in v2.0, which would in turn lead to better sentence segmentation.
Tokenization in 2.1 is still rule-based – we just improved the way the rules are processed under the hood. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
Each sentence in #1266 (comment) is tokenized with I. as one token.
Your Environment
The text was updated successfully, but these errors were encountered: