-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Times such as "7pm" tokenized wrong #736
Comments
It appears that the number in the time is somehow being mapped to the ith element from
For example, "8am" becomes |
I think the issue is in
When I add this special case to override the existing rule it works:
|
Thanks, your analysis is definitely correct. Fixing. |
Issue fixed and regression test passes! The fix should be included in the next release (coming later today). |
Previous versions of spacy (< 1.6.0) have a bug that can cause issues in parsing numbers (see explosion/spaCy#736). Please update spacy to latest version.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
There appears to be a bug in how times are tokenized for English.
This produces:
Instead of
IS_TITLE PROPN is_title
I was expecting7 NUM 7
, which is what you get if you used7 pm
instead (with a space in between). I see thatTOKENIZER_EXCEPTIONS
includes a number of exceptions to handle this type of case so I'm confused why it doesn't work. Also it seems that the "7" should be preserved instead of being replaced withIS_TITLE
.Your Environment
spacy/data/en-1.1.0
undersite-packages
.The text was updated successfully, but these errors were encountered: