Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization regression #3449

Closed
christian-storm opened this issue Mar 20, 2019 · 5 comments
Closed

Tokenization regression #3449

christian-storm opened this issue Mar 20, 2019 · 5 comments
Labels
feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy

Comments

@christian-storm
Copy link

How to reproduce the behaviour

Each sentence in #1266 (comment) is tokenized with I. as one token.

Your Environment

  • spaCy version: 2.0.18
  • Platform: Darwin-18.2.0-x86_64-i386-64bit
  • Python version: 3.7.2
@svlandeg
Copy link
Member

Thanks for the report - I'll have a look!

@ines ines added feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy labels Mar 21, 2019
@svlandeg
Copy link
Member

Ok I looked into it. Just a quick question: which version of spaCy are you using now, and was this ever working in a previous version?

It does look like this is the intended default behaviour, though. https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py#L30-L31 defines a suffix rule for when one lower-case character comes before a dot, and https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py#L33 for when two upper-case characters are followed by a dot.

It looks virtually impossible to change these default rules to accomodate your use-case for one upper-case letter, because it would break other cases such as "Tonight I saw A. Douglas." or "I live in the U.S. since 2 years." And it feels like you would not often have grammatically correct English sentences with I at the end?

It's virtually impossible to cover all use-cases entirely correct out of the box, but you can always write your own custom tokenizer to work around specific issues in your dataset ofcourse!

@christian-storm
Copy link
Author

Thanks for taking a look into this. This behavior was with Spacy 2.0.18. I had filed a previous bug (#1266) for 1.9. @honnibal replied a few months later that it had been fixed leading me to believe that it is worth fixing. I also agree that this case is fairly unlikely given it usually arises with ungrammatical sentences. However, a lot of text is ungrammatical. My 12 year old twins, for instance, continually use I/me incorrectly.

I hear you with regards to the rules. However, with 2.1, where the tokenization is no longer rule based, it seems the cue that "I." is a noun would hint at that it should be "I" PRON and "." PUNCT. In your example, "Tonight I saw A. Douglas." spacy correctly finds "A." PROPN "Douglas" PROPN "." PUNCT leading me to believe that would work. Maybe I'm barking up the wrong tree but it seems worth considering. To that end perhaps this is actually fixed in 2.1?!

With 2.1 writing my own tokenizer to fix this is finally possible. Before it was impossible to split a token. Cheers to 2.1!

@ines
Copy link
Member

ines commented Mar 30, 2019

Cool, I think a custom splitting rule definitely sounds like a good solution here. I think @honnibal's answer on that other thread was probably mostly referring to the improved parse in v2.0, which would in turn lead to better sentence segmentation.

However, with 2.1, where the tokenization is no longer rule based

Tokenization in 2.1 is still rule-based – we just improved the way the rules are processed under the hood.

@ines ines closed this as completed Mar 30, 2019
svlandeg added a commit to svlandeg/spaCy that referenced this issue Apr 2, 2019
@lock
Copy link

lock bot commented Apr 29, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

3 participants