Tokenization regression #3449

christian-storm · 2019-03-20T21:12:00Z

How to reproduce the behaviour

Each sentence in #1266 (comment) is tokenized with I. as one token.

Your Environment

spaCy version: 2.0.18
Platform: Darwin-18.2.0-x86_64-i386-64bit
Python version: 3.7.2

svlandeg · 2019-03-21T11:37:07Z

Thanks for the report - I'll have a look!

svlandeg · 2019-03-22T18:30:33Z

Ok I looked into it. Just a quick question: which version of spaCy are you using now, and was this ever working in a previous version?

It does look like this is the intended default behaviour, though. https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py#L30-L31 defines a suffix rule for when one lower-case character comes before a dot, and https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py#L33 for when two upper-case characters are followed by a dot.

It looks virtually impossible to change these default rules to accomodate your use-case for one upper-case letter, because it would break other cases such as "Tonight I saw A. Douglas." or "I live in the U.S. since 2 years." And it feels like you would not often have grammatically correct English sentences with I at the end?

It's virtually impossible to cover all use-cases entirely correct out of the box, but you can always write your own custom tokenizer to work around specific issues in your dataset ofcourse!

christian-storm · 2019-03-22T19:47:17Z

Thanks for taking a look into this. This behavior was with Spacy 2.0.18. I had filed a previous bug (#1266) for 1.9. @honnibal replied a few months later that it had been fixed leading me to believe that it is worth fixing. I also agree that this case is fairly unlikely given it usually arises with ungrammatical sentences. However, a lot of text is ungrammatical. My 12 year old twins, for instance, continually use I/me incorrectly.

I hear you with regards to the rules. However, with 2.1, where the tokenization is no longer rule based, it seems the cue that "I." is a noun would hint at that it should be "I" PRON and "." PUNCT. In your example, "Tonight I saw A. Douglas." spacy correctly finds "A." PROPN "Douglas" PROPN "." PUNCT leading me to believe that would work. Maybe I'm barking up the wrong tree but it seems worth considering. To that end perhaps this is actually fixed in 2.1?!

With 2.1 writing my own tokenizer to fix this is finally possible. Before it was impossible to split a token. Cheers to 2.1!

ines · 2019-03-30T11:46:59Z

Cool, I think a custom splitting rule definitely sounds like a good solution here. I think @honnibal's answer on that other thread was probably mostly referring to the improved parse in v2.0, which would in turn lead to better sentence segmentation.

However, with 2.1, where the tokenization is no longer rule based

Tokenization in 2.1 is still rule-based – we just improved the way the rules are processed under the hood.

lock · 2019-04-29T12:37:22Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy labels Mar 21, 2019

ines closed this as completed Mar 30, 2019

svlandeg added a commit to svlandeg/spaCy that referenced this issue Apr 2, 2019

failing test for Issue explosion#3449

1424b12

svlandeg mentioned this issue Apr 2, 2019

Allow English stopwords with any type of apostrophe #3530

Merged

3 tasks

lock bot locked as resolved and limited conversation to collaborators Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization regression #3449

Tokenization regression #3449

christian-storm commented Mar 20, 2019

svlandeg commented Mar 21, 2019

svlandeg commented Mar 22, 2019

christian-storm commented Mar 22, 2019

ines commented Mar 30, 2019

lock bot commented Apr 29, 2019

Tokenization regression #3449

Tokenization regression #3449

Comments

christian-storm commented Mar 20, 2019

How to reproduce the behaviour

Your Environment

svlandeg commented Mar 21, 2019

svlandeg commented Mar 22, 2019

christian-storm commented Mar 22, 2019

ines commented Mar 30, 2019

lock bot commented Apr 29, 2019