Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add contractions for won't #952

Closed
wants to merge 1 commit into from
Closed

Add contractions for won't #952

wants to merge 1 commit into from

Conversation

kinow
Copy link
Contributor

@kinow kinow commented Apr 3, 2017

Hi, I tried implementing the fix for this issue, but alas it did not work in my local environment.

The issue is that Won't & won't return lemma wo and nt. Whereas I would expect to get will not.

Saw a similar issue for Let's -> Let us, but it also didn't work for me. I wonder if there is any documentation on how to test the tokenizer exceptions before submitting pull requests? Anyway, feel free to close it and implement in a better way if necessary.

Cheers
Bruno

Types of changes

  • [ X ] Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality to spaCy)
  • Breaking change (fix or feature causing change to spaCy's existing functionality)
  • Documentation (addition to documentation of spaCy)

Checklist:

  • My change requires a change to spaCy's documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • [ X ] All new and existing tests passed.

@ines
Copy link
Member

ines commented Apr 3, 2017

Thanks for giving this a go!

I just had a look at this problem and it turned out that it was caused by a missing POS tag in the tokenizer exceptions. The exceptions for the verbs are already handled here, and they did have the correct lemma. But because they were missing a TAG, it was later overwritten by the lemmatizer.

I added the missing tags and it should work properly now!

@ines ines closed this Apr 3, 2017
@kinow
Copy link
Contributor Author

kinow commented Apr 3, 2017

Thanks @ines.

I couldn't understand why it was not working, and started looking at the Tokenizer class. Thanks for pointing where I was supposed to be looking at :-)

Will give it a try today.
Bruno

@kinow
Copy link
Contributor Author

kinow commented Apr 4, 2017

Arrived home, checked out the latest version

git log -n 1
commit 808cd6cf7f184e20d9b8e42364f7e10f045028dc
Author: ines <ines@ines.io>
Date:   Mon Apr 3 18:12:52 2017 +0200

    Add missing tags to verbs (resolves #948)

Then python setup.py build && python setup.py install. Executed a test code.

from spacy.en import English
from spacy.symbols import ORTH, LEMMA, POS

from pprint import pprint as pp

nlp = English()

text = "Don't you use NLP? Won't you need it? Let's use it !!!"

tokens = nlp(text)
for token in tokens:
    print("%s - %s - %s" % (token.text, token.lemma_, token.pos_))

Got:

Do - do - 
n't - not - ADV
you -  - 
use -  - 
NLP -  - 
? -  - 
Wo - will - VERB
n't - not - ADV
you -  - 
need -  - 
it -  - 
? -  - 
Let - let - 
's - -PRON- - 
use -  - 
it -  - 
! -  - 
! -  - 
! -  -

Then downloaded the models python3 -m spacy.en.download --force all, and re-executed the code:

Do - do - VERB
n't - not - ADV
you - -PRON- - PRON <---- ??? why the lemma here is not you?
use - use - VERB
NLP - nlp - PROPN
? - ? - PUNCT
Wo - will - VERB
n't - not - ADV
you - -PRON- - PRON <--- ditto
need - need - VERB
it - -PRON- - PRON <--- ditto
? - ? - PUNCT
Let - let - VERB
's - 's - PRON <----- why not us here?
use - use - VERB
it - -PRON- - PRON
! - ! - PUNCT
! - ! - PUNCT
! - ! - PUNCT

Am I missing anything? Maybe I should have executed something else, or there's something wrong with my code?

@f11r
Copy link
Contributor

f11r commented Apr 4, 2017

Regarding lemmatisation to -PRON- see: #906 and #898 (comment).

@kinow
Copy link
Contributor Author

kinow commented Apr 7, 2017

Thanks @f11r

I am working on an application that would need the values to be matched against a dictionary. When I have have contractions, like "Let's", I need the values let and us to match against a list of words.

For "Let's", the lemmas will have what I am looking for (i.e. [let, us]). But for "It", if I use the lemma, then I will get -PRON- - if I understand it correctly.

In this case I would either have to think about another strategy, or maybe always use lemma and, if I find any word surrounded by dashes, fallback to using the text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / en English language data and models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants