Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Lemmatizer Exceptions #595

Closed
jirimotejlek opened this issue Oct 31, 2016 · 4 comments
Closed

Adding Lemmatizer Exceptions #595

jirimotejlek opened this issue Oct 31, 2016 · 4 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@jirimotejlek
Copy link

I tried to edit exceptions in spacy/data/en-1.1.0/wordnet/verb.exc but it didn't have any effect, spacy is still returning the old lemma. Is there some other way to fix or add the lemmatizer exceptions?

Your Environment

  • Operating System: Ubuntu
  • Python Version Used: 2.7
  • spaCy Version Used: 0.101.0
  • Environment Information:
@honnibal honnibal added the usage General spaCy usage label Oct 31, 2016
@honnibal
Copy link
Member

honnibal commented Nov 2, 2016

Have you checked that the POS tag is being predicted correctly? The exceptions are POS-keyed, so they won't fire if the POS is incorrect.

@jirimotejlek
Copy link
Author

Yes, i'm looking for verbs and then for a lemma. Had a problem with a few words (like "Don't feed the dog" is returning lemma "fee" for feed.)
As past tense of fee is very unlikely in my case I went ahead and removed that exception. But SpaCy is still somehow returning "fee".

Is there some cache to rebuild wordnet?

This is an example code:

def get_main_verbs_of_sent(sent):
    """Return the main (non-auxiliary) verbs in a sentence."""
    return [tok for tok in sent
        if tok.pos == VERB and tok.dep_ not in {'aux', 'auxpass'}]

tdoc_itext = textacy.Doc("Don't feed the dog.", lang=u"en")
for sent in doc_itext.sents:
    itext_verbs = get_main_verbs_of_sent(sent)
    for verb in itext_verbs:
        print verb.text
        print verb.pos_
        print verb.lemma_

feed
verb
fee

@honnibal honnibal added bug Bugs and behaviour differing from documentation and removed usage General spaCy usage labels Nov 3, 2016
@honnibal
Copy link
Member

honnibal commented Nov 3, 2016

Thanks for the report — turned out to be a deeper problem.

"feed" is being assigned the tag VB, which means it shouldn't be lemmatized at all. The 'VB' tag is supposed to be associated with the morphological feature {verbform: inf}, in the tag map. This wasn't being set on the tokens, due to a bug in the morphological analyser (which is a bit of a mess).

Should be fixed now.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants