Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependencies not deprojectivized in spaCy 1.7 #898

Closed
adam-ra opened this issue Mar 21, 2017 · 11 comments
Closed

Dependencies not deprojectivized in spaCy 1.7 #898

adam-ra opened this issue Mar 21, 2017 · 11 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@adam-ra
Copy link

adam-ra commented Mar 21, 2017

I've noticed suspiciously large amount of evident parser errors after migrating from Spacy 1.6.0 and generic ‘en’ model to 1.7.2 + ‘en_depent_web_md’.

Environment: Python 3.4.3 / 3.5.1 on 64-bit Linux (Ubuntu).

Some example below (please note the abundance of proper noun tags).

1.6.0: pain in lower back

 0) ROOT    pain                    NOUN  NN    ROOT
 1) prep   ---- in                  ADP   IN    prep
 3) pobj   -------- back            NOUN  NN    pobj
 2) amod   ------------ low         ADJ   JJR   amod

1.7.2: pain in lower back

 3) ROOT    back                    PROPN NNP   ROOT
 0) nsubj  ---- pain                NOUN  NN    nsubj
 1) nmod   ---- in                  X     XX    nmod
 2) compound ---- lower             PROPN NNP   compound

1.6.0: I feel pain in lower back

 1) ROOT    feel                    VERB  VBP   ROOT
 0) nsubj  ---- i                   PRON  PRP   nsubj
 2) dobj   ---- pain                NOUN  NN    dobj
 3) prep   -------- in              ADP   IN    prep
 5) pobj   ------------ back        NOUN  NN    pobj
 4) amod   ---------------- low     ADJ   JJR   amod

1.7.2: I feel pain in lower back

 1) ROOT    feel                    VERB  VBP   ROOT
 0) nsubj  ---- i                   NOUN  NN    nsubj
 2) dobj   ---- pain                NOUN  NN    dobj
 3) prep   -------- in              ADP   IN    prep
 5) pobj   ------------ back        PROPN NNP   pobj
 4) compound ---------------- lower PROPN NNP   compound

1.6.0: sores on my dick

 0) ROOT    sore                    NOUN  NNS   ROOT
 1) prep   ---- on                  ADP   IN    prep
 3) pobj   -------- dick            NOUN  NN    pobj
 2) poss   ------------ my          ADJ   PRP$  poss

1.7.2: sores on my dick

 0) ROOT    sore                    NOUN  NN    ROOT
 1) prep   ---- on                  ADP   IN    prep
 3) pobj   -------- dick            PROPN NNP   pobj
 2) compound ------------ my        PROPN NNP   compound
@honnibal
Copy link
Member

Literally just pushed a fix to this. Could you try redownloading? It should give you en_depent_web_md-1.2.1, which should have this fixed.

@honnibal honnibal added the models Issues related to the statistical models label Mar 21, 2017
@adam-ra
Copy link
Author

adam-ra commented Mar 21, 2017

Wow, what a coincidence :)
It keeps downloading 1.2.0… not sure if it's the new or old one, will let you know

@adam-ra
Copy link
Author

adam-ra commented Mar 21, 2017

1.2.0 behaves as the old one (perhaps it is the old one?) while:

python -m spacy download en_depent_web_md-1.2.1

    Compatibility error

    No compatible model found for en_depent_web_md-1.2.1 (spaCy v1.7.2).```

@honnibal
Copy link
Member

honnibal commented Mar 21, 2017

1.2.0 is the old one -- 1.2.1 is the fix. Sorry, missed an entry in our compatibility table (for future reference: https://github.com/explosion/spacy-models/blob/master/compatibility.json )

Try now?

@adam-ra
Copy link
Author

adam-ra commented Mar 21, 2017

It took me some time to realise that I need to try en_depent_web_md without version suffix.

The main issue — superfluous proper noun tags — seems gone now! Thanks for the quick reaction.

There is, however, some bug — broken pronoun lemma.
I feel pain in lower back:

 1) ROOT    feel                    VERB  VBP   ROOT
 0) nsubj  ---- -PRON-              PRON  PRP   nsubj
 2) dobj   ---- pain                NOUN  NN    dobj
 3) prep   -------- in              ADP   IN    prep
 4) advmod ---- low                 ADJ   JJR   advmod
 5) prt    -------- back            ADV   RB    prt

Also, here is some unexpected dependency links but perhaps these are just within the expected margin of error.

pain in lower back: “back” as an adverbial particle, lower as phrase head :(

 0) ROOT    pain                    NOUN  NN    ROOT
 1) prep   ---- in                  ADP   IN    prep
 2) pobj   -------- low             ADJ   JJR   pobj
 3) prt    ------------ back        ADV   RB    prt```

BTW, the downloader has a bit unfriendly symlinking policy, it fails if the link target already exists.

@honnibal
Copy link
Member

honnibal commented Mar 21, 2017

The -PRON- lemma is expected behaviour --- it was actually a bug that caused pronouns to not be lemmatised in the previous version.

@adam-ra
Copy link
Author

adam-ra commented Mar 21, 2017

This seems quite a controversial decision to me… I understand at least some reasons (for instance, there's no obvious base form for 3rd person personal pronouns), but other than that one would expect that lemma is a form belonging to the language vocabulary (unlike stems). Also, this brings personal, possessive and other pronouns into the same lemma, which is not always a good thing.

Out of curiosity: does this decision stem from OntoNotes or is it your idea? I've checked CLEAR guidelines and it's not part of it. Also, it would be nice if this was added to the annotation docs.

EDIT: sorry, either I missed the part in the docs or you just added it :)

Is there any other special lemma like this?

@adam-ra
Copy link
Author

adam-ra commented Mar 21, 2017

Sorry for flooding with syntactic details, but chances are the following behaviour was not intended.

Pain in back, headache
 0) ROOT    pain                    PROPN NNP   ROOT
 1) prep   ---- in                  ADP   IN    prep
 2) pobj   -------- back            NOUN  NN    pobj
 3) punct  ---- ,                   PUNCT ,     punct
 4) relcl||pobj ---- headache       NOUN  NN    relcl||pobj

What got me thinking is both ‘relcl’ label itself (I'd expect ‘rcmod’ if anything) and the x||y syntax.

Also, the new model seems to like appositions a lot more than the old one (some “NP, NP” constructs are labelled as appositions rather than coordinations, but I guess this distinction is semantic/pragmatic, so hard to expect a supervised parser to perform well in this task).

@honnibal
Copy link
Member

The -PRON- lemma was my idea, and yes I agree that it's controversial. I see your argument about it not being within the language, but it seemed to me to be the best solution for pronouns. I should check again what the Universal Dependencies project does.

Thanks for the note about the dependency labels. I think the de-projectiviser isn't running after the parser. That explains the | labels.

I'm not sure what the situation is with the appositions.

@honnibal honnibal added bug Bugs and behaviour differing from documentation performance and removed models Issues related to the statistical models labels Mar 21, 2017
@honnibal honnibal changed the title Poor tagger+parser performance in the new model Dependencies not deprojectivized in spaCy 1.7 Mar 21, 2017
@adam-ra
Copy link
Author

adam-ra commented Mar 22, 2017

Thanks for the explanations!

I understand the reason and for some practical reasons this artificial lemma is actually quite convenient (including my use case). Anyway I've got a comment loosely related to personal pronoun lemmas. My first language is Polish, a morphologically rich language. Polish adjective inflects for number, gender and case. In noun phrases adjective gender depends on the gender of the noun being the syntactic head of the noun (strictly speaking, adjectives and noun agree on number, gender and grammatical case). So, aside for picking nominative case, there is no proper way to decide which adjective form should be used as lemma. The tradition has it masculine, singular, nominative — both in old dictionaries and in modern tagsets. Similar situation happens for other Slavic languages (e.g. Slovene, Czech, Croatian).

BTW in the tagset of the National Corpus of Polish, 3rd person pronouns are lemmatised to the male form, also (http://nkjp.pl/poliqarp/help/ense2.html#x3-20002, see ppron3). In the case of the MULTEXT-East tagset, the decisions made differ (http://nl.ijs.si/ME/V4/msd/html/index.html). There is also an interesting discussion on the Universal Dependencies project you mentioned: UniversalDependencies/docs#276

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants