Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc.noun_chunks Sentence Length Bug #693

Closed
shiredude95 opened this issue Dec 20, 2016 · 2 comments
Closed

doc.noun_chunks Sentence Length Bug #693

shiredude95 opened this issue Dec 20, 2016 · 2 comments

Comments

@shiredude95
Copy link

shiredude95 commented Dec 20, 2016

  • Operating System: Ubuntu 15.04
  • Python Version Used: 3.5.1
  • spaCy Version Used: 1.4.0
  • Environment Information: 64 Bit System,4GB Ram

doc.noun_chunks doesn't parse the complete sentence.

Test1:

from spacy.en import English
nlp = English()
doc=nlp("the TopTown International Airport Board and Goodwill Space Exploration Partnership.")
for chunk in doc.noun_chunks:
    print(chunk)

Produces the output:

the TopTown International Airport Board

But Test2:

from spacy.en import English
nlp = English()
doc=nlp("the Goodwill Space Exploration Partnership and TopTown International Airport Board.")
for chunk in doc.noun_chunks:
    print(chunk)

Produces Output:

the Goodwill Space Exploration Partnership

Although both are identified properly they are done so only when they come early in a sentence and are ignored when they appear near the end.

@honnibal
Copy link
Member

honnibal commented Dec 23, 2016

It looks like the noun chunk detection rules could be improved here. The issue comes from the combination of coordination and proper nouns:

    def english_noun_chunks(obj):
        '''Detect base noun phrases from a dependency parse.
        Works on both Doc and Span.'''
        labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',
                  'attr', 'ROOT', 'root']
        doc = obj.doc # Ensure works on both Doc and Span.
        np_deps = [doc.vocab.strings[label] for label in labels]
        conj = doc.vocab.strings['conj']
        np_label = doc.vocab.strings['NP']
        for i, word in enumerate(obj):
            if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
                yield word.left_edge.i, word.i+1, np_label
            elif word.pos == NOUN and word.dep == conj:
                head = word.head
                while head.dep == conj and head.head.i < head.i:
                    head = head.head
                # If the head is an NP, and we're coordinated to it, we're an NP
                if head.dep in np_deps:
                    yield word.left_edge.i, word.i+1, np_label

I think the correction should be:

elif word.pos in (NOUN, PROPN) and word.dep == conj:

@ines ines added this to the Debug parser transition system milestone Feb 18, 2017
ines added a commit that referenced this issue Mar 18, 2017
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants