Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

._.labels doesn't work for spans with length of one #91

Open
LawlAoux opened this issue Apr 26, 2022 · 5 comments
Open

._.labels doesn't work for spans with length of one #91

LawlAoux opened this issue Apr 26, 2022 · 5 comments

Comments

@LawlAoux
Copy link

For some reason, when the span has a length of one, ._.labels returns an empty tuple. I would expect it to return the part of speech of the individual word (which can be done by taking the token of the word in the span and then taking tag_.

Reproduction:

import spacy, benepar
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("Tuesday morning")
sent = tuple(doc.sents)[0]
first_child = tuple(sent._.children)[0]
pos = first_child._.labels

From this code pos will be an empty tuple, but I would expect it to be equal to first_child[0].tag_ which is "NNP"

@burak0006
Copy link

I encountered the same problem. Even couldn't iterate through ._.parse_string as it is a nested complicated structure with parenthesis

@anmolagarwal999
Copy link

@burak0006 @LawlAoux
You can make use of the less complicated version of the parsed string at the leaf to solve this issue.

all_tokens = self.span_obj._.parse_string.split("(")
label = all_tokens[1].split(" ")[0]

Eg:
image

Here, the parsed strings at the leafs are:

  • (NN Stock)
  • (NNS prices)
  • (VBD soared)
  • ........

@badvision
Copy link

I also had the same problem. Is there some kind of conversion to CNF along the way that causes the API to go bonkers? The only working solution I could come up with is that which @anmolagarwal999 suggested, but it is unfortunate to have to parse a string constructed of a sentence that is already parsed. :/ A better API is warranted in my opinion.

If you pass the parsed_sentence string into this function it will give you an appropriate tree structure.

# This was adapted from https://stackoverflow.com/questions/54959875/recursive-parentheses-parser-for-expressions-of-strings
def parse_tree(sentence):
    stack = []  # or a `collections.deque()` object, which is a little faster
    top = items = []
    for token in filter(None, re.compile(r'(?:([()])|\s+)').split(sentence)):
        if token == '(':
            stack. Append(items)
            items.append([])
            items = items[-1]
        elif token == ')':
            if not stack:
                raise ValueError("Unbalanced parentheses")
            items = stack.pop()  
        else:
            items. Append(token)
    if stack:
        raise ValueError("Unbalanced parentheses")    
    return top

This is a tree so it's not convenient to get stuff out of it. Here is an XPath-like function which you can use to query the structure.

def find_pos(tree, pos):
    result = []
    if not isinstance(tree[0], str):
        result = [find_pos(subtree, pos) for subtree in tree]
    else:
        pos_parts = pos.split("/")
        if re.match(pos_parts[0], tree[0], flags=re.IGNORECASE):
            if len(pos_parts) == 1:
                return tree[1]
            else:
                result =  [find_pos(subtree, "/".join(pos_parts[1:])) for subtree in tree[1:]]
    if len(result) == 0:
        return None
    result = [f for f in result if f is not None]
    if len(result) == 0:
        return None
    elif len(result) == 1:
        return result[0]
    else:
        return result

You provide the (re-)parsed tree and the desired part of speech (as a string, case insensitive), but you have to specify the path from the root. For example if your sentence is a S > VP kind of sentence then getting the verb(s) should be like this: find_pos(command, 'VP/VB') and if there is a noun associated with that, find_pos(command, 'VP/NP/NN.*') should do. If you want to get prepositional nouns (go to the store) then you can also use find_pos(command, 'VP/PP/NP/NN.*'). Slashes separate tree levels you want to iterate through, but the expressions between the slashes can be complex regex expressions too! This allows some cleverness if you're careful with it.

Since I use regular expressions you have to import re to use this code. Enjoy!

@Naman-ntc
Copy link

Naman-ntc commented Jan 22, 2023

Given any span you can use the function to get a list of labels

def get_span_labels(span: str) -> List[str]:
    labels = span._.labels
    if len(labels) == 0:
        doc = span.doc
        start, end = span.start, span.end
        assert start + 1 == end
        labels = (doc[start].tag_,)
        # constituent_data = doc._._constituent_data
        # labels_index = (
        #     (constituent_data.starts == start) * (constituent_data.ends == end)
        # ).argmax()
        # labels = constituent_data.label_vocab[labels_index]
    return labels

@th-yoo
Copy link

th-yoo commented Jul 4, 2024

Below is a portion of the parse_string() function.

        label = label_vocab[label_idx]
        if (i + 1) >= j:
            token = doc[i]
            s = (
                "("
                + u"{} {}".format(token.tag_, token.text)
                .replace("(", "-LRB-")
                .replace(")", "-RRB-")
                .replace("{", "-LCB-")
                .replace("}", "-RCB-")
                .replace("[", "-LSB-")
                .replace("]", "-RSB-")
                + ")"
            )

label is an empty tuple but, ._.parse_string shows token.tag_ as a tag.

  • Workaroud 1
    Instead of ._.lables, use the function below.
def get_labels(span):
    return span._.labels or (span[0].tag_,)
  • Workaround 2
    Override the installed extensions.
org_span_labels = spacy.tokens.Span.remove_extension('labels')

def get_labels(span):
    return  org_span_labels[2](span) or (span[0].tag_,)

spacy.tokens.Span.set_extension('labels', getter=get_labels)

spacy.tokens.Token.remove_extension('labels')
spacy.tokens.Token.set_extension(
    'labels',
    getter=lambda token: get_labels(token.doc[token.i: token.i+1])
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants