-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
._.labels
doesn't work for spans with length of one
#91
Comments
I encountered the same problem. Even couldn't iterate through |
@burak0006 @LawlAoux all_tokens = self.span_obj._.parse_string.split("(")
label = all_tokens[1].split(" ")[0] Here, the parsed strings at the leafs are:
|
I also had the same problem. Is there some kind of conversion to CNF along the way that causes the API to go bonkers? The only working solution I could come up with is that which @anmolagarwal999 suggested, but it is unfortunate to have to parse a string constructed of a sentence that is already parsed. :/ A better API is warranted in my opinion. If you pass the parsed_sentence string into this function it will give you an appropriate tree structure.
This is a tree so it's not convenient to get stuff out of it. Here is an XPath-like function which you can use to query the structure.
You provide the (re-)parsed tree and the desired part of speech (as a string, case insensitive), but you have to specify the path from the root. For example if your sentence is a S > VP kind of sentence then getting the verb(s) should be like this: Since I use regular expressions you have to |
Given any span you can use the function to get a list of labels def get_span_labels(span: str) -> List[str]:
labels = span._.labels
if len(labels) == 0:
doc = span.doc
start, end = span.start, span.end
assert start + 1 == end
labels = (doc[start].tag_,)
# constituent_data = doc._._constituent_data
# labels_index = (
# (constituent_data.starts == start) * (constituent_data.ends == end)
# ).argmax()
# labels = constituent_data.label_vocab[labels_index]
return labels |
Below is a portion of the parse_string() function. label = label_vocab[label_idx]
if (i + 1) >= j:
token = doc[i]
s = (
"("
+ u"{} {}".format(token.tag_, token.text)
.replace("(", "-LRB-")
.replace(")", "-RRB-")
.replace("{", "-LCB-")
.replace("}", "-RCB-")
.replace("[", "-LSB-")
.replace("]", "-RSB-")
+ ")"
)
def get_labels(span):
return span._.labels or (span[0].tag_,)
org_span_labels = spacy.tokens.Span.remove_extension('labels')
def get_labels(span):
return org_span_labels[2](span) or (span[0].tag_,)
spacy.tokens.Span.set_extension('labels', getter=get_labels)
spacy.tokens.Token.remove_extension('labels')
spacy.tokens.Token.set_extension(
'labels',
getter=lambda token: get_labels(token.doc[token.i: token.i+1])
) |
For some reason, when the span has a length of one,
._.labels
returns an empty tuple. I would expect it to return the part of speech of the individual word (which can be done by taking the token of the word in the span and then takingtag_
.Reproduction:
From this code pos will be an empty tuple, but I would expect it to be equal to
first_child[0].tag_
which is "NNP"The text was updated successfully, but these errors were encountered: