-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging spans - discussion #156
Comments
I keep coming up against this too. Provisionally, I've been doing my merging by collecting the arguments to merge() from the spans in one step, and then merging them all in the next step. I've had a few ideas.
1.a 1.b After 1.c
I can see the argument for 1.a, and it's at least better than what we have now. I think it probably leaves too much convenience on the table, though. 1.b is possibly a bit inconsistent with the way the rest of the library works. I don't really have any other "call this after doing this" sort of thing. So, probably I favour 1.c, along with some name changes to make things a little bit clearer. It's possibly a little confusing that the start and end token indices can change out from under you. On the other hand, if you're calling |
Based on an approach resembling 1.c that you suggested I think I have come up with something that works, however it will break current code written for merges since spans extracted from a document will most of the times change after you merge a span of that document. I have replaced begin and end attributes with properties and additionally store start_idx and end_idx offsets for each span in the constructor. The main assumption for this approach to work is that char offsets won't change, so we can use them as a point of reference to recalculate the correct start, and end token indices using the Doc. I didn't need to store references to created spans in the Doc class. Each time span.start or span.end is accessed I check whether doc[initial_token_index].idx matches the corresponding character offset, if not I iterate through the tokens in the doc till I find the correct token index such that character offset idx matches that of the token.idx. Does this approach seem ok? I am having a bit of trouble with the head attribute of the merged span not pointing to the right token, but I am looking into the merge code now. |
This sounds like what I had in mind, yes. I'm not quite sure what you mean about the breakage of current code. I'm not sure why that's necessary. There's currently no way to change the This is a difficult bit of the code, so you can expect me to be pretty fussy about it — your patches probably won't be accepted cleanly :). I'll get to this soon enough myself, if you prefer to wait. But if you want to do this, go ahead. |
Actually I might be mistaken about breaking current code, will recheck. I absolutely understand, the reason I implemented it was because I am fussy myself ^_^. I am writing a relation extractor at the sentence level, so I would prefer to have the option to decide whether to merge entities and noun phrases inside the extractor and not beforehand (there are obviously other ways to get around this, but a relation extractor at the sentence level seemed like a reasonable choice). At the moment merges work, but the head attribute of the merged token isn't correct. |
Yeah, pull request is fine. Have you found the current A minor gotcha is that in the C data, the (The way I'm storing the tree is a little bit weird. I want to be able to look up various child features quickly, and I don't want to store an explicit n**2 table of children.) |
Yes, I am trying to understand what it does after it gets the actual span. Aha, so that is what that token.head += i is about! Ok, I think I understand now. Ok, will add some tests for those cases then in tests/spans/test_merge.py . Will make pull request when done and you can change the code as you see fit ^_^ |
This is merged in now, with a lot of other changes (I made major changes to the I ended up blending 1.b and 1.c: Thanks for your help on this, and for agitating for the change. Example of this at work: >>> for np in doc.noun_chunks:
... np.merge(np.root.tag_, np.text, np.root.ent_type_)
...
>>> for tok in doc:
... print(tok.text)
...
The cat
sat
on
the mat
in
a fuzzy hat
. Much nicer. |
Great! Will get the changes and check it out 👍 Thnx |
Now live in v0.99 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Hi!
I've been working on relation extraction using spaCy for the past week or so and the API has been very convenient and the results are great 😄 - lovely lib!
However I have stumbled upon the problem of span merging as pointed out in comments in the code - namely the problem that spans that have been extracted earlier are invalidated. How are you thinking of approaching this problem?
I have solved it partially for my purposes, but its a bit of an ugly hack:
I extended Spans to have the same ents property that is defined in Doc and added a collapse function which basically achieves the same as merging the span using reasonable labels. I know adding the ents property to a Span doesn't always make sense - but in the case where the span is a sentence or a noun phrase it does. A sentence seems to be a quite specific type of span - many properties of doc make sense for a sentence span. But i digress.
I thought of changing the merge function so that it didn't shrink the doc, simply replacing the tokens that have been merged using a special placeholder in the array, and then in iter iterate through the array as normal, and only yield the object if it is not that placeholder. However this isn't compatible with getting items using an index. Any ideas?
The text was updated successfully, but these errors were encountered: