Merging spans - discussion #156

andreasgrv · 2015-10-30T12:37:02Z

Hi!

I've been working on relation extraction using spaCy for the past week or so and the API has been very convenient and the results are great 😄 - lovely lib!

However I have stumbled upon the problem of span merging as pointed out in comments in the code - namely the problem that spans that have been extracted earlier are invalidated. How are you thinking of approaching this problem?

I have solved it partially for my purposes, but its a bit of an ugly hack:

            # iterating in reverse order doesn't invalidate next spans we want to process
            # since we are shrinking the doc starting from the end
            for sentence in list(reversed(list(doc.sents))):
                # same for entities
                for ent in reversed(sentence.ents):
                    # collapse function basically merges and uses default stuff for labels
                    ent.collapse()
            # get sentences again - with correct spans
            for sentence in doc.sents:
                # do stuff you wanted to do with collapsed entities here

I extended Spans to have the same ents property that is defined in Doc and added a collapse function which basically achieves the same as merging the span using reasonable labels. I know adding the ents property to a Span doesn't always make sense - but in the case where the span is a sentence or a noun phrase it does. A sentence seems to be a quite specific type of span - many properties of doc make sense for a sentence span. But i digress.

    def collapse(self):
        start_idx = self[0].idx
        end_idx = self[-1].idx + len(self[-1])
        lemma = u' '.join(word.lemma_ for word in self)
        ent_type = max([word.ent_type_ for word in self])
        merged =  self.doc.merge(start_idx, end_idx, self.root.tag_, lemma, ent_type)
        return merged

I thought of changing the merge function so that it didn't shrink the doc, simply replacing the tokens that have been merged using a special placeholder in the array, and then in iter iterate through the array as normal, and only yield the object if it is not that placeholder. However this isn't compatible with getting items using an index. Any ideas?

honnibal · 2015-10-30T13:32:32Z

I keep coming up against this too.

Provisionally, I've been doing my merging by collecting the arguments to merge() from the spans in one step, and then merging them all in the next step. Doc.merge() takes character indices, so you can merge multiple spans in that step.

I've had a few ideas.

The Span object has start_idx and end_idx attributes, that refer to character offsets. This allows the Span to check that span.doc[span.start].idx == span.start_idx, i.e. that the span starts on the same character it originally did.

1.a doc.merge() still invalidates Span objects, but now the failure is explicit.

1.b After doc.merge(), you have to call span.recalculate_indices() or something, which checks that the Span starts where it should, and if it doesn't, resets the span.start and span.__len__ attributes.

1.c span.start and span.__len__ are computed properties, which are magically always correct.

Doc objects hold a reference to each Span object that slices into it. After Doc.merge(), the Span objects are updated. Care is taken to ensure the bidirectional reference doesn't cause a memory leak.

I can see the argument for 1.a, and it's at least better than what we have now. I think it probably leaves too much convenience on the table, though. 1.b is possibly a bit inconsistent with the way the rest of the library works. I don't really have any other "call this after doing this" sort of thing.

So, probably I favour 1.c, along with some name changes to make things a little bit clearer.

It's possibly a little confusing that the start and end token indices can change out from under you. On the other hand, if you're calling .merge(), what do you expect? I think the right naming and docs can make this clear enough.

andreasgrv · 2015-11-04T14:52:52Z

Based on an approach resembling 1.c that you suggested I think I have come up with something that works, however it will break current code written for merges since spans extracted from a document will most of the times change after you merge a span of that document. I have replaced begin and end attributes with properties and additionally store start_idx and end_idx offsets for each span in the constructor. The main assumption for this approach to work is that char offsets won't change, so we can use them as a point of reference to recalculate the correct start, and end token indices using the Doc. I didn't need to store references to created spans in the Doc class.

Each time span.start or span.end is accessed I check whether doc[initial_token_index].idx matches the corresponding character offset, if not I iterate through the tokens in the doc till I find the correct token index such that character offset idx matches that of the token.idx.

Does this approach seem ok? I am having a bit of trouble with the head attribute of the merged span not pointing to the right token, but I am looking into the merge code now.

honnibal · 2015-11-04T15:15:53Z

This sounds like what I had in mind, yes. I'm not quite sure what you mean about the breakage of current code. I'm not sure why that's necessary.

There's currently no way to change the .orth attribute of the tokens, which means there's no way to change the underlying characters. I think this is good. If the user needs to do normalization of the input, they can do set the lexeme.norm property, or normalize the text before processing. So, you should assume the characters are immutable, and we can base the spans off them.

This is a difficult bit of the code, so you can expect me to be pretty fussy about it — your patches probably won't be accepted cleanly :). I'll get to this soon enough myself, if you prefer to wait. But if you want to do this, go ahead.

andreasgrv · 2015-11-04T15:39:25Z

Actually I might be mistaken about breaking current code, will recheck.

I absolutely understand, the reason I implemented it was because I am fussy myself ^_^. I am writing a relation extractor at the sentence level, so I would prefer to have the option to decide whether to merge entities and noun phrases inside the extractor and not beforehand (there are obviously other ways to get around this, but a relation extractor at the sentence level seemed like a reasonable choice).

At the moment merges work, but the head attribute of the merged token isn't correct.
Should I make a pull request when and if it works so that you don't have to work on it from scratch? Don't know what the easiest way to review my changes would be.

honnibal · 2015-11-04T16:01:37Z

Yeah, pull request is fine.

Have you found the current Doc.merge implementation? See also set_children_from_heads in doc.pyx

A minor gotcha is that in the C data, the .head attribute stores an offset integer, while l_edge and r_edge store absolute positions. You'll need to make sure all of l_kids, r_kids, l_edge and r_edge are set correctly, or the lefts, rights, children and subtree iterators will be broken.

(The way I'm storing the tree is a little bit weird. I want to be able to look up various child features quickly, and I don't want to store an explicit n**2 table of children.)

andreasgrv · 2015-11-04T16:09:04Z

Yes, I am trying to understand what it does after it gets the actual span.

Aha, so that is what that

token.head += i

is about! Ok, I think I understand now.

Ok, will add some tests for those cases then in tests/spans/test_merge.py . Will make pull request when done and you can change the code as you see fit ^_^

honnibal · 2015-11-07T08:09:09Z

This is merged in now, with a lot of other changes (I made major changes to the thinc library).

I ended up blending 1.b and 1.c: .start and .end are not computed properties, but we call a function to verify them and reset if necessary when we go to fetch a token.

Thanks for your help on this, and for agitating for the change.

Example of this at work:

>>> for np in doc.noun_chunks:
...   np.merge(np.root.tag_, np.text, np.root.ent_type_)
... 
>>> for tok in doc:
...   print(tok.text)
... 
The cat
sat
on
the mat
in
a fuzzy hat
.

Much nicer.

andreasgrv · 2015-11-07T14:27:09Z

Great! Will get the changes and check it out 👍 Thnx

honnibal · 2015-11-08T15:56:35Z

Now live in v0.99

lock · 2018-05-09T16:31:42Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal closed this as completed Nov 8, 2015

danielsgriffin mentioned this issue Apr 16, 2016

adjust span for PERSONS with "'s" in answer or choice TheGadflyProject/TheGadflyProject#34

Closed

honnibal mentioned this issue Nov 3, 2017

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging spans - discussion #156

Merging spans - discussion #156

andreasgrv commented Oct 30, 2015

honnibal commented Oct 30, 2015

andreasgrv commented Nov 4, 2015

honnibal commented Nov 4, 2015

andreasgrv commented Nov 4, 2015

honnibal commented Nov 4, 2015

andreasgrv commented Nov 4, 2015

honnibal commented Nov 7, 2015

andreasgrv commented Nov 7, 2015

honnibal commented Nov 8, 2015

lock bot commented May 9, 2018

Merging spans - discussion #156

Merging spans - discussion #156

Comments

andreasgrv commented Oct 30, 2015

honnibal commented Oct 30, 2015

andreasgrv commented Nov 4, 2015

honnibal commented Nov 4, 2015

andreasgrv commented Nov 4, 2015

honnibal commented Nov 4, 2015

andreasgrv commented Nov 4, 2015

honnibal commented Nov 7, 2015

andreasgrv commented Nov 7, 2015

honnibal commented Nov 8, 2015

lock bot commented May 9, 2018