Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging spans - discussion #156

Closed
andreasgrv opened this issue Oct 30, 2015 · 10 comments
Closed

Merging spans - discussion #156

andreasgrv opened this issue Oct 30, 2015 · 10 comments

Comments

@andreasgrv
Copy link
Contributor

Hi!

I've been working on relation extraction using spaCy for the past week or so and the API has been very convenient and the results are great 😄 - lovely lib!

However I have stumbled upon the problem of span merging as pointed out in comments in the code - namely the problem that spans that have been extracted earlier are invalidated. How are you thinking of approaching this problem?

I have solved it partially for my purposes, but its a bit of an ugly hack:

            # iterating in reverse order doesn't invalidate next spans we want to process
            # since we are shrinking the doc starting from the end
            for sentence in list(reversed(list(doc.sents))):
                # same for entities
                for ent in reversed(sentence.ents):
                    # collapse function basically merges and uses default stuff for labels
                    ent.collapse()
            # get sentences again - with correct spans
            for sentence in doc.sents:
                # do stuff you wanted to do with collapsed entities here

I extended Spans to have the same ents property that is defined in Doc and added a collapse function which basically achieves the same as merging the span using reasonable labels. I know adding the ents property to a Span doesn't always make sense - but in the case where the span is a sentence or a noun phrase it does. A sentence seems to be a quite specific type of span - many properties of doc make sense for a sentence span. But i digress.

    def collapse(self):
        start_idx = self[0].idx
        end_idx = self[-1].idx + len(self[-1])
        lemma = u' '.join(word.lemma_ for word in self)
        ent_type = max([word.ent_type_ for word in self])
        merged =  self.doc.merge(start_idx, end_idx, self.root.tag_, lemma, ent_type)
        return merged

I thought of changing the merge function so that it didn't shrink the doc, simply replacing the tokens that have been merged using a special placeholder in the array, and then in iter iterate through the array as normal, and only yield the object if it is not that placeholder. However this isn't compatible with getting items using an index. Any ideas?

@honnibal
Copy link
Member

I keep coming up against this too.

Provisionally, I've been doing my merging by collecting the arguments to merge() from the spans in one step, and then merging them all in the next step. Doc.merge() takes character indices, so you can merge multiple spans in that step.

I've had a few ideas.

  1. The Span object has start_idx and end_idx attributes, that refer to character offsets. This allows the Span to check that span.doc[span.start].idx == span.start_idx, i.e. that the span starts on the same character it originally did.

1.a doc.merge() still invalidates Span objects, but now the failure is explicit.

1.b After doc.merge(), you have to call span.recalculate_indices() or something, which checks that the Span starts where it should, and if it doesn't, resets the span.start and span.__len__ attributes.

1.c span.start and span.__len__ are computed properties, which are magically always correct.

  1. Doc objects hold a reference to each Span object that slices into it. After Doc.merge(), the Span objects are updated. Care is taken to ensure the bidirectional reference doesn't cause a memory leak.

I can see the argument for 1.a, and it's at least better than what we have now. I think it probably leaves too much convenience on the table, though. 1.b is possibly a bit inconsistent with the way the rest of the library works. I don't really have any other "call this after doing this" sort of thing.

So, probably I favour 1.c, along with some name changes to make things a little bit clearer.

It's possibly a little confusing that the start and end token indices can change out from under you. On the other hand, if you're calling .merge(), what do you expect? I think the right naming and docs can make this clear enough.

@andreasgrv
Copy link
Contributor Author

Based on an approach resembling 1.c that you suggested I think I have come up with something that works, however it will break current code written for merges since spans extracted from a document will most of the times change after you merge a span of that document. I have replaced begin and end attributes with properties and additionally store start_idx and end_idx offsets for each span in the constructor. The main assumption for this approach to work is that char offsets won't change, so we can use them as a point of reference to recalculate the correct start, and end token indices using the Doc. I didn't need to store references to created spans in the Doc class.

Each time span.start or span.end is accessed I check whether doc[initial_token_index].idx matches the corresponding character offset, if not I iterate through the tokens in the doc till I find the correct token index such that character offset idx matches that of the token.idx.

Does this approach seem ok? I am having a bit of trouble with the head attribute of the merged span not pointing to the right token, but I am looking into the merge code now.

@honnibal
Copy link
Member

honnibal commented Nov 4, 2015

This sounds like what I had in mind, yes. I'm not quite sure what you mean about the breakage of current code. I'm not sure why that's necessary.

There's currently no way to change the .orth attribute of the tokens, which means there's no way to change the underlying characters. I think this is good. If the user needs to do normalization of the input, they can do set the lexeme.norm property, or normalize the text before processing. So, you should assume the characters are immutable, and we can base the spans off them.

This is a difficult bit of the code, so you can expect me to be pretty fussy about it — your patches probably won't be accepted cleanly :). I'll get to this soon enough myself, if you prefer to wait. But if you want to do this, go ahead.

@andreasgrv
Copy link
Contributor Author

Actually I might be mistaken about breaking current code, will recheck.

I absolutely understand, the reason I implemented it was because I am fussy myself ^_^. I am writing a relation extractor at the sentence level, so I would prefer to have the option to decide whether to merge entities and noun phrases inside the extractor and not beforehand (there are obviously other ways to get around this, but a relation extractor at the sentence level seemed like a reasonable choice).

At the moment merges work, but the head attribute of the merged token isn't correct.
Should I make a pull request when and if it works so that you don't have to work on it from scratch? Don't know what the easiest way to review my changes would be.

@honnibal
Copy link
Member

honnibal commented Nov 4, 2015

Yeah, pull request is fine.

Have you found the current Doc.merge implementation? See also set_children_from_heads in doc.pyx

A minor gotcha is that in the C data, the .head attribute stores an offset integer, while l_edge and r_edge store absolute positions. You'll need to make sure all of l_kids, r_kids, l_edge and r_edge are set correctly, or the lefts, rights, children and subtree iterators will be broken.

(The way I'm storing the tree is a little bit weird. I want to be able to look up various child features quickly, and I don't want to store an explicit n**2 table of children.)

@andreasgrv
Copy link
Contributor Author

Yes, I am trying to understand what it does after it gets the actual span.

Aha, so that is what that

token.head += i

is about! Ok, I think I understand now.

Ok, will add some tests for those cases then in tests/spans/test_merge.py . Will make pull request when done and you can change the code as you see fit ^_^

@honnibal
Copy link
Member

honnibal commented Nov 7, 2015

This is merged in now, with a lot of other changes (I made major changes to the thinc library).

I ended up blending 1.b and 1.c: .start and .end are not computed properties, but we call a function to verify them and reset if necessary when we go to fetch a token.

Thanks for your help on this, and for agitating for the change.

Example of this at work:

>>> for np in doc.noun_chunks:
...   np.merge(np.root.tag_, np.text, np.root.ent_type_)
... 
>>> for tok in doc:
...   print(tok.text)
... 
The cat
sat
on
the mat
in
a fuzzy hat
.

Much nicer.

@andreasgrv
Copy link
Contributor Author

Great! Will get the changes and check it out 👍 Thnx

@honnibal
Copy link
Member

honnibal commented Nov 8, 2015

Now live in v0.99

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants