💫 Rethink Doc.merge and Span.merge, add Token.split #1487

honnibal · 2017-11-03T11:16:26Z

See also: #1474, #975, #758, #653, #616, #450, #429, #329, #375, #314, #214, #213, #156

Currently Doc.merge() and Span.merge() promise too much. They let you merge a span while holding references to other spans, and the span indices are supposed to be magically recalculated.

Unsurprisingly this has proven super difficult to get right. It also makes the Doc.merge() inefficient for repeated calls, because every time we merge something, we have to set the doc into the correct state.

We'd also really like to have a token.split() function, that divides tokens. This is a big hole at the moment, that really lets us down for languages like Chinese where the tokenization is an important part of the annotation to be changed through the pipeline.

I think we should consider having a Doc.retokenize() context manager, which you would need to activate before calling span.merge() or token.split(). This should allow us to make the retokenization more reliable and efficient. Within the block, we could keep a reference to all new Span and Token objects. Only Span and Token objects created during retokenization should be used during retokenization.

These changes are being left out of v2, because the v2 policy is to avoid breaking changes to the Doc, Span and Token objects (this makes it more predictable which parts of the application need to be updated).

The text was updated successfully, but these errors were encountered:

chaturv3di · 2017-11-06T08:19:17Z

Does this affect v1.9.0 too?

ines · 2017-11-06T11:28:21Z

@chaturv3di No, this will be more like spaCy v3.0, as it will introduce breaking changes to the Doc, Span and Token. As @honnibal mentioned above, we've been trying to avoid introducing breaking changes to spaCy's container objects in v2.0, since the new version already affects so many other parts of the library.

The above changes also require some experimentation and we haven't really decided on the perfect solution yet. That's why we've put it up for discussion on the issue tracker 😊

This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.

JamesMessinger · 2018-05-27T21:41:19Z

@ines - why was this issue closed? Are there still plans to implement a token.split() method? Currently it still throws a NotImplementedError ☹

What's the best way to split tokens currently?

lock · 2018-06-26T21:43:24Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added enhancement Feature requests and improvements help wanted Contributions welcome! labels Nov 3, 2017

ines changed the title ~~Rethink Doc.merge and Span.merge, add Token.split~~ 💫 Rethink Doc.merge and Span.merge, add Token.split Nov 3, 2017

ines mentioned this issue Nov 10, 2017

spacy.tokens.doc.Doc.ents.__get__ AssertionError #1547

Closed

honnibal mentioned this issue Mar 31, 2018

Add doc.retokenize() context manager #2172

Merged

3 tasks

ines added the feat / doc Feature: Doc, Span and Token objects label Apr 29, 2018

ines closed this as completed May 20, 2018

This was referenced May 20, 2018

spaCy v2.0.0 Span.merge() method has bugs #1474

Closed

bug with merge method for noun_chunks #2061

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

honnibal commented Nov 3, 2017

chaturv3di commented Nov 6, 2017

ines commented Nov 6, 2017

JamesMessinger commented May 27, 2018

lock bot commented Jun 26, 2018

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

Comments

honnibal commented Nov 3, 2017

chaturv3di commented Nov 6, 2017

ines commented Nov 6, 2017

JamesMessinger commented May 27, 2018

lock bot commented Jun 26, 2018