Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

Closed
honnibal opened this issue Nov 3, 2017 · 4 comments
Closed

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

honnibal opened this issue Nov 3, 2017 · 4 comments
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects help wanted Contributions welcome!

Comments

@honnibal
Copy link
Member

honnibal commented Nov 3, 2017

See also: #1474, #975, #758, #653, #616, #450, #429, #329, #375, #314, #214, #213, #156

Currently Doc.merge() and Span.merge() promise too much. They let you merge a span while holding references to other spans, and the span indices are supposed to be magically recalculated.

Unsurprisingly this has proven super difficult to get right. It also makes the Doc.merge() inefficient for repeated calls, because every time we merge something, we have to set the doc into the correct state.

We'd also really like to have a token.split() function, that divides tokens. This is a big hole at the moment, that really lets us down for languages like Chinese where the tokenization is an important part of the annotation to be changed through the pipeline.

I think we should consider having a Doc.retokenize() context manager, which you would need to activate before calling span.merge() or token.split(). This should allow us to make the retokenization more reliable and efficient. Within the block, we could keep a reference to all new Span and Token objects. Only Span and Token objects created during retokenization should be used during retokenization.

These changes are being left out of v2, because the v2 policy is to avoid breaking changes to the Doc, Span and Token objects (this makes it more predictable which parts of the application need to be updated).

@honnibal honnibal added enhancement Feature requests and improvements help wanted Contributions welcome! labels Nov 3, 2017
@ines ines changed the title Rethink Doc.merge and Span.merge, add Token.split 💫 Rethink Doc.merge and Span.merge, add Token.split Nov 3, 2017
@chaturv3di
Copy link

Does this affect v1.9.0 too?

@ines
Copy link
Member

ines commented Nov 6, 2017

@chaturv3di No, this will be more like spaCy v3.0, as it will introduce breaking changes to the Doc, Span and Token. As @honnibal mentioned above, we've been trying to avoid introducing breaking changes to spaCy's container objects in v2.0, since the new version already affects so many other parts of the library.

The above changes also require some experimentation and we haven't really decided on the perfect solution yet. That's why we've put it up for discussion on the issue tracker 😊

honnibal added a commit that referenced this issue Mar 31, 2018
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
honnibal added a commit that referenced this issue Mar 31, 2018
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
honnibal added a commit that referenced this issue Apr 1, 2018
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
honnibal added a commit that referenced this issue Apr 3, 2018
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
@ines ines added the feat / doc Feature: Doc, Span and Token objects label Apr 29, 2018
@ines ines closed this as completed May 20, 2018
@JamesMessinger
Copy link
Contributor

@ines - why was this issue closed? Are there still plans to implement a token.split() method? Currently it still throws a NotImplementedError

What's the best way to split tokens currently?

@lock
Copy link

lock bot commented Jun 26, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

4 participants