Native coref component #7264

svlandeg · 2021-03-03T12:51:32Z

Work-in-progress

Description

Creating a native coref component in spaCy

Types of change

new feature

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

* initial coref_er pipe * matcher more flexible * base coref component without actual model * initial setup of coref_er.score * rename to include_label * preliminary score_clusters method * apply scoring in coref component * IO fix * return None loss for now * rename to CoreferenceResolver * some preliminary unit tests * use registry as callable

svlandeg · 2021-03-03T12:51:48Z

Status March 1:

Wrote preliminary v3-compatible framework to facilitate experimentation with different coref models
Currently assuming two different pipeline components:
- coref_er / CorefEntityRecognizer is a rule-based mention detection algorithm: uses noun chunks, POS tags and named entities
- coref / CoreferenceResolver assembles the provided mentions into clusters (dummy implementation)
Using doc.spans to store the information:
- doc.spans[coref_mentions] for storing all relevant coref mentions (nouns, pronouns, names, ...)
- doc.spans[coref_clusters_i] for different clusters, indexed with i
Coref.v0 needs to be implemented and changed to Coref.v1
Scorer.score_clusters method that currently uses a too simple scoring mechanism (binary relations between mentions), should be refined with actual coref scoring algorithm

While all of this is mostly dummy framework, it already helped discover some bugs & required functionality, cf PRs #7197, #7209 and #7225.

Going forward, having this bare framework should facilitate working on this functionality with different people in parallel, filling in different parts...

TODO

Implement proper coref ML model
Proper mention detection algorithm, rule-based, ML-based, something like the SpanCategorizer, ...
Meaningful evaluation script
Tune & benchmark
Rewrite errors to use spacy.errors

Open questions / current issues

While we talked about keeping doc.spans a relatively simple dictionary of strings mapping to lists of spans, we might consider having a more formal way of defining clusters that belong together - currently this is done by matching a prefix in the spans key, which is obviously not ideal
The design with the rule-based coref_er is again awkward, because this component won't run during nlp.update, meaning that the coref model could only train on gold mentions, which is not a good idea in terms of generalizability and robustness of the ML model.

LifeIsStrange · 2021-03-29T22:00:35Z

Just saying that I hope that the state of the art
will be available eventually.

Anyway this is a very welcome improvement that I'm looking forward :)

This includes the coref code that was being tested separately, modified to work in spaCy. It hasn't been tested yet and presumably still needs fixes. In particular, the evaluation code is currently omitted. It's unclear at the moment whether we want to use a complex scorer similar to the official one, or a simpler scorer using more modern evaluation methods.

Ended up not making a difference, but oh well.

When sentences are not available, just treat the whole doc as one sentence. A reasonable general fallback, but important due to the init call, where upstream components aren't run.

Training seems to actually run now!

This makes their scope tighter and more contained, and has the nice side effect that fewer things need to be passed around for backprop.

The loss was being returned as a single element array, which caused training to die when it attempted to turn it into JSON.

This is closer to the traditional evaluation method. That uses an average of three scores, this is just using the bcubed metric for now (nothing special about bcubed, just picked one). The scoring implementation comes from the coval project. It relies on scipy, which is one issue, and is rather involved, which is another. Besides being comparable with traditional evaluations, this scoring is relatively fast.

The intent of this was that it would be a component pipeline that used entities as input, but that's now covered by the get_mentions function as a pipeline arg.

mypy now exits without an error, except for two apparently unrelated ones about setup.py.

Update Coref Docs

Fix tokenization mismatch handling in coref

This was changed by merge

There's no guarantee about the order in which SpanGroup keys will come out, so access them in sorted order when doing comparisons.

This was necessary when the tok2vec_size option was necessary.

Dimension inference in Coref

This was probably used in the prototyping stage, left as a reference, and then forgotten. Nothing uses it any more.

polm · 2022-07-12T07:16:59Z

@explosion-bot please test_gpu

polm · 2022-07-12T07:34:22Z

@explosion-bot please test_gpu

explosion-bot · 2022-07-12T07:35:08Z

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/100

svlandeg · 2022-08-11T07:54:54Z

Closing this PR, as we'll release the functionality in spacy-experimental first: explosion/spacy-experimental#17

The docs PR is here: #11291

svlandeg · 2022-10-06T14:22:54Z

Just wanted to send a quick update about coref support in spaCy:

we've released an end-to-end neural coref component as part of spacy-experimental 0.6.0. Just run pip install spacy-experimental==0.6.0 and it will automatically become available in your spaCy installation.
the release contains a pretrained pipeline for you to play with: https://github.com/explosion/spacy-experimental/releases/tag/v0.6.0
If you're interested in training a coref pipeline yourself, check out this project we've assembled: https://github.com/explosion/projects/tree/v3/experimental/coref
we've published a blog with many details on this architecture: https://explosion.ai/blog/coref
a video will be released soon :-)

We'd love for you to try this out, and any feedback is very welcome over at the discussion forum!

svlandeg added enhancement Feature requests and improvements feat / coref Feature: Coreference resolution ⚠️ wip Work in progress labels Mar 3, 2021

svlandeg changed the title ~~Native coref component (#7243)~~ Native coref component Mar 3, 2021

polm and others added 23 commits May 15, 2021 20:05

Merge branch 'master' into feature/coref

3608b7b

Minor fixes

91b1114

Attempt to use registry correctly

e303628

Merge remote-tracking branch 'upstream/develop' into feature/coref

a33d294

Fiddle with get_mentions definition

0517155

Ended up not making a difference, but oh well.

Add basic tuplify init

883c137

Make get_sentence_map work with init

a7d9c81

When sentences are not available, just treat the whole doc as one sentence. A reasonable general fallback, but important due to the init call, where upstream components aren't run.

Deal with generators in tuplify

0620820

Fix pipeline intialize

2486b8a

Fix backprop

d22acee

Training seems to actually run now!

Break pairwise operations into pseudolayers

fa92daf

This makes their scope tighter and more contained, and has the nice side effect that fewer things need to be passed around for backprop.

Help out python gc in coref backprop

8c5df62

Catch a stray reference

ff3fed0

Fix loss

e1b4a85

The loss was being returned as a single element array, which caused training to die when it attempted to turn it into JSON.

Remove coref_er.py

0942a0b

The intent of this was that it would be a component pipeline that used entities as input, but that's now covered by the get_mentions function as a pipeline arg.

Minor cleanup

d6fd5fe

Don't use a generator for no reason

d6389b1

Remove references to coref_er

a484245

Merge remote-tracking branch 'upstream/master' into feature/coref

ba2e491

delete outdated tests

2e3c0e2

set versions to v1 instead of v0

9100265

polm added 14 commits July 6, 2022 19:22

Use normal PyTorchWrapper in coref

b59b924

Do dimension inference in span predictor

b0800ea

Span predictor leftovers

da81a90

Fix types

2eee0d2

mypy now exits without an error, except for two apparently unrelated ones about setup.py.

Merge pull request #11087 from polm/coref/doc-update

2c2791d

Update Coref Docs

Merge branch 'fix/coref-alignment' into feature/coref

1b3db14

Merge branch 'feature/coref' into fix/coref-alignment

6d9eafe

Merge pull request #11042 from polm/fix/coref-alignment

9cbb970

Fix tokenization mismatch handling in coref

Merge branch 'feature/coref' into coref/dimension-inference

4d03239

Add type annotations for internal models

baeb35f

Merge branch 'master' into coref/dimension-inference

5969634

Update error number

f9c82e2

This was changed by merge

Merge branch 'master' into feature/coref

7792229

Update error number

0f3c456

This was changed by merge

polm force-pushed the feature/coref branch from 7264ae1 to 0f3c456 Compare July 11, 2022 11:24

polm added 5 commits July 12, 2022 12:56

Merge branch 'feature/coref' into coref/dimension-inference

64a0bf4

Make get_clusters_from_doc return spans in order

1baa334

There's no guarantee about the order in which SpanGroup keys will come out, so access them in sorted order when doing comparisons.

Remove config from coref tests

07e8556

This was necessary when the tok2vec_size option was necessary.

Merge pull request #11089 from polm/coref/dimension-inference

90973fa

Dimension inference in Coref

Remove orphaned function

2e9dadf

This was probably used in the prototyping stage, left as a reference, and then forgotten. Nothing uses it any more.

explosion unlocked this conversation Jul 12, 2022

explosion locked and limited conversation to collaborators Jul 12, 2022

polm added 2 commits August 4, 2022 15:09

Update docs to mark experimental, rename SpanPredictor to SpanResolver

3a7658e

Update architectures

62ffddd

svlandeg closed this Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native coref component #7264

Native coref component #7264

svlandeg commented Mar 3, 2021

svlandeg commented Mar 3, 2021

LifeIsStrange commented Mar 29, 2021

polm commented Jul 12, 2022

polm commented Jul 12, 2022

explosion-bot commented Jul 12, 2022 •

edited

Loading

svlandeg commented Aug 11, 2022

svlandeg commented Oct 6, 2022 •

edited

Loading

Native coref component #7264

Native coref component #7264

Conversation

svlandeg commented Mar 3, 2021

Description

Types of change

Checklist

svlandeg commented Mar 3, 2021

TODO

Open questions / current issues

LifeIsStrange commented Mar 29, 2021

polm commented Jul 12, 2022

polm commented Jul 12, 2022

explosion-bot commented Jul 12, 2022 • edited Loading

svlandeg commented Aug 11, 2022

svlandeg commented Oct 6, 2022 • edited Loading

explosion-bot commented Jul 12, 2022 •

edited

Loading

svlandeg commented Oct 6, 2022 •

edited

Loading