Native coref component #7243

svlandeg · 2021-03-01T21:40:49Z

Work-in-progress

Description

Creating a native coref component in spaCy

Types of change

enhancement

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

svlandeg · 2021-03-01T22:03:35Z

Status March 1:

Wrote preliminary v3-compatible framework to facilitate experimentation with different coref models
Currently assuming two different pipeline components:
- coref_er / CorefEntityRecognizer is a rule-based mention detection algorithm: uses noun chunks, POS tags and named entities
- coref / CoreferenceResolver assembles the provided mentions into clusters (dummy implementation)
Using doc.spans to store the information:
- doc.spans[coref_mentions] for storing all relevant coref mentions (nouns, pronouns, names, ...)
- doc.spans[coref_clusters_i] for different clusters, indexed with i
Coref.v0 needs to be implemented and changed to Coref.v1
Scorer.score_clusters method that currently uses a too simple scoring mechanism (binary relations between mentions), should be refined with actual coref scoring algorithm

While all of this is mostly dummy framework, it already helped discover some bugs & required functionality, cf PRs #7197, #7209 and #7225.

Going forward, having this bare framework should facilitate working on this functionality with different people in parallel, filling in different parts...

TODO

Implement proper coref ML model
Proper mention detection algorithm, rule-based, ML-based, something like the SpanCategorizer, ...
Meaningful evaluation script
Tune & benchmark
Rewrite errors to use spacy.errors

Open questions / current issues

While we talked about keeping doc.spans a relatively simple dictionary of strings mapping to lists of spans, we might consider having a more formal way of defining clusters that belong together - currently this is done by matching a prefix in the spans key, which is obviously not ideal
The design with the rule-based coref_er is again awkward, because this component won't run during nlp.update, meaning that the coref model could only train on gold mentions, which is not a good idea in terms of generalizability and robustness of the ML model.

spacy/ml/models/coref.py

svlandeg · 2021-03-03T12:50:04Z

I'll just merge this into the new WIP feature/coref branch on explosion, so we can continue the discussion there.

svlandeg added 13 commits February 23, 2021 18:15

initial coref_er pipe

476549a

matcher more flexible

2b58d3e

base coref component without actual model

72003ee

Merge remote-tracking branch 'upstream/master' into feature/coref

de68f72

initial setup of coref_er.score

0f8530f

rename to include_label

67f1961

preliminary score_clusters method

ca9b5a6

apply scoring in coref component

2719d31

IO fix

cd139e2

return None loss for now

6f6888d

rename to CoreferenceResolver

dc5035a

Merge remote-tracking branch 'upstream/master' into feature/coref

d0f00b2

some preliminary unit tests

5d0ec5a

svlandeg marked this pull request as draft March 1, 2021 21:40

svlandeg added enhancement Feature requests and improvements feat / coref Feature: Coreference resolution labels Mar 1, 2021

svlandeg added the ⚠️ wip Work in progress label Mar 1, 2021

honnibal reviewed Mar 2, 2021

View reviewed changes

spacy/ml/models/coref.py Outdated Show resolved Hide resolved

use registry as callable

8635eb7

svlandeg marked this pull request as ready for review March 3, 2021 12:49

svlandeg merged commit e0c45c6 into explosion:feature/coref Mar 3, 2021

svlandeg deleted the feature/coref branch March 3, 2021 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native coref component #7243

Native coref component #7243

svlandeg commented Mar 1, 2021 •

edited

Loading

svlandeg commented Mar 1, 2021 •

edited

Loading

svlandeg commented Mar 3, 2021

Native coref component #7243

Native coref component #7243

Conversation

svlandeg commented Mar 1, 2021 • edited Loading

Description

Types of change

Checklist

svlandeg commented Mar 1, 2021 • edited Loading

TODO

Open questions / current issues

svlandeg commented Mar 3, 2021

svlandeg commented Mar 1, 2021 •

edited

Loading

svlandeg commented Mar 1, 2021 •

edited

Loading