Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native coref component #7243

Merged
merged 14 commits into from
Mar 3, 2021
Merged

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Mar 1, 2021

Work-in-progress

Description

Creating a native coref component in spaCy

Types of change

enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg marked this pull request as draft March 1, 2021 21:40
@svlandeg svlandeg added enhancement Feature requests and improvements feat / coref Feature: Coreference resolution labels Mar 1, 2021
@svlandeg
Copy link
Member Author

svlandeg commented Mar 1, 2021

Status March 1:

  • Wrote preliminary v3-compatible framework to facilitate experimentation with different coref models
  • Currently assuming two different pipeline components:
    • coref_er / CorefEntityRecognizer is a rule-based mention detection algorithm: uses noun chunks, POS tags and named entities
    • coref / CoreferenceResolver assembles the provided mentions into clusters (dummy implementation)
  • Using doc.spans to store the information:
    • doc.spans[coref_mentions] for storing all relevant coref mentions (nouns, pronouns, names, ...)
    • doc.spans[coref_clusters_i] for different clusters, indexed with i
  • Coref.v0 needs to be implemented and changed to Coref.v1
  • Scorer.score_clusters method that currently uses a too simple scoring mechanism (binary relations between mentions), should be refined with actual coref scoring algorithm

While all of this is mostly dummy framework, it already helped discover some bugs & required functionality, cf PRs #7197, #7209 and #7225.

Going forward, having this bare framework should facilitate working on this functionality with different people in parallel, filling in different parts...

TODO

  • Implement proper coref ML model
  • Proper mention detection algorithm, rule-based, ML-based, something like the SpanCategorizer, ...
  • Meaningful evaluation script
  • Tune & benchmark
  • Rewrite errors to use spacy.errors

Open questions / current issues

  • While we talked about keeping doc.spans a relatively simple dictionary of strings mapping to lists of spans, we might consider having a more formal way of defining clusters that belong together - currently this is done by matching a prefix in the spans key, which is obviously not ideal

  • The design with the rule-based coref_er is again awkward, because this component won't run during nlp.update, meaning that the coref model could only train on gold mentions, which is not a good idea in terms of generalizability and robustness of the ML model.

@svlandeg svlandeg added the ⚠️ wip Work in progress label Mar 1, 2021
spacy/ml/models/coref.py Outdated Show resolved Hide resolved
@svlandeg svlandeg marked this pull request as ready for review March 3, 2021 12:49
@svlandeg
Copy link
Member Author

svlandeg commented Mar 3, 2021

I'll just merge this into the new WIP feature/coref branch on explosion, so we can continue the discussion there.

@svlandeg svlandeg merged commit e0c45c6 into explosion:feature/coref Mar 3, 2021
@svlandeg svlandeg deleted the feature/coref branch March 3, 2021 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / coref Feature: Coreference resolution ⚠️ wip Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants