Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic framework and APIs for entity linker #3459

Merged
merged 68 commits into from
Mar 29, 2019

Conversation

svlandeg
Copy link
Member

This PR is the first to start adressing Issue #3339 to implement Entity Linking functionality in spaCy. While most functionality is still "dummy", any feedback on the code structuring & API's is welcome before continuing with the next bit of work. I'd expect some of this code to still change quite a bit when we actually start implementing non-dummy algorithms.

Description

Basic architecture and APIs for Entity Linking functionality:

  • spans and their underlying tokens get a kb_id and ent_kb_id field respectively

  • added a pipe called el to nlp, referring to an instance of EntityLinker

  • The current implementation of EntityLinker simply selects the candidate entity with highest prior probability.

  • kb.pxd and kb.pyd hold the knowledge base (based on @honnibal's notes)

    • The KB works like this: first add all entities. Then per alias/mention, add each candidate entity with its prior probability
    • Candidate is an object generated from the candidate_generator and used as input to the actual EntityLinker
    • Entity ID (string) is translated to Entity hash (using vocab) and then to Entry index (using _entry_index). The entry index then points to an _EntryC struct (ID, name, features, ...) in the _entries vector
    • Alias (string) is translated to Alias hash (using vocab) and then to Alias index (using _alias_index). The alias index then points to an _AliasC struct (entry candidates + prior probs) in the _aliases_table vector
  • added some unit tests to check the above behaviour

    • reading and writing of kb_id in span
    • valid and invalid construction of a KB
  • examples/pipeline/dummy_entity_linking.py: Shows how the current functionality works

Open questions

  • get_candidates is sort of inbetween the internal KB structure and the actual implementation of the Entity Linker --> not sure where to put it. Currently part of kb.pyx. Different algorithms should be able to overwrite its behaviour but then again that may also influence the actual storage in KB (e.g. case invariance: "Douglas", "DOUGLAS" and "douglas" all mapping to the same entries/candidates). Something to think about :-)

  • should we always assume there is a vocab from nlp, or should we have it as optional argument and create a new one if none was given?

Issues not yet addressed

  • vectors & features & types in kb (seems more easy to start thinking about this in the context of a non-dummy algorithm)
  • documentation not yet updated (perhaps it makes sense to continue with the next phases first, to get to more stable APIs)

Types of change

New feature

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@ines ines added enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer feat / pipeline Feature: Processing pipeline and components labels Mar 22, 2019
spacy/kb.pyx Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
examples/pipeline/dummy_entity_linking.py Show resolved Hide resolved
examples/pipeline/dummy_entity_linking.py Outdated Show resolved Hide resolved
spacy/kb.pyx Outdated Show resolved Hide resolved
spacy/kb.pyx Outdated Show resolved Hide resolved
examples/pipeline/dummy_entity_linking.py Outdated Show resolved Hide resolved
@ines ines merged commit 6890006 into explosion:master Mar 29, 2019
@svlandeg svlandeg deleted the feature/el-framework branch April 2, 2019 07:51
@svlandeg svlandeg restored the feature/el-framework branch April 2, 2019 07:52
@svlandeg svlandeg deleted the feature/el-framework branch April 2, 2019 07:52
@svlandeg svlandeg mentioned this pull request Aug 1, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer feat / pipeline Feature: Processing pipeline and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants