Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Entity Linking in spaCy #3339

Closed
svlandeg opened this issue Feb 27, 2019 · 40 comments
Closed

💫 Entity Linking in spaCy #3339

svlandeg opened this issue Feb 27, 2019 · 40 comments
Labels
enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer

Comments

@svlandeg
Copy link
Member

Feature description

With @honnibal & @ines we have been discussing adding an Entity Linking module to spaCy. This module would run on top of NER results and disambiguate & link tagged mentions to a knowledge base. We are thinking of implementing this in a few different phases:

  1. Implement an efficient encoding of a knowledge base + all APIs / interfaces, to integrate with the current processing pipeline. We would take the following components of EL into account:
    • Candidate generation
    • Encoding document context
    • Encoding local context
    • Type prediction
    • Coreference resolution / ensuring global consistency
  2. Implement a model that links English texts to English Wikipedia entries
  3. Implement a cross-lingual model that links non-English texts to English Wikipedia entries
  4. Fine-tune WP linking models to be able to ship them as such
  5. Implement support in Prodigy to perform custom EL annotations for your specific project
  6. Test / implement the models on a different domain & non-wikipedia knowledge base

Notes

As some prior research, we compiled some notes on this project & its requirements: https://drive.google.com/file/d/1UYnPgx3XjhUx48uNNQ3kZZoDzxoinSBF. This contains more details on the EL components and implementation phases.

Feedback requested

We will start implementing the APIs soon, but we would love to hear your ideas, suggestions, requests with respect to this new functionality first!

@honnibal honnibal added the enhancement Feature requests and improvements label Feb 27, 2019
@honnibal honnibal changed the title Entity Linking in spaCy 💫 Entity Linking in spaCy Feb 27, 2019
@ines ines added the feat / ner Feature: Named Entity Recognizer label Feb 27, 2019
@wejradford
Copy link

This sounds really exciting!

I'm curious how this relates to a task I've used spaCy for in the past (others may have too). The use-case is you're a user with a small KB, which is a set of entities (possibly with aliases) that you want to link text to. Currently, you can roll-your-own system using the existing NER with rule-based patches when required, then matching and ranking candidates. But, if you already had a big Wikipedia model available for linking, I can imagine wanting to try to match into the small KB then back-off to Wikipedia (or is there some weird KB-composition operation???), ideally with the same interfaces.

This kind of thing is almost certainly an MVP non-goal, but I'm interested to see if it's something the team is thinking about.

@turbolent
Copy link

turbolent commented Feb 27, 2019

Is there any plan to integrate this with Wikidata?

@svlandeg
Copy link
Member Author

@turbolent : as our focus will be on linking to Wikipedia in the first phases, I think an integration with Wikidata will come naturally. There are crosslinks between the two anyway, and I've seen some prior work where the Wikidata knowledge graph was used to tune the prior probabilities for P(entity|alias). So yea, I think it's an important resource to exploit.

@svlandeg
Copy link
Member Author

@wejradford : interesting use-case. So basically you'd want to link to your small KB and have WP linking as a sort of fall-back? But then would you want to keep track of two different knowledge DBs or would you somehow unify/merge them?

@anneschuth
Copy link

Really cool!

Wondering about point 3: wouldn't it be easier to link non-english text to non-english wikipedia? And then use language links from that non-english wikipedia to jump to the english wikipedia? It would have the added benefit of being able to (also) use the non-english wikipedia links.

@honnibal
Copy link
Member

@anneschuth I'd rather have a canonical knowledge base with potentially language-specific feature vectors. That feels a little bit more generic and less Wikipedia-specific to me. I think it also makes sense to do the KB reconciliation once, as an offline data-dependent step, and not mix it into the runtime.

@wejradford
Copy link

@svlandeg that's a good question. For the use-cases I'm familiar with, I wouldn't expect spaCy to merge them. For example, I might want to pull out general entities from Wikipedia, but also some specific things like less-prominent entities of the usual type, or perhaps commands/intents if it's a dialogue case.

@svlandeg
Copy link
Member Author

@wejradford : gotcha. I think we were sort of assuming one KB and one EL component per pipeline, but it's an interesting use-case to think about...

@dodijk
Copy link

dodijk commented Feb 28, 2019

Cool feature!

+1 for linking non-English text to non-English Wikipedia and then optionally exploiting Wikipedia's cross-language links. The estimation of P(entity|alias) makes a lot more sense to me if the aliases are in the same language as the mentions. Plus, I expect the coverage of entities for document collection in a specific language to be higher in the corresponding non-English Wikipedia than in the English Wikipedia.

@svlandeg
Copy link
Member Author

svlandeg commented Feb 28, 2019

The downside is ofcourse that any non-English WP is significantly smaller than the English one (https://meta.wikimedia.org/wiki/List_of_Wikipedias). Ofcourse you can still use the interwiki links to go from the overlapping subset of WP:EN to the set of links available in the other language.

I do agree with the idea that exploiting non-english WP links would be useful, and the prior probabilities & candidate generation for XEL will certainly be an interesting task to tackle...

@svlandeg
Copy link
Member Author

svlandeg commented Mar 1, 2019

By the way, the XELMS paper by Upadhyay et al in EMNLP'18 has some interesting results around the topic of cross-lingual entity linking. Basically they train a XEL model using multilingual supervision, exploiting both the richer content in WP:EN as well as language-sensitive information in the target language. The paper has some interesting experiments (e.g. Table 3) comparing to prior work as well as comparing their system using either monolingual or join supervision.

@p-sodmann
Copy link
Contributor

So, this will basically make disambiguation easier?
How will the implementation of the knowledgebase look like? Will I be able to connect my triplet store somehow?

@svlandeg
Copy link
Member Author

svlandeg commented Mar 4, 2019

We're aiming for an in-memory implementation using a Cython backend, so you'll probably have to convert your triplet store to the spaCy KB structure using APIs that we'll make available for adding entities, aliases and prior probabilities. Would that work for you?

@p-sodmann
Copy link
Contributor

That would be great, currently I am "converting" them to the phrase matcher, so it will be probably no issue.

@mhham
Copy link
Contributor

mhham commented Mar 4, 2019

You should definitely take a look at this to encode context and wikipedia articles :
https://github.com/wikipedia2vec/wikipedia2vec

And this for some great state of the art : https://github.com/openai/deeptype

@svlandeg
Copy link
Member Author

svlandeg commented Mar 5, 2019

@turbolent @anneschuth @dodijk : So we're thinking about centering the initial work around WikiData specifically. This would ensure we have the same KB across languages and would support the cross-lingual linking in a (hopefully) more straightforward fashion.

An additional advantage would be that IDs are more stable (WP titles can change). Also, WikiData seems to have much more coverage (WP:EN has 5.8M pages, WikiData has 55M entities). For example, @honnibal doesn't seem to have a WP page but does have a WikiData entry Q47153978 : https://www.wikidata.org/wiki/Q47153978 :-)

@Tpt
Copy link
Contributor

Tpt commented Mar 19, 2019

Planning to do the annotations using Wikidata is great!

The Wikidata canonical URIs are like http://www.wikidata.org/entity/Q47153978
https://www.wikidata.org/wiki/Q47153978 is the URL of the wiki page describing the entity.

@DeNeutoy
Copy link
Contributor

Hi @svlandeg - I'm one of the authors of scispaCy. Cool to see you were thinking of using it in your stage 6 design doc. We've actually begun to do a bit of work on this ourselves (nothing fancy, mainly just some string matching/sklearn classifier type of approaches), and we have a knowledge base (a filtered version of the Unified Medical Language System, such that we can distribute it). It has 2.78M concepts (down from 3.3M concepts on the full UMLS release) and it covers 99% of the entities in the Med Mentions Dataset.

We'd be happy to help out, either by testing out components as you go, or by implementing the entity linking system you land on for biomedical and seeing what goes wrong. The med mentions data is very nice and easy to work with - it's another option to consider as well as the gene databases that you have listed in your presentation.

@svlandeg
Copy link
Member Author

Hi @DeNeutoy ! Nice work on scispaCy :-)

We're currently focusing on getting the general architecture & APIs in place, that should allow you to connect any KB and EL model into the usual spaCy nlp processing pipeline. This may take a bit of iteration to make sure it covers all use-cases, and it would be great to get your feedback on those PR's to come (or anyone else contributing to this thread ofcourse!). I'm making some good progress on this so hope to have something preliminary out soonish.

And it would definitely be great to collaborate on getting the infrastructure to work for the biomedical domain, BioNLP being my first true love ;-)
Let's keep talking!

@sammous
Copy link
Contributor

sammous commented Mar 21, 2019

I am really looking forward for this feature that I needed in the past.
I will be needing it probably in the future, with a personal KB, so I am specifically wishing for step 6.
I used wikipedia2vec and was quite impressed by the quality/efficiency of the code.
I think the Al2 guys (cc @DeNeutoy) did a fantastic job on AI/Med KB, so definitely you should keep talking.
Cheers

@svlandeg
Copy link
Member Author

PR #3459 : first general framework, API's etc, using a simple dummy algorithm for now. All feedback welcome!

@ibeltagy
Copy link

Hi @svlandeg,
I work with @DeNeutoy on scispaCy and I wanted to share some thoughts about supporting non-wikipedia KBs. It seems to me that most of the design for wikipedia will generalize to non-wikipedia KBs except the candidate generation part. A resource like CrossWikis should make it easy to get the prior probability P(e|m), but such resource won't be available for other KBs.

One potential solution is featurizing entity names in the KB and the text span, then using cosine similarity to generate candidates. The features could be sparse (I tried char-n-gram for scispaCy entity linking and they work well) and/or dense (embeddings for the text. It would be nice if the embeddings are character-based not word-based). It is also possible to have the featurizing function be an input to the NEL model.

With a KB of millions of candidates, you will also need some form of fast approximate nearest neighbors. I tried nmslib, and it is fast and works really well (after some parameter tuning).

Another improvement is having the features weighted before computing cosine similarities. I tried simple tfidf weighting of the char-n-gram features, but a more fancy solution is learning feature weights as in Chen and Van Durme, EACL 2017 (https://github.com/ctongfei/probe).

@svlandeg
Copy link
Member Author

Hi @ibeltagy : thanks for the pointers and ideas! Definitely worth considering, you're right that prior probabilities are not always easy to come by, so we'll have to make sure that there are viable alternatives.

@koenvanderveen
Copy link

@svlandeg, nice project! Loading a large KB into memory can be quite time consuming. Do you have any ideas regarding speeding up loading?

@svlandeg
Copy link
Member Author

It's sort of next-up on the TODO list to try and load a significant part of WikiData in memory and see how that goes ;-)

@pwichmann
Copy link

Until Spacy includes Entity Linking, what would be the next-best system that is free and could be used out-of-the-box?

Are there usable solutions based on OpenAI's DeepType paper? Or wikipedia2vec?

It seems to me that a joint NER and EL solution would be wayyyy better than any system that is strictly sequential and attempts to solve NER before linking entities.

Other solutions on my list:

@Tpt
Copy link
Contributor

Tpt commented Apr 24, 2019

A new paper about entity linking with Wikidata that might be relevant: OpenTapioca: Lightweight Entity Linking for Wikidata

@svlandeg
Copy link
Member Author

@Tpt : it's a tempting idea to not have to rely on WP data, but I think it comes with serious limitations, too. As the authors point out, there's no good way to get prior probabilities, and the aliases you obtain are somewhat artificially clean. For instance, using WP links and coreference resolution, you could find a whole bunch of candidate entities for "The president", while this particular mention is probably too vague to be added to Wikidata. But the vagueness of it is realistic in actual texts.

I like the approach of exploiting the Wikidata knowledge graph to improve semantic similarity between the entities though!

@svlandeg
Copy link
Member Author

svlandeg commented May 1, 2019

As a quick update and also in reply @koenvanderveen: We've written a custom reader/writer to store the KB (Cython datastructure) on file and read the entries back in in bulk. As a POC, focusing on the Person NER type, we selected all humans and fictional humans from WikiData and linked them to their EN:WP articles, yielding a set of 1.6M entities. WP interwiki links are used to generate realistic aliases and their prior probabilities, obtaining about 1.3M aliases. This KB is written to file, and read back in, in a matter of seconds, and the file size is about 55M. This does not yet include storing any type of additional features/vectors for the entities, which is what we'll tackle next.

@pwichmann
Copy link

Any updates on this? I'm probably most interested in linking ORG entities. It would be amazing if spacy could cover the problem domain of entity linking as well.

@svlandeg
Copy link
Member Author

@pwichmann : yep - we're definitely in full swing working on this. We're currently designing and testing the neural net to train the entity linker on the Person types, just to start somewhere, but once we get good signal we'll build from there and expand to other entities including ORG.

The current setup is roughly like this:

Candidate generation:

  • use prior probabilities extracted from Wikipedia links to obtain a set of most likely Wikidata identifiers for a mention in text

Candidate ranking:

  • entity encoder: encodes the Wikidata description + relation tuples of the candidate
  • sentence encoder: encodes the local context of the mention
  • article encoder: encodes the global context of an article/doc
  • type encoder: takes as input the predicted NER type

Each candidate+mention pair is then run through the network and a probability is obtained as to how likely they match. This output is then combined with the prior probabilities of the candidates to obtain a final score for each pair.

@jcnewell
Copy link

jcnewell commented Jun 6, 2019

@turbolent : as our focus will be on linking to Wikipedia in the first phases, I think an integration with Wikidata will come naturally. There are crosslinks between the two anyway, and I've seen some prior work where the Wikidata knowledge graph was used to tune the prior probabilities for P(entity|alias). So yea, I think it's an important resource to exploit.

If spaCy (or its users) plan to link to Wikipedia be aware that Wikipedia URLs can be unstable over time. If you need to store links persistently I've found it's best to use Wikidata IDs and them resolve these to Wikipedia URLs as late as possible.

@svlandeg
Copy link
Member Author

svlandeg commented Jun 6, 2019

@jcnewell : yep, that's exactly what we'll be doing - the knowledge base will be centered around Wikidata IDs.

@tsoernes
Copy link

tsoernes commented Jun 6, 2019 via email

@svlandeg
Copy link
Member Author

svlandeg commented Jun 6, 2019

@tsoernes : that should be possible, but then you'll have to plug in your own KB and your own training data.

@kermitt2
Copy link

kermitt2 commented Jun 6, 2019

Hello!

About free open source tools, you can add to the list entity-fishing which is doing Wikidata entity recognition and disambiguation for 5 languages, with many options, and not restricted to NER (it can be restricted to NER of course).

It does that at scale... since 2017... I think I failed a bit on the communication side for this tool :)

Documentation
Demo

I've started to add a DeepType implementation at some point (with BidLSTM-CRF model for final typing), but not progressing a lot due to other ongoing projects.

@kermitt2
Copy link

kermitt2 commented Jun 6, 2019

One info that you might find useful, the way I managed to use efficiently millions of entities, vocabularies, links, statistics, word and entity embeddings in different languages was to use LMDB as embedded databases. Basically there is no loading, all the resources are immediately "warm" and I could get up to 600K multithreaded access per second (with SSD). It was in Java, but the Python LMDB binding is very good and robust too.

@louisguitton
Copy link
Contributor

Each candidate+mention pair is then run through the network and a probability is obtained as to how likely they match. This output is then combined with the prior probabilities of the candidates to obtain a final score for each pair.

If I follow correctly your design doc and this thread: you obtain the final score for each candidate+mention pair, apply a confidence threshold and get a link or NIL for each mention; context is encoded by the sentence encoder and the article encoder.
Have you looked at approaches like TAGME? I know it might not be state of the art, but that's the model I currently use. The voting scheme uses the context in a different way, and you consider all candidate+mention pairs at once to disambiguate.

Interested in your thoughts on how good a baseline it could be.

@honnibal
Copy link
Member

honnibal commented Oct 3, 2019

I think we can close this now 🎉 . We're still working on better models, but the functionality is all shipped.

@honnibal honnibal closed this as completed Oct 3, 2019
@lock
Copy link

lock bot commented Nov 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer
Projects
None yet
Development

No branches or pull requests