Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Improve model loading & recommended way: link vs. import #1284

Closed
ines opened this issue Aug 22, 2017 · 2 comments
Closed

💫 Improve model loading & recommended way: link vs. import #1284

ines opened this issue Aug 22, 2017 · 2 comments
Labels
help wanted Contributions welcome! models Issues related to the statistical models 🌙 nightly Discussion and contributions related to nightly builds

Comments

@ines
Copy link
Member

ines commented Aug 22, 2017

Inspired by #1283.

There are currently two ways of loading spaCy models: either by creating a shortcut link and loading them via spacy.load(), or by using the native import and then calling the model's load() method. Both of them have advantages and disadvantages in different situations.

import spacy
nlp = spacy.load('en')

import en_core_web_sm
nlp = en_core_web_sm.load()

However, especially for new users, model linking and loading often feels confusing, because it introduces too many concepts at once, and keeps the model loading process (and how models are organised) fairly opaque. It's also a common source of problems, as shortcut links create symlinks and require permissions to write to the spacy/data directory.

Going forward, we're thinking about improving the way model links are handled and created, and potentially encouraging the import syntax as the simple, recommended way of loading models (especially for beginners). spacy.load() will obviously still be the more flexible and customisable alternative.

Some background on spacy link vs. import

Originally, the symlinks were introduced to allow users to only download a model once and then link it across environments (since the models are quite large and up to 1GB). It also allowed us to keep backwards compatibility on spacy.load('en'), even as more English models became available.

In spaCy v2.0, most models will be much smaller (~15MB for the default English model). So except for the word vectors (which will still be around 500MB), re-downloading models won't be such a big issue anymore. It'd then also be possible for us to upload the models to PyPi and conda, so users can do something like pip install spacy_en_core_web_sm.

Advantages of import

  • Easier debugging and better, native error messages. If importing a model fails because it's not installed, Python will tell you. Your code will also fail immediately, instead of somewhere down the line when you call spacy.load. This also makes testing and CI workflows more convenient (especially relevant for production users).
  • More transparency. It's clear which model packages are being used, and you always know exactly which model you're loading. Otherwise, you might end up with the wrong model linked as en and not notice until much later when you get unexpected results.
  • Better collaboration and dependency management. You don't have to rely on all of your team members having the same models linked to the same shortcuts. Instead, all you have to do is add the model packages to the requirements.txt and setup.py.

Advantages of spacy link and spacy.load()

  • Shorter and easier to read.
  • Loading models programmatically from a string. For example, if your models are ['en', 'de'], you can pass the string names into spacy.load() them when you need them, without having to export them first, or use importlib workarounds.
  • Custom names for models.

Solutions

  • Use entry points to make installed models available to spaCy (see here for a quick explanation on how this could work.) spaCy would then define an entry point group spacy_models, and each model package could register the data it makes available in that group. This would also allow your custom models hook into spaCy.
  • Find a solution to make the entry points work with custom names and shortcuts like en and de.
  • Update the documentation and quickstart guides to highlight the import syntax first (?)

Interested to hear your feedback on this, which model loading solution you prefer and why, and what else you'd like to be able to do with spaCy models. Also curious about any experiences with entry points, gotchas and other things we should consider with this solution.

@ines ines added help wanted Contributions welcome! models Issues related to the statistical models 🌙 nightly Discussion and contributions related to nightly builds labels Aug 22, 2017
@ines
Copy link
Member Author

ines commented Nov 9, 2017

Merging this with #1456!

@ines ines closed this as completed Nov 9, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Contributions welcome! models Issues related to the statistical models 🌙 nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

No branches or pull requests

1 participant