💫 Improve model loading & recommended way: link vs. import #1284

ines · 2017-08-22T10:55:33Z

Inspired by #1283.

There are currently two ways of loading spaCy models: either by creating a shortcut link and loading them via spacy.load(), or by using the native import and then calling the model's load() method. Both of them have advantages and disadvantages in different situations.

import spacy
nlp = spacy.load('en')

import en_core_web_sm
nlp = en_core_web_sm.load()

However, especially for new users, model linking and loading often feels confusing, because it introduces too many concepts at once, and keeps the model loading process (and how models are organised) fairly opaque. It's also a common source of problems, as shortcut links create symlinks and require permissions to write to the spacy/data directory.

Going forward, we're thinking about improving the way model links are handled and created, and potentially encouraging the import syntax as the simple, recommended way of loading models (especially for beginners). spacy.load() will obviously still be the more flexible and customisable alternative.

Some background on `spacy link` vs. `import`

Originally, the symlinks were introduced to allow users to only download a model once and then link it across environments (since the models are quite large and up to 1GB). It also allowed us to keep backwards compatibility on spacy.load('en'), even as more English models became available.

In spaCy v2.0, most models will be much smaller (~15MB for the default English model). So except for the word vectors (which will still be around 500MB), re-downloading models won't be such a big issue anymore. It'd then also be possible for us to upload the models to PyPi and conda, so users can do something like pip install spacy_en_core_web_sm.

Advantages of `import`

Easier debugging and better, native error messages. If importing a model fails because it's not installed, Python will tell you. Your code will also fail immediately, instead of somewhere down the line when you call spacy.load. This also makes testing and CI workflows more convenient (especially relevant for production users).
More transparency. It's clear which model packages are being used, and you always know exactly which model you're loading. Otherwise, you might end up with the wrong model linked as en and not notice until much later when you get unexpected results.
Better collaboration and dependency management. You don't have to rely on all of your team members having the same models linked to the same shortcuts. Instead, all you have to do is add the model packages to the requirements.txt and setup.py.

Advantages of `spacy link` and `spacy.load()`

Shorter and easier to read.
Loading models programmatically from a string. For example, if your models are ['en', 'de'], you can pass the string names into spacy.load() them when you need them, without having to export them first, or use importlib workarounds.
Custom names for models.

Solutions

Use entry points to make installed models available to spaCy (see here for a quick explanation on how this could work.) spaCy would then define an entry point group spacy_models, and each model package could register the data it makes available in that group. This would also allow your custom models hook into spaCy.
Find a solution to make the entry points work with custom names and shortcuts like en and de.
Update the documentation and quickstart guides to highlight the import syntax first (?)

Interested to hear your feedback on this, which model loading solution you prefer and why, and what else you'd like to be able to do with spaCy models. Also curious about any experiences with entry points, gotchas and other things we should consider with this solution.

The text was updated successfully, but these errors were encountered:

ines · 2017-11-09T16:12:33Z

Merging this with #1456!

lock · 2018-05-08T09:27:51Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added help wanted Contributions welcome! models Issues related to the statistical models 🌙 nightly Discussion and contributions related to nightly builds labels Aug 22, 2017

ines mentioned this issue Oct 24, 2017

💫 Improve "spacy download" and support for different installation prefixes #1456

Closed

ines closed this as completed Nov 9, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💫 Improve model loading & recommended way: link vs. import #1284

💫 Improve model loading & recommended way: link vs. import #1284

ines commented Aug 22, 2017

ines commented Nov 9, 2017

lock bot commented May 8, 2018

💫 Improve model loading & recommended way: link vs. import #1284

💫 Improve model loading & recommended way: link vs. import #1284

Comments

ines commented Aug 22, 2017

Some background on spacy link vs. import

Advantages of import

Advantages of spacy link and spacy.load()

Solutions

ines commented Nov 9, 2017

lock bot commented May 8, 2018

Some background on `spacy link` vs. `import`

Advantages of `import`

Advantages of `spacy link` and `spacy.load()`