💫 Improve model loading & recommended way: link vs. import #1284
Labels
help wanted
Contributions welcome!
models
Issues related to the statistical models
🌙 nightly
Discussion and contributions related to nightly builds
Inspired by #1283.
There are currently two ways of loading spaCy models: either by creating a shortcut link and loading them via
spacy.load()
, or by using the nativeimport
and then calling the model'sload()
method. Both of them have advantages and disadvantages in different situations.However, especially for new users, model linking and loading often feels confusing, because it introduces too many concepts at once, and keeps the model loading process (and how models are organised) fairly opaque. It's also a common source of problems, as shortcut links create symlinks and require permissions to write to the
spacy/data
directory.Going forward, we're thinking about improving the way model links are handled and created, and potentially encouraging the
import
syntax as the simple, recommended way of loading models (especially for beginners).spacy.load()
will obviously still be the more flexible and customisable alternative.Some background on
spacy link
vs.import
Originally, the symlinks were introduced to allow users to only download a model once and then link it across environments (since the models are quite large and up to 1GB). It also allowed us to keep backwards compatibility on
spacy.load('en')
, even as more English models became available.In spaCy v2.0, most models will be much smaller (~15MB for the default English model). So except for the word vectors (which will still be around 500MB), re-downloading models won't be such a big issue anymore. It'd then also be possible for us to upload the models to PyPi and conda, so users can do something like
pip install spacy_en_core_web_sm
.Advantages of
import
import
ing a model fails because it's not installed, Python will tell you. Your code will also fail immediately, instead of somewhere down the line when you callspacy.load
. This also makes testing and CI workflows more convenient (especially relevant for production users).en
and not notice until much later when you get unexpected results.requirements.txt
andsetup.py
.Advantages of
spacy link
andspacy.load()
['en', 'de']
, you can pass the string names intospacy.load()
them when you need them, without having to export them first, or useimportlib
workarounds.Solutions
spacy_models
, and each model package could register the data it makes available in that group. This would also allow your custom models hook into spaCy.en
andde
.import
syntax first (?)Interested to hear your feedback on this, which model loading solution you prefer and why, and what else you'd like to be able to do with spaCy models. Also curious about any experiences with entry points, gotchas and other things we should consider with this solution.
The text was updated successfully, but these errors were encountered: