You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current pickling implementation was only supposed to be an exploratory kludge. However, I didn't leave a TODO and the status of it got lost.
Vocab.__reduce__ currently writes state to temp files, which are then not cleaned up. Pickling therefore fills the disk, and really only pretends to work.
The root of the problem is that a number of spaCy classes carry large binary data structures. Common usage is to load this data and consider it immutable, however you can write to these models, e.g. to change the word vectors, and pickle should not silently dump these changes. On the other hand it's harsh to assume we always need to write out the state. This would mean that users who follow the pattern of keeping the data immutable have to write out ~1gb of data to pickle the models. This makes average usage of Spark etc really problematic.
We could do this implicitly with copy-on-write semantics, but I don't think it's great to invoke some method where it may or may not write out 1gb of data to disk, depending on the entire execution history of the program.
We could have a more explicit version of copy-on-write, where all the classes track whether they've been changed, and then the models should refuse to be pickled if the state is unclean. Users would then explicitly save the state after they change it. I think this is a recipe for having long-running processes suddenly die, though. Mostly Python is designed around the assumption that things can either be pickled or they can't. It's surprising to find that your pickle works, sometimes, depending on state, but then your long-running process dies because you didn't meet the assumed invariant. And then next time you run, you get an error in that other place in your code where the classes get pickled.
I've been thinking for a while that context managers are the idiomatic standard for dealing with this problem. The idea would be that if you want to write to any of this loaded data, you have to open it within a context manager, so that the changes are explicitly scoped, and you explicitly decide whether you want to save the changes or dump them.
Ignoring the naming of everything, this might look like:
fromspacy.enimportEnglishnlp=English()
# Open a pre-trained model, do some more training, and save the changeswithnlp.entity.update_model(file_or_path_or_etc):
fordoc, labelsinmy_training_data:
nlp.entity.train(doc, labels)
# Change the vector of 'submarines' to be the vector formed# by "spaceships - space + ocean"# When the context manager exits, revert the changeswithnlp.vocab.update_lexicon(revert=True):
submarines=nlp.vocab[u'submarines']
spaceships=nlp.vocab[u'spaceships']
space=nlp.vocab[u'space']
ocean=nlp.vocab[u'ocean']
submarines.vector= (spaceships.vector-space.vector) +ocean.vector
The text was updated successfully, but these errors were encountered:
The current pickling implementation was only supposed to be an exploratory kludge. However, I didn't leave a TODO and the status of it got lost.
Vocab.__reduce__
currently writes state to temp files, which are then not cleaned up. Pickling therefore fills the disk, and really only pretends to work.The root of the problem is that a number of spaCy classes carry large binary data structures. Common usage is to load this data and consider it immutable, however you can write to these models, e.g. to change the word vectors, and pickle should not silently dump these changes. On the other hand it's harsh to assume we always need to write out the state. This would mean that users who follow the pattern of keeping the data immutable have to write out ~1gb of data to pickle the models. This makes average usage of Spark etc really problematic.
We could do this implicitly with copy-on-write semantics, but I don't think it's great to invoke some method where it may or may not write out 1gb of data to disk, depending on the entire execution history of the program.
We could have a more explicit version of copy-on-write, where all the classes track whether they've been changed, and then the models should refuse to be pickled if the state is unclean. Users would then explicitly save the state after they change it. I think this is a recipe for having long-running processes suddenly die, though. Mostly Python is designed around the assumption that things can either be pickled or they can't. It's surprising to find that your pickle works, sometimes, depending on state, but then your long-running process dies because you didn't meet the assumed invariant. And then next time you run, you get an error in that other place in your code where the classes get pickled.
I've been thinking for a while that context managers are the idiomatic standard for dealing with this problem. The idea would be that if you want to write to any of this loaded data, you have to open it within a context manager, so that the changes are explicitly scoped, and you explicitly decide whether you want to save the changes or dump them.
Ignoring the naming of everything, this might look like:
The text was updated successfully, but these errors were encountered: