Pickling relies on temp files, doesn't clean them up #228

honnibal · 2016-01-19T14:01:19Z

The current pickling implementation was only supposed to be an exploratory kludge. However, I didn't leave a TODO and the status of it got lost.

Vocab.__reduce__ currently writes state to temp files, which are then not cleaned up. Pickling therefore fills the disk, and really only pretends to work.

The root of the problem is that a number of spaCy classes carry large binary data structures. Common usage is to load this data and consider it immutable, however you can write to these models, e.g. to change the word vectors, and pickle should not silently dump these changes. On the other hand it's harsh to assume we always need to write out the state. This would mean that users who follow the pattern of keeping the data immutable have to write out ~1gb of data to pickle the models. This makes average usage of Spark etc really problematic.

We could do this implicitly with copy-on-write semantics, but I don't think it's great to invoke some method where it may or may not write out 1gb of data to disk, depending on the entire execution history of the program.

We could have a more explicit version of copy-on-write, where all the classes track whether they've been changed, and then the models should refuse to be pickled if the state is unclean. Users would then explicitly save the state after they change it. I think this is a recipe for having long-running processes suddenly die, though. Mostly Python is designed around the assumption that things can either be pickled or they can't. It's surprising to find that your pickle works, sometimes, depending on state, but then your long-running process dies because you didn't meet the assumed invariant. And then next time you run, you get an error in that other place in your code where the classes get pickled.

I've been thinking for a while that context managers are the idiomatic standard for dealing with this problem. The idea would be that if you want to write to any of this loaded data, you have to open it within a context manager, so that the changes are explicitly scoped, and you explicitly decide whether you want to save the changes or dump them.

Ignoring the naming of everything, this might look like:

from spacy.en import English

nlp = English()

# Open a pre-trained model, do some more training, and save the changes
with nlp.entity.update_model(file_or_path_or_etc):
    for doc, labels in my_training_data:
        nlp.entity.train(doc, labels)

# Change the vector of 'submarines' to be the vector formed
# by "spaceships - space + ocean"
# When the context manager exits, revert the changes
with nlp.vocab.update_lexicon(revert=True):
    submarines = nlp.vocab[u'submarines']
    spaceships = nlp.vocab[u'spaceships']
    space = nlp.vocab[u'space']
    ocean = nlp.vocab[u'ocean']
    submarines.vector = (spaceships.vector - space.vector) + ocean.vector

The text was updated successfully, but these errors were encountered:

lock · 2018-05-09T15:31:50Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal closed this as completed Jan 19, 2016

honnibal mentioned this issue May 7, 2017

💫 Improve annotation serialisation #1045

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pickling relies on temp files, doesn't clean them up #228

Pickling relies on temp files, doesn't clean them up #228

honnibal commented Jan 19, 2016

lock bot commented May 9, 2018

Pickling relies on temp files, doesn't clean them up #228

Pickling relies on temp files, doesn't clean them up #228

Comments

honnibal commented Jan 19, 2016

lock bot commented May 9, 2018