Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickling relies on temp files, doesn't clean them up #228

Closed
honnibal opened this issue Jan 19, 2016 · 1 comment
Closed

Pickling relies on temp files, doesn't clean them up #228

honnibal opened this issue Jan 19, 2016 · 1 comment

Comments

@honnibal
Copy link
Member

The current pickling implementation was only supposed to be an exploratory kludge. However, I didn't leave a TODO and the status of it got lost.

Vocab.__reduce__ currently writes state to temp files, which are then not cleaned up. Pickling therefore fills the disk, and really only pretends to work.

The root of the problem is that a number of spaCy classes carry large binary data structures. Common usage is to load this data and consider it immutable, however you can write to these models, e.g. to change the word vectors, and pickle should not silently dump these changes. On the other hand it's harsh to assume we always need to write out the state. This would mean that users who follow the pattern of keeping the data immutable have to write out ~1gb of data to pickle the models. This makes average usage of Spark etc really problematic.

We could do this implicitly with copy-on-write semantics, but I don't think it's great to invoke some method where it may or may not write out 1gb of data to disk, depending on the entire execution history of the program.

We could have a more explicit version of copy-on-write, where all the classes track whether they've been changed, and then the models should refuse to be pickled if the state is unclean. Users would then explicitly save the state after they change it. I think this is a recipe for having long-running processes suddenly die, though. Mostly Python is designed around the assumption that things can either be pickled or they can't. It's surprising to find that your pickle works, sometimes, depending on state, but then your long-running process dies because you didn't meet the assumed invariant. And then next time you run, you get an error in that other place in your code where the classes get pickled.

I've been thinking for a while that context managers are the idiomatic standard for dealing with this problem. The idea would be that if you want to write to any of this loaded data, you have to open it within a context manager, so that the changes are explicitly scoped, and you explicitly decide whether you want to save the changes or dump them.

Ignoring the naming of everything, this might look like:

from spacy.en import English

nlp = English()

# Open a pre-trained model, do some more training, and save the changes
with nlp.entity.update_model(file_or_path_or_etc):
    for doc, labels in my_training_data:
        nlp.entity.train(doc, labels)

# Change the vector of 'submarines' to be the vector formed
# by "spaceships - space + ocean"
# When the context manager exits, revert the changes
with nlp.vocab.update_lexicon(revert=True):
    submarines = nlp.vocab[u'submarines']
    spaceships = nlp.vocab[u'spaceships']
    space = nlp.vocab[u'space']
    ocean = nlp.vocab[u'ocean']
    submarines.vector = (spaceships.vector - space.vector) + ocean.vector
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant