Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lda models load/save backward compatibility across Python versions #1039

Merged
merged 16 commits into from
Dec 22, 2016

Conversation

anmolgulati
Copy link
Contributor

@anmolgulati anmolgulati commented Dec 7, 2016

I've branched this off PR #913 . This PR is specifically for loading LDA models. There was some still some issue in loading(across python vecrsion) word2vec models in PR #913, so thought to separately tackle that issue. This PR looks ready to be merged for now.
@tmylk @piskvorky Please review.

@@ -995,7 +996,7 @@ def __getitem__(self, bow, eps=None):
"""
return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)

def save(self, fname, ignore=['state', 'dispatcher'], *args, **kwargs):
def save(self, fname, ignore=['state', 'dispatcher'], separately = None, *args, **kwargs):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: no separately=None (no spaces in argument params).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Save the dictionary separately in json.
id2word_fname = utils.smart_extension(fname, '.bin')
try:
with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to open the file as binary and explicitly write binary (utf8) strings into it.

Copy link
Contributor Author

@anmolgulati anmolgulati Dec 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saving the id2word dictionary separately in json format(following this. I've changed it now, and added id2word in the separately list, while saving the model. Though I still need to add tests to check this. Will add them first, and let you know.

with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout:
json.dump(id2word, fout)
except Exception as e:
logging.warning("failed to save id2words dictionary in %s: %s", id2word_fname, e)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning, or exception? What's the user contract on storing this "id2words" dictionary?

Copy link
Contributor Author

@anmolgulati anmolgulati Dec 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it would be better to have it as an exception. Removed this now. It should return an exception if it fails to save the model. I still need to add tests to check this.

@@ -0,0 +1 @@
{"0": "interface", "1": "computer", "2": "human", "3": "response", "4": "time", "5": "survey", "6": "system", "7": "user", "8": "eps", "9": "trees", "10": "graph", "11": "minors"}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like a binary (extension .bin?) file.

return _pickle.loads(f.read())

if sys.version_info > (3, 0):
return _pickle.load(f, encoding='latin1')
Copy link
Owner

@piskvorky piskvorky Dec 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely not! What is this latin1?

The content is (and should be read as) binary.

Copy link
Contributor Author

@anmolgulati anmolgulati Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as a fix for when loading objects in Python 3 which were pickled in Python 2, which gives an exception.
Basically, Python 3 attempts to convert the pickled py2 object into a str object, when we need it to be bytes and gives an exception. I used the latin1 encoding for as a work around for that. (Asked on Stackoverflow)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a code comment to explain this?

Copy link
Owner

@piskvorky piskvorky Dec 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this hack needs to be marked and explained thoroughly in a comment.

I'm not familiar with such py2/py3 pickling work arounds, but isn't there a cleaner way to achieve the same effect? This sticks out like a sore thumb. @tmylk @anmol01gulati

Copy link
Contributor Author

@anmolgulati anmolgulati Dec 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky Umm, I had actually searched quite a lot, and tried various things on my system. This is the only way(a hack actually), I found, through which it works. By the way, I felt, we would not want to have this functionality in the future and could do away with the backward compatibility, if majority of the users shift to one Python 3 later (it's not the case right now though).
I'll open up a new PR to add a comment in the code though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am coding something entirely different and this solution is the only thing that worked for loading python2 pickles in python3... The creators claim that pickle is backwards compatible but apparently only if I pass latin1... Any other way just breaks and burns.

@anmolgulati anmolgulati changed the title Lda models load/save backward compatibility across Python versions [WIP] Lda models load/save backward compatibility across Python versions Dec 8, 2016
…lity of id2word dictionary in loading a model
@anmolgulati
Copy link
Contributor Author

I've made the necessary changes now. The id2word dictionary is saved in binary format.
We only create a new file and save the id2word dictionary when it's not marked as 'ignored'. So, when loading thedictionary from the file(in ldamodel), we check if the file exists, and only then load the dictionary.
To accomodate this, I had to change the testfile() function in test_ldamodel, so thus, a different file is created for each test. Does this sound good?
@piskvorky @tmylk We could merge it then, if it looks fine.

@anmolgulati anmolgulati changed the title [WIP] Lda models load/save backward compatibility across Python versions Lda models load/save backward compatibility across Python versions Dec 15, 2016
@tmylk tmylk merged commit e08af7b into piskvorky:develop Dec 22, 2016
@tmylk
Copy link
Contributor

tmylk commented Dec 22, 2016

A code comment on `latin`` is needed in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants