Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading and Saving LDA Models across Python 2 and 3. #913

Closed
wants to merge 9 commits into from

Conversation

anmolgulati
Copy link
Contributor

@anmolgulati anmolgulati commented Oct 3, 2016

Modified load and save methods in ldamodel.py to manage compatibility issues when loading and saving models across Python 2 and 3.
This PR tackles Isssue #853

@tmylk
Copy link
Contributor

tmylk commented Oct 3, 2016

Thanks. Could you please add tests for these methods? Maybe we need to add 2 pickled models to the test_data folder, one for Python 2 and one for Python 3?

@anmolgulati
Copy link
Contributor Author

Yes sure, I was gonna ask for suggestions on testing this actually.

@anmolgulati
Copy link
Contributor Author

anmolgulati commented Oct 4, 2016

So, I added two saved LDA models, one in Python 2.7 and other in Python 3.5 environments in test_data folder. The method I used to create these models is in test_ldamodel.py.
Also added a test to load these models and compare(both must be the same actually).
But right now, the test Fails in Python 2.7 environment and Gives an error (same as referred in FAQ).
@tmylk @jayantj Any suggestions?

@tmylk tmylk added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 4, 2016
@anmolgulati
Copy link
Contributor Author

anmolgulati commented Oct 4, 2016

So I got this working. The LDA model loads fine now across both in Python 2 and 3.
Right now, the test fails in checking expElogbeta array equality in the two saved models. But this is maybe due to precision differences, they are different when the model was created itself, i.e., the arrays are different when created across python versions. (I printed the arrays for both environments at time of creation). It's not a problem of persistence. We should probably remove this check, or add other checks for equality of models.
Otherwise, the tests pass.

@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

How different are they? Can you add a comparison with an epsilon of each other?

@anmolgulati
Copy link
Contributor Author

Well, the epsilon is big for some indexes, I'm adding the exact arrays I printed.
(Python 2.7 expElogbeta: [[ 0.00685637, 0.01305742, 0.00683859, 0.09222717, 0.09222721,
0.09235907, 0.00893736, 0.07625897, 0.00682716, 0.14222405, 0.14247683, 0.09542658],
[ 0.10647068, 0.09757566, 0.10649878, 0.01035072, 0.01035068, 0.01023777, 0.2084948 , 0.07635718, 0.10651685, 0.00796197, 0.00776243, 0.00771203]])

(Python 3.5 expElogbeta: [[ 0.00683436, 0.00685224, 0.01325946, 0.00895329, 0.09220773, 0.0764139, 0.09233712, 0.09220773, 0.006823, 0.14213993, 0.14239321, 0.09537042]
[ 0.10656816, 0.10653988, 0.09735885, 0.20859865, 0.01032817, 0.07618502,
0.01021732, 0.01032817, 0.10658615, 0.00796839, 0.00776821, 0.00771765]])

Also, the id2word dictionary saved in both Python formats is not in the same order.

if (isinstance(self.eta, six.string_types) and self.eta == 'auto') or len(self.eta.shape) != 1:
separately_explicit.append('eta')
# Merge separately_explicit with separately.
if separately is not None and separately:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if separately is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made changes as suggested.

@@ -1037,6 +1072,18 @@ def load(cls, fname, *args, **kwargs):
"""
kwargs['mmap'] = kwargs.get('mmap', None)
result = super(LdaModel, cls).load(fname, *args, **kwargs)
# Load the separately stored id2word dictionary saved in json.
id2word_fname = utils.smart_extension(fname, '.json')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make all files for one model in a special folder, so it is easy to keep track

@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

This solution looks really complicated. Let's try something simpler. Let's save the Python version with the pickle. If the versions differ on load than raise an exception to use Dill. (assuming dill works across versions)

@tmylk tmylk mentioned this pull request Oct 5, 2016
@anmolgulati
Copy link
Contributor Author

I created the LDAModels with the same seed now and the tests pass.

@anmolgulati
Copy link
Contributor Author

anmolgulati commented Oct 5, 2016

@tmylk Actually, I think no more changes need to be made for compatibility in word2vec or doc2vec models. The load/save methods work fine now for them. We could probably use the present pickle methods itself.
I've added tests now to Check compatibility for Word2Vec Models. The models load without errors. But the tests fail though in testing equality of models across python versions. I also added the code I used to create the word2vec models in test_word2vec.py (commented code). Please review once.
I'll too dig deeper if something else might be an issue.

@tmylk
Copy link
Contributor

tmylk commented Oct 26, 2016

Add an epsilon for equality comparison.
Please git fetch, git merge develop, so the PR becomes reviewable (there are unrelated commits).

@tmylk tmylk closed this Oct 26, 2016
@tmylk tmylk reopened this Oct 26, 2016
@@ -474,11 +487,11 @@ def testRNG(self):

def models_equal(self, model, model2):
self.assertEqual(len(model.vocab), len(model2.vocab))
self.assertTrue(numpy.allclose(model.syn0, model2.syn0))
self.assertTrue(numpy.allclose(model.syn0, model2.syn0, atol=1e-4))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how big is the difference on your machine?

# logging.warning("Word2Vec model saved")

def testModelCompatibilityWithPythonVersions(self):
fname_model_2_7 = os.path.join(os.path.dirname(__file__), 'word2vecmodel_python_2_7')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use module_path and datapath defined above.

# Load the separately stored id2word dictionary saved in json.
id2word_fname = utils.smart_extension(fname, '.json')
try:
with utils.smart_open(id2word_fname, 'r') as fin:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open file as binary, decode as necessary (if necessary).

# If id2word is not already in ignore, then saving it separately in json.
id2word = None
if self.id2word is not None and 'id2word' not in ignore:
id2word = dict((k,v) for k,v in self.id2word.iteritems())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma.

id2word_fname = utils.smart_extension(fname, '.json')
try:
with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout:
json.dump(id2word, fout)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better open the output as binary and write encoded utf8 to it.

Actually, the json module already produces binary strings in dump AFAIK, so what is this even for?

# Because of loading from S3 load can't be used (missing readline in smart_open)
return _pickle.loads(f.read())

if sys.version_info > (3,0):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma.

@piskvorky piskvorky added feature Issue described a new feature and removed bug Issue described a bug labels Dec 7, 2016
@tmylk
Copy link
Contributor

tmylk commented Dec 26, 2016

Merged in #1039

@tmylk tmylk closed this Dec 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants