Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Phraser memory usage (drop frequencies) #2208

Merged
merged 12 commits into from
Jan 11, 2019
6 changes: 4 additions & 2 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,10 +210,12 @@ def load(cls, *args, **kwargs):
# update older models
# if value in phrasegrams dict is a tuple, load only the scores.
try:
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items())
for components, scores in model.__dict__['phrasegrams'].items():
Copy link
Owner

@piskvorky piskvorky Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this strange __dict__ access? Why not use the pattern I showed in my last review?

And what is the try for, what KeyError are we guarding against? Please add code comments.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really work? It mutates the collection it's iterating over, which is usually a bad idea.

As mentioned previously, I'd make a copy of the original keys (not items) and iterate over that, while mutating the original (large) dict.

if isinstance(scores, tuple):
model.__dict__['phrasegrams'][components] = scores[1]
except KeyError:
pass

# if no scoring parameter, use default scoring
if not hasattr(model, 'scoring'):
logger.info('older version of %s loaded without scoring function', cls.__name__)
Expand Down