-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where saved Phrases model did not load its connector_words #3116
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't understand the motivation for this PR. What is the problem, why is this needed? Can you include a minimal reproducing example?
A reproducible example will include training data as well. Although an short example of bug will be, If you train a bigram and trigram model, and apply them on words like ["united, "states", "of", "america"], then this will output ['united_states_of_america']
But if you save these models and load them and then apply them on words like ["united, "states", "of", "america"], then this will output
|
Please add a unit test that fails before this PR, and succeeds after. That should demonstrate the issue clearly. |
Do you need the unit test changes in the same commit or different commit over this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! One minor code style fix and we're good to go.
gensim/test/test_phrases.py
Outdated
@@ -321,9 +321,10 @@ def test_save_load_custom_scorer(self): | |||
def test_save_load(self): | |||
"""Test saving and loading a Phrases object.""" | |||
with temporary_file("test.pkl") as fpath: | |||
bigram = Phrases(self.sentences, min_count=1, threshold=1) | |||
bigram = Phrases(self.sentences, min_count=1, threshold=1, connector_words=self.connector_words) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we changing the behavior of an existing, passing test?
Should we not be adding a new test case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to create a separate test, But after reading this whole file gensim/test/test_phrases.py, I found that this test function i.e "test_save_load" should ideally be capturing this bug. Give its name "test_save_load" I assumed that this test was designed to capture the saving & loading of the whole model and this test shouldn't pass if saved connector words are not loaded. Adding it separately will increase the number of lines of code and will increase the test execution time. Another thought can be to remove this test altogether i.e not testing loading & saving of connector words. Let me know I can change the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that this test function i.e "test_save_load" should ideally be capturing this bug
Not necessarily. The bug is loading an old model. So the new test could be equally be called test_save_load_old_model.
Adding it separately will increase the number of lines of code and will increase the test execution time.
The test is less than 10 lines long, so duplicating them isn't a huge issue. That leaves execution time.
How long does the test take to run?
Another thought can be to remove this test altogether i.e not testing loading & saving of connector words.
What would be the benefit of removing this test? Isn't it testing a valid case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't understand your first point. After reading your first point, It seems that you have misunderstood that this bug is loading an old model. This bug occurs when we save and load the model in the same version itself. I mean to say that "test_save_load_old_model" would not describe this bug related test. string like "test_save_load_with_connector_words" can describe this bug
I agree with all of the rest of your points.
It will take the same time as test "test_save_load"
Yes its testing a valid case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string like "test_save_load_with_connector_words" can describe this bug
Sure, if that's a more suitable name, please use that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated PR with new test method and no behaviour changing of an existing, passing test
… model of version >= 4 Added tests for asserting persistence of phrases connector_words
@@ -5,6 +5,7 @@ Changes | |||
|
|||
- LsiModel: Only log top words that actually exist in the dictionary (PR [#3091](https://github.com/RaRe-Technologies/gensim/pull/3091), [@kmurphy4](https://github.com/kmurphy4)) | |||
- [#3115](https://github.com/RaRe-Technologies/gensim/pull/3115): Make LSI dispatcher CLI param for number of jobs optional, by [@robguinness](https://github.com/robguinness)) | |||
- fix bug when loading saved Phrases model (PR [#3116](https://github.com/RaRe-Technologies/gensim/pull/3116), [@aloknayak29](https://github.com/aloknayak29)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have a script that spits out a formatted list of all merged PRs automatically, during a release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, but I've been keeping track of them as we merge them in smart_open, and I've found that way to be more manageable (less risk of missing PRs, and easier to see what has changed since the last release), so I'm trying out the same approach with gensim.
It's also easier to deal with this while the PR is still in my short term memory. When I'm doing the release, that memory is long gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no problem. One advantage of the script is consistency in formatting – see for example the change here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've got a script that handles these changes, but multiple versions of it were floating around my work/home machines. It's now committed to the repo, so once we nail down the format of the entries, the formatting should no longer be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking about this script: https://github.com/RaRe-Technologies/gensim/blob/develop/release/generate_changelog.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Once the two scripts output stuff in the same format, the consistency problem will be solved, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be misunderstanding, but why do we need two scripts? generate_changelog.py
seems to be doing what we need = generate formatted CHANGELOG from all new PRs since the last release. When / why would we use annotate_pr.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using annotate_pr.py for each PR (since a few week ago). I'll use generate_changelog.py for the next release. I'll keep whichever ends up being the most convenient - for now, it seems to be the former, but time will tell.
Thank you @aloknayak29 ! |
Bug description: If we have saved the trained phrases model in this version i.e gensim version 4.0.0 or 4.0.1, while giving non empty connector_words. And If we try to load the saved model in this version itself. Then the connector_words of the loaded phrases model will be wrongly assigned empty frozen set. This is a functional bug and will result in some of the words not getting grouped in ngram