-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Computes training loss for Word2Vec model (fixes issue #999) #1201
[WIP] Computes training loss for Word2Vec model (fixes issue #999) #1201
Conversation
I can't see how a pure-Python implementation would be useful... I assume people who need to see progress need to see it on large data. And pure-Python on large data would be too slow to start with. |
@piskvorky Yes I agree completely. However, the manner in which we want to compute and use the loss value is itself not clear right now (ongoing discussion at #999). This is why I had submitted a PR first by making changes for the Python path for skip gram model and have planned to make appropriate changes for other paths, both for skip gram and CBOW, for Word2Vec (and also for Doc2Vec). |
gensim/models/word2vec.py
Outdated
@@ -140,7 +140,7 @@ | |||
FAST_VERSION = -1 | |||
MAX_WORDS_IN_BATCH = 10000 | |||
|
|||
def train_batch_sg(model, sentences, alpha, work=None): | |||
def train_batch_sg(model, sentences, alpha, work=None, print_freq=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename print_freq
to eval_every
as well as in LdaModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure Ivan. I'll update the parameter name.
A good start, I think. Although pure python implementation can be slow (need the benchmark for check this), if you rarely logging model loss, this does not slow learning process significantly (but Cython version is the best choice). |
…into word2vec_skipgram_loss
…ancholi13/gensim into word2vec_skipgram_loss
I personally don't see this current implementation as useful enough to justify the code-complication. The people who requested it likely don't use the pure-python path – it's an advanced need, and advanced users are dependent on the optimized code. The discussion on #999 seemed to conclude that accumulating the error in the model was a superior approach – but this only logs the value of I believe this feature will require the active participation of someone who actively needs the feature – either to provide a good spec & review, or to implement – to become a worthwhile addition. |
@gojomo I am indeed planning to implement this for both - python path as well as the optimized cython path. So overall this should infact assist the users in getting an idea about how the training of their model is progressing. But having said that, I do agree that this may actually be just one of the many use-cases that users might need this functionality for so more user input before going forward with the implementation wouldn't hurt. cc : @tmylk @menshikh-iv |
At the time of my comment, before commit 8949749, there wasn't yet any accumulation – only logging of occasional single-pair errors. Yes, a value for a whole epoch, that is thus comparable against another epoch, would be useful. (A running average of 'recent' error might also be useful.) |
How does this extra branching / extra code in the critical path affect performance? @chinmayapancholi13 Can you post some benchmarks, before + after? |
@piskvorky Sure. I plan to post benchmarks for before v/s after cases, as soon as I complete the coding part. |
@@ -698,6 +698,13 @@ def test_reset_from(self): | |||
model.reset_from(other_model) | |||
self.assertEqual(model.wv.vocab, other_vocab) | |||
|
|||
def test_compute_training_loss(self): | |||
model = word2vec.Word2Vec(min_count=1, sg=1, negative=5, hs=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for more training modes.
Benchmarks for seeing effect of loss computation code on training time
Edit : The benchmarks posted here have only been run for test data sizes of 25 KB and 50 MB. The associated ipynb (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb) has a section for reproducing the benchmark values for test data of sizes 25 KB, 1 MB, 10 MB, 50 MB and 100 MB. |
The lee corpus is too small/quick to get any feedback. From the text8 results, it looks like 'true' may cause an 0.5% to 3% slowdown. Running text8 with more iterations might increase the reliability of the result. A comparison against the code without even the switch-to-turn-it-on (without this feature at all) might be relevant. |
@@ -133,6 +133,6 @@ check_files() { | |||
if [[ "$MODIFIED_FILES" == "no_match" ]]; then | |||
echo "No file has been modified" | |||
else | |||
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.txt,*.csv,*.vec,Dockerfile*" | |||
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.txt,*.csv,*.vec,Dockerfile*,*.c,*.pyx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure .pyx
should be here? I didn't see what kind of warnings flake was generating, but as cython syntax is mostly python, and most of our enforceable conventions should still be in effect, we may want some style-enforcement there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo flake8 can't correctly check pyx
files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo We were getting errors like these due to flake8 :
So although I do agree that there is some style-checking that we might want to do in .pyx files (in the python-like code), to avoid getting errors due to cases similar to the above cases, I thought it would be better to ignore .pyx for flake8 tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. There's an SO answer that implies it may be possible to turn off just certain warnings for .pyx
files – https://stackoverflow.com/questions/31269527/running-pep8-or-pylint-on-cython-code – though the full example file is a broken link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing this link. :) I can try to use the config specified in the answer here to check if all the undesired warnings/errors are turned off using it.
While summing all the errors for a single call to |
@chinmayapancholi13 good progress! What does I'm most interested in seeing a performance comparison against the original. Comparison of the new switch on/off is interesting too, but we have to evaluate what we're losing by adding an extra switch to the critical path. And like @gojomo says, it's best to run the benchmark against a larger corpus and multiple times, for robustness. |
@piskvorky When we say a "larger corpus", could you please give an approx size? As I have mentioned in my comment here, I did this comparison between the old and new training times for 100 MB text8 corpus as well (but not for all the possible 8 combinations of |
I'd find a couple tests that runs for tens-of-minutes better than many tests that run for a couple-minutes (or just tens-of-seconds). Increasing the |
Got it, @gojomo. So to conclude, we should use the text8 corpus with |
Analysis of the extra time taken for training due to loss computation code
|
The most interesting comparison, to determine the overhead for people who won't us the feature, would be between the old code (no changes present), and the new code with |
@gojomo I agree. I have updated the table to include these values too. |
) * computes training loss for skip gram * synced word2vec.py with gensim_main * removed unnecessary keep_bocab_item import * synced word2vec.py with gensim_main * PEP8 changes * added Python-only implementation for skip-gram model * updated param name to 'compute_loss' * removed 'raise ImportError' statement from prev commit * [WIP] partial changes for loss computation for skipgram case * [WIP] updated cython code * added unit test for training loss computation * added loss computation for neg sampling * removed unnecessary 'raise ImportError' stmt * added .c and .pyx to flake8 ignore list * added loss computation for CBOW model in Python path * added loss computation for CBOW model in Cython path * PEP8 (F811) fix due to var 'prod' * updated w2v ipynb for training loss computation and benchmarking * updated .c files * added benchmark results
AFAICT, whatever overhead there is from both the switch, or enabling the switch, disappears into the variance (in these short tests), so there's no evidence of a performance-related reason to hold back. But for the reasons in my comment #1201 (comment) I still think this was merged (and #999 closed) prematurely, because there's no confirmation of usefulness from the people who have actually requested this. Also, while the titles still mention "skip-gram", the benchmarking included non-skip-gram-modes... so does that mean what's been done works for CBOW, too? |
This PR computes the training loss for each pair of words being trained as per the skip gram model. The loss value is computed in the function
train_sg_pair
and the value is displayed after everyprint_freq
(passed as a parameter to functiontrain_batch_sg
) number of training pairs. To not show the loss value at all, setprint_freq
as 0 (which is the default value).