Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Computes training loss for Word2Vec model (fixes issue #999) #1201

Merged

Conversation

chinmayapancholi13
Copy link
Contributor

This PR computes the training loss for each pair of words being trained as per the skip gram model. The loss value is computed in the function train_sg_pair and the value is displayed after every print_freq (passed as a parameter to function train_batch_sg) number of training pairs. To not show the loss value at all, set print_freq as 0 (which is the default value).

@chinmayapancholi13 chinmayapancholi13 changed the title computes training loss for skip gram Computes training loss for skip gram model. Fixes issue #999. Mar 9, 2017
@piskvorky
Copy link
Owner

I can't see how a pure-Python implementation would be useful... I assume people who need to see progress need to see it on large data. And pure-Python on large data would be too slow to start with.

@chinmayapancholi13
Copy link
Contributor Author

@piskvorky Yes I agree completely. However, the manner in which we want to compute and use the loss value is itself not clear right now (ongoing discussion at #999). This is why I had submitted a PR first by making changes for the Python path for skip gram model and have planned to make appropriate changes for other paths, both for skip gram and CBOW, for Word2Vec (and also for Doc2Vec).

@chinmayapancholi13 chinmayapancholi13 changed the title Computes training loss for skip gram model. Fixes issue #999. [WIP] Computes training loss for skip gram model. Fixes issue #999. Mar 16, 2017
@@ -140,7 +140,7 @@
FAST_VERSION = -1
MAX_WORDS_IN_BATCH = 10000

def train_batch_sg(model, sentences, alpha, work=None):
def train_batch_sg(model, sentences, alpha, work=None, print_freq=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename print_freq to eval_every as well as in LdaModel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure Ivan. I'll update the parameter name.

@menshikh-iv
Copy link
Contributor

A good start, I think. Although pure python implementation can be slow (need the benchmark for check this), if you rarely logging model loss, this does not slow learning process significantly (but Cython version is the best choice).

@gojomo
Copy link
Collaborator

gojomo commented May 23, 2017

I personally don't see this current implementation as useful enough to justify the code-complication. The people who requested it likely don't use the pure-python path – it's an advanced need, and advanced users are dependent on the optimized code.

The discussion on #999 seemed to conclude that accumulating the error in the model was a superior approach – but this only logs the value of train_error_value, the error of a single (!) skip-pair training example, occasionally. (I think that number will jump all over, based on which 2 words happen to have been in the example when logging is chosen. I find it hard to imagine any use for such logged values.)

I believe this feature will require the active participation of someone who actively needs the feature – either to provide a good spec & review, or to implement – to become a worthwhile addition.

@chinmayapancholi13
Copy link
Contributor Author

@gojomo I am indeed planning to implement this for both - python path as well as the optimized cython path.
Also, as per the latest discussion on #999, it was agreed that storing the sum of the training losses of all the word-pairs would be a good start to address the issue. As mentioned in the comment by @dietmar here, this would enable a user to get the cumulative training loss value for each call to the function train().

So overall this should infact assist the users in getting an idea about how the training of their model is progressing. But having said that, I do agree that this may actually be just one of the many use-cases that users might need this functionality for so more user input before going forward with the implementation wouldn't hurt.

cc : @tmylk @menshikh-iv

@gojomo
Copy link
Collaborator

gojomo commented May 24, 2017

At the time of my comment, before commit 8949749, there wasn't yet any accumulation – only logging of occasional single-pair errors. Yes, a value for a whole epoch, that is thus comparable against another epoch, would be useful. (A running average of 'recent' error might also be useful.)

@piskvorky
Copy link
Owner

piskvorky commented Jun 13, 2017

How does this extra branching / extra code in the critical path affect performance?

@chinmayapancholi13 Can you post some benchmarks, before + after?

@chinmayapancholi13
Copy link
Contributor Author

@piskvorky Sure. I plan to post benchmarks for before v/s after cases, as soon as I complete the coding part.

@@ -698,6 +698,13 @@ def test_reset_from(self):
model.reset_from(other_model)
self.assertEqual(model.wv.vocab, other_vocab)

def test_compute_training_loss(self):
model = word2vec.Word2Vec(min_count=1, sg=1, negative=5, hs=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for more training modes.

@chinmayapancholi13
Copy link
Contributor Author

chinmayapancholi13 commented Jun 29, 2017

Benchmarks for seeing effect of loss computation code on training time
text8_50000000 : 50 MB size
lee_background.cor : 25 KB size

train_data compute_loss hs sg mean std
text8_50000000 True 0 1 125.242767 0.442522
text8_50000000 False 0 1 124.090732 1
text8_50000000 True 1 1 252.800164 1.140344
text8_50000000 False 1 1 245.065643 2.657392
text8_50000000 True 0 0 43.812430 1.216697
text8_50000000 False 0 0 42.815214 0.142814
text8_50000000 True 1 0 74.801153 0.300728
text8_50000000 False 1 0 74.236441 0.126426
lee_background.cor True 0 1 0.560387 0.005805
lee_background.cor False 0 1 0.687179 0.143629
lee_background.cor True 1 1 1.126855 0.004407
lee_background.cor False 1 1 1.135358 0.059161
lee_background.cor True 0 0 0.316948 0.005148
lee_background.cor False 0 0 0.319236 0.005792
lee_background.cor True 1 0 0.429879 0.005373
lee_background.cor False 1 0 0.429489 0.000756

Edit : The benchmarks posted here have only been run for test data sizes of 25 KB and 50 MB. The associated ipynb (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb) has a section for reproducing the benchmark values for test data of sizes 25 KB, 1 MB, 10 MB, 50 MB and 100 MB.

@menshikh-iv menshikh-iv merged commit cdc5944 into piskvorky:develop Jun 29, 2017
@gojomo
Copy link
Collaborator

gojomo commented Jun 29, 2017

The lee corpus is too small/quick to get any feedback. From the text8 results, it looks like 'true' may cause an 0.5% to 3% slowdown. Running text8 with more iterations might increase the reliability of the result. A comparison against the code without even the switch-to-turn-it-on (without this feature at all) might be relevant.

@@ -133,6 +133,6 @@ check_files() {
if [[ "$MODIFIED_FILES" == "no_match" ]]; then
echo "No file has been modified"
else
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.txt,*.csv,*.vec,Dockerfile*"
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.txt,*.csv,*.vec,Dockerfile*,*.c,*.pyx"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure .pyx should be here? I didn't see what kind of warnings flake was generating, but as cython syntax is mostly python, and most of our enforceable conventions should still be in effect, we may want some style-enforcement there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo flake8 can't correctly check pyx files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo We were getting errors like these due to flake8 :
flake8_2

flake8_1

So although I do agree that there is some style-checking that we might want to do in .pyx files (in the python-like code), to avoid getting errors due to cases similar to the above cases, I thought it would be better to ignore .pyx for flake8 tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. There's an SO answer that implies it may be possible to turn off just certain warnings for .pyx files – https://stackoverflow.com/questions/31269527/running-pep8-or-pylint-on-cython-code – though the full example file is a broken link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this link. :) I can try to use the config specified in the answer here to check if all the undesired warnings/errors are turned off using it.

@gojomo
Copy link
Collaborator

gojomo commented Jun 29, 2017

While summing all the errors for a single call to train() is somewhat more useful, to a knowledgeable user, than the prior implementation, it's still not quite what users may find interpretable. (At least not compared to the running-average error printed during training by other tools like fasttext.) I also think it might need to be abs()d somewhere, lest offsetting errors in both directions be misleading when totaled. So I don't really see this as ready-for-merging, again preferring someone who actively needs the feature to review it.

@piskvorky
Copy link
Owner

piskvorky commented Jun 30, 2017

@chinmayapancholi13 good progress!

What does mean and std mean in the table? Mean and std of what?

I'm most interested in seeing a performance comparison against the original. Comparison of the new switch on/off is interesting too, but we have to evaluate what we're losing by adding an extra switch to the critical path.

And like @gojomo says, it's best to run the benchmark against a larger corpus and multiple times, for robustness.

@chinmayapancholi13
Copy link
Contributor Author

@piskvorky mean and std refer to the average value and standard deviation of the time taken for training the word2vec model.

When we say a "larger corpus", could you please give an approx size? As I have mentioned in my comment here, I did this comparison between the old and new training times for 100 MB text8 corpus as well (but not for all the possible 8 combinations of compute_loss, sg and hs values) and I was getting similar results (i.e. the difference in time values was about the same percentage as the results posted for 50 MB test data).

@gojomo
Copy link
Collaborator

gojomo commented Jun 30, 2017

I'd find a couple tests that runs for tens-of-minutes better than many tests that run for a couple-minutes (or just tens-of-seconds). Increasing the iter to 20 or 40 might achieve that with text8. Also, if setting hs=1 should also set negative=0 (to match usual/recommended user choices of jsut one mode or the other – even though it is possible to run both at the same time and thus be benchmarking the combination).

@chinmayapancholi13
Copy link
Contributor Author

Got it, @gojomo. So to conclude, we should use the text8 corpus with iter set around 20-40 and compare the time taken by the new code with compute_loss set to True with the original code (without any switch). I'll update the benchmarks as soon as I have these values. :)

@chinmayapancholi13
Copy link
Contributor Author

chinmayapancholi13 commented Jul 1, 2017

@gojomo @piskvorky

Analysis of the extra time taken for training due to loss computation code

  • Theses values are for the full text8 corpus (100 MB) with iter=6.
  • Machine specs : 64 GB RAM, i-7 processor, 3.38 GHz (upto 3.6 GHz).
compute_loss hs sg (mean, std) with switch (mean, std) without switch
True 1 1 (341.065673, 0.722949) -
False 1 1 (337.980214, 0.356567) (338.080630, 0.541960)
True 0 1 (172.808942, 2.547450) -
False 0 1 (171.416717, 1.786813) (169.936202, 0.422062)
True 1 0 (103.503094, 0.930082) -
False 1 0 (104.200277, 1.064672) (102.959268, 1.517684)
True 0 0 (58.828162, 0.410273) -
False 0 0 (58.280944, 0.300317) (59.028026, 0.161999)

@gojomo
Copy link
Collaborator

gojomo commented Jul 1, 2017

The most interesting comparison, to determine the overhead for people who won't us the feature, would be between the old code (no changes present), and the new code with compute_loss=False.

@chinmayapancholi13
Copy link
Contributor Author

@gojomo I agree. I have updated the table to include these values too.

saparina pushed a commit to saparina/gensim that referenced this pull request Jul 9, 2017
)

* computes training loss for skip gram

* synced word2vec.py with gensim_main

* removed unnecessary keep_bocab_item import

* synced word2vec.py with gensim_main

* PEP8 changes

* added Python-only implementation for skip-gram model

* updated param name to 'compute_loss'

* removed 'raise ImportError' statement from prev commit

* [WIP] partial changes for loss computation for skipgram case

* [WIP] updated cython code

* added unit test for training loss computation

* added loss computation for neg sampling

* removed unnecessary 'raise ImportError' stmt

* added .c and .pyx to flake8 ignore list

* added loss computation for CBOW model in Python path

* added loss computation for CBOW model in Cython path

* PEP8 (F811) fix due to var 'prod'

* updated w2v ipynb for training loss computation and benchmarking

* updated .c files

* added benchmark results
@gojomo
Copy link
Collaborator

gojomo commented Jul 13, 2017

AFAICT, whatever overhead there is from both the switch, or enabling the switch, disappears into the variance (in these short tests), so there's no evidence of a performance-related reason to hold back.

But for the reasons in my comment #1201 (comment) I still think this was merged (and #999 closed) prematurely, because there's no confirmation of usefulness from the people who have actually requested this.

Also, while the titles still mention "skip-gram", the benchmarking included non-skip-gram-modes... so does that mean what's been done works for CBOW, too?

@chinmayapancholi13 chinmayapancholi13 changed the title [WIP] Computes training loss for skip gram model. Fixes issue #999. [WIP] Computes training loss for Word2Vec model (fixes issue #999) Jul 14, 2017
@chinmayapancholi13
Copy link
Contributor Author

@gojomo Since the title #999 mentions "skip-gram", this PR was originally only for skip-gram mode but later I decided to implement both modes in the same PR. Thanks for pointing this out. :) I have updated the title of the PR now.

This was referenced Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants