Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2 #1695

Merged
merged 21 commits into from
Nov 8, 2017

Conversation

jodevak
Copy link
Contributor

@jodevak jodevak commented Nov 6, 2017

As requested, this is a new pull request. Thanks

half_precision_model_kv = keyedvectors.KeyedVectors.load_word2vec_format(
testfile(), binary=True, datatype=np.float16
)
self.assertEqual(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2)
self.assertEquals(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright

["minors", "survey", "minors", "survey", "minors"]
]
model = word2vec.Word2Vec(sentences, size=10, min_count=0, max_vocab_size=2, seed=42, hs=1, negative=0)
self.assertTrue(len(model.wv.vocab), 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you need assertEqual, not assertTrue, don't you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

.. [#taddy] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
.. [#deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
.. [taddy] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
.. [deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but why are you remove # in citate ? (#1633)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autopep8 tool did

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is merged with an older version.

@@ -647,15 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No

Examples
--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
>>> model.build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: model.build_vocab_from_freq({"Word1": 15, "Word2": 20}, update=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, whats the problem with this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces after :, , (in comment fixed variant)

@@ -647,15 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No

Examples
--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model is undefined, please create model first (docstring should be executable, i.e. I can copy-paste this code to console and I expect that code run successfully) we plan to add doctests to our CI soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


self.corpus_count = corpus_count if corpus_count else 0
self.raw_vocab = vocab
self.corpus_count = corpus_count if corpus_count else 0 # Since no sentences are provided, this is to control the corpus_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 - two spaces before #

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are 2 space, arent they ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, really, sorry


self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule,update=update) # trim by min_count & precalculate downsampling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return previous variant

@@ -675,14 +679,14 @@ def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
type(sentence)
)
checked_string_types += 1
if sentence_no % progress_per == 0:
if sentence_no % progress_per == 0 and sentence_no != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because 0% anything will equal to 0; so the logger will log a statement saying sentence 0 and processed 0.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we want that :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we want that :)

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test based on #1599 (comment) and fix log message based on this comment - #1599 (comment)

After this - I'll merge your PR

@@ -647,13 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No

Examples
--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
>>> from gensim.models.word2vec import Word2Vec
>>> model=Word2Vec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 model = Word2Vec()

@jodevak
Copy link
Contributor Author

jodevak commented Nov 6, 2017

@menshikh-iv function testPruneVocab is already there .

@menshikh-iv
Copy link
Contributor

@jodevak need add test for total_words, because you change "counting logic"

@jodevak
Copy link
Contributor Author

jodevak commented Nov 6, 2017

@menshikh-iv Do you have any suggestions to test total_words, other than adding new attributes to the model object nor returning total_words as a value ?

@menshikh-iv
Copy link
Contributor

I see 3 variants

  1. Test for logger output (strange way, but why not)
  2. Make _total_words attr & check it after build_vocab
  3. Return total_words from build_vocab

@piskvorky what's variant looks best for you?

@jodevak
Copy link
Contributor Author

jodevak commented Nov 7, 2017

@menshikh-iv Choice 3 seems most convenient to me.

@menshikh-iv
Copy link
Contributor

Thank you @jodevak 👍

@menshikh-iv menshikh-iv merged commit 40b0417 into piskvorky:develop Nov 8, 2017
VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
…#1695)

* fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab

* Fixing Indentation

* Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace

* Remove trailing white spaces

* Adding test

* fix spaces

* iteration 2 on code

* iteration 2 on code

* Fixing old version of word2vec.py merge problems

* Fixing indent

* Fixing Styling

* Fixing Styling

* test

* test

* adding total words count test

* adding total words count test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants