Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DNM] error-resistant train(). Fix #1052 #1139

Closed
wants to merge 2 commits into from

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Feb 9, 2017

Changes train() to require explicit epochs argument, and explicit size of corpus (either total_examples or total_words) – thus forcing users to be conscious of train()s behavior in these regards, and (intentionally) breaking older code that might be operating under misconceptions.

Tests updated to be as explicit as needed.

To do:

  • Update any notebooks using train()

Also a potential related change: add an optional epoch_callback argument to train(), that could be called after each training epoch. That could completely obviate the need for users/tutorials running train() in their own loop to do extra steps (such as evaluation) before training is done.

@gojomo gojomo changed the title [WIP][DNM] error-proof train() [WIP][DNM] error-resistant train() Feb 9, 2017
@tmylk tmylk changed the title [WIP][DNM] error-resistant train() [WIP][DNM] error-resistant train(). Fix #1052 Mar 8, 2017
@robotcator
Copy link
Contributor

@gojomo Hello, when I fix the notebooks/docs to reflect the change of "train", I got an error in the notebooks/doc2vec-IMDB.ipynb:

TypeError                                 Traceback (most recent call last)
<ipython-input-2-4bb56643bf06> in <module>()
     53 
     54         for txt in txt_files:
---> 55             with open(txt, 'r', encoding='utf-8') as t:
     56                 control_chars = [chr(0x85)]
     57                 t_clean = t.read()

TypeError: 'encoding' is an invalid keyword argument for this function

I use Ubuntu 14.04, python 2.7.6 and gensim version 1.0.1, the error was caused by 'encoding' is an invalid keyword argument for the open function in python2.7. Dose it need to change the code for support the python2?

@tmylk
Copy link
Contributor

tmylk commented Mar 12, 2017

The notebooks should support both Python 2 and 3, same as our source code. Unfortunately we have been lax about it before - fixing it as a part of this release would be a great contribution.

@prakhar2b
Copy link
Contributor

@tmylk will python 2 and 3 compatibility library six be a good choice here ?

@tmylk
Copy link
Contributor

tmylk commented Mar 12, 2017

Yes, gensim already uses six

@robotcator
Copy link
Contributor

robotcator commented Mar 12, 2017

@prakhar2b would you mind guide me to work through this? I think this is a good opportunity for me to get family with the process of contribution. My initial thought was library codecs.

@prakhar2b
Copy link
Contributor

@robotcator Definitely, that would be a pleasure. I suggest you post your doubts on gensim google group, the community is very active in responding. You can also mail me directly if you want - er.prakhar2b@gmail.com

@tmylk
Copy link
Contributor

tmylk commented Mar 17, 2017

Ping @robotcator for a status update

@robotcator
Copy link
Contributor

@tmylk sorry for late status update. I have fixed the compatibility between Python 3 and Python 2 by using codecs library. But the code runs in Python 2 seems inefficiently, I am stuck in finding a solution to optimizing python2 code.

@tmylk
Copy link
Contributor

tmylk commented Mar 17, 2017

Could you please update the pr with the new code?

@robotcator
Copy link
Contributor

@tmylk sorry for late reply. I have make an silly mistake on the git. I create an pull request#1220. Since I am not familiar with git and I am curious about why there are so many commits and comments on my pull request.

@robotcator
Copy link
Contributor

@tmylk Of course.

@robotcator
Copy link
Contributor

robotcator commented Mar 23, 2017

@tmylk I have create pr 1237 to to make sure that ValueError is thrown when params are not supplied. I don't know why there are conflicts with the base branch.

tmylk pushed a commit that referenced this pull request Mar 30, 2017
…Fix #1052. (#1237)

* fix the compatibility between python2 & 3

* require explicit corpus size, epochs for train()

* make all train() calls use explicit count, epochs

* add tests to make sure that ValueError is indeed thrown

* update test

* fix the word2vec's reset_from()

* require explicit corpus size, epochs for train()

* make all train() calls use explicit count, epochs

* fix some error

* fix test error
@tmylk
Copy link
Contributor

tmylk commented Apr 3, 2017

@gojomo Could you please write a note for users on how to upgrade their code for this change? It will be the breaking change to warrant a major release to gensim 2.0.0

@@ -448,7 +448,8 @@ def __init__(
if isinstance(sentences, GeneratorType):
raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
self.build_vocab(sentences, trim_rule=trim_rule)
self.train(sentences)
self.train(sentences, total_examples=self.corpus_count, epochs=self.iter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please

@gojomo
Copy link
Collaborator Author

gojomo commented Apr 3, 2017

The note could be:

Any direct calls to method train() of Word2Vec/Doc2Vec now require an explicit epochs parameter and explicit estimate of corpus size (usually as total_examples). See the [method documentation] for more information.

I'm in favor of generous version-incrments, but if considering a "2.0.0" release to signal a break in backward compatibility, so soon after the "1.0.0" release, we may want to look at any other prposed/pending/desirable changes in names/defaults/parameters, to include as well.

Some things that come to mind:

  • infer_vector() defaults & parameter names
  • KeyedVector property/method names still retain lots of word2vec-specific details – like word where the vectors might be more general, or syn0 when it's no longer necessarily the sort of NN-layer that has that name
  • Doc2Vec's DocvecsArray may be mergeable with KeyedVectors, eliminating some code duplication, adding some other flexibility (at cost of complexity) to KeyedVectors, maybe making model.wv and model.dv perfectly-parallel ways to access word-vectors-by-key and doc-vectors-by-key, respectively.

@gojomo
Copy link
Collaborator Author

gojomo commented Apr 3, 2017

We could use a new label to mark issues/PRs where breaks with backward-compatibility are happening (or proposed), requiring sync w/ version numbering. I've created one for that purpose, and applied it here to get things started.

@gojomo gojomo added the breaks backward-compatibility Change breaks backward compatibility label Apr 3, 2017
@tmylk
Copy link
Contributor

tmylk commented Apr 4, 2017

Thanks for suggesting more ways to make the API and code more streamlined.

Unfortunately SemVer adoption means that (contrary to topic modelling) "Version numbers are not for humans"

Small changes in releases are better than big releases, so if we get to version 10.1.0 by the end of the year then it is ok.

@gojomo
Copy link
Collaborator Author

gojomo commented Apr 4, 2017

The idea of batching together sets of breaking changes, to help minimize user upgrade efforts, seems orthogonal to SemVer adoption to me. But if you'd like to move to a more rapid tempo, sure.

That style of numbering/rapid-incremental-release might also justify eventually splitting gensim into narrow single-algorithm subprojects (with some shared core dependencies). Then small changes, not affecting most users, would only trigger a release/version-increment/need-to-review-release-notes among the users of that one module.

@tmylk
Copy link
Contributor

tmylk commented Apr 10, 2017

Merged in #1237

@tmylk tmylk closed this Apr 10, 2017
@piskvorky
Copy link
Owner

@gojomo any suggestions where those "subproject split lines" ought to be?

@gojomo
Copy link
Collaborator Author

gojomo commented Apr 11, 2017

@piskvorky In the very-small, very-focused packaging style of (say) node/npm, almost every module would be a different package. There'd be one or a few core or util packages, but then nearly every 'model' would be its own package. In the extreme, even Word2Vec and Doc2Vec would be different packages (with Doc2Vec depending on Word2Vec).

Still mulling over whether I'd advocate this – it's less common in the Python world than the JS world, and may result in a bewildering variety of projects each with their own version-number – but it does mesh with rigorous SemVer and rapid, high-resolution release cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaks backward-compatibility Change breaks backward compatibility
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants