Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-design "*2vec" implementations #1777

Merged
merged 119 commits into from
Feb 1, 2018
Merged
Show file tree
Hide file tree
Changes from 77 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
31943ae
first design draft
manneshiva Dec 11, 2017
d7209f4
adds public interfaces
manneshiva Dec 13, 2017
fe19b9a
adds VocabItem and cleans BaseKeyedVectors
manneshiva Dec 13, 2017
fece94f
adds explicit parameters
manneshiva Dec 13, 2017
e310dbf
implements `train` and adds `Callback` functionality
manneshiva Dec 14, 2017
30872ac
refactors `train`, adds classes for vocabulary building and trainable…
manneshiva Dec 18, 2017
2892f37
changes function parameters
manneshiva Dec 19, 2017
4b1e7f8
fixes minor errors
manneshiva Dec 19, 2017
68ac5bc
starts refactoring `Word2Vec` based on new design
manneshiva Dec 19, 2017
7f60a47
removes `build_vocab_from_freq`, corrects `reset_from`
manneshiva Dec 19, 2017
abc5702
changes attribute names
manneshiva Dec 19, 2017
b60a9d5
adds saving/loading from word2vec format
manneshiva Dec 19, 2017
ca1eae9
refactors/renames variables based on new design
manneshiva Dec 19, 2017
dab7b99
fixes **not** storing normalized vectors and recalculable tables
manneshiva Dec 19, 2017
d249668
replaces `syn0` with `vectors`, adds `estimate_memory`
manneshiva Dec 20, 2017
99cf2ad
fixes indents
manneshiva Dec 20, 2017
267c682
starts `FastText` refactoring based on new design
manneshiva Dec 20, 2017
c2bbb20
refactors to call coomon methods from `word2vec_utils`, removes depre…
manneshiva Dec 21, 2017
7d774d7
refactors `FastText`
manneshiva Dec 21, 2017
9b156f5
adds common methods in `word2vec_utils`
manneshiva Dec 21, 2017
0db83f1
refactors keyedvectors for FT & W2V by creating a common base class
manneshiva Dec 21, 2017
b761dff
creates a common base class for Word2Vec and FastText
manneshiva Dec 24, 2017
817f71b
deletes word2vec_utils.py
manneshiva Dec 24, 2017
75892cc
extracts logging to separate methods
manneshiva Dec 25, 2017
61c4e5e
corrects alpha decay, modifies `_get_thread_working_mem` to support d…
manneshiva Dec 26, 2017
707aef3
refactors doc2vec initialization and training
manneshiva Dec 26, 2017
e370314
minor fixes to support doc2vec
manneshiva Dec 26, 2017
45347f3
corrects parameter setting while calling `train`
manneshiva Dec 26, 2017
ab8dd4b
deletes `callbacks`, fixes alpha setting and degradation from `train`
manneshiva Dec 26, 2017
679e82f
adds post training methods and keyedvectors for docvecs
manneshiva Dec 26, 2017
1f488a5
extracts common methods as functions, discard unnecessary function call
manneshiva Dec 27, 2017
0f666f4
shifts adding null word from trainables to vocab class
manneshiva Dec 27, 2017
6a9171d
unifies variable naming
manneshiva Dec 27, 2017
1246d13
moves corpus_count from vocabulary to model attribute
manneshiva Dec 27, 2017
4bba589
refactors test cases and corrects failing cases
manneshiva Dec 27, 2017
26b9b06
removes old import
manneshiva Dec 27, 2017
a923e7e
fixes errors
manneshiva Dec 27, 2017
51df908
creates seperate class for callbacks, adds saving and loss capturing …
manneshiva Dec 27, 2017
bfae0e7
refactors poincare keyedvectors base and related changes
manneshiva Dec 28, 2017
9c261e0
extracts save/load_word2vec_format as functions to avoid code repitio…
manneshiva Dec 29, 2017
d8d22bd
removes model initialization to None
manneshiva Dec 29, 2017
8301b03
shifts cum_tables, make_cum_table & create_binary_tree from trainable…
manneshiva Jan 3, 2018
36e6a30
adds fasttext test cases
manneshiva Jan 3, 2018
ae60bd8
adds doc strings for public APIs for D2V, W2V & FT
manneshiva Jan 4, 2018
eddd24e
adds docstrings for keyedvectors
manneshiva Jan 4, 2018
f3d76cf
resolves failing test cases
manneshiva Jan 5, 2018
9367dc6
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
manneshiva Jan 5, 2018
a721aac
updates cython generated .c files
manneshiva Jan 5, 2018
65b8821
corrects error statement when failing to import FAST VERSION
manneshiva Jan 5, 2018
9f1103e
betters logging
manneshiva Jan 5, 2018
52d1e5f
deletes fasttext wrapper
manneshiva Jan 7, 2018
6941e1e
fixes PEP8 long lines error
manneshiva Jan 7, 2018
8574055
fixes non-any2vec failing test cases
manneshiva Jan 7, 2018
173a8e9
deletes testing pure python any2vec implementations from tox
manneshiva Jan 7, 2018
be73e0b
fixes test_similarities failing test cases
manneshiva Jan 7, 2018
0fc8340
fixes PEP8 errors
manneshiva Jan 8, 2018
673086d
fixes python3 failing test cases
manneshiva Jan 8, 2018
ce0dee9
renames syn0 to vectors in keras integration test
manneshiva Jan 8, 2018
f300088
fixes annoy notebook failure
manneshiva Jan 8, 2018
211c286
adds property aliases for backward compatibility
manneshiva Jan 8, 2018
b4700ed
adds properties and methods for backward compatibility
manneshiva Jan 8, 2018
142c8a6
removes trainables save
manneshiva Jan 8, 2018
74ce823
minor changes to test cases
manneshiva Jan 8, 2018
3281a73
shifts epoch saver callback to an example in docstring
manneshiva Jan 9, 2018
b1a7390
adds deleters for syn1 & syn1neg
manneshiva Jan 9, 2018
995d1cf
deprecates old KeyedVectors in favour of Word2VecKeyedVectors
manneshiva Jan 9, 2018
fc9e77f
reverts word2vec_pre_kv_py2 saved models to original
manneshiva Jan 10, 2018
9fba59f
adds deprecated models and dependent python files
manneshiva Jan 10, 2018
c9d9ec8
adds unit tests for loading old models
manneshiva Jan 10, 2018
883cb81
imports deprecated in model.__init__
manneshiva Jan 10, 2018
7fcc3a4
removes .wv.most_similar calls
manneshiva Jan 10, 2018
9001abe
adds code to support loading old models
manneshiva Jan 10, 2018
6c42905
adds cython auto generated .c files
manneshiva Jan 10, 2018
0bea623
fixes PEP8 failures & fetching attributes from pre_kv word2vec models
manneshiva Jan 10, 2018
09f9bdf
fixes num_ngram_vectors
manneshiva Jan 10, 2018
4b142a0
fixes estimate_memory, shifts BaseKeyedVectors to keyedvectors.py
manneshiva Jan 11, 2018
710c124
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
manneshiva Jan 11, 2018
6f1c522
Merge branch 'develop' into refactor_any2vec
manneshiva Jan 12, 2018
60db35d
fixes review comments -- typos, indents, adding deprecated. No design…
manneshiva Jan 12, 2018
922ae60
Merge branch 'refactor_any2vec' of https://github.com/manneshiva/gens…
manneshiva Jan 12, 2018
f3e2259
fixes PEP8
manneshiva Jan 12, 2018
0a76c2a
shifts *KeyedVectors to keyedvectors.py
manneshiva Jan 13, 2018
06e03ef
de-duplicates data between keyedvectors, vocabulary, trainables and r…
manneshiva Jan 15, 2018
9aa9b66
fixes failing cases
manneshiva Jan 15, 2018
cbffa32
removes unused vocabulary paramter from methods
manneshiva Jan 15, 2018
4caa3f4
removes base classes for vocabulary & trainables, cleans code
manneshiva Jan 16, 2018
31f9943
removes build_vocab from BaseAny2VecModel
manneshiva Jan 16, 2018
5650bab
fixes vector size for doc2vec
manneshiva Jan 21, 2018
da539e2
Fix typo in classname
menshikh-iv Jan 23, 2018
818439d
remove docs for fasttext wrapper
menshikh-iv Jan 23, 2018
54c9b2e
update docstrings for callback
menshikh-iv Jan 23, 2018
bb54290
Merge remote-tracking branch 'upstream/develop' into refactor_any2vec
menshikh-iv Jan 23, 2018
0d1c48c
Fix documentation build
menshikh-iv Jan 23, 2018
13f5ea9
light cleanup for docstrings
menshikh-iv Jan 25, 2018
8cc2bf6
renames private util_any2vec functions
manneshiva Jan 28, 2018
ac2d01f
adds deprecated warning for attributes
manneshiva Jan 28, 2018
0fae977
adds deprecated warnings.warn for old doc2vec parameters
manneshiva Jan 28, 2018
d58dc41
shifts any2vec callback under gensim/models
manneshiva Jan 28, 2018
2422994
adds pure python implementations
manneshiva Jan 28, 2018
401d46e
fixes PEP8 errors
manneshiva Jan 28, 2018
46b0b3a
changes build_vocab method signature
manneshiva Jan 28, 2018
902aed7
fixes vocabulary trimming error
manneshiva Jan 29, 2018
3562818
fixes long line
manneshiva Jan 29, 2018
83374be
removes deprecated/utils
manneshiva Jan 30, 2018
d8455fa
adds old_saveload to deprecated
manneshiva Jan 30, 2018
1f38dc7
removes unused import
manneshiva Jan 30, 2018
cd4e22d
returns fasttext wrapper
manneshiva Feb 1, 2018
fd2e697
adds alias iter setter
manneshiva Feb 1, 2018
02072b1
fixes fasttext load error
manneshiva Feb 1, 2018
114ab5f
ignores PEP8 unused import
manneshiva Feb 1, 2018
0179835
Return fasttext wrapper rst
menshikh-iv Feb 1, 2018
0601c69
Add rst for deprecated stuff
menshikh-iv Feb 1, 2018
572c960
Add all needed deprecations, upd *.rst.
menshikh-iv Feb 1, 2018
62b0852
add description for deprecated package
menshikh-iv Feb 1, 2018
e9ebaa8
add missing import + return env war to tox config
menshikh-iv Feb 1, 2018
d7cee63
drop useless import
menshikh-iv Feb 1, 2018
79c1263
adds num_ngrams_vectors property
manneshiva Feb 1, 2018
19f2ee5
reverts to calling old attributes in all tests
manneshiva Feb 1, 2018
7a32739
fixes PEP8
manneshiva Feb 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion gensim/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
similarities within a corpus of documents.
"""

from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils # noqa:F401
from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils, callbacks # noqa:F401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callbacks module should be "at least" in gensim.models I think, I have no idea, why this should be in "root" of gensim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already exists gensim.models.callbacks. I can add my Callback base class to this file but will require me to rename the base class as another class with the same name exists -- link.

import logging

__version__ = '3.2.0'
Expand Down
49 changes: 49 additions & 0 deletions gensim/callbacks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
Abstract base class to build callbacks. Callbacks are used to apply custom functions over the model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't "abstract", maybe you should use abc for this purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not make sense to make this inherit ABC and use @abstractmethod since not all the methods (on_epoch_end, on_train_begin etc.) need to be re-implemented -- just the one required by the user. Changing 'Abstract base class' to 'Base class` in docstring.

at specific points during training (epoch start, batch end etc.). To implement a Callback, subclass
:class: ~gensim.callbacks.Callback, look at the example below which creates a callback to save a training model
after each epoch:

>>> from gensim.test.utils import common_texts as sentences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good example, I like it 👍

>>> from gensim.callbacks import Callback
>>> from gensim.models import word2vec
>>> class ModelEpochSaver(Callback): # Callback to save model after every epoch
>>> def __init__(self, path_prefix):
>>> self.path_prefix = path_prefix
>>> def on_epoch_end(self, model):
>>> model.save('{}_epoch{}'.format(self.path_prefix, self.cur_epoch))
>>> self.cur_epoch += 1
>>> def on_train_begin(self, model):
>>> self.cur_epoch = 0
>>> epoch_saver = ModelEpochSaver('axax')
>>> model = word2vec.Word2Vec(sentences, iter=5, size=10, min_count=0, seed=42, callbacks=[epoch_saver])

"""


class Callback(object):
"""Abstract base class used to build new callbacks."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls remove Abstract since it's not really abstract.


def __init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add *args, **kwargs (because probably user wants to store something in this class, for example, for evaluation proposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take out __init__ and let the user define it.

Copy link
Contributor

@janpom janpom Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv Not sure what you mean. If anyone wanted to use Callback, they would subclass it and would add their own __init__ with any parameters they need. I can't see how *args, **kwargs in the parent __init__ would help.

pass

def on_epoch_begin(self, model):
pass

def on_epoch_end(self, model):
pass

def on_batch_begin(self, model):
pass

def on_batch_end(self, model):
pass

def on_train_begin(self, model):
pass

def on_train_end(self, model):
pass
3 changes: 2 additions & 1 deletion gensim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from .logentropy_model import LogEntropyModel # noqa:F401
from .word2vec import Word2Vec # noqa:F401
from .doc2vec import Doc2Vec # noqa:F401
from .keyedvectors import KeyedVectors # noqa:F401
from .word2vec import Word2VecKeyedVectors as KeyedVectors # noqa:F401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An old variant is better (like from @jayantj) with KeyedVectors = SomeNewClass by documentation reasons (KeyedVectors does not exist as a class -> never be in the documentation).

from .ldamulticore import LdaMulticore # noqa:F401
from .phrases import Phrases # noqa:F401
from .normmodel import NormModel # noqa:F401
Expand All @@ -23,6 +23,7 @@
from .translation_matrix import TranslationMatrix, BackMappingTranslationMatrix # noqa:F401

from . import wrappers # noqa:F401
from . import deprecated # noqa:F401

from gensim import interfaces, utils

Expand Down
Loading