-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve FastText
documentation
#2353
Changes from 2 commits
18f5302
b57a086
5fa0c9f
d66f55c
9696657
3019bea
74b740c
09ab630
2e728cb
c435a8e
c688877
677679c
29c5210
044d699
f9df136
cdc727a
4ea3f06
e532d62
931d3d7
177c712
bd83886
11fabca
2d490f0
9b6f8bb
9b5e161
6aa013a
7d2b562
25b24c7
b4e8405
1fc9bf2
72ec312
29c4faf
74410fc
a3456a4
96eab08
ff72185
9140cf6
31c79c3
2f479ca
c48a7f3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,9 @@ | |
This module contains a fast native C implementation of Fasttext with Python interfaces. It is **not** only a wrapper | ||
around Facebook's implementation. | ||
|
||
This module supports loading models trained with Facebook's fastText implementation. | ||
It also supports continuing training from such models. | ||
|
||
For a tutorial see `this notebook | ||
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb>`_. | ||
|
||
|
@@ -31,6 +34,15 @@ | |
>>> from gensim.models import FastText | ||
>>> | ||
>>> model = FastText(common_texts, size=4, window=3, min_count=1, iter=10) | ||
>>> sentences = [ | ||
... ['computer', 'artificial', 'intelligence'], | ||
... ['artificial', 'trees'], | ||
... ['human', 'intelligence'], | ||
... ['artificial', 'graph'], | ||
... ['intelligence'], | ||
... ['artificial', 'intelligence', 'system'] | ||
... ] | ||
>>> model.train(sentences, total_examples=len(sentences), epochs=model.epochs) | ||
|
||
Persist a model to disk with: | ||
|
||
|
@@ -41,7 +53,49 @@ | |
>>> fname = get_tmpfile("fasttext.model") | ||
>>> | ||
>>> model.save(fname) | ||
>>> model = FastText.load(fname) # you can continue training with the loaded model! | ||
>>> model = FastText.load(fname) | ||
|
||
Once loaded, such models behave identically to those created from scratch. | ||
For example, you can continue training the loaded model: | ||
|
||
>>> new_sentences = [ | ||
... ['sweet', 'child', 'of', 'mine'], | ||
... ['rocket', 'queen'], | ||
... ['you', 'could', 'be', 'mine'], | ||
... ['november', 'rain'], | ||
... ] | ||
>>> 'rocket' in model.wv | ||
False | ||
>>> model.train(new_sentences, total_examples=len(sentences), epochs=model.epochs) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's correct. I agree that is confusing. The docstring for the train function attempts to clarify the situation. Personally, I think if neither total_examples and total_words are specified, we should try to determine sensible defaults by looking at e.g. len(sentences). WDYT @menshikh-iv ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure? I read the linked docs and still don't get why it's not Please include some top-level intuition here. A short sentence on why this parameter is mandatory, and what should be its value, because it looks really strange and superfluous. +1 for sensible defaults. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gojomo If the corpus does self-report its length though, should we use that instead? If yes, which should we do:
If the corpus does not self-report its length, then we could raise an exception with a heplful message. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it's able to self-report its length, in count of texts, then yes, that would work as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, while we're talking about simplifying the API, what do you think about removing the sentences and corpus_file parameters from the constructor? Currently, we have an inconsistency: in the constructor, we just pass sentences/corpus_file without total_examples and total_words parameters. In the train function, we include those additional parameters. Instead of passing sentences in the constructor, the user can pass them in separately via the train function. Pros:
Cons:
@menshikh-iv @piskvorky @gojomo What do you think? The same thing also applies to the callbacks parameters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@gojomo Why do you think it is inappropriate? We could do something like this: if total_examples or total_words:
pass # nothing to do here
elif sentences and hasattr(sentences, '__len__'): # could also check for callable, if necessary
total_examples = len(sentences)
elif data_corpus and hasattr(data_corpus, '__len__'):
total_examples = len(data_corpus)
else:
raise ValueError(
'unable to infer total_examples or total_words from the training source, '
'please pass one of them explicitly'
) It looks ugly, but it allows to user to do something like: model.train(sentences) instead of model.train(sentences, num_examples=len(sentences)) I feel the former is more Pythonic. Finally, I think having two separate keyword parameters for the input is confusing for the user. In my opinion, it would look a lot simpler if we unified the two parameters, and dealt with untangling them in the implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mpenkov yes, I'd consider that "sensible" defaults. Thanks. Agreed on unifying There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @piskvorky OK. I think it's worth dealing with API refactoring in a separate PR, for two reasons:
To answer your question, I think it makes sense to deprecate iter from the constructor. It's a poor name for a parameter, for three reasons:
|
||
>>> 'rocket' in model.wv | ||
True | ||
|
||
You can also load models trained with Facebook's fastText implementation: | ||
|
||
.. sourcecode:: pycon | ||
|
||
>>> from gensim.test.utils import datapath | ||
>>> cap_path = datapath("crime-and-punishment.bin") | ||
>>> # Partial model: loads quickly, uses less RAM, but cannot continue training | ||
>>> fb_partial = FastText.load_fasttext_format(cap_path, full_model=False) | ||
>>> # Full model: loads slowly, consumes RAM, but can continue training (see below) | ||
>>> fb_full = FastText.load_fasttext_format(cap_path, full_model=True) | ||
|
||
Once loaded, such models behave identically to those trained from scratch. | ||
You may continue training them on new data: | ||
|
||
.. sourcecode:: pycon | ||
|
||
>>> 'computer' in fb_full.wv.vocab # New word, currently out of vocab | ||
False | ||
>>> 'rocket' in fb_full.wv.vocab | ||
False | ||
>>> fb_full.train(sentences, total_examples=len(sentences), epochs=model.epochs) | ||
>>> fb_full.train(new_sentences, total_examples=len(new_sentences), epochs=model.epochs) | ||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
>>> 'computer' in fb_full.wv.vocab # We have learned this word now | ||
True | ||
>>> 'rocket' in fb_full.wv.vocab | ||
True | ||
|
||
Retrieve word-vector for vocab and out-of-vocab word: | ||
|
||
|
@@ -85,6 +139,28 @@ | |
|
||
>>> analogies_result = model.wv.evaluate_word_analogies(datapath('questions-words.txt')) | ||
|
||
Implementation Notes | ||
-------------------- | ||
|
||
These notes may help developers navigate our fastText implementation. | ||
Our FastText implementation is split across several submodules: | ||
|
||
- :py:mod:`gensim.models.fasttext`: This module. Contains FastText-specific functionality only. | ||
- :py:mod:`gensim.models.keyedvectors`: Implements both generic and FastText-specific functionality. | ||
- :py:mod:`gensim.models.word2vec`: | ||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- :py:mod:`gensim.models.base_any2vec`: | ||
- :py:mod:`gensim.models.utils_any2vec`: Wrapper over Cython extensions. | ||
|
||
Our implementation relies heavily on inheritance. | ||
It consists of several important classes: | ||
|
||
- :py:class:`FastTextVocab`: the vocabulary. Redundant, simply wraps its superclass. | ||
mpenkov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- :py:class:`~gensim.models.keyedvectors.FastTextKeyedVectors`: the vectors. | ||
Once training is complete, this class is sufficient for calculating embeddings. | ||
- :py:class:`FastTextTrainables`: the underlying neural network. The implementation | ||
uses this class to *learn* the word embeddings. | ||
- :py:class:`FastText`: ties everything together. | ||
|
||
""" | ||
|
||
import logging | ||
|
@@ -759,7 +835,8 @@ def load_fasttext_format(cls, model_file, encoding='utf8'): | |
|
||
Notes | ||
------ | ||
Due to limitations in the FastText API, you cannot continue training with a model loaded this way. | ||
This function effectively ignores `.vec` output file. | ||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
It only needs the `.bin` file. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -773,7 +850,7 @@ def load_fasttext_format(cls, model_file, encoding='utf8'): | |
|
||
Returns | ||
------- | ||
:class: `~gensim.models.fasttext.FastText` | ||
gensim.models.fasttext.FastText | ||
The loaded model. | ||
|
||
""" | ||
|
@@ -862,15 +939,44 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=None, case_inse | |
return self.wv.accuracy(questions, restrict_vocab, most_similar, case_insensitive) | ||
|
||
|
||
# | ||
# Keep for backward compatibility. | ||
# | ||
class FastTextVocab(Word2VecVocab): | ||
"""This is a redundant class. It exists only to maintain backwards compatibility | ||
with older gensim versions.""" | ||
pass | ||
|
||
|
||
class FastTextTrainables(Word2VecTrainables): | ||
"""Represents the inner shallow neural network used to train :class:`~gensim.models.fasttext.FastText`.""" | ||
"""Represents the inner shallow neural network used to train :class:`~gensim.models.fasttext.FastText`. | ||
|
||
Mostly inherits from its parent (:py:class:`gensim.models.word2vec.Word2VecTrainables`). | ||
Adds logic for calculating and maintaining ngram weights. | ||
|
||
Attributes | ||
---------- | ||
|
||
menshikh-iv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
hashfxn : function | ||
Used for randomly initializing weights. Defaults to the built-in hash() | ||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
layer1_size : int | ||
The size of the inner layer of the NN. Equal to the vector dimensionality. Set in the :py:class:`gensim.models.word2vec.Word2VecTrainables` constructor. | ||
seed : float | ||
The random generator seed used in reset_weights and update_weights | ||
syn1 : numpy.array | ||
The inner layer of the NN. Each row corresponds to a term in the vocabulary. Columns correspond to weights of the inner layer. There are layer1_size such weights. Set in the reset_weights and update_weights methods, only if hierarchical sampling is used. | ||
syn1neg : numpy.array | ||
Similar to syn1, but only set if negative sampling is used. | ||
vectors_lockf : numpy.array | ||
A one-dimensional array with one element for each term in the vocab. Set in reset_weights to an array of ones. | ||
vectors_vocab_lockf : numpy.array | ||
Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL) | ||
vectors_ngrams_lockf : numpy.array | ||
np.ones((self.bucket, wv.vector_size), dtype=REAL) | ||
|
||
Notes | ||
----- | ||
|
||
The lockf stuff looks like it gets used by the fast C implementation. | ||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
""" | ||
def __init__(self, vector_size=100, seed=1, hashfxn=hash, bucket=2000000): | ||
super(FastTextTrainables, self).__init__( | ||
vector_size=vector_size, seed=seed, hashfxn=hashfxn) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this
model.epochs
coming from? The model instantiation above shows no such variable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the epochs parameter is optional, so not included during instantiation.
Unfortunately, there is also some confusion about its name: the FastText constructor uses iter to specify the number of epochs, whereas the superclass uses the proper name epochs.
The presence of the epochs parameter to the train function (which seems to override the one set in the constructor) also complicates matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. If it's optional, let's not use it in
train
. Or if we use it intrain
, let's instantiate it explicitly. This neither-here-nor-there example is confusing ("where is this value come from?").Regarding
iter
/epochs
-- can you please rename it toepochs
, consistently? I remember some discussion around this (cc @menshikh-iv @gojomo ), but can't imagine why we'd want both. At most we could supportiter
for a while as an alias, but with a clear deprecation warning.This is a perfect opportunity to clean up some of the API mess, rather than piling on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree regarding the cleanup. My preference would be to leave epochs/iter out of the constructor. The model doesn't need that parameter until training time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Models in Gensim generally allow the
trained_model = Constructor(params_including_training_params)
pattern. So breaking that could be confusing to existing users (and a big change to backward incompatibility).I'm not totally opposed though, especially if we still allow ctr params for a while with "deprecated" warnings. The API needs a clean up, and now is a good time.
Not a big priority though, and the documentation examples can already promote the
instantiate then train, as 2 steps
pattern.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just training parameters you need to include in the constructor. It's also parameters for vocabulary creation. So you're managing at least 3 sets of separate parameters, 2 of which are duplicated by other methods of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should promote them as separate steps in docs. Question is, do we deprecate (certainly not remove) them from ctr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your motivation in not removing them (backward compatibility). Unfortunately, the current mess won't go away until we remove things like this.
I think the first step should be to deprecate them. After a while, we can remove them, perhaps in time for a major release.
If we want a one-liner way to instantiate and train, we can always write a pure function and promote that. That should make it easier for users to cut over to the cleaner API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, deprecation is what I suggest.