Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve FastText documentation #2353

Merged
merged 40 commits into from
Jan 29, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
18f5302
WIP: doco improvements
mpenkov Jan 24, 2019
b57a086
more doco
mpenkov Jan 25, 2019
5fa0c9f
Merge remote-tracking branch 'upstream/develop' into doc-improv
mpenkov Jan 25, 2019
d66f55c
flake8-docs updates
mpenkov Jan 25, 2019
9696657
adding fixmes
mpenkov Jan 25, 2019
3019bea
minor fixup
mpenkov Jan 26, 2019
74b740c
review response
mpenkov Jan 26, 2019
09ab630
Remove magic constant
mpenkov Jan 26, 2019
2e728cb
deprecate the iter parameter to the FastText constructor
mpenkov Jan 26, 2019
c435a8e
minor documentation fixes
mpenkov Jan 26, 2019
c688877
review response: use absolute references
mpenkov Jan 26, 2019
677679c
review response
mpenkov Jan 26, 2019
29c5210
fix unit test
mpenkov Jan 26, 2019
044d699
Revert "deprecate the iter parameter to the FastText constructor"
mpenkov Jan 26, 2019
f9df136
Revert "fix unit test"
mpenkov Jan 26, 2019
cdc727a
more documentation improvements
mpenkov Jan 27, 2019
4ea3f06
comment out pesky import
mpenkov Jan 27, 2019
e532d62
fix typo
mpenkov Jan 27, 2019
931d3d7
improve tutorial notebook
mpenkov Jan 27, 2019
177c712
minor documentation update
mpenkov Jan 27, 2019
bd83886
flake8-docs
mpenkov Jan 27, 2019
11fabca
more doco fixes
mpenkov Jan 27, 2019
2d490f0
fix example
mpenkov Jan 27, 2019
9b6f8bb
git rm docs/fasttext-notes.md
mpenkov Jan 27, 2019
9b5e161
review response: include _fasttext_bin in docs
mpenkov Jan 27, 2019
6aa013a
review response: make examples more readable
mpenkov Jan 27, 2019
7d2b562
review response: remove blank line
mpenkov Jan 27, 2019
25b24c7
review response: add emphasis
mpenkov Jan 27, 2019
b4e8405
review response: add comment
mpenkov Jan 27, 2019
1fc9bf2
review response: add example
mpenkov Jan 27, 2019
72ec312
review response: remove redundant line
mpenkov Jan 27, 2019
29c4faf
review response: update comment
mpenkov Jan 28, 2019
74410fc
Update gensim/models/fasttext.py
piskvorky Jan 28, 2019
a3456a4
review response: improve examples
mpenkov Jan 28, 2019
96eab08
clarify example
mpenkov Jan 28, 2019
ff72185
review response: improve example
mpenkov Jan 28, 2019
9140cf6
review response: improve tokenization in example
mpenkov Jan 28, 2019
31c79c3
flake8
mpenkov Jan 29, 2019
2f479ca
fix long lines
mpenkov Jan 29, 2019
c48a7f3
fixup: use correct parameter name
mpenkov Jan 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions gensim/models/_fasttext_bin.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# -*- coding: utf-8 -*-
"""Load models from the native binary format released by Facebook.

The main entry point is the :py:func:`load` function.
It returns a :py:class:`Model` namedtuple containing everything loaded from the binary.

Examples
--------

Expand Down
120 changes: 113 additions & 7 deletions gensim/models/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@
This module contains a fast native C implementation of Fasttext with Python interfaces. It is **not** only a wrapper
around Facebook's implementation.

This module supports loading models trained with Facebook's fastText implementation.
It also supports continuing training from such models.

For a tutorial see `this notebook
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb>`_.

Expand All @@ -31,6 +34,15 @@
>>> from gensim.models import FastText
>>>
>>> model = FastText(common_texts, size=4, window=3, min_count=1, iter=10)
>>> sentences = [
... ['computer', 'artificial', 'intelligence'],
... ['artificial', 'trees'],
... ['human', 'intelligence'],
... ['artificial', 'graph'],
... ['intelligence'],
... ['artificial', 'intelligence', 'system']
... ]
>>> model.train(sentences, total_examples=len(sentences), epochs=model.epochs)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this model.epochs coming from? The model instantiation above shows no such variable.

Copy link
Collaborator Author

@mpenkov mpenkov Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the epochs parameter is optional, so not included during instantiation.

Unfortunately, there is also some confusion about its name: the FastText constructor uses iter to specify the number of epochs, whereas the superclass uses the proper name epochs.

The presence of the epochs parameter to the train function (which seems to override the one set in the constructor) also complicates matters.

Copy link
Owner

@piskvorky piskvorky Jan 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. If it's optional, let's not use it in train. Or if we use it in train, let's instantiate it explicitly. This neither-here-nor-there example is confusing ("where is this value come from?").

Regarding iter / epochs -- can you please rename it to epochs, consistently? I remember some discussion around this (cc @menshikh-iv @gojomo ), but can't imagine why we'd want both. At most we could support iter for a while as an alias, but with a clear deprecation warning.

This is a perfect opportunity to clean up some of the API mess, rather than piling on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree regarding the cleanup. My preference would be to leave epochs/iter out of the constructor. The model doesn't need that parameter until training time.

Copy link
Owner

@piskvorky piskvorky Jan 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Models in Gensim generally allow the trained_model = Constructor(params_including_training_params) pattern. So breaking that could be confusing to existing users (and a big change to backward incompatibility).

I'm not totally opposed though, especially if we still allow ctr params for a while with "deprecated" warnings. The API needs a clean up, and now is a good time.

Not a big priority though, and the documentation examples can already promote the instantiate then train, as 2 steps pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just training parameters you need to include in the constructor. It's also parameters for vocabulary creation. So you're managing at least 3 sets of separate parameters, 2 of which are duplicated by other methods of the class.

Copy link
Owner

@piskvorky piskvorky Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should promote them as separate steps in docs. Question is, do we deprecate (certainly not remove) them from ctr?

Copy link
Collaborator Author

@mpenkov mpenkov Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your motivation in not removing them (backward compatibility). Unfortunately, the current mess won't go away until we remove things like this.

I think the first step should be to deprecate them. After a while, we can remove them, perhaps in time for a major release.

If we want a one-liner way to instantiate and train, we can always write a pure function and promote that. That should make it easier for users to cut over to the cleaner API.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, deprecation is what I suggest.


Persist a model to disk with:

Expand All @@ -41,7 +53,49 @@
>>> fname = get_tmpfile("fasttext.model")
>>>
>>> model.save(fname)
>>> model = FastText.load(fname) # you can continue training with the loaded model!
>>> model = FastText.load(fname)

Once loaded, such models behave identically to those created from scratch.
For example, you can continue training the loaded model:

>>> new_sentences = [
... ['sweet', 'child', 'of', 'mine'],
... ['rocket', 'queen'],
... ['you', 'could', 'be', 'mine'],
... ['november', 'rain'],
... ]
>>> 'rocket' in model.wv
False
>>> model.train(new_sentences, total_examples=len(sentences), epochs=model.epochs)
Copy link
Owner

@piskvorky piskvorky Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why total_example=len(sentences)? Even if correct, it looks strange… would be my first question as a user.
Let's provide the answer here in the docs why total_examples is needed, what is its role.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's correct. I agree that is confusing. The docstring for the train function attempts to clarify the situation.

Personally, I think if neither total_examples and total_words are specified, we should try to determine sensible defaults by looking at e.g. len(sentences). WDYT @menshikh-iv ?

Copy link
Owner

@piskvorky piskvorky Jan 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I read the linked docs and still don't get why it's not len(new_sentences).

Please include some top-level intuition here. A short sentence on why this parameter is mandatory, and what should be its value, because it looks really strange and superfluous. +1 for sensible defaults.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For train() to manage alpha correctly, and show meaningful progress estimates, it needs a good estimate of the size of the supplied-corpus - even when the corpus (as an iterable) may not self-report its length. In the typical case where the same corpus was just surveyed for its vocabulary, this value should be handy. (In the Word2Vec case, the count from the vocab-scan is cached inside the model for later consultation – unsure if the FT paths do this.) In the case where other/new data is being supplied, the caller should supply the right counts for the current data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo If the corpus does self-report its length though, should we use that instead? If yes, which should we do:

  1. total_examples = len(corpus)
  2. total_words = len(corpus)

If the corpus does not self-report its length, then we could raise an exception with a heplful message. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's able to self-report its length, in count of texts, then yes, that would work as total_examples. But because streaming a corpus from an iterable is the motivating case for this interface, relying on that inside the method seems inappropriate to me. Leaving it the responsibility of the caller seems fine to me – and if they're lucky enough to have a corpus that reports its own length, they can supply it easily.

Copy link
Collaborator Author

@mpenkov mpenkov Jan 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, while we're talking about simplifying the API, what do you think about removing the sentences and corpus_file parameters from the constructor? Currently, we have an inconsistency: in the constructor, we just pass sentences/corpus_file without total_examples and total_words parameters. In the train function, we include those additional parameters.

Instead of passing sentences in the constructor, the user can pass them in separately via the train function.

Pros:

  • Simpler constructor. There will be less parameters.
  • Easier to understand and consistent API.

Cons:

  • Getting a trained model now takes two steps instead of one (instantiate, train). Not sure how much of a con this is given that the train function is more powerful than the constructor anyway - it has parameters that the constructor doesn't.

@menshikh-iv @piskvorky @gojomo What do you think?

The same thing also applies to the callbacks parameters.

Copy link
Collaborator Author

@mpenkov mpenkov Jan 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But because streaming a corpus from an iterable is the motivating case for this interface, relying on that inside the method seems inappropriate to me.

@gojomo Why do you think it is inappropriate? We could do something like this:

if total_examples or total_words:
    pass  # nothing to do here
elif sentences and hasattr(sentences, '__len__'):  # could also check for callable, if necessary
    total_examples = len(sentences)
elif data_corpus and hasattr(data_corpus, '__len__'):
    total_examples = len(data_corpus)
else:
    raise ValueError(
        'unable to infer total_examples or total_words from the training source, '
        'please pass one of them explicitly'
    )

It looks ugly, but it allows to user to do something like:

model.train(sentences)

instead of

model.train(sentences, num_examples=len(sentences))

I feel the former is more Pythonic.

Finally, I think having two separate keyword parameters for the input is confusing for the user. In my opinion, it would look a lot simpler if we unified the two parameters, and dealt with untangling them in the implementation.

Copy link
Owner

@piskvorky piskvorky Jan 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov yes, I'd consider that "sensible" defaults. Thanks.

Agreed on unifying iter/epoch. How would that work? Keep both as acceptable input (internally unify to one), no deprecations? Deprecate one? Which one?

Copy link
Collaborator Author

@mpenkov mpenkov Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky OK. I think it's worth dealing with API refactoring in a separate PR, for two reasons:

  1. Such changes risk introducing regressions, and I'd rather not have them together with the relatively safe documentation changes.
  2. We've already spent much effort refactoring FastText, and it may be prudent to wait for things to stabilize (e.g. fix introduced regressions) before proceeding.

To answer your question, I think it makes sense to deprecate iter from the constructor. It's a poor name for a parameter, for three reasons:

  1. It masks the built-in iter function.
  2. It isn't obvious.
  3. It's inconsistent. We use epochs everywhere else outside of the constructor.

>>> 'rocket' in model.wv
True

You can also load models trained with Facebook's fastText implementation:

.. sourcecode:: pycon

>>> from gensim.test.utils import datapath
>>> cap_path = datapath("crime-and-punishment.bin")
>>> # Partial model: loads quickly, uses less RAM, but cannot continue training
>>> fb_partial = FastText.load_fasttext_format(cap_path, full_model=False)
>>> # Full model: loads slowly, consumes RAM, but can continue training (see below)
>>> fb_full = FastText.load_fasttext_format(cap_path, full_model=True)

Once loaded, such models behave identically to those trained from scratch.
You may continue training them on new data:

.. sourcecode:: pycon

>>> 'computer' in fb_full.wv.vocab # New word, currently out of vocab
False
>>> 'rocket' in fb_full.wv.vocab
False
>>> fb_full.train(sentences, total_examples=len(sentences), epochs=model.epochs)
>>> fb_full.train(new_sentences, total_examples=len(new_sentences), epochs=model.epochs)
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
>>> 'computer' in fb_full.wv.vocab # We have learned this word now
True
>>> 'rocket' in fb_full.wv.vocab
True

Retrieve word-vector for vocab and out-of-vocab word:

Expand Down Expand Up @@ -85,6 +139,28 @@

>>> analogies_result = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))

Implementation Notes
--------------------

These notes may help developers navigate our fastText implementation.
Our FastText implementation is split across several submodules:

- :py:mod:`gensim.models.fasttext`: This module. Contains FastText-specific functionality only.
- :py:mod:`gensim.models.keyedvectors`: Implements both generic and FastText-specific functionality.
- :py:mod:`gensim.models.word2vec`:
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
- :py:mod:`gensim.models.base_any2vec`:
- :py:mod:`gensim.models.utils_any2vec`: Wrapper over Cython extensions.

Our implementation relies heavily on inheritance.
It consists of several important classes:

- :py:class:`FastTextVocab`: the vocabulary. Redundant, simply wraps its superclass.
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
- :py:class:`~gensim.models.keyedvectors.FastTextKeyedVectors`: the vectors.
Once training is complete, this class is sufficient for calculating embeddings.
- :py:class:`FastTextTrainables`: the underlying neural network. The implementation
uses this class to *learn* the word embeddings.
- :py:class:`FastText`: ties everything together.

"""

import logging
Expand Down Expand Up @@ -759,7 +835,8 @@ def load_fasttext_format(cls, model_file, encoding='utf8'):

Notes
------
Due to limitations in the FastText API, you cannot continue training with a model loaded this way.
This function effectively ignores `.vec` output file.
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
It only needs the `.bin` file.

Parameters
----------
Expand All @@ -773,7 +850,7 @@ def load_fasttext_format(cls, model_file, encoding='utf8'):

Returns
-------
:class: `~gensim.models.fasttext.FastText`
gensim.models.fasttext.FastText
The loaded model.

"""
Expand Down Expand Up @@ -862,15 +939,44 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=None, case_inse
return self.wv.accuracy(questions, restrict_vocab, most_similar, case_insensitive)


#
# Keep for backward compatibility.
#
class FastTextVocab(Word2VecVocab):
"""This is a redundant class. It exists only to maintain backwards compatibility
with older gensim versions."""
pass


class FastTextTrainables(Word2VecTrainables):
"""Represents the inner shallow neural network used to train :class:`~gensim.models.fasttext.FastText`."""
"""Represents the inner shallow neural network used to train :class:`~gensim.models.fasttext.FastText`.

Mostly inherits from its parent (:py:class:`gensim.models.word2vec.Word2VecTrainables`).
Adds logic for calculating and maintaining ngram weights.

Attributes
----------

menshikh-iv marked this conversation as resolved.
Show resolved Hide resolved
hashfxn : function
Used for randomly initializing weights. Defaults to the built-in hash()
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
layer1_size : int
The size of the inner layer of the NN. Equal to the vector dimensionality. Set in the :py:class:`gensim.models.word2vec.Word2VecTrainables` constructor.
seed : float
The random generator seed used in reset_weights and update_weights
syn1 : numpy.array
The inner layer of the NN. Each row corresponds to a term in the vocabulary. Columns correspond to weights of the inner layer. There are layer1_size such weights. Set in the reset_weights and update_weights methods, only if hierarchical sampling is used.
syn1neg : numpy.array
Similar to syn1, but only set if negative sampling is used.
vectors_lockf : numpy.array
A one-dimensional array with one element for each term in the vocab. Set in reset_weights to an array of ones.
vectors_vocab_lockf : numpy.array
Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL)
vectors_ngrams_lockf : numpy.array
np.ones((self.bucket, wv.vector_size), dtype=REAL)

Notes
-----

The lockf stuff looks like it gets used by the fast C implementation.
piskvorky marked this conversation as resolved.
Show resolved Hide resolved

"""
def __init__(self, vector_size=100, seed=1, hashfxn=hash, bucket=2000000):
super(FastTextTrainables, self).__init__(
vector_size=vector_size, seed=seed, hashfxn=hashfxn)
Expand Down
10 changes: 8 additions & 2 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -1936,14 +1936,19 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
If True, uses the Facebook-compatible hash function instead of the
Gensim backwards-compatible hash function.

Some important attributes:
menshikh-iv marked this conversation as resolved.
Show resolved Hide resolved

Attributes
----------
vectors_vocab : np.array
A vector for each entity in the vocabulary.
Each row corresponds to a vector for an entity in the vocabulary.
Columns correspond to vector dimensions.
vectors_vocab_norm : np.array
Same as vectors_vocab, but the vectors are L2 normalized.
vectors_ngrams : np.array
A vector for each ngram across all entities in the vocabulary.
Each row is a vector that corresponds to a bucket.
Columns correspond to vector dimensions.
vectors_ngrams_norm : np.array
Same as vectors_ngrams, but the vectors are L2 normalized.
Under some conditions, may actually be the same matrix as
Expand All @@ -1957,7 +1962,8 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
bucket to an index, and then indexing into vectors_ngrams (in other
words, vectors_ngrams[hash2index[hash_fn(ngram) % bucket]].
num_ngram_vectors : int
TODO
The number of vectors that correspond to ngrams, as opposed to terms
(full words).

"""
def __init__(self, vector_size, min_n, max_n, bucket, compatible_hash):
Expand Down