Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged
merged 64 commits into from
Jul 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
7e642a2
slim low-value warnings
gojomo Dec 5, 2019
b8de987
clarify vectors/vectors_vocab relationship; fix lockf & nonsense ngra…
gojomo Dec 5, 2019
38343d6
mv FT, KV tests to right place
gojomo Dec 6, 2019
a255e8c
rm deprecations, obsolete refs/tests, delete_temporary_training_data,…
gojomo Dec 5, 2019
4e334c1
update usages, tests, flake8 cleanup
gojomo Dec 7, 2019
a16cec5
expand KeyedVectors to obviate Doc2VecKeyedVectors; upconvert old off…
gojomo Dec 12, 2019
d4267f8
fix docstring warnings; update usages
gojomo Dec 12, 2019
f6e7aa6
rm unused old plain-python codepaths
gojomo Dec 13, 2019
470b119
unify class comments under __init__ for consistncy w/ api doc present…
gojomo Dec 14, 2019
cd02b8b
name/comment harmonization (rm 'entity', lessen 'word'-centricity)
gojomo Dec 17, 2019
0c77ae4
table formatting
gojomo Dec 17, 2019
cfa723d
return pyemd to linux test env
gojomo Dec 17, 2019
a4f7b77
split backcompat tests for better resolution
gojomo Dec 18, 2019
4412696
convert Vocab & related data items to use dataclasses
gojomo Dec 18, 2019
65c2b2d
rm obsolete Vocab/Trainable/abstract/Wrapper classes, persistent call…
gojomo Dec 18, 2019
1d0f52f
tune tests for stability, runtimes; rm auto reruns that hide flakiness
gojomo Jan 15, 2020
8123596
fix numpy FutureWarning: arrays to stack must be sequence
gojomo Dec 26, 2019
c5efb24
(commented-out) deoptimization option
gojomo Jan 22, 2020
2c234dd
stronger FB model testing; no _unpack_copy test
gojomo Jan 22, 2020
9910404
merge redundant methods; rm duplicated imports/defs
gojomo Jan 22, 2020
658813f
rationalize _lockf, buckets_word behaviors
gojomo Jan 22, 2020
3cdb1d6
rename .docvecs to .dv
gojomo Jan 24, 2020
10d9f55
update usages; rm obsolete tests; restore gensim.utils import
gojomo Jan 28, 2020
79af68e
intensify FT tests (more epochs, more buckets)
gojomo May 12, 2020
8875d8b
flake8-3.8.0 style fixes - but also pin flake8-3.7.9 vs 3.8.0 'output…
gojomo May 12, 2020
4b7566e
replace vectors_norm with 1d norms
gojomo May 12, 2020
1baab2a
tighten testParallel
gojomo May 13, 2020
8d2f1fe
rm .vocab & 'Vocab' classes; add expandable 'vecattrs'
gojomo May 14, 2020
fc65525
update usages (no vocabs)
gojomo May 15, 2020
4657b14
enable running inside '-m mtprof' (or cProfile) via explicit unittest…
gojomo May 15, 2020
b5ff29b
faster sample_int reads
gojomo May 15, 2020
098119b
load_word2vec_format(.., no_header=True) to support GLoVe text vectors
gojomo May 19, 2020
318a858
refactor & comment lockf feature; allow single-element lockf
gojomo May 26, 2020
fe3ae31
improve FT comment
gojomo May 26, 2020
d503205
rm deprecated/unneded init_sims calls
gojomo May 26, 2020
679dde9
Merge branch 'develop' into kv_cleanup
piskvorky Jul 5, 2020
411473b
fixes to code style
piskvorky Jul 6, 2020
45fd5f6
flake8: fix overlong lines
piskvorky Jul 6, 2020
5acc5f5
Merge branch 'develop' into kv_cleanup
gojomo Jul 6, 2020
5764f8c
rm stray merge error
gojomo Jul 6, 2020
e49ae4c
rm duplicated , old nonstandard hash workarounds
gojomo Jul 6, 2020
278c2bd
use numpy-recommended PRNG constructor
gojomo Jul 6, 2020
5c7eb1c
add sg to FastTextConfig & consult it; rm remaining broken-hash cruft
gojomo Jul 6, 2020
23805d1
reorg conditional packages for clarity
gojomo Jul 6, 2020
f5b902c
comments, names, refactoring, randomization
gojomo Jul 7, 2020
7b571b2
Apply suggestions from code review
gojomo Jul 7, 2020
87860c5
fix cruft left from suggestion
gojomo Jul 7, 2020
39fe128
fix numpy-32bit-on-Windows; executable docs
gojomo Jul 7, 2020
15152ff
mv lee_corpus to utils; cleanup
gojomo Jul 7, 2020
3d424a2
update poincare for latest KV __init__ signature
gojomo Jul 7, 2020
99f7009
restore word_vec method for proper overriding, but rm usages
gojomo Jul 7, 2020
2bb8abf
Apply suggestions from code review
gojomo Jul 7, 2020
33c6508
adjust testParallel against failure risk
gojomo Jul 8, 2020
8f17d6d
merge ~piskvorky's /pull/10 cleanups
gojomo Jul 10, 2020
cb33e46
intensify training for an occasionally failing test
gojomo Jul 11, 2020
581ef06
clarify word/char ngrams handling; rm outdated comments
gojomo Jul 14, 2020
9f21cba
mostly avoid duplciating FastTextConfig fields into locals
gojomo Jul 16, 2020
d912616
avoid copies/pointers for no-bucket (FT as W2V) case
gojomo Jul 16, 2020
583bbe6
rm obsolete test (already skipped & somewhat originally misguided)
gojomo Jul 16, 2020
0330cfc
simpler/faster .get(..., default) (avoids exception-catching in has_i…
gojomo Jul 16, 2020
9caf217
add default option to get_index; avoid exception in has_index_for
gojomo Jul 16, 2020
14dd9f5
chained range check
gojomo Jul 16, 2020
8674949
Merge branch 'develop' into kv_cleanup
mpenkov Jul 19, 2020
0d2679a
Update CHANGELOG.md
mpenkov Jul 19, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,11 @@ Changes

## Unreleased

This release contains a major refactoring.
mpenkov marked this conversation as resolved.
Show resolved Hide resolved

### :+1: Improvements

* KeyedVectors & X2Vec API streamlining, consistency (PR [#2698](https://github.com/RaRe-Technologies/gensim/pull/2698), __[@gojomo](https://github.com/gojomo)__)
* No more wheels for x32 platforms (if you need x32 binaries, please build them yourself).
(__[menshikh-iv](https://github.com/menshikh-iv)__, [#6](https://github.com/RaRe-Technologies/gensim-wheels/pull/6))
* Speed up random number generation in word2vec model (PR [#2864](https://github.com/RaRe-Technologies/gensim/pull/2864), __[@zygm0nt](https://github.com/zygm0nt)__)
Expand Down
2 changes: 0 additions & 2 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@ include gensim/models/fasttext_inner.pxd
include gensim/models/fasttext_corpusfile.cpp
include gensim/models/fasttext_corpusfile.pyx

include gensim/models/_utils_any2vec.c
include gensim/models/_utils_any2vec.pyx
include gensim/corpora/_mmreader.c
include gensim/corpora/_mmreader.pyx
include gensim/_matutils.c
Expand Down
9 changes: 0 additions & 9 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,6 @@ Modules:
models/coherencemodel
models/basemodel
models/callbacks
models/utils_any2vec
models/_utils_any2vec
models/word2vec_inner
models/doc2vec_inner
models/fasttext_inner
Expand All @@ -63,13 +61,6 @@ Modules:
models/wrappers/ldavowpalwabbit.rst
models/wrappers/wordrank
models/wrappers/varembed
models/wrappers/fasttext
models/deprecated/doc2vec
models/deprecated/fasttext
models/deprecated/word2vec
models/deprecated/keyedvectors
models/deprecated/fasttext_wrapper
models/base_any2vec
similarities/docsim
similarities/termsim
similarities/index
Expand Down
2 changes: 1 addition & 1 deletion docs/src/auto_examples/tutorials/run_fasttext.rst
Original file line number Diff line number Diff line change
Expand Up @@ -479,7 +479,7 @@ The example training corpus is a toy corpus, results are not expected to be good
.. code-block:: none

/Volumes/work/workspace/gensim_misha/gensim/models/keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
vectors = vstack(self.get_vector(word, use_norm=True) for word in used_words).astype(REAL)
'breakfast'


Expand Down
2 changes: 1 addition & 1 deletion docs/src/auto_examples/tutorials/run_word2vec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ Which of the below does not belong in the sequence?
.. code-block:: none

/home/misha/git/gensim/gensim/models/keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
vectors = vstack(self.get_vector(word, use_norm=True) for word in used_words).astype(REAL)
car


Expand Down
9 changes: 0 additions & 9 deletions docs/src/models/_utils_any2vec.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/src/models/base_any2vec.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/deprecated/doc2vec.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/src/models/deprecated/fasttext.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/src/models/deprecated/fasttext_wrapper.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/deprecated/keyedvectors.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/deprecated/word2vec.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/utils_any2vec.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/wrappers/fasttext.rst

This file was deleted.

82 changes: 41 additions & 41 deletions gensim/corpora/sharded_corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@
import logging
import os
import math
import numpy
import scipy.sparse as sparse
import time

import numpy
import scipy.sparse as sparse
from six.moves import range

import gensim
Expand Down Expand Up @@ -263,9 +263,7 @@ def init_shards(self, output_prefix, corpus, shardsize=4096, dtype=_default_dtyp

is_corpus, corpus = gensim.utils.is_corpus(corpus)
if not is_corpus:
raise ValueError(
"Cannot initialize shards without a corpus to read from! (Got corpus type: {0})".format(type(corpus))
)
raise ValueError("Cannot initialize shards without a corpus to read from! Corpus type: %s" % type(corpus))

proposed_dim = self._guess_n_features(corpus)
if proposed_dim != self.dim:
Expand Down Expand Up @@ -360,7 +358,7 @@ def load_shard(self, n):

filename = self._shard_name(n)
if not os.path.isfile(filename):
raise ValueError('Attempting to load nonexistent shard no. {0}'.format(n))
raise ValueError('Attempting to load nonexistent shard no. %s' % n)
shard = gensim.utils.unpickle(filename)

self.current_shard = shard
Expand All @@ -387,11 +385,9 @@ def shard_by_offset(self, offset):
"""
k = int(offset / self.shardsize)
if offset >= self.n_docs:
raise ValueError('Too high offset specified ({0}), available '
'docs: {1}'.format(offset, self.n_docs))
raise ValueError('Too high offset specified (%s), available docs: %s' % (offset, self.n_docs))
if offset < 0:
raise ValueError('Negative offset {0} currently not'
' supported.'.format(offset))
raise ValueError('Negative offset %s currently not supported.' % offset)
return k

def in_current(self, offset):
Expand All @@ -411,7 +407,7 @@ def in_next(self, offset):
"""
if self.current_shard_n == self.n_shards:
return False # There's no next shard.
return (self.offsets[self.current_shard_n + 1] <= offset) and (offset < self.offsets[self.current_shard_n + 2])
return self.offsets[self.current_shard_n + 1] <= offset and offset < self.offsets[self.current_shard_n + 2]

def resize_shards(self, shardsize):
"""
Expand Down Expand Up @@ -440,9 +436,8 @@ def resize_shards(self, shardsize):
if new_stop > self.n_docs:
# Sanity check
assert new_shard_idx == n_new_shards - 1, \
'Shard no. {0} that ends at {1} over last document' \
' ({2}) is not the last projected shard ({3})???' \
''.format(new_shard_idx, new_stop, self.n_docs, n_new_shards)
'Shard no. %r that ends at %r over last document (%r) is not the last projected shard (%r)' % (
new_shard_idx, new_stop, self.n_docs, n_new_shards)
new_stop = self.n_docs

new_shard = self[new_start:new_stop]
Expand All @@ -466,9 +461,9 @@ def resize_shards(self, shardsize):
for old_shard_n, old_shard_name in enumerate(old_shard_names):
os.remove(old_shard_name)
except Exception as e:
logger.error(
'Exception occurred during old shard no. %d removal: %s.\nAttempting to at least move new shards in.',
old_shard_n, str(e)
logger.exception(
'Error during old shard no. %d removal: %s.\nAttempting to at least move new shards in.',
old_shard_n, str(e),
)
finally:
# If something happens with cleaning up - try to at least get the
Expand All @@ -479,7 +474,7 @@ def resize_shards(self, shardsize):
# If something happens when we're in this stage, we're screwed.
except Exception as e:
logger.exception(e)
raise RuntimeError('Resizing completely failed for some reason. Sorry, dataset is probably ruined...')
raise RuntimeError('Resizing completely failed. Sorry, dataset is probably ruined...')
finally:
# Sets the new shard stats.
self.n_shards = n_new_shards
Expand Down Expand Up @@ -524,19 +519,18 @@ def _guess_n_features(self, corpus):
else:
if not self.dim:
raise TypeError(
"Couldn't find number of features, refusing to guess "
"(dimension set to {0}, type of corpus: {1})."
.format(self.dim, type(corpus))
"Couldn't find number of features, refusing to guess. Dimension: %s, corpus: %s)" % (
self.dim, type(corpus),
)
)
else:
logger.warning("Couldn't find number of features, trusting supplied dimension (%d)", self.dim)
n_features = self.dim
logger.warning("Couldn't find number of features, trusting supplied dimension (%d)", self.dim)
n_features = self.dim

if self.dim and n_features != self.dim:
logger.warning(
"Discovered inconsistent dataset dim (%d) and feature count from corpus (%d). "
"Coercing to dimension given by argument.",
self.dim, n_features
self.dim, n_features,
)

return n_features
Expand Down Expand Up @@ -591,7 +585,7 @@ def __getitem__(self, offset):
start = offset.start
stop = offset.stop
if stop > self.n_docs:
raise IndexError('Requested slice offset {0} out of range ({1} docs)'.format(stop, self.n_docs))
raise IndexError('Requested slice offset %s out of range (%s docs)' % (stop, self.n_docs))

# - get range of shards over which to iterate
first_shard = self.shard_by_offset(start)
Expand Down Expand Up @@ -674,21 +668,23 @@ def __getitem__(self, offset):

def __add_to_slice(self, s_result, result_start, result_stop, start, stop):
"""
Add the rows of the current shard from `start` to `stop`
Add rows of the current shard from `start` to `stop`
into rows `result_start` to `result_stop` of `s_result`.

Operation is based on the self.sparse_serialize setting. If the shard
Operation is based on the ``self.sparse_serialize`` setting. If the shard
contents are dense, then s_result is assumed to be an ndarray that
already supports row indices `result_start:result_stop`. If the shard
contents are sparse, assumes that s_result has `result_start` rows
and we should add them up to `result_stop`.

Returns the resulting s_result.
Return the resulting ``s_result``.

"""
if (result_stop - result_start) != (stop - start):
raise ValueError(
'Result start/stop range different than stop/start range (%d - %d vs. %d - %d)'
% (result_start, result_stop, start, stop)
'Result start/stop range different than stop/start range (%s - %s vs. %s - %s)' % (
result_start, result_stop, start, stop,
)
)

# Dense data: just copy using numpy's slice notation
Expand All @@ -699,16 +695,16 @@ def __add_to_slice(self, s_result, result_start, result_stop, start, stop):

# A bit more difficult, we're using a different structure to build the
# result.
else:
if s_result.shape != (result_start, self.dim):
raise ValueError(
'Assuption about sparse s_result shape invalid: {0} expected rows, {1} real rows.'
.format(result_start, s_result.shape[0])
if s_result.shape != (result_start, self.dim):
raise ValueError(
'Assuption about sparse s_result shape invalid: %s expected rows, %s real rows.' % (
result_start, s_result.shape[0],
)
)

tmp_matrix = self.current_shard[start:stop]
s_result = sparse.vstack([s_result, tmp_matrix])
return s_result
tmp_matrix = self.current_shard[start:stop]
s_result = sparse.vstack([s_result, tmp_matrix])
return s_result

def _getitem_format(self, s_result):
if self.sparse_serialization:
Expand Down Expand Up @@ -817,5 +813,9 @@ def serialize(serializer, fname, corpus, id2word=None, index_fname=None, progres

Ignore the parameters id2word, index_fname, progress_cnt, labels
and metadata. They currently do nothing and are here only to
provide a compatible method signature with superclass."""
serializer.save_corpus(fname, corpus, id2word=id2word, progress_cnt=progress_cnt, metadata=metadata, **kwargs)
provide a compatible method signature with superclass.

"""
serializer.save_corpus(
fname, corpus, id2word=id2word, progress_cnt=progress_cnt, metadata=metadata, **kwargs,
)
3 changes: 1 addition & 2 deletions gensim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from .logentropy_model import LogEntropyModel # noqa:F401
from .word2vec import Word2Vec # noqa:F401
from .doc2vec import Doc2Vec # noqa:F401
from .keyedvectors import KeyedVectors, WordEmbeddingSimilarityIndex # noqa:F401
from .keyedvectors import KeyedVectors # noqa:F401
from .ldamulticore import LdaMulticore # noqa:F401
from .phrases import Phrases # noqa:F401
from .normmodel import NormModel # noqa:F401
Expand All @@ -23,7 +23,6 @@
from .translation_matrix import TranslationMatrix, BackMappingTranslationMatrix # noqa:F401

from . import wrappers # noqa:F401
from . import deprecated # noqa:F401

from gensim import interfaces, utils

Expand Down
Loading