Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged
merged 64 commits into from
Jul 19, 2020
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
7e642a2
slim low-value warnings
gojomo Dec 5, 2019
b8de987
clarify vectors/vectors_vocab relationship; fix lockf & nonsense ngra…
gojomo Dec 5, 2019
38343d6
mv FT, KV tests to right place
gojomo Dec 6, 2019
a255e8c
rm deprecations, obsolete refs/tests, delete_temporary_training_data,…
gojomo Dec 5, 2019
4e334c1
update usages, tests, flake8 cleanup
gojomo Dec 7, 2019
a16cec5
expand KeyedVectors to obviate Doc2VecKeyedVectors; upconvert old off…
gojomo Dec 12, 2019
d4267f8
fix docstring warnings; update usages
gojomo Dec 12, 2019
f6e7aa6
rm unused old plain-python codepaths
gojomo Dec 13, 2019
470b119
unify class comments under __init__ for consistncy w/ api doc present…
gojomo Dec 14, 2019
cd02b8b
name/comment harmonization (rm 'entity', lessen 'word'-centricity)
gojomo Dec 17, 2019
0c77ae4
table formatting
gojomo Dec 17, 2019
cfa723d
return pyemd to linux test env
gojomo Dec 17, 2019
a4f7b77
split backcompat tests for better resolution
gojomo Dec 18, 2019
4412696
convert Vocab & related data items to use dataclasses
gojomo Dec 18, 2019
65c2b2d
rm obsolete Vocab/Trainable/abstract/Wrapper classes, persistent call…
gojomo Dec 18, 2019
1d0f52f
tune tests for stability, runtimes; rm auto reruns that hide flakiness
gojomo Jan 15, 2020
8123596
fix numpy FutureWarning: arrays to stack must be sequence
gojomo Dec 26, 2019
c5efb24
(commented-out) deoptimization option
gojomo Jan 22, 2020
2c234dd
stronger FB model testing; no _unpack_copy test
gojomo Jan 22, 2020
9910404
merge redundant methods; rm duplicated imports/defs
gojomo Jan 22, 2020
658813f
rationalize _lockf, buckets_word behaviors
gojomo Jan 22, 2020
3cdb1d6
rename .docvecs to .dv
gojomo Jan 24, 2020
10d9f55
update usages; rm obsolete tests; restore gensim.utils import
gojomo Jan 28, 2020
79af68e
intensify FT tests (more epochs, more buckets)
gojomo May 12, 2020
8875d8b
flake8-3.8.0 style fixes - but also pin flake8-3.7.9 vs 3.8.0 'output…
gojomo May 12, 2020
4b7566e
replace vectors_norm with 1d norms
gojomo May 12, 2020
1baab2a
tighten testParallel
gojomo May 13, 2020
8d2f1fe
rm .vocab & 'Vocab' classes; add expandable 'vecattrs'
gojomo May 14, 2020
fc65525
update usages (no vocabs)
gojomo May 15, 2020
4657b14
enable running inside '-m mtprof' (or cProfile) via explicit unittest…
gojomo May 15, 2020
b5ff29b
faster sample_int reads
gojomo May 15, 2020
098119b
load_word2vec_format(.., no_header=True) to support GLoVe text vectors
gojomo May 19, 2020
318a858
refactor & comment lockf feature; allow single-element lockf
gojomo May 26, 2020
fe3ae31
improve FT comment
gojomo May 26, 2020
d503205
rm deprecated/unneded init_sims calls
gojomo May 26, 2020
679dde9
Merge branch 'develop' into kv_cleanup
piskvorky Jul 5, 2020
411473b
fixes to code style
piskvorky Jul 6, 2020
45fd5f6
flake8: fix overlong lines
piskvorky Jul 6, 2020
5acc5f5
Merge branch 'develop' into kv_cleanup
gojomo Jul 6, 2020
5764f8c
rm stray merge error
gojomo Jul 6, 2020
e49ae4c
rm duplicated , old nonstandard hash workarounds
gojomo Jul 6, 2020
278c2bd
use numpy-recommended PRNG constructor
gojomo Jul 6, 2020
5c7eb1c
add sg to FastTextConfig & consult it; rm remaining broken-hash cruft
gojomo Jul 6, 2020
23805d1
reorg conditional packages for clarity
gojomo Jul 6, 2020
f5b902c
comments, names, refactoring, randomization
gojomo Jul 7, 2020
7b571b2
Apply suggestions from code review
gojomo Jul 7, 2020
87860c5
fix cruft left from suggestion
gojomo Jul 7, 2020
39fe128
fix numpy-32bit-on-Windows; executable docs
gojomo Jul 7, 2020
15152ff
mv lee_corpus to utils; cleanup
gojomo Jul 7, 2020
3d424a2
update poincare for latest KV __init__ signature
gojomo Jul 7, 2020
99f7009
restore word_vec method for proper overriding, but rm usages
gojomo Jul 7, 2020
2bb8abf
Apply suggestions from code review
gojomo Jul 7, 2020
33c6508
adjust testParallel against failure risk
gojomo Jul 8, 2020
8f17d6d
merge ~piskvorky's /pull/10 cleanups
gojomo Jul 10, 2020
cb33e46
intensify training for an occasionally failing test
gojomo Jul 11, 2020
581ef06
clarify word/char ngrams handling; rm outdated comments
gojomo Jul 14, 2020
9f21cba
mostly avoid duplciating FastTextConfig fields into locals
gojomo Jul 16, 2020
d912616
avoid copies/pointers for no-bucket (FT as W2V) case
gojomo Jul 16, 2020
583bbe6
rm obsolete test (already skipped & somewhat originally misguided)
gojomo Jul 16, 2020
0330cfc
simpler/faster .get(..., default) (avoids exception-catching in has_i…
gojomo Jul 16, 2020
9caf217
add default option to get_index; avoid exception in has_index_for
gojomo Jul 16, 2020
14dd9f5
chained range check
gojomo Jul 16, 2020
8674949
Merge branch 'develop' into kv_cleanup
mpenkov Jul 19, 2020
0d2679a
Update CHANGELOG.md
mpenkov Jul 19, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@ include gensim/models/fasttext_inner.pxd
include gensim/models/fasttext_corpusfile.cpp
include gensim/models/fasttext_corpusfile.pyx

include gensim/models/_utils_any2vec.c
include gensim/models/_utils_any2vec.pyx
include gensim/corpora/_mmreader.c
include gensim/corpora/_mmreader.pyx
include gensim/_matutils.c
Expand Down
9 changes: 0 additions & 9 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,6 @@ Modules:
models/coherencemodel
models/basemodel
models/callbacks
models/utils_any2vec
models/_utils_any2vec
models/word2vec_inner
models/doc2vec_inner
models/fasttext_inner
Expand All @@ -63,13 +61,6 @@ Modules:
models/wrappers/ldavowpalwabbit.rst
models/wrappers/wordrank
models/wrappers/varembed
models/wrappers/fasttext
models/deprecated/doc2vec
models/deprecated/fasttext
models/deprecated/word2vec
models/deprecated/keyedvectors
models/deprecated/fasttext_wrapper
models/base_any2vec
similarities/docsim
similarities/termsim
similarities/index
Expand Down
9 changes: 0 additions & 9 deletions docs/src/models/_utils_any2vec.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/src/models/base_any2vec.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/deprecated/doc2vec.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/src/models/deprecated/fasttext.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/src/models/deprecated/fasttext_wrapper.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/deprecated/keyedvectors.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/deprecated/word2vec.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/utils_any2vec.rst

This file was deleted.

9 changes: 0 additions & 9 deletions docs/src/models/wrappers/fasttext.rst

This file was deleted.

4 changes: 2 additions & 2 deletions gensim/corpora/sharded_corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -687,8 +687,8 @@ def __add_to_slice(self, s_result, result_start, result_stop, start, stop):
"""
if (result_stop - result_start) != (stop - start):
raise ValueError(
'Result start/stop range different than stop/start range (%d - %d vs. %d - %d)'
% (result_start, result_stop, start, stop)
'Result start/stop range different than stop/start range ({0} - {1} vs. {2} - {3})'
.format(result_start, result_stop, start, stop)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general preference in gensim is to avoid format.

@piskvorky Is that still th case?

Copy link
Owner

@piskvorky piskvorky Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Preference is %, and then move to f"" which is infinitely nicer once we drop py3.6.

)

# Dense data: just copy using numpy's slice notation
Expand Down
3 changes: 1 addition & 2 deletions gensim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from .logentropy_model import LogEntropyModel # noqa:F401
from .word2vec import Word2Vec # noqa:F401
from .doc2vec import Doc2Vec # noqa:F401
from .keyedvectors import KeyedVectors, WordEmbeddingSimilarityIndex # noqa:F401
from .keyedvectors import KeyedVectors # noqa:F401
from .ldamulticore import LdaMulticore # noqa:F401
from .phrases import Phrases # noqa:F401
from .normmodel import NormModel # noqa:F401
Expand All @@ -23,7 +23,6 @@
from .translation_matrix import TranslationMatrix, BackMappingTranslationMatrix # noqa:F401

from . import wrappers # noqa:F401
from . import deprecated # noqa:F401

from gensim import interfaces, utils

Expand Down
18 changes: 9 additions & 9 deletions gensim/models/_fasttext_bin.py
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@ def _get_field_from_model(model, field):
requested field name, fields are listed in the `_NEW_HEADER_FORMAT` list
"""
if field == 'bucket':
return model.trainables.bucket
return model.bucket
elif field == 'dim':
return model.vector_size
elif field == 'epoch':
Expand All @@ -457,7 +457,7 @@ def _get_field_from_model(model, field):
elif field == 'minn':
return model.wv.min_n
elif field == 'min_count':
return model.vocabulary.min_count
return model.min_count
elif field == 'model':
# `model` => cbow:1, sg:2, sup:3
# cbow = continous bag of words (default)
Expand All @@ -467,7 +467,7 @@ def _get_field_from_model(model, field):
elif field == 'neg':
return model.negative
elif field == 't':
return model.vocabulary.sample
return model.sample
elif field == 'word_ngrams':
# This is skipped in gensim loading setting, using the default from FB C++ code
return 1
Expand Down Expand Up @@ -531,9 +531,9 @@ def _dict_save(fout, model, encoding):
# In the unsupervised case we have only words (no labels). Hence both fields
# are equal.

fout.write(np.int32(len(model.wv.vocab)).tobytes())
fout.write(np.int32(len(model.wv)).tobytes())

fout.write(np.int32(len(model.wv.vocab)).tobytes())
fout.write(np.int32(len(model.wv)).tobytes())

# nlabels=0 <- no labels we are in unsupervised mode
fout.write(np.int32(0).tobytes())
Expand All @@ -544,7 +544,7 @@ def _dict_save(fout, model, encoding):
fout.write(np.int64(-1))

for word in model.wv.index2word:
word_count = model.wv.vocab[word].count
word_count = model.wv.get_vecattr(word, 'count')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check: what's going on here, what's this API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed more following #2698 (comment)

fout.write(word.encode(encoding))
fout.write(_END_OF_WORD_MARKER)
fout.write(np.int64(word_count).tobytes())
Expand Down Expand Up @@ -572,7 +572,7 @@ def _input_save(fout, model):
ngrams_n, ngrams_dim = model.wv.vectors_ngrams.shape

assert vocab_dim == ngrams_dim
assert vocab_n == len(model.wv.vocab)
assert vocab_n == len(model.wv)
assert ngrams_n == model.wv.bucket

fout.write(struct.pack('@2q', vocab_n + ngrams_n, vocab_dim))
Expand All @@ -596,9 +596,9 @@ def _output_save(fout, model):
saved model
"""
if model.hs:
hidden_output = model.trainables.syn1
hidden_output = model.syn1
if model.negative:
hidden_output = model.trainables.syn1neg
hidden_output = model.syn1neg

hidden_n, hidden_dim = hidden_output.shape
fout.write(struct.pack('@2q', hidden_n, hidden_dim))
Expand Down
147 changes: 0 additions & 147 deletions gensim/models/_utils_any2vec.pyx

This file was deleted.

Loading