Support for fastText, word2vec, and text embeddings
The largest change is this release is support for reading fastText, word2vec, and text embeddings, in addition to finalfusion embeddings.
- Add support for reading fastText (
Embeddings.read_fasttext()
), text (Embeddings.read_text()
), textdims (Embeddings.read_text()
), and word2vec (Embeddings.read_fasttext()
) formats. - Each of these newly-supported formats provides a keyword argument
lossy
. If set, the embeddings will be read lossily, permitting invalid UTF-8 in words. - Add the
embedding_similarity
method, which looks up words that are similar to a given embedding. The method for traditional word-based lookups has been renamed fromsimilarity
toword_similarity
. - Iteration over embeddings returned tuples
(word, embedding)
in previous releases. Now instances of theEmbedding
class are returned, which provideword
,embedding
, andnorm
properties.norm
is the embedding norm before normalization of an embedding using its l2 norm. - Add support for memory mapping quantized embedding matrices.
- Add the
ngram_indices
andsubword_indices
to theVocab
class. These methods return the subword indices for a given word, which can be used to retrieve the subword embeddings individually. Thengram_indices
methods returns each subword with its index, whereassubword_indices
only returns the indices. - Update to pyo3 0.8.