glovpy

Package for interfacing Stanford's C GloVe implementation from Python.

Installation

Install glovpy from PyPI:

pip install glovpy

Additionally the first time you import glopy it will build GloVe from scratch on your system.

Requirements

We highly recommend that you use a Unix-based system, preferably a variant of Debian. The package needs git, make and a C compiler (clang or gcc) installed.

Otherwise the implementation is as barebones as it gets, only the standard library and gensim are being used (gensim only for producing KeyedVectors).

Example Usage

Here's a quick example of how to train GloVe on 20newsgroups using Gensim's tokenizer.

from gensim.utils import tokenize
from sklearn.datasets import fetch_20newsgroups

from glovpy import GloVe

texts = fetch_20newsgroups().data
corpus = [list(tokenize(text, lowercase=True, deacc=True)) for text in texts]

model = GloVe(vector_size=25)
model.train(corpus)

for word, similarity in model.wv.most_similar("god"):
    print(f"{word}, sim: {similarity}")

word	similarity
existence	0.9156746864
jesus	0.8746870756
lord	0.8555182219
christ	0.8517201543
bless	0.8298447728
faith	0.8237065077
saying	0.8204566240
therefore	0.8177698255
desires	0.8094088435
telling	0.8083973527

API Reference

`class glovpy.GloVe(vector_size, window_size, symmetric, distance_weighting, alpha, min_count, iter, initial_learning_rate, threads, memory)`

Wrapper around the original C implementation of GloVe.

Parameters

Parameter	Type	Description	Default
vector_size	int	Number of dimensions the trained word vectors should have.	50
window_size	int	Number of context words to the left (and to the right, if symmetric is True).	15
alpha	float	Parameter in exponent of weighting function; default 0.75	0.75
symmetric	bool	If true, both future and past words will be used as context, otherwise only past words will be used.	True
distance_weighting	bool	If False, do not weight cooccurrence count by distance between words. If True (default), weight the cooccurrence count by inverse of distance between the target word and the context word.	True
min_count	int	Minimum number of times a token has to appear to be kept in the vocabulary.	5
iter	int	Number of training iterations.	25
initial_learning_rate	float	Initial learning rate for training.	0.05
threads	int	Number of threads to use for training.	8
memory	float	Soft limit for memory consumption, in GB. (based on simple heuristic, so not extremely accurate)	4.0

Attributes

Name	Type	Description
wv	KeyedVectors	Token embeddings in the form of Gensim keyed vectors.

Methods

`glovpy.GloVe.train(tokens)`

Train the model on a stream of texts.

Parameter	Type	Description
tokens	Iterable[list[str]]	Stream of documents in the form of lists of tokens. The stream has to be reusable, as the model needs at least two passes over the corpus.

`glovpy.utils.reusable(gen_func)`

Function decorator that turns your generator function into an iterator, thereby making it reusable. You can use this if you want to reuse a generator function so that multiple passes can be made.

Parameters

Parameter	Type	Description
gen_func	Callable	Generator function that you want to be reusable.

Returns

Returns	Type	Description
_multigen	Callable	Iterator class wrapping the generator function.

Example usage

Here's how to stream a very long file line by line in a reusable manner.

from gensim.utils import tokenize
from glovpy.utils import reusable
from glovpy import GloVe

@reusable
def stream_lines():
    with open("very_long_text_file.txt") as f:
        for line in f:
            yield list(tokenize(line))

model = GloVe()
model.train(stream_lines())

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
glopy		glopy
glovpy		glovpy
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

glovpy

Installation

Requirements

Example Usage

API Reference

`class glovpy.GloVe(vector_size, window_size, symmetric, distance_weighting, alpha, min_count, iter, initial_learning_rate, threads, memory)`

Parameters

Attributes

Methods

`glovpy.GloVe.train(tokens)`

`glovpy.utils.reusable(gen_func)`

Parameters

Returns

Example usage

About

Releases

Packages

Languages

License

centre-for-humanities-computing/glovpy

Folders and files

Latest commit

History

Repository files navigation

glovpy

Installation

Requirements

Example Usage

API Reference

class glovpy.GloVe(vector_size, window_size, symmetric, distance_weighting, alpha, min_count, iter, initial_learning_rate, threads, memory)

Parameters

Attributes

Methods

glovpy.GloVe.train(tokens)

glovpy.utils.reusable(gen_func)

Parameters

Returns

Example usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`class glovpy.GloVe(vector_size, window_size, symmetric, distance_weighting, alpha, min_count, iter, initial_learning_rate, threads, memory)`

`glovpy.GloVe.train(tokens)`

`glovpy.utils.reusable(gen_func)`

Packages