embeddingcache

Retrieve text embeddings, but cache them locally if we have already computed them.

Motivation

If you are doing a handful of different NLP tasks, or have a single NLP pipeline you keep tuning, you probably don't want to recompute embeddings. Hence, we cache them.

Quickstart

pip install embeddingcache

from embeddingcache.embeddingcache import get_embeddings
embeddings = get_embeddings(
            strs=["hi", "I love Berlin."],
            embedding_model="all-MiniLM-L6-v2",
            db_directory=Path("dbs/"),
            verbose=True,
        )

Design assumptions

We use SQLite3 to cache embeddings. [This could be adapted easily, since we use SQLAlchemy.]

We assume read-heavy loads, with one concurrent writer. (However, we retry on write failures.)

We shard SQLite3 into two databases: hashstring.db: hashstring table. Each row is a (unique, primary key) SHA512 hash to text (also unique). Both fields are indexed.

[embedding_model_name].db: embedding table. Each row is a (unique, primary key) SHA512 hash to a 1-dim numpy (float32) vector, which we serialize to the table as bytes.

Developer instructions

pre-commit install
pip install -e .
pytest

TODO

Update pyproject.toml
Add tests
Consider other hash functions?
float32 and float64 support
Consider adding optional joblib for caching?
Different ways of computing embeddings (e.g. using an API) rather than locally
S3 backup and/or
WAL
LiteStream
Retry on write errors
Other DB backends
Best practices: Give specific OpenAI version number.
RocksDB / RocksDB-cloud?
Include model name in DB for sanity check on slugify.
Validate on numpy array size.
Validate BLOB size for hashes.
Add optional libraries like openai and sentence-transformers
- Also consider other embedding providers, e.g. cohere
- And libs just for devs
Consider the max_length of each text to embed, warn if we exceed
pdoc3 and/or sphinx
Normalize embeddings by default, but add option
Option to return torch tensors
Consider reusing the same DB connection instead of creating it from scratch every time.
Add batch_size parameter?
Test check for collisions
Use logging not verbose output.
Rewrite using classes.
Fix dependabot.
Don't keep re-using DB session, store it in the class or global
DRY.
Suggest to use versioned OpenAI model
Add device to sentence transformers
Allow fast_sentence_transformers
Test that things work if there are duplicate strings
Remove DBs after test
Do we have to have nested embedding.embedding for all calls?
codecov and code quality shields

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
src/embeddingcache		src/embeddingcache
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.sourcery.yaml		.sourcery.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

embeddingcache

Motivation

Quickstart

Design assumptions

Developer instructions

TODO

About

Releases

Packages

Languages

License

turian/embeddingcache

Folders and files

Latest commit

History

Repository files navigation

embeddingcache

Motivation

Quickstart

Design assumptions

Developer instructions

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages