Skip to content

Retrieve text embeddings, but cache them locally if we have already computed them.

License

Notifications You must be signed in to change notification settings

turian/embeddingcache

Repository files navigation

embeddingcache

PyPI license python black isort tests

Retrieve text embeddings, but cache them locally if we have already computed them.

Motivation

If you are doing a handful of different NLP tasks, or have a single NLP pipeline you keep tuning, you probably don't want to recompute embeddings. Hence, we cache them.

Quickstart

pip install embeddingcache
from embeddingcache.embeddingcache import get_embeddings
embeddings = get_embeddings(
            strs=["hi", "I love Berlin."],
            embedding_model="all-MiniLM-L6-v2",
            db_directory=Path("dbs/"),
            verbose=True,
        )

Design assumptions

We use SQLite3 to cache embeddings. [This could be adapted easily, since we use SQLAlchemy.]

We assume read-heavy loads, with one concurrent writer. (However, we retry on write failures.)

We shard SQLite3 into two databases: hashstring.db: hashstring table. Each row is a (unique, primary key) SHA512 hash to text (also unique). Both fields are indexed.

[embedding_model_name].db: embedding table. Each row is a (unique, primary key) SHA512 hash to a 1-dim numpy (float32) vector, which we serialize to the table as bytes.

Developer instructions

pre-commit install
pip install -e .
pytest

TODO

  • Update pyproject.toml
  • Add tests
  • Consider other hash functions?
  • float32 and float64 support
  • Consider adding optional joblib for caching?
  • Different ways of computing embeddings (e.g. using an API) rather than locally
  • S3 backup and/or
  • WAL
  • LiteStream
  • Retry on write errors
  • Other DB backends
  • Best practices: Give specific OpenAI version number.
  • RocksDB / RocksDB-cloud?
  • Include model name in DB for sanity check on slugify.
  • Validate on numpy array size.
  • Validate BLOB size for hashes.
  • Add optional libraries like openai and sentence-transformers
    • Also consider other embedding providers, e.g. cohere
    • And libs just for devs
  • Consider the max_length of each text to embed, warn if we exceed
  • pdoc3 and/or sphinx
  • Normalize embeddings by default, but add option
  • Option to return torch tensors
  • Consider reusing the same DB connection instead of creating it from scratch every time.
  • Add batch_size parameter?
  • Test check for collisions
  • Use logging not verbose output.
  • Rewrite using classes.
  • Fix dependabot.
  • Don't keep re-using DB session, store it in the class or global
  • DRY.
  • Suggest to use versioned OpenAI model
  • Add device to sentence transformers
  • Allow fast_sentence_transformers
  • Test that things work if there are duplicate strings
  • Remove DBs after test
  • Do we have to have nested embedding.embedding for all calls?
  • codecov and code quality shields