If you are doing a handful of different NLP tasks, or have a single NLP pipeline you keep tuning, you probably don't want to recompute embeddings. Hence, we cache them.
pip install embeddingcache
from embeddingcache.embeddingcache import get_embeddings
embeddings = get_embeddings(
strs=["hi", "I love Berlin."],
embedding_model="all-MiniLM-L6-v2",
db_directory=Path("dbs/"),
verbose=True,
)
We use SQLite3 to cache embeddings. [This could be adapted easily, since we use SQLAlchemy.]
We assume read-heavy loads, with one concurrent writer. (However, we retry on write failures.)
We shard SQLite3 into two databases: hashstring.db: hashstring table. Each row is a (unique, primary key) SHA512 hash to text (also unique). Both fields are indexed.
[embedding_model_name].db: embedding table. Each row is a (unique, primary key) SHA512 hash to a 1-dim numpy (float32) vector, which we serialize to the table as bytes.
pre-commit install
pip install -e .
pytest
- Update pyproject.toml
- Add tests
- Consider other hash functions?
- float32 and float64 support
- Consider adding optional joblib for caching?
- Different ways of computing embeddings (e.g. using an API) rather than locally
- S3 backup and/or
- WAL
- LiteStream
- Retry on write errors
- Other DB backends
- Best practices: Give specific OpenAI version number.
- RocksDB / RocksDB-cloud?
- Include model name in DB for sanity check on slugify.
- Validate on numpy array size.
- Validate BLOB size for hashes.
- Add optional libraries like openai and sentence-transformers
- Also consider other embedding providers, e.g. cohere
- And libs just for devs
- Consider the max_length of each text to embed, warn if we exceed
- pdoc3 and/or sphinx
- Normalize embeddings by default, but add option
- Option to return torch tensors
- Consider reusing the same DB connection instead of creating it from scratch every time.
- Add batch_size parameter?
- Test check for collisions
- Use logging not verbose output.
- Rewrite using classes.
- Fix dependabot.
- Don't keep re-using DB session, store it in the class or global
- DRY.
- Suggest to use versioned OpenAI model
- Add device to sentence transformers
- Allow fast_sentence_transformers
- Test that things work if there are duplicate strings
- Remove DBs after test
- Do we have to have nested embedding.embedding for all calls?
- codecov and code quality shields