Note: This currently relies on using the --output_format="webdataset"
option from img2dataset. If your images are not inside .tar
files, this will not work correctly.
CLIP embeddings generated by clip-retrieval are not ordered the same as the webdataset they are generated from. This tool can reorder large CLIP embedding datasets such that they match the order of the image dataset they were generated from.
git clone https://github.com/Veldrovive/embedding-dataset-reordering
cd embedding-dataset-reordering
pip install -e .
This module exposes three functions. Example commands are meant to be evaluated from inside the examples
folder.
For example, to download the test dataset with img2dataset, navigate to the root directory and run cd examples && reorder-embeddings download-data
.
To generate embeddings with clip-retrieval for this test data, run reorder-embeddings clip-inference
from the examples folder.
reorder
: Takes as input an unordered embedding dataset along with metadata generated by clip-retrieval
and reorders the embeddings to match the order of the image dataset.
Note: Before starting, you need to find the shard string width and index string width of your dataset. This is a manual task, but it is easy to find. Navigate to the metadata directory of your embedding dataset and run reorder-embeddings example_key
.
This will print something similar to:
Example Keys:
Shard 3 has keys ['0000309', '0000321']
Shard 2 has keys ['0000209', '0000237']
Shard 0 has keys ['0000022', '0000031']
Shard 1 has keys ['0000114', '0000123']
By inspection, we can see that the first 5 characters represent the index of the shard (i.e. the keys for shard 3 start with 00003) so the final 3 digits reprent the index which means the index width is 3.
Parameters
embeddings_folder
: Path to the folder containing the embedding.npy
files.metadata_folder
: Path to the folder containing the.parquet
metadata files.output_folder
: Path to the folder where the reordered.npy
files will be saved.index_width
: The index width found above.output_shard_width
: The width of the shard string for the output files. Should be the same as the shard with for the webdataset.limit
: The number of shards to reorder.run-concurrent
: The number of workers to use during reordering.verbose
: Whether to print out expanded logging.tmp-folder
: With many workers, the temporary file directories get very large. If this is a problem, reduce the number of workers or settmp-folder
to a location with more space available.
download-data
: Uses img2dataset to download a test dataset. Run this from the examples
directory to download the default one.
clip_-nference
: Uses clip-retrieval
to generate embeddings for the test dataset. Run this from the examples
directory after downloading the test dataset.
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
to run tests:
pip install -r requirements-test.txt
then
make lint
make test
You can use make black
to reformat the code