This is a minimal demo/skeleton code of CLIP curation, please check Algorithm 1 in MetaCLIP paper. This is not the pipeline used to collect data in paper.
see README.
The key function of sub-string matching is in substr_matching.
We also include a CommonCrawl WARC parser (in cc_matching.py
) that requires the following package for fast HTML parsing.
pip install warcio
pip install selectolax
pip install fasttext-langdetect # for LID
pip install tqdm
For customized HTML parser, check selectolax doc for more details.
The parser supports both WAT and WARC formats (WAT is pre-parsed format from WARC, we expect 1% loss of image-text pairs). Get a CommonCrawl WAT/WARC file, S3 is recommended, here is a quick http example:
mkdir -p data/CC/warc; mkdir -p data/CC/wat; mkdir -p data/CC/matched
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz -O data/CC/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz
Then
python metaclip/cc_matching.py data/CC/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz data/CC/matched/CC-MAIN-20210723143921-20210723173921-00000.warc.gz.json
Want a distributed system to parse the full CC and download a dataset? consider to integrate substr_matching.py
and balancing.py
into a open source system: cc2dataset and img2dataset.
mkdir -p data/CC/balanced
python metaclip/balancing.py data/CC/matched data/CC/balanced 20000 # the magic 20k !
We expect balancing is the last step to ensure training data distribution. If you want such a run before image downloading/NSFW/dedup etc., please increase 20000 to a larger number and rerun balancing after getting images to accomendate loss of URL-text pairs.
We also provide a numpy impl. of the algorithm, which is close to the impl. in the paper.
python metaclip/pipeline.py metaclip_400m substr_indexing
python metaclip/pipeline.py metaclip_400m entry_count
python metaclip/pipeline.py metaclip_400m balance_sampling
- integrate numpy impl. w/ WARC/WAT parser