fix large index size #23

littlewine · 2023-11-02T16:26:38Z

Fix bug #22 (and #18 partially); passage embeddings were saved multiple times to disk (x num_shards)

mlomeli1 · 2023-11-06T13:48:39Z

thank you for the fix @littlewine , could you provide more details to how you tested? e.g. you can exactly compute the expected size of the embeddings before and after the change.

littlewine · 2023-11-09T14:07:47Z

hi,
as I mention in #22 the index of MSMarco-passage-v1 (8.8M) passages was 1.6TB. This is related to this issue, which I verified after running the following code (on top of the inflated embeddings):

>>> e = torch.load('embeddings.0.pt')                                                                                                                            

>>> e.shape
torch.Size([768, 69077])

>>> p = pickle.load(open('passages.0.pt','rb'))
>>> len(p)
69077

>>> e.dtype 
torch.float16 # 2 bytes
>>> e.shape[0]*e.shape[1]*2
106102272
>>> e.shape[0]*e.shape[1]*2/1024/1024
101.18701171875 #101mb, while the file size in disk is 13GB

after making the fix on my own code, I re-created the embeddings (using --index_mode faiss --faiss_index_type ivfpq --faiss_code_size 16) and verified that the size is now 15G.

Not sure why the tests are failing, but I don't think it could be related to my fix.

mlomeli1 · 2023-11-09T15:26:10Z

@littlewine thank you! Just to clarify something you asked in #22 , the code saves the original embeddings not the compressed version of the embeddings (hence you can't see a size reduction even when you used --index_mode faiss --faiss_index_type ivfpq --faiss_code_size 16 in your repro).

Also, could your run black to format your code using the lines size specified in the readme so the lint test passes? Thanks

littlewine · 2023-11-14T20:59:33Z

@mlomeli1 thanks for the clarification. I still believe GPU mem usage is larger than it should; I only managed to run retrieval on this index on 4x12G GPUs with the DistributedIndex (which I think shards embeddings across 4 devices). this doesn't block me rn, but I still wonder why this is the case - I might give it a closer look when I find more time.

I'm currently traveling but I'll try to do the formatting soon!

mlomeli1 · 2023-11-23T17:38:37Z

I've merged this, can fix the linter separately.

fix large index size

7348725

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2023

mlomeli1 merged commit 23639ea into facebookresearch:main Nov 23, 2023
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix large index size #23

fix large index size #23

littlewine commented Nov 2, 2023

mlomeli1 commented Nov 6, 2023

littlewine commented Nov 9, 2023 •

edited

Loading

mlomeli1 commented Nov 9, 2023

littlewine commented Nov 14, 2023

mlomeli1 commented Nov 23, 2023

fix large index size #23

fix large index size #23

Conversation

littlewine commented Nov 2, 2023

mlomeli1 commented Nov 6, 2023

littlewine commented Nov 9, 2023 • edited Loading

mlomeli1 commented Nov 9, 2023

littlewine commented Nov 14, 2023

mlomeli1 commented Nov 23, 2023

littlewine commented Nov 9, 2023 •

edited

Loading