Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix large index size #23

Merged
merged 1 commit into from
Nov 23, 2023
Merged

Conversation

littlewine
Copy link
Contributor

Fix bug #22 (and #18 partially); passage embeddings were saved multiple times to disk (x num_shards)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2023
@mlomeli1
Copy link
Contributor

mlomeli1 commented Nov 6, 2023

thank you for the fix @littlewine , could you provide more details to how you tested? e.g. you can exactly compute the expected size of the embeddings before and after the change.

@littlewine
Copy link
Contributor Author

littlewine commented Nov 9, 2023

hi,
as I mention in #22 the index of MSMarco-passage-v1 (8.8M) passages was 1.6TB. This is related to this issue, which I verified after running the following code (on top of the inflated embeddings):

>>> e = torch.load('embeddings.0.pt')                                                                                                                            

>>> e.shape
torch.Size([768, 69077])

>>> p = pickle.load(open('passages.0.pt','rb'))
>>> len(p)
69077

>>> e.dtype 
torch.float16 # 2 bytes
>>> e.shape[0]*e.shape[1]*2
106102272
>>> e.shape[0]*e.shape[1]*2/1024/1024
101.18701171875 #101mb, while the file size in disk is 13GB

after making the fix on my own code, I re-created the embeddings (using --index_mode faiss --faiss_index_type ivfpq --faiss_code_size 16) and verified that the size is now 15G.

Not sure why the tests are failing, but I don't think it could be related to my fix.

@mlomeli1
Copy link
Contributor

mlomeli1 commented Nov 9, 2023

@littlewine thank you! Just to clarify something you asked in #22 , the code saves the original embeddings not the compressed version of the embeddings (hence you can't see a size reduction even when you used --index_mode faiss --faiss_index_type ivfpq --faiss_code_size 16 in your repro).

Also, could your run black to format your code using the lines size specified in the readme so the lint test passes? Thanks

@littlewine
Copy link
Contributor Author

@mlomeli1 thanks for the clarification. I still believe GPU mem usage is larger than it should; I only managed to run retrieval on this index on 4x12G GPUs with the DistributedIndex (which I think shards embeddings across 4 devices). this doesn't block me rn, but I still wonder why this is the case - I might give it a closer look when I find more time.

I'm currently traveling but I'll try to do the formatting soon!

@mlomeli1 mlomeli1 merged commit 23639ea into facebookresearch:main Nov 23, 2023
1 of 2 checks passed
@mlomeli1
Copy link
Contributor

I've merged this, can fix the linter separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants