-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create_kb
step using an unfiltered dump runs out of memory
#32
Comments
Hi @c-koster! Thanks for bringing that up. We indeed ran this on a machine with a lot of memory (120 GB), so it's possible we overlooked memory bottlenecks. Can you post the complete stack trace? One way we could fix this would be to turn these two calls into generators. |
Hi, @rmitsch! I re-ran the script with
I'm currently looking into limiting the results based on link count. A threshold of 20 links (which i got from this spaCy video on entity linking) yields about 1.8M entities. Also, re: making the
Thanks for your help! |
Yes, these are both good options. Ideally we'd get this working without the necessity of link count thresholds, as this is a magic number that might be hard to set properly for users. Anyway, this will require a deeper look into the memory bottlenecks in |
Hello @rmitsch I did some investigating of the the First, here are some lines from a profile of the
Something I notice is that aside from the spacy.load step (which will be constant for the unfiltered dumps), the I also did a time-based analysis of the full (unfiltered) step but it made a lot less sense. The memory used spikes to about 20GB and then falls off. I suspect this is related to a very large group-by query being run locally but haven't tested this thoroughly. Some questions:
Thanks! |
Hello!
I am working to create a knowledge base using the latest (unfiltered) English wiki dumps. I've successfully followed the steps in benchmarks/nel up to
wikid_parse
to make a 20GBen/wiki.sqlite3
file.However when I run the next step
wikid_create_kb
, my machine runs out of memory in two places:retrieving entities here which i think i resolved by modifying
PRAGMA mmap_size
in the SQLite table.computing description vectors for all the entities here. What kind of machine did y'all get this to work on? My estimate says that 16GB of memory should be fine but this step is quickly crashing my computer.
Here is a spacy info dump:
Thank you!
The text was updated successfully, but these errors were encountered: