Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower-than-expected LMDB import; tuning? #36

Open
bnewbold opened this issue Jun 27, 2019 · 1 comment
Open

Slower-than-expected LMDB import; tuning? #36

bnewbold opened this issue Jun 27, 2019 · 1 comment

Comments

@bnewbold
Copy link
Contributor

When experimenting with import of the fatcat release metadata corpus (about 97 million records, similar in size/scope to crossref corpus when abstracts+references removed), I found the Java LMDB importing slower than expected:

    java -jar lookup/build/libs/lookup-service-1.0-SNAPSHOT-onejar.jar fatcat --input /srv/biblio-glutton/datasets/release_export_expanded.json.gz /srv/biblio-glutton/config/biblio-glutton.yaml

    [...]
    -- Meters ----------------------------------------------------------------------
    fatcatLookup
                 count = 1146817
             mean rate = 8529.29 events/second
         1-minute rate = 8165.53 events/second
         5-minute rate = 6900.17 events/second
        15-minute rate = 6358.60 events/second

    [...] RAN OVER NIGHT

    6/26/19 4:32:11 PM =============================================================

    -- Meters ----------------------------------------------------------------------
    fatcatLookup
                 count = 37252787
             mean rate = 1474.81 events/second
         1-minute rate = 1022.73 events/second
         5-minute rate = 1022.72 events/second
        15-minute rate = 1005.36 events/second
    [...]

I cut it off soon after, when the rate drooped further to ~900/second.

This isn't crazy slow (it would finish in another day or two), but, for instance, the node/elasticsearch ingest of the same corpus completed pretty quickly on the same machine:


    [...]
    Loaded 2131000 records in 296.938 s (8547.008547008547 record/s)
    Loaded 2132000 records in 297.213 s (7142.857142857142 record/s)
    Loaded 2133000 records in 297.219 s (3816.793893129771 record/s)
    Loaded 2134000 records in 297.265 s (12987.012987012988 record/s)
    Loaded 2135000 records in 297.364 s (13513.513513513513 record/s)
    [...]
    Loaded 98076000 records in 22536.231 s (9433.962264150943 record/s)
    Loaded 98077000 records in 22536.495 s (9090.90909090909 record/s)

This is a ~30 thread machine with 50 GByte RAM and a consumer-grade Samsung 2 TByte SSD. I don't seem to have any lmdb libraries installed, I guess they are vendored in. In my config I have (truncated to relevant bits):

storage: /srv/biblio-glutton/data/db
batchSize: 10000
maxAcceptedRequests: -1

server:
  type: custom
  applicationConnectors:
  - type: http
    port: 8080
  adminConnectors:
  - type: http
    port: 8081
  registerDefaultExceptionMappers: false
  maxThreads: 2048
  maxQueuedRequests: 2048
  acceptQueueSize: 2048

Wondering what kind of performance others are seeing by the "end" of a full crossref corpus import, and if there is other tuning I should do.

For my particular use-case (fatcat matching) it is tempting to redirect to a HTTP REST API which can handle at least hundreds of requests/sec at a couple ms or so latency; this would keep the returned data "fresh" without needing a pipeline to rebuild the LMDB snapshots periodically or continuously. Probably not worth it for most users and most cases. I do think there is a "universal bias" towards the most recent published works though: most people read and are processing new papers, and new papers tend to cite recent (or forthcoming) papers, so having the matching corpus even a month or two out of date could be sub-optimal. The same "freshness" issue would exist with elasticsearch anyways though.

@karatekaneen
Copy link
Contributor

karatekaneen commented Dec 17, 2019

I am currently facing the same issue, although on a less powerful machine.
When I started the import to the LMDB a couple of days ago it was doing a couple of thousand per second which was going to take a long time but I'm not in that big of a rush and allowed it to run over the weekend.

Now when I checked up on the progress it's doing 25 per second so it probably won't finish this side of christmas.
I'm running everything default in a docker container running on GCP. I ran the import via nohup to let it run while I disconnected from the shell, don't know if that affects performance somehow.

Also worth to mention is that I faced the same issue when importing the DOI<->PMID mapping dataset where I saw the rate decrease quite rapidly but due to the smaller size of the dataset it finished quite fast anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants