Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance as database grows #1425

Closed
punktilious opened this issue Aug 18, 2020 · 9 comments
Closed

Investigate performance as database grows #1425

punktilious opened this issue Aug 18, 2020 · 9 comments
Assignees
Labels
performance performance
Milestone

Comments

@punktilious
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Load large volumes of data and track how ingestion performance varies over time.

Asses how access and search performance varies.

Understand scaling, such as memory and disk requirements for a large volume of data.

Use the new fhir-bucket project to load resources stored in COS.

@punktilious punktilious added the performance performance label Aug 18, 2020
@punktilious punktilious self-assigned this Aug 18, 2020
@chgl
Copy link
Contributor

chgl commented Aug 19, 2020

A few months back I ran a quick test to measure the throughput of the IBM FHIR server over time. Big disclaimer: these results should be taken with a grain of salt since they were mostly used as a quick benchmark, but here's the throughput in FHIR resources/s for a total of 6 million resources:

resource-throughput-ibm-fhir-pg12_3

The resources were sent to the server as update-as-create Bundles each containing between 10 and 100 resources (each bundle contained one Encounter and some associated Conditions, Procedures, Observations). The left side shows the throughput using a Postgres 12.3 with default options, the right one with the attached, optimized postgres.conf.txt active. DB, server, and client were run on the same Ubuntu 18.04 VM with 16GB of RAM and 8 vCPUs. If I recall correctly, the CPUs weren't entirely saturated. The containers were reset between the runs.

The sender/client is a Spring Boot Batch app reading data from a DB, mapping it to FHIR Bundles and sending them via REST, using 8 threads in parallel. When sending the resources to S3 (minio) the throughput was around 10.000 resources/s.

I ran the same tests against a HAPI FHIR server (v4, R4) and averaged 800 resources/s, but it remained more consistent over time.

Definitely interested in some sort of "FHIRBench" which can benchmark servers in various scenarios:

  • single-resource posts
  • multi-resource bundle posts
  • conditional creates
  • conditional updates
  • ... the same for reading
  • bulk im/export
  • horizontal scalability (do YugaByte/CockroachDB help)

(each one multi-threaded, at various fill-stages of the DB, etc.)

jvm.options.txt
postgres.conf.txt
docker-compose.yml.txt

@punktilious
Copy link
Collaborator Author

@chgl thank you for this - really useful. There's a new project called fhir-bucket (not yet merged to master) which periodically scans S3 buckets looking for new files (NDJSON and JSON) to load and is designed to do so with a lot of parallelism (loading activity is coordinated across several instances). This covers bulk-load type scenarios. The FHIRBench scenarios you described are also being discussed, and contributions are welcome if you are able to.

@punktilious
Copy link
Collaborator Author

punktilious commented Aug 19, 2020

@chgl also, we recently introduced a change to our logical id generation to force right-hand index inserts. This change is included in our 4.3.3 release. This particularly benefits PostgreSQL data sources because it reduces the number of blocks being modified and therefore the number of full page writes being logged, which otherwise gets worse as the database grows.

@lmsurpre
Copy link
Member

lmsurpre commented Aug 19, 2020

Its probably also worth noting that we've made a lot of performance updates since 4.3.0 as we started testing in earnest. I'd be really curious to see how 4.3.3 compares to what you saw on 4.3.0.
@punktilious can go into the details, but here are some highlights from the release notes:

A couple of those were db2-specific, but most will benefit PostgreSQL as well.

@lmsurpre
Copy link
Member

Oops, I should have refreshed before submitting that :-)

@chgl
Copy link
Contributor

chgl commented Aug 19, 2020

Thanks for the information! I've seen the performance improvements in the release notes and am curious as well :) - I will hopefully find some time in the next two weeks.

@chgl
Copy link
Contributor

chgl commented Aug 28, 2020

Some update running on v4.3.3: I've played around with the memory limits a bit more because I believe the drops in throughput are caused by either swapping or GC or both - note that these runs are executed on just 8GB of RAM. It's still not optimized enough yet, but I've found another interesting discovery: the job I use supports sending either transactions with conditional updates or update-as-create bundles.

transactions-ram-limited-vs-update-as-create

The left shows the throughput when using transactions, the right when using update-as-create. I assumed update-as-create would generally achieve higher throughput as it requires less processing. Do you have any insights/recommendations here? (quick edit: one factor might acutally be #1362 which doesn't apply when setting custom resource ids)

CPU and RAM stats:

cpu-and-ram

@punktilious
Copy link
Collaborator Author

@chgl I'm actively looking into performance right now so hopefully will get a chance to take a look at what you're showing here. Are you running with openjdk? JDK Mission Control is pretty useful looking for GC pauses. Add something like this to the jvm.options file in your WLPHOME/usr/servers/fhir-server directory:

-XX:StartFlightRecording=disk=true,dumponexit=true,filename=run1.jfr,settings=profile,path-to-gc-roots=true
-XX:FlightRecorderOptions=stackdepth=96

Then you can load the run1.jfr file with JDK Mission Control.

My personal suspicion is that the dips could be related to database checkpoints. If you have the database files and WAL on different mount points then it's easy to tell looking at iostat if you get a surge in writes to the database files with a drop in writes to the WAL. I'll cook up some queries to try and glean more info from the db stats tables.

@lmsurpre lmsurpre added this to the Sprint 17 milestone Sep 15, 2020
punktilious added a commit that referenced this issue Sep 15, 2020
Signed-off-by: Robin Arnold <robin.arnold23@ibm.com>
@punktilious
Copy link
Collaborator Author

With over 70 million resources resident in the target database (PostgreSQL 12), a bulk load of resources using the fhir-bucket project was able to ingest over 700 resources/second using a concurrency of 100 threads processing bundles containing up to 100 resources each. We are currently planning a refactor of the part of the schema used to store references and this should improve scaling by reducing the size of our indexes significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance performance
Projects
None yet
Development

No branches or pull requests

3 participants