Rust - LMDB Cache back-end #9827

yamadapc · 2024-07-01T02:23:04Z

Adds a Rust LMDB wrapper and benchmark
Set-up bincode for our types
RKYV had been added but was removed on 71e7fc269881a44bab0d44b054e53ce8beabc477
Adds docs/rfcs/parcel-v3/0001_cache_implementation.md which discusses some cache related ideas and measurements taken from this branch

…into issue/lmdb-demo

…to issue/lmdb-demo

crates/parcel_core/src/types/asset.rs

devongovett · 2024-07-01T18:04:29Z

docs/rfcs/parcel-v3/0001_cache_implementation.md

+- Serializing a single request - **165ns** with `bincode` and 288ns with `rkyv`
+- Read a single request and deserialize - **821ns**
+- Write 1 request with async writes - **15204ns** / **15us**
+- Write 1 request with sync writes - **3000000ns** / **3000us** / **3ms**


That's expensive! That means if you have 100,000 nodes and 500,000 edges in the RequestGraph (not unreasonable, my current benchmark has 87k nodes and 400k edges, and I'm sure Jira has 10x more than that), just writing them alone will take 30 minutes.

Even with async writes, that would be 9 seconds added to the build time just to write request nodes to the cache. That's 11x slower than performing a complete build from scratch with no cache.

Note we'd write only request nodes so those estimates are 6x higher than reality.

It's quite expensive though, I think there is likely still some tuning that could speed it up a lot ; or something my benchmark does incorrectly.

Considering the cache is still unused, let's iterate on the implementation and keep the discussion open over the next few weeks as we get closer to having an E2E

devongovett · 2024-07-01T18:43:03Z

I don't think we need a database for this. We don't need any of the features that a database offers, e.g. durability, fault tolerance, transactions, indexes, etc. We don't care about data loss - a cache can simply be rebuilt from scratch if something goes wrong. All of those features have significant overhead and added complexity.

The history here is that we started with a file system based cache in Parcel. We used this cache not only as an actual cache, but as temporary storage within a build to pass data from one phase to another (e.g. transformation to packaging), which may be on different threads. In the JS implementation of Parcel, this was necessary due to the lack of shared memory between threads, as well as v8 heap size limits that we ran into while building large apps. We eventually moved to LMDB because it was faster than the FS cache due to batching, but avoiding disk IO during builds entirely would be much faster still.

In Rust, neither of these are problems anymore. We can keep as much as we want in memory (even more than the total system physical memory if necessary - virtual memory is already paged by the OS), and we don't need to write to disk in order to share memory between threads. Avoiding writing to disk until Parcel shuts down (or maybe during idle time) will be much faster - I think your benchmarks show the cost well. This is worth doing because the common case is a dev server that does many builds over its lifetime, none of which need to be written to disk, so we should optimize that dev loop. Not only is this faster, it's also simpler to implement, and avoids problems of inconsistent state that could occur when data structures are sharded.

yamadapc · 2024-07-02T00:54:48Z

We might care about transactions / consistency on some types of entries, but generally I agree that most of the time we just want to write the transformation outputs (or other intermediary assets) into files. That's anyhow the way to support remote caching.

To do that, we'd need individual cache entries for each of these intermediary assets.

This reverts commit c1719ad.

crates/parcel_filesystem/src/lib.rs

crates/parcel/benches/parcel_benchmarks.rs

MonicaOlejniczak · 2024-07-08T04:29:59Z

crates/parcel/benches/parcel_benchmarks.rs

+    ),
+    |b| {
+      let mut scratch = vec![];
+      scratch.resize(10000, 0); // pre-alloc


Isn't this equivalent to vec![0; 10000] on a single line?

crates/parcel/src/cache/mod.rs

devongovett · 2024-07-08T05:55:11Z

I'm not sure remote caching will be beneficial anymore tbh. If done per file, the network latency alone will be an order of magnitude slower than the CPU time needed to simply recompile that file from scratch (a couple milliseconds). Even if you downloaded the entire cache at once up front it's questionable whether recompiling without a cache would be faster depending on your internet speed.

yamadapc · 2024-07-08T06:26:30Z

That's not how it'll work, I don't think. But that is a good point.

I reckon remote caching would work through downloading large incremental intermediary artifacts that are combined into the output files. That would surely be faster than rebuilding locally.

Currently there's a large amount of complexity related to cache writes, which is working around the fact there's a single cache entry. We'd like to avoid this complexity while also addressing a number of pitfalls it causes.

Regarding what you wrote regarding cache write times, I think we'll be able to address this better once we have an E2E build. My estimate was quite a bit lower, but I also thought since we don't need to wait we can offload any writes we do want to make.

I do agree regarding storing transformation outputs as files.

devongovett · 2024-07-08T07:01:15Z

That would surely be faster than rebuilding locally.

I'm not so sure. You might have to download a lot of data. Say you start with no cache. Just downloading a large cache folder over a typical internet connection might be slower than building from scratch.

once we have an E2E build

Just pointing out that we have had this for a while on the core-rs2 branch. You're welcome to test stuff there. Not sure why we need to rewrite everything again... 😕

yamadapc · 2024-07-18T02:05:23Z

We've decided to split the code from the markdown RFC on this branch.

Therefore I'll remove all code from this PR, and we can keep it open as a draft while we gather information and discuss how we'd like to implement persistence of "requests" or transformation outputs.

Pending information would be:

Performance of builds past rewrites
Combined size of all serialized requests for large builds
Combined runtime impact of reading and writing requests individually as opposed to all in one go

It'll be possible to gather this information once E2E builds are working on v2 with the v3 feature-flag turned-on.

yamadapc · 2024-07-18T02:06:11Z

It will also be useful to clarify the meaning and process related to RFC documents added into this project. Which is something we can do soon.

pyamada-atlassian and others added 30 commits June 21, 2024 04:22

Fix unused imports warning

fe392aa

Don't abbreviate names

e5d371b

Clean-up unused code

de1e493

Set-up tracing subscriber

e05f4fb

Start to draft persistence over LMDB for rust only impl

bcdf5b7

Draft damage tracking

d6ff0b7

Update with cache markdown file

a1d3026

Merge branch 'v2' into rust-request-persistence-spike

683826f

Merge remote-tracking branch 'origin/rust-request-persistence-spike' …

cc1d1d6

…into issue/lmdb-demo

Update lockfile

cbae0af

Fix lockfile sorting

359d03d

Merge branch 'rust-request-persistence-spike' into issue/lmdb-demo

f9eba97

Update RFC with bincode notes

954d638

Update 0001_cache_implementation.md

bfb93b2

Merge branch 'v2' into issue/lmdb-demo

f6738a9

Merge branch 'issue/lmdb-demo' of github.com:parcel-bundler/parcel in…

e6e2b59

…to issue/lmdb-demo

Fix tracking filesystem

2294713

Update with tracking filesystem impl

8888c68

Update tracking filesystem documentation

dbd9004

Add LMDB cache implementation using heed crate

3901f41

Update with docs to LMDB cache impl

0ae6565

Add rkyv into all types

0c04a61

Refactor archival

d5610c6

Add some tests and thoughts

c249e3b

Update with rkyv and benchmarks

cb188bb

Fix unused import

e4b5ddf

Update with test case showing undefined behaviour on LMDB+rkyv

d75dda3

Prepare to remove rkyv

0bc8bdf

Update with bincode

c0af3d2

Update documentation and benchmark code

306725a

pyamada-atlassian added 7 commits July 1, 2024 01:49

Update benchmarks to use sequential keys

df5fad8

Update with throughput notes

97f3f52

Remove rkyv from parcel types

71e7fc2

Update RFC document

99628cd

Update cache docs

1e187a8

Update RFC

7f6444b

Merge remote-tracking branch 'origin/v2' into lmdb-cache-implementation

705dd75

devongovett reviewed Jul 1, 2024

View reviewed changes

Use serde traits for bincode

c1719ad

yamadapc force-pushed the lmdb-cache-implementation branch from be38ae9 to c1719ad Compare July 8, 2024 03:29

Revert "Use serde traits for bincode"

f0d21c8

This reverts commit c1719ad.

yamadapc requested a review from MonicaOlejniczak July 8, 2024 03:35

Remove unused import

58b162d

MonicaOlejniczak reviewed Jul 8, 2024

View reviewed changes

crates/parcel_filesystem/src/lib.rs Show resolved Hide resolved

MonicaOlejniczak reviewed Jul 8, 2024

View reviewed changes

crates/parcel/benches/parcel_benchmarks.rs Show resolved Hide resolved

MonicaOlejniczak reviewed Jul 8, 2024

View reviewed changes

crates/parcel/src/cache/mod.rs Show resolved Hide resolved

devongovett closed this Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust - LMDB Cache back-end #9827

Rust - LMDB Cache back-end #9827

yamadapc commented Jul 1, 2024

devongovett Jul 1, 2024

yamadapc Jul 2, 2024

MonicaOlejniczak Jul 8, 2024 •

edited

Loading

devongovett commented Jul 1, 2024 •

edited

Loading

yamadapc commented Jul 2, 2024

MonicaOlejniczak Jul 8, 2024

devongovett commented Jul 8, 2024

yamadapc commented Jul 8, 2024 •

edited

Loading

devongovett commented Jul 8, 2024

yamadapc commented Jul 18, 2024

yamadapc commented Jul 18, 2024

Rust - LMDB Cache back-end #9827

Rust - LMDB Cache back-end #9827

Conversation

yamadapc commented Jul 1, 2024

devongovett Jul 1, 2024

Choose a reason for hiding this comment

yamadapc Jul 2, 2024

Choose a reason for hiding this comment

MonicaOlejniczak Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

devongovett commented Jul 1, 2024 • edited Loading

yamadapc commented Jul 2, 2024

MonicaOlejniczak Jul 8, 2024

Choose a reason for hiding this comment

devongovett commented Jul 8, 2024

yamadapc commented Jul 8, 2024 • edited Loading

devongovett commented Jul 8, 2024

yamadapc commented Jul 18, 2024

yamadapc commented Jul 18, 2024

MonicaOlejniczak Jul 8, 2024 •

edited

Loading

devongovett commented Jul 1, 2024 •

edited

Loading

yamadapc commented Jul 8, 2024 •

edited

Loading