Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the best way to bulk load redb? #822

Closed
marvin-j97 opened this issue Jul 9, 2024 · 2 comments
Closed

What is the best way to bulk load redb? #822

marvin-j97 opened this issue Jul 9, 2024 · 2 comments

Comments

@marvin-j97
Copy link
Contributor

I am trying to benchmark large data sets in https://github.com/marvin-j97/rust-storage-bench, so I want to load a lot of data very quickly, no matter the durability.

If I use Durability::None, that bloats the disk size, which already caused me to run of disk space more than once, so I started doing an Immediate flush here and there:

// Note the keys are written in monotonically increasing order.
for x in 0..item_count {
    db.insert(
        key,
        value,
        // NOTE: Avoid too much disk space building up...
        args.backend == Backend::Redb && x % 100_000 == 0,
    );
}

With the above code, writing 100M 16 byte keys and 64 byte values takes 130 minutes, which is about 78µs per insert, which is slower than the fsync time of my SSD (pm9a3), so it looks like there is no advantage to write with Durability=None. So is there any point in using None at all?

Additionally, for this comparatively small data set (it's just ~8 GB of user data), redb has written 4.4 TB (write amp = 540), with the resulting .redb file being ~28GB.

As a comparison

  • sled 0.x takes 5 minutes, and comparable disk space
  • fjall takes 3 minutes, and uses 8 GB (to be expected, because LSM)

What is the best way to write a lot of KVs without bloating disk space, while keeping inserts somewhat fast?

@cberner
Copy link
Owner

cberner commented Jul 10, 2024

It's best to insert them all in a single transaction, and then the durability won't matter. Here's an example:

let mut txn = db.write_transaction();
let mut inserter = txn.get_inserter();
{
for _ in 0..ELEMENTS {
let (key, value) = gen_pair(&mut rng);
inserter.insert(&key, &value).unwrap();
}
}
drop(inserter);
txn.commit().unwrap();

If you're able to insert them in sorted order, that might improve write amplification. Alternately, you can adjust the cache size, if you have enough RAM

@marvin-j97
Copy link
Contributor Author

That works better, down to ~2.74µs per item

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants