core, eth, trie: write nodebuffer asynchronously to disk #28471

joeylichang · 2023-11-07T02:54:06Z

Pathdb will write trienodes to the disk in one batch after the nodebuffer is full. The operation of writing to the disk will block the execution of blocks, which will cause performance glitches from the monitoring point of view. This PR proposes asyncnodebuffer to formally solve the performance glitch introduced by this. asyncnodebuffer is an optimization of nodebuffer for asynchronous writing to disk. It consists of two nodebuffer. When the current nodebuffer is full, it will become immutable and switch to the background to write to disk. The new front nodebuffer is used for subsequent writing of chasing blocks. Of course, the immutable nodebuffer is still readable.

joeylichang · 2023-11-07T03:10:43Z

@rjl493456442 Can I invite you to help evaluate this PR?

rjl493456442 · 2023-11-07T04:03:28Z

I once had a similar idea, but I abandoned it due to certain limitations in LevelDB. Specifically, the following write operations will remain blocked even if we flush the buffer in the background. Nevertheless, it's certainly worth considering this approach within the context of Pebble.

As you can see, it involves a non-trivial complexity. Do you have any performance data before we dive into details?

joeylichang · 2023-11-07T06:35:19Z

I once had a similar idea, but I abandoned it due to certain limitations in LevelDB. Specifically, the following write operations will remain blocked even if we flush the buffer in the background. Nevertheless, it's certainly worth considering this approach within the context of Pebble.

As you can see, it involves a non-trivial complexity. Do you have any performance data before we dive into details?

Yes, writes limited by leveldb will still be blocked. Since asynchronous writing does not block the main process, subsequent block chasing performance is still much smoother due to cache hits, which is still recommended.

Intel x86, 16 core, 64G, 3000 IOPS, 128M/S EC2
~1200 tx/block, flush 256MB nodebuffer causes glitches of ~30s during leveldb compaction.

joeylichang · 2023-11-09T03:57:17Z

I once had a similar idea, but I abandoned it due to certain limitations in LevelDB. Specifically, the following write operations will remain blocked even if we flush the buffer in the background. Nevertheless, it's certainly worth considering this approach within the context of Pebble.
As you can see, it involves a non-trivial complexity. Do you have any performance data before we dive into details?

Yes, writes limited by leveldb will still be blocked. Since asynchronous writing does not block the main process, subsequent block chasing performance is still much smoother due to cache hits, which is still recommended.

Intel x86, 16 core, 64G, 3000 IOPS, 128M/S EC2 ~1200 tx/block, flush 256MB nodebuffer causes glitches of ~30s during leveldb compaction.

@rjl493456442 How do you think this PR is necessary?

rjl493456442 · 2023-11-09T09:14:39Z

@joeylichang I will try to do some benchmarks first.

rjl493456442 · 2023-11-18T09:39:29Z

Deployed on our benchmark machine, will post the result once it's available.

rjl493456442 · 2023-11-19T06:27:04Z

The first wave of benchmark data.

After running for a few hours, it turns out this pull request is slightly faster than master(Note, I rebase your pull request against laster master branch before deployement).

07 is pull request
08 is master

The worst execution time drops from 3s to 1s

From the logs I can tell, during the flushing, the overall execution performance is still affected, even by putting the write in the background. We can see the mgasps is dropped on both machine when flushing.

Let's keep them running for a few more days then.

joeylichang · 2023-11-20T09:09:48Z

Let's keep them running for a few more days then.

cool~, looking forward to the result.

Execution performance is still affected ——— According to our observation, it is mainly the impact of disk bandwidth, which blocks the execution of transactions.
We have conducted a comparative test of pebbledb vs leveldb, and pebbledb will perform better.

fynnss · 2023-11-20T09:13:42Z

From the logs I can tell, during the flushing, the overall execution performance is still affected, even by putting the write in the background. We can see the mgasps is dropped on both machine when flushing.

@rjl493456442 Can you also show the disk read/write metrics here? Maybe disk bandwidth is the bottleneck.

rjl493456442 · 2023-11-20T13:39:50Z

After running a few more days, this pull request is consistently faster than master.

Due to fact that accumulating 256 Megabytes state in memory will take roughly 1.5min in full sync, so the theoretical performance speedup is ~1%(average_buffer_flush_time / 1.5min). Although the speedup is not that obvious from the metrics.

holiman

Can't really say much about the overall idea at this point, don't fully understand it yet. Some things looks a bit flaky to me.

holiman · 2023-11-22T09:44:22Z

trie/database.go

@@ -151,16 +151,16 @@ func (db *Database) Commit(root common.Hash, report bool) error {
 // Size returns the storage size of diff layer nodes above the persistent disk
 // layer, the dirty nodes buffered within the disk layer, and the size of cached
 // preimages.
-func (db *Database) Size() (common.StorageSize, common.StorageSize, common.StorageSize) {
+func (db *Database) Size() (common.StorageSize, common.StorageSize, common.StorageSize, common.StorageSize) {


Please update the method docs. With four return values, might be time to split it into a bullet-list, e.g. something like

// Size returns the sizes of: // - the diff layer nodes above the persistent disk layer, // - the dirty nodes buffered within the disk layer, // - the immutable nodes in the disk layer, // - the cached preimages

added the docs.

holiman · 2023-11-22T09:45:04Z

trie/triedb/pathdb/disklayer.go

@@ -294,14 +337,15 @@ func (dl *diskLayer) setBufferSize(size int) error {
 }

 // size returns the approximate size of cached nodes in the disk layer.
-func (dl *diskLayer) size() common.StorageSize {
+func (dl *diskLayer) size() (common.StorageSize, common.StorageSize) {


update method doc

added the docs.

holiman · 2023-11-22T09:45:49Z

trie/triedb/pathdb/disklayer.go

+	// empty returns an indicator if trienodebuffer contains any state transition inside.
+	empty() bool
+
+	// getSize return the trienodebuffer used size.


It returns two numbers, please document what they mean, respectively

added the docs.

holiman · 2023-11-22T09:47:16Z

trie/triedb/pathdb/nodebuffer.go

+// asyncnodebuffer implement trienodebuffer interface, and aysnc the nodecache
+// to disk.
+type asyncnodebuffer struct {
+	mux        sync.RWMutex


We typically name mutexes mu rather than mux. Not saying it's better, but please let's stick to the convention :)

Please also document what the mutex covers. In this case, I assume it is used as a concurrency-guard for operations on both current and background?

Yes, change to mu and, add docs for 'asyncnodebuffer' and 'flush' method

holiman · 2023-11-22T09:53:41Z

trie/triedb/pathdb/nodebuffer.go

+	if atomic.LoadUint64(&a.background.immutable) == 1 {
+		return nil
+	}
+
+	atomic.StoreUint64(&a.current.immutable, 1)


This does not look concurrency-safe: first you check, then you store (blindly). What you probably want to use here is CompareAndSwap, and why not make the immutable into an atomic.Bool rather than using atomic.LoadUint64 ?
It would become something like

if !a.background.immutable.CompareAndSwap(false, true) { return nil }

atomic.Bool is better. CompareAndSwap is not appropriate, the background switch 'currnet' the immutable should be false.

holiman · 2023-11-22T09:55:13Z

trie/triedb/pathdb/nodebuffer.go

+	size      uint64                                    // The size of aggregated writes
+	limit     uint64                                    // The maximum memory allowance in bytes
+	nodes     map[common.Hash]map[string]*trienode.Node // The dirty node set, mapped by owner and path
+	immutable uint64                                    // The flag equal 1, flush nodes to disk background


Use a bool or even better atomic.Bool

"The flag equal 1, flush nodes to disk background" -- please expand a bit on what the meaning of this field is.
For example (and I might be incorrect since I've still not fully understood this PR)

"If this is set to true, then some other thread is performing a flush in the background, and thus the ..."

or

"If this is set to true, then this nodebuffer is immutable and any write-operations to it will exit with error"

holiman · 2023-11-22T09:59:54Z

trie/triedb/pathdb/nodebuffer.go

+	if atomic.LoadUint64(&b.immutable) == 1 {
+		return errWriteImmutable
+	}


I'm not sure this is sufficient. What if someone is performing a commit, so is in the for-loop below, and then some other threads sets the immutable to true? This seems not thread-safe to me, but maybe I don't fully understand what the concurrency looks like.

joeylichang · 2023-11-30T02:35:13Z

@rjl493456442 Any further results?

joeylichang force-pushed the async_nodebuffer branch from 92e4c98 to ad03bc6 Compare November 7, 2023 03:15

joeylichang requested review from s1na, karalabe, holiman and rjl493456442 as code owners November 7, 2023 03:15

joeylichang force-pushed the async_nodebuffer branch 2 times, most recently from e9d0659 to 288b3c9 Compare November 7, 2023 03:39

core, eth, trie: write nodebuffer asynchronously to disk

37ec06a

joeylichang force-pushed the async_nodebuffer branch from 288b3c9 to 37ec06a Compare November 7, 2023 06:58

holiman reviewed Nov 22, 2023

View reviewed changes

pathdb: add more docs for asyncnodebuffer

8677240

joeylichang mentioned this pull request Oct 10, 2024

triedb/pathdb, eth: use double-buffer mechanism in pathdb #30464

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core, eth, trie: write nodebuffer asynchronously to disk #28471

core, eth, trie: write nodebuffer asynchronously to disk #28471

joeylichang commented Nov 7, 2023

joeylichang commented Nov 7, 2023

rjl493456442 commented Nov 7, 2023

joeylichang commented Nov 7, 2023 •

edited

Loading

joeylichang commented Nov 9, 2023

rjl493456442 commented Nov 9, 2023

rjl493456442 commented Nov 18, 2023

rjl493456442 commented Nov 19, 2023 •

edited

Loading

joeylichang commented Nov 20, 2023

fynnss commented Nov 20, 2023

rjl493456442 commented Nov 20, 2023

holiman left a comment

holiman Nov 22, 2023

joeylichang Nov 23, 2023

holiman Nov 22, 2023

joeylichang Nov 23, 2023

holiman Nov 22, 2023

joeylichang Nov 23, 2023

holiman Nov 22, 2023

holiman Nov 22, 2023

joeylichang Nov 23, 2023

holiman Nov 22, 2023

joeylichang Nov 23, 2023

holiman Nov 22, 2023

holiman Nov 22, 2023

joeylichang Nov 23, 2023

holiman Nov 22, 2023

joeylichang commented Nov 30, 2023

core, eth, trie: write nodebuffer asynchronously to disk #28471

Are you sure you want to change the base?

core, eth, trie: write nodebuffer asynchronously to disk #28471

Conversation

joeylichang commented Nov 7, 2023

joeylichang commented Nov 7, 2023

rjl493456442 commented Nov 7, 2023

joeylichang commented Nov 7, 2023 • edited Loading

joeylichang commented Nov 9, 2023

rjl493456442 commented Nov 9, 2023

rjl493456442 commented Nov 18, 2023

rjl493456442 commented Nov 19, 2023 • edited Loading

joeylichang commented Nov 20, 2023

fynnss commented Nov 20, 2023

rjl493456442 commented Nov 20, 2023

holiman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joeylichang commented Nov 30, 2023

joeylichang commented Nov 7, 2023 •

edited

Loading

rjl493456442 commented Nov 19, 2023 •

edited

Loading