Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core, eth, trie: write nodebuffer asynchronously to disk #28471

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

joeylichang
Copy link

Pathdb will write trienodes to the disk in one batch after the nodebuffer is full. The operation of writing to the disk will block the execution of blocks, which will cause performance glitches from the monitoring point of view. This PR proposes asyncnodebuffer to formally solve the performance glitch introduced by this. asyncnodebuffer is an optimization of nodebuffer for asynchronous writing to disk. It consists of two nodebuffer. When the current nodebuffer is full, it will become immutable and switch to the background to write to disk. The new front nodebuffer is used for subsequent writing of chasing blocks. Of course, the immutable nodebuffer is still readable.

@joeylichang
Copy link
Author

@rjl493456442 Can I invite you to help evaluate this PR?

@joeylichang joeylichang force-pushed the async_nodebuffer branch 2 times, most recently from e9d0659 to 288b3c9 Compare November 7, 2023 03:39
@rjl493456442
Copy link
Member

I once had a similar idea, but I abandoned it due to certain limitations in LevelDB. Specifically, the following write operations will remain blocked even if we flush the buffer in the background. Nevertheless, it's certainly worth considering this approach within the context of Pebble.

As you can see, it involves a non-trivial complexity. Do you have any performance data before we dive into details?

@joeylichang
Copy link
Author

joeylichang commented Nov 7, 2023

I once had a similar idea, but I abandoned it due to certain limitations in LevelDB. Specifically, the following write operations will remain blocked even if we flush the buffer in the background. Nevertheless, it's certainly worth considering this approach within the context of Pebble.

As you can see, it involves a non-trivial complexity. Do you have any performance data before we dive into details?

Yes, writes limited by leveldb will still be blocked. Since asynchronous writing does not block the main process, subsequent block chasing performance is still much smoother due to cache hits, which is still recommended.

Intel x86, 16 core, 64G, 3000 IOPS, 128M/S EC2
~1200 tx/block, flush 256MB nodebuffer causes glitches of ~30s during leveldb compaction.

@joeylichang
Copy link
Author

I once had a similar idea, but I abandoned it due to certain limitations in LevelDB. Specifically, the following write operations will remain blocked even if we flush the buffer in the background. Nevertheless, it's certainly worth considering this approach within the context of Pebble.
As you can see, it involves a non-trivial complexity. Do you have any performance data before we dive into details?

Yes, writes limited by leveldb will still be blocked. Since asynchronous writing does not block the main process, subsequent block chasing performance is still much smoother due to cache hits, which is still recommended.

Intel x86, 16 core, 64G, 3000 IOPS, 128M/S EC2 ~1200 tx/block, flush 256MB nodebuffer causes glitches of ~30s during leveldb compaction.

@rjl493456442 How do you think this PR is necessary?

@rjl493456442
Copy link
Member

@joeylichang I will try to do some benchmarks first.

@rjl493456442
Copy link
Member

Deployed on our benchmark machine, will post the result once it's available.

@rjl493456442
Copy link
Member

rjl493456442 commented Nov 19, 2023

The first wave of benchmark data.

截屏2023-11-19 下午2 21 09

After running for a few hours, it turns out this pull request is slightly faster than master(Note, I rebase your pull request against laster master branch before deployement).

07 is pull request
08 is master

截屏2023-11-19 下午2 22 20

The worst execution time drops from 3s to 1s


From the logs I can tell, during the flushing, the overall execution performance is still affected, even by putting the write in the background. We can see the mgasps is dropped on both machine when flushing.

截屏2023-11-19 下午2 25 08 截屏2023-11-19 下午2 24 23

Let's keep them running for a few more days then.

@joeylichang
Copy link
Author

Let's keep them running for a few more days then.

cool~, looking forward to the result.

Execution performance is still affected ——— According to our observation, it is mainly the impact of disk bandwidth, which blocks the execution of transactions.
We have conducted a comparative test of pebbledb vs leveldb, and pebbledb will perform better.

@fynnss
Copy link

fynnss commented Nov 20, 2023

From the logs I can tell, during the flushing, the overall execution performance is still affected, even by putting the write in the background. We can see the mgasps is dropped on both machine when flushing.

@rjl493456442 Can you also show the disk read/write metrics here? Maybe disk bandwidth is the bottleneck.

@rjl493456442
Copy link
Member

After running a few more days, this pull request is consistently faster than master.

image
截屏2023-11-20 下午9 28 32

Due to fact that accumulating 256 Megabytes state in memory will take roughly 1.5min in full sync, so the theoretical performance speedup is ~1%(average_buffer_flush_time / 1.5min). Although the speedup is not that obvious from the metrics.

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't really say much about the overall idea at this point, don't fully understand it yet. Some things looks a bit flaky to me.

@@ -151,16 +151,16 @@ func (db *Database) Commit(root common.Hash, report bool) error {
// Size returns the storage size of diff layer nodes above the persistent disk
// layer, the dirty nodes buffered within the disk layer, and the size of cached
// preimages.
func (db *Database) Size() (common.StorageSize, common.StorageSize, common.StorageSize) {
func (db *Database) Size() (common.StorageSize, common.StorageSize, common.StorageSize, common.StorageSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the method docs. With four return values, might be time to split it into a bullet-list, e.g. something like

// Size returns the sizes of: 
// - the diff layer nodes above the persistent disk layer, 
// - the dirty nodes buffered within the disk layer, 
// - the immutable nodes in the disk layer, 
// - the cached preimages

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the docs.

@@ -294,14 +337,15 @@ func (dl *diskLayer) setBufferSize(size int) error {
}

// size returns the approximate size of cached nodes in the disk layer.
func (dl *diskLayer) size() common.StorageSize {
func (dl *diskLayer) size() (common.StorageSize, common.StorageSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update method doc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the docs.

// empty returns an indicator if trienodebuffer contains any state transition inside.
empty() bool

// getSize return the trienodebuffer used size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns two numbers, please document what they mean, respectively

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the docs.

// asyncnodebuffer implement trienodebuffer interface, and aysnc the nodecache
// to disk.
type asyncnodebuffer struct {
mux sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically name mutexes mu rather than mux. Not saying it's better, but please let's stick to the convention :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document what the mutex covers. In this case, I assume it is used as a concurrency-guard for operations on both current and background?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, change to mu and, add docs for 'asyncnodebuffer' and 'flush' method

Comment on lines 154 to 158
if atomic.LoadUint64(&a.background.immutable) == 1 {
return nil
}

atomic.StoreUint64(&a.current.immutable, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look concurrency-safe: first you check, then you store (blindly). What you probably want to use here is CompareAndSwap, and why not make the immutable into an atomic.Bool rather than using atomic.LoadUint64 ?
It would become something like

if !a.background.immutable.CompareAndSwap(false, true) {
    return nil
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic.Bool is better. CompareAndSwap is not appropriate, the background switch 'currnet' the immutable should be false.

size uint64 // The size of aggregated writes
limit uint64 // The maximum memory allowance in bytes
nodes map[common.Hash]map[string]*trienode.Node // The dirty node set, mapped by owner and path
immutable uint64 // The flag equal 1, flush nodes to disk background
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a bool or even better atomic.Bool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The flag equal 1, flush nodes to disk background" -- please expand a bit on what the meaning of this field is.
For example (and I might be incorrect since I've still not fully understood this PR)

"If this is set to true, then some other thread is performing a flush in the background, and thus the ..."

or

"If this is set to true, then this nodebuffer is immutable and any write-operations to it will exit with error"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 253 to 255
if atomic.LoadUint64(&b.immutable) == 1 {
return errWriteImmutable
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is sufficient. What if someone is performing a commit, so is in the for-loop below, and then some other threads sets the immutable to true? This seems not thread-safe to me, but maybe I don't fully understand what the concurrency looks like.

@joeylichang
Copy link
Author

@rjl493456442 Any further results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants