Refactor stream implementation #499

danlaine · 2024-01-14T12:37:18Z

This PR refactors the stream implementation. The iterator maintains a stack where each element represents an ongoing iteration of a branch node. In each iteration, the element at the top of the stack tells us which key-value pair to traverse next.

The special casing in the "setup" code that deals with branch/extension nodes which are the last element of path_to_key is kind of annoying and not ideal, but we can probably refactor this further if/when we implement a function that returns an iterator over all nodes on the path to a given key.

richardpringle

Pretty good! I like the pattern. I'm pretty sure we can use a concrete type for children_iter though.

richardpringle · 2024-01-15T16:03:21Z

firewood/src/merkle/stream.rs

-enum IteratorState<'a> {
+struct BranchIterator {
+    // The nibbles of the key at this node.
+    key_nibbles: Vec<u8>,


I think it would make sense to keep this as a Box<[u8]>. It's a little bit smaller (no capacity) and assures that no one tries go grow or shrink the key.

This might also be a good opportunity to bytes::Bytes to cut down on allocations

I tried making this a Box<[u8]> but ran into some compiler issues. Maybe you could help me out with that next time we chat?

One way to fix it here

richardpringle · 2024-01-15T16:24:07Z

firewood/src/merkle/stream.rs

+    key_nibbles: Vec<u8>,
+    // Returns the non-empty children of this node
+    // and their positions in the node's children array.
+    children_iter: Box<dyn Iterator<Item = (DiskAddress, u8)> + Send>,


I think getting rid of this dynamic dispatch should be fairly straight forward... I'll try to remember to come back up here and comment after reading the rest of the code.

firewood/src/merkle/stream.rs

richardpringle · 2024-01-15T17:58:43Z

firewood/src/merkle/stream.rs

+                    let mut child_key_nibbles = branch_iter.key_nibbles.clone(); // TODO reduce¸cloning
+                    child_key_nibbles.push(pos);


I don't think there's any way to actually reduce cloning here without storing partial-paths as some sort of borrow. However, in theory, that push could cause the new Vec to double in size. If you use iterators, the compiler is more likely to know the exact size to allocate:

Suggested change

let mut child_key_nibbles = branch_iter.key_nibbles.clone(); // TODO reduce¸cloning

child_key_nibbles.push(pos);

let child_key_nibbles = branch_iter.key_nibbles.iter().copied().chain(Some(pos)).collect();

In this case, it's likely that the compiler would have known to allocate the proper size to begin with. There are articles out there that explain this, but my keyword game wasn't quite good enough today to find any quickly. Here's a thread with some explanation though:

https://users.rust-lang.org/t/performance-difference-between-iterator-and-for-loop/50254

I'm not 100% sure, but I think the Alice from the thread is Alice Rhyl, who's one of the core Tokio contributors. She's got some of the better answers and explanations on that platform.

richardpringle · 2024-01-15T18:01:26Z

firewood/src/merkle/stream.rs

+                            if let Some(value) = branch.value.as_ref() {
+                                let value = value.to_vec();
+                                return Poll::Ready(Some(Ok((
+                                    key_from_nibble_iter(child_key_nibbles.into_iter().skip(1)), // skip the sentinel node leading 0


Can we just construct child_key_nibbles without the leading 0 in the first place?

We'd have to special case so that we don't push the 0 index that points from the sentinel node to the root but it's certainly possible

firewood/src/merkle/stream.rs

xinifinity · 2024-02-01T08:11:39Z

This PR refactors the stream implementation. The iterator maintains a stack where each element represents an ongoing iteration of a branch node. In each iteration, the element at the top of the stack tells us which key-value pair to traverse next.

The special casing in the "setup" code that deals with branch/extension nodes which are the last element of path_to_key is kind of annoying and not ideal, but we can probably refactor this further if/when we implement a function that returns an iterator over all nodes on the path to a given key.

Do you mind sharing what is the major motivation for the refactor? Is it for readability or for better performance?

danlaine · 2024-02-01T14:31:04Z

This PR refactors the stream implementation. The iterator maintains a stack where each element represents an ongoing iteration of a branch node. In each iteration, the element at the top of the stack tells us which key-value pair to traverse next.
The special casing in the "setup" code that deals with branch/extension nodes which are the last element of path_to_key is kind of annoying and not ideal, but we can probably refactor this further if/when we implement a function that returns an iterator over all nodes on the path to a given key.

Do you mind sharing what is the major motivation for the refactor? Is it for readability or for better performance?

Just readability. I haven't tested the performance of this relative to the existing implementation.

richardpringle · 2024-02-01T16:59:17Z

How hard do you think it would be to pop the starting logic into its own method?

impl<'a, S: ShaleStore<Node> + Send + Sync, T> Stream for MerkleKeyValueStream<'a, S, T> {
    type Item = Result<(Key, Value), api::Error>;

    fn poll_next(
        mut self: std::pin::Pin<&mut Self>,
        _cx: &mut std::task::Context<'_>,
    ) -> Poll<Option<Self::Item>> {
        // destructuring is necessary here because we need mutable access to `key_state`
        // at the same time as immutable access to `merkle`
        let Self {
            key_state,
            merkle_root,
            merkle,
        } = &mut *self;

        match key_state {
            IteratorState::StartAtKey(key) => {
                self.start_at(key)? // might look slightly different than this
                self.poll_next(_cx)
            }
            IteratorState::Iterating { branch_iter_stack } => {
                // ...
            }
        }
    }
}

This is definitely something that I should have done in the previous pass.

xinifinity · 2024-02-01T20:00:45Z

firewood/src/v2/api.rs

+    #[error("Child not found")]
+    ChildNotFound,
+
+    #[error("Extension node is at the root; should be branch node")]


This is only referring to sentinel root? as his can happen for the real root, right? The sentinel root we force it to be branch node.

updated name

xinifinity · 2024-02-01T20:02:47Z

firewood/src/merkle/stream.rs

@@ -465,6 +444,84 @@ mod tests {
        check_stream_is_done(merkle.iter(root)).await;
    }

+    #[tokio::test]
+    async fn no_start_key() {


Is this somewhat overlapped with test at L407? If so, can we incorporate into one?

xinifinity

This PR will also work with the extension node removal PR, right?

danlaine · 2024-02-02T15:59:38Z

This PR will also work with the extension node removal PR, right?

Not as it is right now. We will need to update the traversal logic in this PR to account for the fact that branch nodes may have partial paths in them.

rkuris

Partial review, will complete the remainder from L130 on after standups

rkuris · 2024-02-05T17:14:51Z

firewood/src/merkle/stream.rs

-enum IteratorState<'a> {
+struct BranchIterator {
+    // The nibbles of the key at this node.
+    key_nibbles: Vec<u8>,


One way to fix it here

rkuris · 2024-02-05T17:22:27Z

firewood/src/merkle/stream.rs

+) -> Result<IteratorState, api::Error> {
+    let root_node = merkle
+        .get_node(root_node)
+        .map_err(|e| api::Error::InternalError(Box::new(e)))?;


The error handling is poor here; if there is an IO error it gets mapped to an internal error. The right fix for this is to write From implementations for MerkleError and ShaleError for v2::api::Error.

This should be done in a follow up though.

rkuris · 2024-02-05T17:28:24Z

firewood/src/merkle/stream.rs

+
+    // Get each node in [path_to_key]. Mark the last node as being the last
+    // so we can use that information in the while loop below.
+    let path_to_key = path_to_key


There is a much easier way than doing all this work up front, see next two comments.

rkuris · 2024-02-05T17:28:50Z

firewood/src/merkle/stream.rs

+
+    let mut matched_key_nibbles = vec![];
+
+    for node in path_to_key {


Suggested change

for node in path_to_key {

let mut path_iterator = path_to_key.into_iter().peekable();

while let Some(node) = path_iterator.next() {

rkuris · 2024-02-05T17:29:46Z

firewood/src/merkle/stream.rs

+
+        match node.inner() {
+            NodeType::Leaf(_) => (),
+            NodeType::Branch(branch) if is_last => {


Suggested change

NodeType::Branch(branch) if is_last => {

NodeType::Branch(branch) if path_iterator.peek().is_none() => {

Once you do this, I'd argue the iterator shouldn't actually pre-fetch all the nodes and should just be the iterator over path_to_key, reading nodes as it needs them.

rkuris · 2024-02-05T17:32:15Z

firewood/src/merkle/stream.rs

+            NodeType::Branch(branch) if is_last => {
+                // This is the last node in the path to [key].
+                // Figure out whether to start iterating over this node's
+                // children at [pos] or [pos + 1].


🏆 level commenting

rkuris · 2024-02-05T17:35:30Z

firewood/src/merkle/stream.rs

+                // Get the child at [pos], if any.
+                let Some(child) = branch.children.get(pos as usize) else {
+                    // This should never happen -- [pos] should never be OOB.
+                    return Err(api::Error::InternalError(Box::new(
+                        api::Error::ChildNotFound,
+                    )));
+                };


This is a valid case for a panic, as the code prevents this from ever happening.

See https://blog.burntsushi.net/unwrap/#so-one-should-never-use-unwrap-or-expect

We have a rule for prohibiting unwrap() but not expect() exactly for this case.

Suggested change

// Get the child at [pos], if any.

let Some(child) = branch.children.get(pos as usize) else {

// This should never happen -- [pos] should never be OOB.

return Err(api::Error::InternalError(Box::new(

api::Error::ChildNotFound,

)));

};

// Get the child at [pos], if any.

let child = branch.children.get(pos as usize).expect("pos cannot be OOB");

rkuris · 2024-02-05T17:39:06Z

firewood/src/merkle/stream.rs

+                let child = child
+                    .map(|child_addr| merkle.get_node(child_addr))
+                    .transpose()
+                    .map_err(|e| api::Error::InternalError(Box::new(e)))?;


same comment about errors as above. Should be fixed in a followup

rkuris · 2024-02-05T17:44:19Z

firewood/src/merkle/stream.rs

+                let comparer = if child.is_none() {
+                    // The child doesn't exist; we don't need to iterate over [pos].
+                    <u8 as PartialOrd>::gt
+                } else {
+                    // The child does exist; the first key to iterate over must be at [pos].
+                    <u8 as PartialOrd>::ge
+                };
+
+                let children_iter = get_children_iter(branch)
+                    .filter(move |(_, child_pos)| comparer(child_pos, &pos));


Comments significantly improve readability but in this case I'd just include the logic in the closure, which makes it even more readable:

Suggested change

let comparer = if child.is_none() {

// The child doesn't exist; we don't need to iterate over [pos].

<u8 as PartialOrd>::gt

} else {

// The child does exist; the first key to iterate over must be at [pos].

<u8 as PartialOrd>::ge

};

let children_iter = get_children_iter(branch)

.filter(move |(_, child_pos)| comparer(child_pos, &pos));

let children_iter = get_children_iter(branch).filter(move |(_, child_pos)| {

if child.is_none() {

child_pos > &pos

} else {

child_pos >= &pos

}

});

danlaine · 2024-02-05T19:27:08Z

Marking as WIP. Going to try to implement this by implementing using a node iterator. See #399.

danlaine · 2024-02-07T16:54:52Z

closing in favor of #517

Dan Laine added 19 commits January 11, 2024 16:29

WIP

896e8bd

WIP

0e96883

WIP

6c6581c

WIP

9669108

WIP

53a3f41

WIP

1ea329e

compiles

c721e64

put child index in key

84163c8

make key vector in node iterator

dd050e7

no-op implementation for iterator initialization

56ab548

cleanup; add test no_start_key

8c3af4a

improve no_start_key test

49a93c6

use get_children_iter helper

4a68282

add comment

009bfe9

comments and cleanup

b8dddb4

comments and cleanup

54fa043

WIP

8dfd728

comment WIP code

082a797

uncomment WIP code

fe2de9b

richardpringle reviewed Jan 15, 2024

View reviewed changes

rkuris assigned danlaine Jan 17, 2024

Dan Laine added 9 commits January 29, 2024 10:28

Merge remote-tracking branch 'origin/main' into fix-stream-3

0105df4

WIP fixing initialization code

e3e5adc

WIP fixing initialization code

c1fd906

18 of 20 UT pass

66a715e

readability nit

9b82d90

nit remove useless assignment

902afe0

comments and nits

580762a

all UT pass

bdfd326

appease linter

a8a5cf8

Dan Laine added 2 commits January 31, 2024 13:25

appease linter

001874b

Merge remote-tracking branch 'origin/main' into fix-stream-3

c5b87fd

danlaine marked this pull request as ready for review January 31, 2024 20:33

danlaine requested review from xinifinity and rkuris as code owners January 31, 2024 20:33

danlaine changed the title ~~[WIP] Refactor stream implementation~~ Refactor stream implementation Jan 31, 2024

Merge remote-tracking branch 'origin/main' into fix-stream-3

e0728ee

xinifinity reviewed Feb 1, 2024

View reviewed changes

Dan Laine added 8 commits February 2, 2024 09:14

separate out initialization of key state into function

6f2de3c

naming nits

a32e770

&mut --> &

7799284

appease linter

15eaa77

rename error

ba65fda

appease linter

d3f519f

remove errantly added WIP code

9f82812

test cleanup

60c6c39

Dan Laine added 2 commits February 2, 2024 11:19

fix upper loop bound in test

25cbdbb

Merge branch 'main' into fix-stream-3

1752650

rkuris requested changes Feb 5, 2024

View reviewed changes

danlaine marked this pull request as draft February 5, 2024 19:26

danlaine mentioned this pull request Feb 6, 2024

Implement iterator over nodes; refactor key-value iterator #517

Merged

danlaine closed this Feb 7, 2024

danlaine deleted the fix-stream-3 branch April 2, 2024 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor stream implementation #499

Refactor stream implementation #499

danlaine commented Jan 14, 2024 •

edited

Loading

richardpringle left a comment

richardpringle Jan 15, 2024

richardpringle Jan 15, 2024

danlaine Jan 30, 2024

rkuris Feb 5, 2024

richardpringle Jan 15, 2024

richardpringle Jan 15, 2024

richardpringle Jan 15, 2024

danlaine Jan 30, 2024

xinifinity commented Feb 1, 2024

danlaine commented Feb 1, 2024

richardpringle commented Feb 1, 2024

xinifinity Feb 1, 2024

danlaine Feb 2, 2024

xinifinity Feb 1, 2024

danlaine Feb 2, 2024

xinifinity left a comment

danlaine commented Feb 2, 2024

rkuris left a comment

rkuris Feb 5, 2024

rkuris Feb 5, 2024

rkuris Feb 5, 2024

rkuris Feb 5, 2024

rkuris Feb 5, 2024

rkuris Feb 5, 2024

rkuris Feb 5, 2024

rkuris Feb 5, 2024

rkuris Feb 5, 2024

rkuris Feb 5, 2024

danlaine commented Feb 5, 2024

danlaine commented Feb 7, 2024

		let mut child_key_nibbles = branch_iter.key_nibbles.clone(); // TODO reduce¸cloning
		child_key_nibbles.push(pos);

	let mut child_key_nibbles = branch_iter.key_nibbles.clone(); // TODO reduce¸cloning
	child_key_nibbles.push(pos);
	let child_key_nibbles = branch_iter.key_nibbles.iter().copied().chain(Some(pos)).collect();


		let mut matched_key_nibbles = vec![];

		for node in path_to_key {

-    for node in path_to_key {
+        let mut path_iterator = path_to_key.into_iter().peekable();
+    while let Some(node) = path_iterator.next() {

	NodeType::Branch(branch) if is_last => {
	NodeType::Branch(branch) if path_iterator.peek().is_none() => {

Refactor stream implementation #499

Refactor stream implementation #499

Conversation

danlaine commented Jan 14, 2024 • edited Loading

richardpringle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinifinity commented Feb 1, 2024

danlaine commented Feb 1, 2024

richardpringle commented Feb 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinifinity left a comment

Choose a reason for hiding this comment

danlaine commented Feb 2, 2024

rkuris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danlaine commented Feb 5, 2024

danlaine commented Feb 7, 2024

danlaine commented Jan 14, 2024 •

edited

Loading