BTreeMap insert does not honor the B-tree invariant #74834

ssomers · 2020-07-27T18:53:03Z

Inserting into a BTreeMap, where the key lands in a node that is full (has 11 elements), cause that node to keep 6 elements and split off 5 elements, of which one ends up in the parent node, and the other 4 in a new node. If the key inserted is not big and belongs among the left side of the keys in the node, the left child comes out with 7 elements and the right child has only 4 elements.

I don't think(*) this causes any observable ugliness so it's impossible to demonstrate using outsider testing code. Maybe through memory footprint or performance analysis. I have a commit peeking at the internals to prove it.

Is this really a bug at all?

The documentation writes that "each node contain B-1 to 2B-1 elements", like other texts on B trees
(where B-1 = MIN_LEN = 5).
The code too wants to assure that any non-root node is not underfull, i.e. carrying less than MIN_LEN elements.
But it's possible that this strategy enhances performance, because often keys inserted are generally increasing, so they tend to land in the right child and leave less wasted space in the leftmost nodes.

(*) For instance, you can't reduce the number of elements further. remove_kv_tracking could temporarily reduce it down to 3, but it will then either merge with a sibling or be stocked back up to 4 elements.

The text was updated successfully, but these errors were encountered:

ssomers · 2020-08-03T11:53:50Z

Let's say we want to insert in a full node (len() = CAPACITY = 2B-1) at edge-index IN (which is in 0..=CAPACITY), and we accomplish this by splitting the node at kv-index S (which is in 0..CAPACITY), moving the element (key/value pair) at kv-index S into a parent node, keeping the S elements on the left in the left node, and moving the (CAPACITY-S-1) elements on the right to a new right child. The current code chooses S = B leaving the B elements on the left in a node and shipping out the B-2 elements on the right to another node, and then inserts the new element either left or right.

The simplest, symmetric rule is to choose S = B-1, the element right in the middle, leaving B-1 elements in the left child and B-1 elements in the right child, and then inserting in either of the children.

Possible exceptions

If IN > B, choose S = B, leaving B elements in the left child and (after the insertion) B-1 in the right child. Compared to the simple, symmetric rule:
- +1 One fewer element to move over to the new right node.
- +1 One fewer element to shift again while inserting into that node (only if IN < CAPACITY, and you might argue this step could have been anticipated).
- +1 More efficient use of space, if we assume the insertion happens in the rightmost node of the tree, and IN > B is a sign that the keys inserted are increasing.
- +1 Same as the current code, when keys are generally increasing.
For the sake of symmetry with rule 1, if IN < B-1, choose S = B-2, leaving (after the insertion) B-1 elements in the left child and B in the right child. Compared to the simple, symmetric rule:
- +1 One fewer element to shift while inserting into the left node.
- -1 One more element to move over to the new right node.
- +1 More efficient use of space, if we assume the insertion happens in the leftmost node of the tree, and IN < B-1 is a sign that the keys inserted are decreasing. However, if keys are generally decreasing, the left-aligned arrays inside nodes imply that every insert will hurt, and if that is more than a bad streak and happens all the time, you should swap the map's ordering, so this doesn't seem as likely or useful as in rule 1.
If IN = B (the right edge of the central element), normally S = B-1 is the only choice. Splitting at kv-index B would require insertion on the left and leave an underfull right child. But we could instead split at edge-index B: keep the first B elements, move the last B-1 elements to the right node, and place the inserted key/value at the parent node. Compared to the simple, symmetric rule:
- +1 Not moving the element at kv-index B.
- +1 No shifting while inserting (though this step could have been anticipated).
If IN = B-1 (the left edge of the central element), normally S = B-1 is the only choice. But like rule 3, we could split at edge-index B-1 and insert into the parent node. Compared to the simple, symmetric rule:
- No genuine difference

Each extra rule has at least two more disadvantages:

-1 Cognitive overhead in already complicated code
-1 Runtime overhead due to a comparison IN <> B (where IN is readily available and B constant) and a jump.

My conclusion is that, beyond the simple, symmetric rule, only exception 1 might be worth implementing.

ssomers added the C-bug Category: This is a bug. label Jul 27, 2020

jonas-schievink added A-collections Area: `std::collection` T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jul 27, 2020

This was referenced Aug 3, 2020

Easy way to respect BTreeMap's minimum node length #75104

Closed

Hard way to respect BTreeMap's minimum node length #75105

Merged

bors closed this as completed in 81dc88f Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BTreeMap insert does not honor the B-tree invariant #74834

BTreeMap insert does not honor the B-tree invariant #74834

ssomers commented Jul 27, 2020 •

edited

Loading

ssomers commented Aug 3, 2020 •

edited

Loading

BTreeMap insert does not honor the B-tree invariant #74834

BTreeMap insert does not honor the B-tree invariant #74834

Comments

ssomers commented Jul 27, 2020 • edited Loading

ssomers commented Aug 3, 2020 • edited Loading

Possible exceptions

ssomers commented Jul 27, 2020 •

edited

Loading

ssomers commented Aug 3, 2020 •

edited

Loading