Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTreeMap insert does not honor the B-tree invariant #74834

Closed
ssomers opened this issue Jul 27, 2020 · 1 comment · Fixed by #75105
Closed

BTreeMap insert does not honor the B-tree invariant #74834

ssomers opened this issue Jul 27, 2020 · 1 comment · Fixed by #75105
Labels
A-collections Area: `std::collection` C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@ssomers
Copy link
Contributor

ssomers commented Jul 27, 2020

Inserting into a BTreeMap, where the key lands in a node that is full (has 11 elements), cause that node to keep 6 elements and split off 5 elements, of which one ends up in the parent node, and the other 4 in a new node. If the key inserted is not big and belongs among the left side of the keys in the node, the left child comes out with 7 elements and the right child has only 4 elements.

I don't think(*) this causes any observable ugliness so it's impossible to demonstrate using outsider testing code. Maybe through memory footprint or performance analysis. I have a commit peeking at the internals to prove it.

Is this really a bug at all?

  • The documentation writes that "each node contain B-1 to 2B-1 elements", like other texts on B trees
    (where B-1 = MIN_LEN = 5).
  • The code too wants to assure that any non-root node is not underfull, i.e. carrying less than MIN_LEN elements.
  • But it's possible that this strategy enhances performance, because often keys inserted are generally increasing, so they tend to land in the right child and leave less wasted space in the leftmost nodes.

(*) For instance, you can't reduce the number of elements further. remove_kv_tracking could temporarily reduce it down to 3, but it will then either merge with a sibling or be stocked back up to 4 elements.

@ssomers ssomers added the C-bug Category: This is a bug. label Jul 27, 2020
@jonas-schievink jonas-schievink added A-collections Area: `std::collection` T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jul 27, 2020
@ssomers
Copy link
Contributor Author

ssomers commented Aug 3, 2020

Let's say we want to insert in a full node (len() = CAPACITY = 2B-1) at edge-index IN (which is in 0..=CAPACITY), and we accomplish this by splitting the node at kv-index S (which is in 0..CAPACITY), moving the element (key/value pair) at kv-index S into a parent node, keeping the S elements on the left in the left node, and moving the (CAPACITY-S-1) elements on the right to a new right child. The current code chooses S = B leaving the B elements on the left in a node and shipping out the B-2 elements on the right to another node, and then inserts the new element either left or right.

The simplest, symmetric rule is to choose S = B-1, the element right in the middle, leaving B-1 elements in the left child and B-1 elements in the right child, and then inserting in either of the children.

Possible exceptions

  1. If IN > B, choose S = B, leaving B elements in the left child and (after the insertion) B-1 in the right child. Compared to the simple, symmetric rule:

    • +1 One fewer element to move over to the new right node.
    • +1 One fewer element to shift again while inserting into that node (only if IN < CAPACITY, and you might argue this step could have been anticipated).
    • +1 More efficient use of space, if we assume the insertion happens in the rightmost node of the tree, and IN > B is a sign that the keys inserted are increasing.
    • +1 Same as the current code, when keys are generally increasing.
  2. For the sake of symmetry with rule 1, if IN < B-1, choose S = B-2, leaving (after the insertion) B-1 elements in the left child and B in the right child. Compared to the simple, symmetric rule:

    • +1 One fewer element to shift while inserting into the left node.
    • -1 One more element to move over to the new right node.
    • +1 More efficient use of space, if we assume the insertion happens in the leftmost node of the tree, and IN < B-1 is a sign that the keys inserted are decreasing. However, if keys are generally decreasing, the left-aligned arrays inside nodes imply that every insert will hurt, and if that is more than a bad streak and happens all the time, you should swap the map's ordering, so this doesn't seem as likely or useful as in rule 1.
  3. If IN = B (the right edge of the central element), normally S = B-1 is the only choice. Splitting at kv-index B would require insertion on the left and leave an underfull right child. But we could instead split at edge-index B: keep the first B elements, move the last B-1 elements to the right node, and place the inserted key/value at the parent node. Compared to the simple, symmetric rule:

    • +1 Not moving the element at kv-index B.
    • +1 No shifting while inserting (though this step could have been anticipated).
  4. If IN = B-1 (the left edge of the central element), normally S = B-1 is the only choice. But like rule 3, we could split at edge-index B-1 and insert into the parent node. Compared to the simple, symmetric rule:

    • No genuine difference

Each extra rule has at least two more disadvantages:

  • -1 Cognitive overhead in already complicated code
  • -1 Runtime overhead due to a comparison IN <> B (where IN is readily available and B constant) and a jump.

My conclusion is that, beyond the simple, symmetric rule, only exception 1 might be worth implementing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Area: `std::collection` C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
2 participants