Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverwriteEmptyByte instead of ClearByte in GrowToNextAllocSize. #4754

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

goldvitaly
Copy link
Contributor

This is simplification that allows to remove ClearDeleted and intended
to have similar performance.

Benchmarks are close to noise. Leaning positive on X86 and leaning
negative on ARM.

  1. On Arm:
    &= ~(uint64_t{0xff} << (byte_index * 8)) -> |= (uint64_t{tag | 0x80} << (byte_index * 8)).
    This operation is even reducing latency from 4 -> 3.
    But we have another shift operation to compute the tag. Theoretically it shouldn't be slower.

  2. On x86:
    new_metadata[new_index ^ old_size] = 0; -> new_metadata[new_index] = tag | MetadataGroup::PresentMask;.
    Here we have the same latency, but we need an extra shift to compute the tag.

Benchmarks on X86:

name                                                  old cpu/op   new cpu/op   delta
BM_MapInsertSeq<Map<int, int>>/1                      32.7ns ± 4%  32.8ns ± 3%     ~     (p=0.354 n=52+52)
BM_MapInsertSeq<Map<int, int>>/2                      42.3ns ± 3%  40.4ns ± 4%   -4.56%  (p=0.000 n=54+55)
BM_MapInsertSeq<Map<int, int>>/3                      50.5ns ± 3%  51.1ns ± 3%   +1.22%  (p=0.000 n=53+56)
BM_MapInsertSeq<Map<int, int>>/4                      59.9ns ± 4%  60.8ns ± 4%   +1.45%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<int, int>>/8                      98.8ns ± 4%  98.3ns ± 3%   -0.50%  (p=0.044 n=56+56)
BM_MapInsertSeq<Map<int, int>>/16                      163ns ± 4%   158ns ± 4%   -2.77%  (p=0.000 n=55+51)
BM_MapInsertSeq<Map<int, int>>/32                      281ns ± 4%   257ns ± 3%   -8.30%  (p=0.000 n=55+55)
BM_MapInsertSeq<Map<int, int>>/64                      646ns ± 4%   624ns ± 3%   -3.41%  (p=0.000 n=55+55)
BM_MapInsertSeq<Map<int, int>>/256                    2.42µs ± 4%  2.38µs ± 3%   -1.57%  (p=0.000 n=53+56)
BM_MapInsertSeq<Map<int, int>>/4096                   38.4µs ± 3%  38.6µs ± 3%     ~     (p=0.118 n=54+52)
BM_MapInsertSeq<Map<int, int>>/65536                   968µs ± 3%   972µs ± 3%     ~     (p=0.333 n=56+57)
BM_MapInsertSeq<Map<int, int>>/1048576                16.3ms ± 5%  16.3ms ± 6%     ~     (p=0.269 n=56+55)
BM_MapInsertSeq<Map<int, int>>/16777216                646ms ± 3%   646ms ± 4%     ~     (p=0.811 n=57+55)
BM_MapInsertSeq<Map<int, int>>/56                      436ns ± 3%   423ns ± 3%   -2.89%  (p=0.000 n=56+57)
BM_MapInsertSeq<Map<int, int>>/224                    1.66µs ± 4%  1.63µs ± 4%   -2.36%  (p=0.000 n=55+57)
BM_MapInsertSeq<Map<int, int>>/3584                   25.1µs ± 4%  25.4µs ± 5%   +1.03%  (p=0.002 n=55+57)
BM_MapInsertSeq<Map<int, int>>/57344                   559µs ± 3%   567µs ± 4%   +1.31%  (p=0.000 n=56+57)
BM_MapInsertSeq<Map<int, int>>/917504                 10.4ms ± 4%  10.4ms ± 4%     ~     (p=0.417 n=56+55)
BM_MapInsertSeq<Map<int, int>>/14680064                420ms ± 3%   420ms ± 3%     ~     (p=0.740 n=57+55)
BM_MapInsertSeq<Map<int*, int*>>/1                    33.8ns ± 3%  33.9ns ± 3%     ~     (p=0.191 n=55+56)
BM_MapInsertSeq<Map<int*, int*>>/2                    37.0ns ± 3%  37.0ns ± 4%     ~     (p=0.948 n=55+56)
BM_MapInsertSeq<Map<int*, int*>>/3                    41.5ns ± 4%  41.6ns ± 4%     ~     (p=0.503 n=56+56)
BM_MapInsertSeq<Map<int*, int*>>/4                    46.1ns ± 4%  47.2ns ± 4%   +2.31%  (p=0.000 n=56+56)
BM_MapInsertSeq<Map<int*, int*>>/8                    63.6ns ± 4%  65.4ns ± 3%   +2.89%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/16                    133ns ± 4%   128ns ± 4%   -3.74%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/32                    237ns ± 3%   236ns ± 4%     ~     (p=0.082 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/64                    597ns ± 3%   631ns ± 3%   +5.73%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/256                  2.78µs ± 3%  2.89µs ± 4%   +3.82%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/4096                 51.9µs ± 3%  54.2µs ± 4%   +4.46%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/65536                1.16ms ± 3%  1.18ms ± 4%   +1.10%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/1048576              28.9ms ± 4%  29.0ms ± 4%     ~     (p=0.549 n=57+54)
BM_MapInsertSeq<Map<int*, int*>>/16777216              914ms ± 3%   912ms ± 3%     ~     (p=0.147 n=57+54)
BM_MapInsertSeq<Map<int*, int*>>/56                    366ns ± 3%   407ns ± 3%  +11.27%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/224                  1.86µs ± 4%  1.96µs ± 3%   +5.18%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/3584                 33.9µs ± 4%  35.9µs ± 3%   +6.04%  (p=0.000 n=57+52)
BM_MapInsertSeq<Map<int*, int*>>/57344                 763µs ± 3%   772µs ± 4%   +1.25%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/917504               16.8ms ±11%  16.6ms ± 6%     ~     (p=0.454 n=57+52)
BM_MapInsertSeq<Map<int*, int*>>/14680064              610ms ± 2%   608ms ± 3%     ~     (p=0.127 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1          34.6ns ± 4%  34.5ns ± 3%     ~     (p=0.331 n=56+52)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/2          43.9ns ± 4%  37.8ns ± 3%  -13.78%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3          47.9ns ± 4%  48.9ns ± 3%   +2.16%  (p=0.000 n=56+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4          53.6ns ± 3%  54.0ns ± 3%   +0.72%  (p=0.012 n=56+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/8          84.6ns ± 4%  81.0ns ± 3%   -4.21%  (p=0.000 n=57+50)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16          142ns ± 3%   136ns ± 3%   -4.22%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/32          258ns ± 4%   245ns ± 3%   -4.96%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/64          699ns ± 4%   692ns ± 3%   -1.02%  (p=0.000 n=50+50)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/256        2.96µs ± 6%  2.97µs ± 4%     ~     (p=0.098 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096       50.4µs ± 3%  50.7µs ± 2%   +0.53%  (p=0.030 n=54+53)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536      1.38ms ± 2%  1.42ms ± 3%   +2.76%  (p=0.000 n=54+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576    37.8ms ± 4%  38.1ms ± 4%   +0.78%  (p=0.016 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216    1.19s ± 2%   1.20s ± 3%     ~     (p=0.055 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/56          430ns ± 3%   413ns ± 3%   -4.10%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/224        1.94µs ± 8%  1.93µs ± 6%     ~     (p=0.587 n=57+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584       32.1µs ± 4%  32.2µs ± 3%     ~     (p=0.403 n=56+57)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344       746µs ± 3%   760µs ± 3%   +1.93%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504     22.0ms ± 6%  22.2ms ± 5%   +0.90%  (p=0.025 n=57+53)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064    734ms ± 2%   736ms ± 3%     ~     (p=0.107 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1          40.8ns ± 4%  40.8ns ± 4%     ~     (p=0.685 n=54+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/2          49.5ns ± 4%  49.6ns ± 7%     ~     (p=0.927 n=56+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3          58.1ns ± 4%  58.2ns ± 4%     ~     (p=0.632 n=55+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4          66.7ns ± 4%  67.0ns ± 5%     ~     (p=0.235 n=55+56)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/8           104ns ± 6%   105ns ± 6%     ~     (p=0.171 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16          196ns ± 5%   197ns ± 6%     ~     (p=0.189 n=54+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/32          361ns ± 8%   364ns ± 7%     ~     (p=0.156 n=53+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/64         1.05µs ± 6%  1.04µs ± 6%     ~     (p=0.069 n=56+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/256        5.32µs ± 5%  5.24µs ± 4%   -1.46%  (p=0.000 n=57+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096        148µs ± 4%   147µs ± 4%   -0.65%  (p=0.028 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536      3.17ms ± 3%  3.13ms ± 2%   -1.49%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576    99.1ms ± 3%  98.3ms ± 3%   -0.89%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216    2.40s ± 3%   2.40s ± 2%     ~     (p=0.247 n=56+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/56          612ns ± 6%   612ns ± 8%     ~     (p=0.968 n=52+56)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/224        3.56µs ± 7%  3.51µs ± 4%   -1.40%  (p=0.006 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584       87.3µs ± 5%  88.5µs ± 6%   +1.30%  (p=0.011 n=57+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344      1.95ms ± 3%  1.93ms ± 4%   -0.77%  (p=0.000 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504     61.2ms ± 4%  61.0ms ± 3%     ~     (p=0.059 n=57+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064    1.50s ± 3%   1.50s ± 3%     ~     (p=0.609 n=57+55)

Benchmarks on ARM:

name                                                  old cpu/op   new cpu/op   delta
BM_MapInsertSeq<Map<int, int>>/1                      39.6ns ± 1%  39.6ns ± 1%    ~     (p=0.416 n=155+156)
BM_MapInsertSeq<Map<int, int>>/2                      44.6ns ± 1%  44.6ns ± 1%    ~     (p=0.993 n=155+157)
BM_MapInsertSeq<Map<int, int>>/3                      50.1ns ± 3%  50.1ns ± 2%    ~     (p=0.585 n=156+157)
BM_MapInsertSeq<Map<int, int>>/4                      55.7ns ± 3%  55.4ns ± 1%  -0.61%  (p=0.000 n=156+119)
BM_MapInsertSeq<Map<int, int>>/8                      78.0ns ± 5%  77.2ns ± 1%  -1.01%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/16                      123ns ± 5%   121ns ± 0%  -1.32%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/32                      216ns ± 6%   213ns ± 0%  -1.51%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/64                      644ns ± 4%   659ns ± 3%  +2.19%  (p=0.000 n=157+146)
BM_MapInsertSeq<Map<int, int>>/256                    3.22µs ± 4%  3.30µs ± 5%  +2.56%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/4096                   53.8µs ± 3%  55.5µs ± 2%  +3.18%  (p=0.000 n=157+144)
BM_MapInsertSeq<Map<int, int>>/65536                  1.29ms ± 3%  1.33ms ± 4%  +2.97%  (p=0.000 n=155+155)
BM_MapInsertSeq<Map<int, int>>/1048576                30.2ms ± 8%  31.0ms ± 9%  +2.58%  (p=0.000 n=157+156)
BM_MapInsertSeq<Map<int, int>>/16777216                902ms ±14%   924ms ±17%  +2.47%  (p=0.005 n=157+157)
BM_MapInsertSeq<Map<int, int>>/56                      350ns ± 6%   345ns ± 0%  -1.64%  (p=0.023 n=157+119)
BM_MapInsertSeq<Map<int, int>>/224                    2.09µs ± 5%  2.14µs ± 5%  +2.01%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/3584                   36.6µs ± 4%  37.3µs ± 3%  +1.76%  (p=0.000 n=157+142)
BM_MapInsertSeq<Map<int, int>>/57344                   903µs ± 5%   923µs ± 4%  +2.23%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/917504                 21.7ms ± 8%  22.0ms ± 8%  +1.52%  (p=0.000 n=157+153)
BM_MapInsertSeq<Map<int, int>>/14680064                678ms ±18%   692ms ±18%  +2.03%  (p=0.037 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/1                    40.7ns ± 1%  40.7ns ± 1%    ~     (p=0.216 n=152+153)
BM_MapInsertSeq<Map<int*, int*>>/2                    45.5ns ± 1%  45.5ns ± 1%    ~     (p=0.403 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/3                    51.1ns ± 1%  51.1ns ± 1%    ~     (p=0.962 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/4                    57.5ns ± 4%  59.7ns ± 1%  +3.86%  (p=0.000 n=157+149)
BM_MapInsertSeq<Map<int*, int*>>/8                    80.3ns ± 4%  82.3ns ± 1%  +2.52%  (p=0.000 n=157+108)
BM_MapInsertSeq<Map<int*, int*>>/16                    127ns ± 4%   129ns ± 4%  +1.57%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/32                    229ns ± 4%   230ns ± 4%  +0.49%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/64                    695ns ± 3%   705ns ± 4%  +1.37%  (p=0.000 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/256                  3.64µs ± 7%  3.46µs ± 3%  -4.93%  (p=0.000 n=157+154)
BM_MapInsertSeq<Map<int*, int*>>/4096                 58.8µs ± 2%  60.8µs ± 2%  +3.46%  (p=0.000 n=157+148)
BM_MapInsertSeq<Map<int*, int*>>/65536                1.14ms ± 2%  1.17ms ± 3%  +3.23%  (p=0.000 n=156+157)
BM_MapInsertSeq<Map<int*, int*>>/1048576              41.4ms ±12%  42.3ms ±13%  +2.13%  (p=0.001 n=157+156)
BM_MapInsertSeq<Map<int*, int*>>/16777216              1.10s ±20%   1.12s ±22%    ~     (p=0.082 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/56                    377ns ± 5%   372ns ± 7%  -1.19%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/224                  2.40µs ± 9%  2.23µs ± 4%  -6.97%  (p=0.000 n=157+155)
BM_MapInsertSeq<Map<int*, int*>>/3584                 39.6µs ± 3%  40.0µs ± 2%  +1.01%  (p=0.000 n=157+121)
BM_MapInsertSeq<Map<int*, int*>>/57344                 776µs ± 3%   795µs ± 3%  +2.40%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/917504               31.0ms ±13%  31.4ms ±14%    ~     (p=0.055 n=157+156)
BM_MapInsertSeq<Map<int*, int*>>/14680064              776ms ±22%   783ms ±25%    ~     (p=0.231 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1          41.2ns ± 1%  41.2ns ± 1%    ~     (p=0.516 n=155+154)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/2          46.8ns ± 1%  46.8ns ± 1%    ~     (p=0.324 n=155+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3          52.7ns ± 2%  52.7ns ± 2%    ~     (p=0.122 n=155+154)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4          58.8ns ± 3%  58.7ns ± 1%    ~     (p=0.142 n=156+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/8          82.7ns ± 4%  82.2ns ± 1%  -0.66%  (p=0.004 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16          131ns ± 4%   130ns ± 0%  -0.85%  (p=0.001 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/32          231ns ± 4%   229ns ± 0%  -0.94%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/64          727ns ± 3%   744ns ± 3%  +2.28%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/256        3.71µs ± 3%  3.79µs ± 3%  +2.24%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096       64.4µs ± 2%  66.2µs ± 2%  +2.75%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536      1.64ms ± 4%  1.68ms ± 3%  +2.11%  (p=0.000 n=149+145)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576    39.5ms ±11%  40.5ms ±13%  +2.32%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216    1.14s ±16%   1.16s ±17%    ~     (p=0.082 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/56          375ns ± 4%   371ns ± 0%  -1.03%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/224        2.35µs ± 3%  2.40µs ± 4%  +1.84%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584       42.4µs ± 2%  43.2µs ± 2%  +1.76%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344      1.15ms ± 3%  1.16ms ± 3%  +1.50%  (p=0.000 n=148+144)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504     27.4ms ± 8%  27.8ms ±10%  +1.44%  (p=0.002 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064    786ms ±19%   791ms ±20%    ~     (p=0.239 n=157+157)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1          56.3ns ± 1%  56.4ns ± 1%    ~     (p=0.363 n=151+149)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/2          69.7ns ± 0%  69.7ns ± 0%    ~     (p=0.330 n=153+150)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3          83.2ns ± 0%  83.1ns ± 0%    ~     (p=0.129 n=149+154)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4          96.5ns ± 0%  96.5ns ± 0%    ~     (p=0.619 n=153+153)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/8           154ns ± 0%   154ns ± 0%    ~     (p=0.769 n=147+149)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16          292ns ± 1%   292ns ± 1%    ~     (p=0.511 n=146+146)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/32          542ns ± 4%   542ns ± 4%    ~     (p=0.729 n=157+156)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/64         1.61µs ± 6%  1.62µs ± 4%  +0.62%  (p=0.003 n=157+156)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/256        8.60µs ± 5%  8.69µs ± 5%  +1.02%  (p=0.000 n=148+151)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096        195µs ± 3%   195µs ± 3%    ~     (p=0.668 n=154+155)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536      4.45ms ± 4%  4.45ms ± 4%    ~     (p=0.698 n=148+141)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576     126ms ±11%   130ms ±10%  +3.06%  (p=0.000 n=157+142)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216    3.28s ±12%   3.43s ± 9%  +4.72%  (p=0.000 n=157+137)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/56          942ns ± 5%   939ns ± 5%    ~     (p=0.284 n=156+157)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/224        5.82µs ± 5%  5.86µs ± 5%  +0.63%  (p=0.008 n=153+155)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584        122µs ± 4%   122µs ± 3%    ~     (p=0.822 n=156+154)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344      2.82ms ± 3%  2.82ms ± 3%    ~     (p=0.765 n=150+144)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504     79.4ms ± 7%  80.0ms ± 7%  +0.83%  (p=0.008 n=149+148)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064    2.17s ±15%   2.25s ±14%  +3.51%  (p=0.000 n=157+146)

This is simplification that allows to remove `ClearDeleted` and intended
to have similar performance.

Benchmarks are close to noise. Leaning positive on X86 and leaning
negative on ARM.

1) On Arm:
`&= ~(uint64_t{0xff} << (byte_index * 8))` -> `|= (uint64_t{tag | 0x80} << (byte_index * 8))`.
This operation is even reducing latency from 4 -> 3.
But we have another shift operation to compute the tag. Theoretically it shouldn't be slower.

2) On x86:
`new_metadata[new_index ^ old_size] = 0;` -> `new_metadata[new_index] = tag | MetadataGroup::PresentMask;`.
Here we have the same latency, but we need an extra shift to compute the tag.

Benchmarks on X86:
```
name                                                  old cpu/op   new cpu/op   delta
BM_MapInsertSeq<Map<int, int>>/1                      32.7ns ± 4%  32.8ns ± 3%     ~     (p=0.354 n=52+52)
BM_MapInsertSeq<Map<int, int>>/2                      42.3ns ± 3%  40.4ns ± 4%   -4.56%  (p=0.000 n=54+55)
BM_MapInsertSeq<Map<int, int>>/3                      50.5ns ± 3%  51.1ns ± 3%   +1.22%  (p=0.000 n=53+56)
BM_MapInsertSeq<Map<int, int>>/4                      59.9ns ± 4%  60.8ns ± 4%   +1.45%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<int, int>>/8                      98.8ns ± 4%  98.3ns ± 3%   -0.50%  (p=0.044 n=56+56)
BM_MapInsertSeq<Map<int, int>>/16                      163ns ± 4%   158ns ± 4%   -2.77%  (p=0.000 n=55+51)
BM_MapInsertSeq<Map<int, int>>/32                      281ns ± 4%   257ns ± 3%   -8.30%  (p=0.000 n=55+55)
BM_MapInsertSeq<Map<int, int>>/64                      646ns ± 4%   624ns ± 3%   -3.41%  (p=0.000 n=55+55)
BM_MapInsertSeq<Map<int, int>>/256                    2.42µs ± 4%  2.38µs ± 3%   -1.57%  (p=0.000 n=53+56)
BM_MapInsertSeq<Map<int, int>>/4096                   38.4µs ± 3%  38.6µs ± 3%     ~     (p=0.118 n=54+52)
BM_MapInsertSeq<Map<int, int>>/65536                   968µs ± 3%   972µs ± 3%     ~     (p=0.333 n=56+57)
BM_MapInsertSeq<Map<int, int>>/1048576                16.3ms ± 5%  16.3ms ± 6%     ~     (p=0.269 n=56+55)
BM_MapInsertSeq<Map<int, int>>/16777216                646ms ± 3%   646ms ± 4%     ~     (p=0.811 n=57+55)
BM_MapInsertSeq<Map<int, int>>/56                      436ns ± 3%   423ns ± 3%   -2.89%  (p=0.000 n=56+57)
BM_MapInsertSeq<Map<int, int>>/224                    1.66µs ± 4%  1.63µs ± 4%   -2.36%  (p=0.000 n=55+57)
BM_MapInsertSeq<Map<int, int>>/3584                   25.1µs ± 4%  25.4µs ± 5%   +1.03%  (p=0.002 n=55+57)
BM_MapInsertSeq<Map<int, int>>/57344                   559µs ± 3%   567µs ± 4%   +1.31%  (p=0.000 n=56+57)
BM_MapInsertSeq<Map<int, int>>/917504                 10.4ms ± 4%  10.4ms ± 4%     ~     (p=0.417 n=56+55)
BM_MapInsertSeq<Map<int, int>>/14680064                420ms ± 3%   420ms ± 3%     ~     (p=0.740 n=57+55)
BM_MapInsertSeq<Map<int*, int*>>/1                    33.8ns ± 3%  33.9ns ± 3%     ~     (p=0.191 n=55+56)
BM_MapInsertSeq<Map<int*, int*>>/2                    37.0ns ± 3%  37.0ns ± 4%     ~     (p=0.948 n=55+56)
BM_MapInsertSeq<Map<int*, int*>>/3                    41.5ns ± 4%  41.6ns ± 4%     ~     (p=0.503 n=56+56)
BM_MapInsertSeq<Map<int*, int*>>/4                    46.1ns ± 4%  47.2ns ± 4%   +2.31%  (p=0.000 n=56+56)
BM_MapInsertSeq<Map<int*, int*>>/8                    63.6ns ± 4%  65.4ns ± 3%   +2.89%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/16                    133ns ± 4%   128ns ± 4%   -3.74%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/32                    237ns ± 3%   236ns ± 4%     ~     (p=0.082 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/64                    597ns ± 3%   631ns ± 3%   +5.73%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/256                  2.78µs ± 3%  2.89µs ± 4%   +3.82%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/4096                 51.9µs ± 3%  54.2µs ± 4%   +4.46%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/65536                1.16ms ± 3%  1.18ms ± 4%   +1.10%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/1048576              28.9ms ± 4%  29.0ms ± 4%     ~     (p=0.549 n=57+54)
BM_MapInsertSeq<Map<int*, int*>>/16777216              914ms ± 3%   912ms ± 3%     ~     (p=0.147 n=57+54)
BM_MapInsertSeq<Map<int*, int*>>/56                    366ns ± 3%   407ns ± 3%  +11.27%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/224                  1.86µs ± 4%  1.96µs ± 3%   +5.18%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/3584                 33.9µs ± 4%  35.9µs ± 3%   +6.04%  (p=0.000 n=57+52)
BM_MapInsertSeq<Map<int*, int*>>/57344                 763µs ± 3%   772µs ± 4%   +1.25%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/917504               16.8ms ±11%  16.6ms ± 6%     ~     (p=0.454 n=57+52)
BM_MapInsertSeq<Map<int*, int*>>/14680064              610ms ± 2%   608ms ± 3%     ~     (p=0.127 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1          34.6ns ± 4%  34.5ns ± 3%     ~     (p=0.331 n=56+52)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/2          43.9ns ± 4%  37.8ns ± 3%  -13.78%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3          47.9ns ± 4%  48.9ns ± 3%   +2.16%  (p=0.000 n=56+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4          53.6ns ± 3%  54.0ns ± 3%   +0.72%  (p=0.012 n=56+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/8          84.6ns ± 4%  81.0ns ± 3%   -4.21%  (p=0.000 n=57+50)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16          142ns ± 3%   136ns ± 3%   -4.22%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/32          258ns ± 4%   245ns ± 3%   -4.96%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/64          699ns ± 4%   692ns ± 3%   -1.02%  (p=0.000 n=50+50)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/256        2.96µs ± 6%  2.97µs ± 4%     ~     (p=0.098 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096       50.4µs ± 3%  50.7µs ± 2%   +0.53%  (p=0.030 n=54+53)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536      1.38ms ± 2%  1.42ms ± 3%   +2.76%  (p=0.000 n=54+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576    37.8ms ± 4%  38.1ms ± 4%   +0.78%  (p=0.016 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216    1.19s ± 2%   1.20s ± 3%     ~     (p=0.055 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/56          430ns ± 3%   413ns ± 3%   -4.10%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/224        1.94µs ± 8%  1.93µs ± 6%     ~     (p=0.587 n=57+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584       32.1µs ± 4%  32.2µs ± 3%     ~     (p=0.403 n=56+57)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344       746µs ± 3%   760µs ± 3%   +1.93%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504     22.0ms ± 6%  22.2ms ± 5%   +0.90%  (p=0.025 n=57+53)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064    734ms ± 2%   736ms ± 3%     ~     (p=0.107 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1          40.8ns ± 4%  40.8ns ± 4%     ~     (p=0.685 n=54+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/2          49.5ns ± 4%  49.6ns ± 7%     ~     (p=0.927 n=56+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3          58.1ns ± 4%  58.2ns ± 4%     ~     (p=0.632 n=55+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4          66.7ns ± 4%  67.0ns ± 5%     ~     (p=0.235 n=55+56)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/8           104ns ± 6%   105ns ± 6%     ~     (p=0.171 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16          196ns ± 5%   197ns ± 6%     ~     (p=0.189 n=54+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/32          361ns ± 8%   364ns ± 7%     ~     (p=0.156 n=53+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/64         1.05µs ± 6%  1.04µs ± 6%     ~     (p=0.069 n=56+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/256        5.32µs ± 5%  5.24µs ± 4%   -1.46%  (p=0.000 n=57+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096        148µs ± 4%   147µs ± 4%   -0.65%  (p=0.028 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536      3.17ms ± 3%  3.13ms ± 2%   -1.49%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576    99.1ms ± 3%  98.3ms ± 3%   -0.89%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216    2.40s ± 3%   2.40s ± 2%     ~     (p=0.247 n=56+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/56          612ns ± 6%   612ns ± 8%     ~     (p=0.968 n=52+56)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/224        3.56µs ± 7%  3.51µs ± 4%   -1.40%  (p=0.006 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584       87.3µs ± 5%  88.5µs ± 6%   +1.30%  (p=0.011 n=57+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344      1.95ms ± 3%  1.93ms ± 4%   -0.77%  (p=0.000 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504     61.2ms ± 4%  61.0ms ± 3%     ~     (p=0.059 n=57+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064    1.50s ± 3%   1.50s ± 3%     ~     (p=0.609 n=57+55)
```

Benchmarks on ARM:
```
name                                                  old cpu/op   new cpu/op   delta
BM_MapInsertSeq<Map<int, int>>/1                      39.6ns ± 1%  39.6ns ± 1%    ~     (p=0.416 n=155+156)
BM_MapInsertSeq<Map<int, int>>/2                      44.6ns ± 1%  44.6ns ± 1%    ~     (p=0.993 n=155+157)
BM_MapInsertSeq<Map<int, int>>/3                      50.1ns ± 3%  50.1ns ± 2%    ~     (p=0.585 n=156+157)
BM_MapInsertSeq<Map<int, int>>/4                      55.7ns ± 3%  55.4ns ± 1%  -0.61%  (p=0.000 n=156+119)
BM_MapInsertSeq<Map<int, int>>/8                      78.0ns ± 5%  77.2ns ± 1%  -1.01%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/16                      123ns ± 5%   121ns ± 0%  -1.32%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/32                      216ns ± 6%   213ns ± 0%  -1.51%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/64                      644ns ± 4%   659ns ± 3%  +2.19%  (p=0.000 n=157+146)
BM_MapInsertSeq<Map<int, int>>/256                    3.22µs ± 4%  3.30µs ± 5%  +2.56%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/4096                   53.8µs ± 3%  55.5µs ± 2%  +3.18%  (p=0.000 n=157+144)
BM_MapInsertSeq<Map<int, int>>/65536                  1.29ms ± 3%  1.33ms ± 4%  +2.97%  (p=0.000 n=155+155)
BM_MapInsertSeq<Map<int, int>>/1048576                30.2ms ± 8%  31.0ms ± 9%  +2.58%  (p=0.000 n=157+156)
BM_MapInsertSeq<Map<int, int>>/16777216                902ms ±14%   924ms ±17%  +2.47%  (p=0.005 n=157+157)
BM_MapInsertSeq<Map<int, int>>/56                      350ns ± 6%   345ns ± 0%  -1.64%  (p=0.023 n=157+119)
BM_MapInsertSeq<Map<int, int>>/224                    2.09µs ± 5%  2.14µs ± 5%  +2.01%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/3584                   36.6µs ± 4%  37.3µs ± 3%  +1.76%  (p=0.000 n=157+142)
BM_MapInsertSeq<Map<int, int>>/57344                   903µs ± 5%   923µs ± 4%  +2.23%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/917504                 21.7ms ± 8%  22.0ms ± 8%  +1.52%  (p=0.000 n=157+153)
BM_MapInsertSeq<Map<int, int>>/14680064                678ms ±18%   692ms ±18%  +2.03%  (p=0.037 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/1                    40.7ns ± 1%  40.7ns ± 1%    ~     (p=0.216 n=152+153)
BM_MapInsertSeq<Map<int*, int*>>/2                    45.5ns ± 1%  45.5ns ± 1%    ~     (p=0.403 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/3                    51.1ns ± 1%  51.1ns ± 1%    ~     (p=0.962 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/4                    57.5ns ± 4%  59.7ns ± 1%  +3.86%  (p=0.000 n=157+149)
BM_MapInsertSeq<Map<int*, int*>>/8                    80.3ns ± 4%  82.3ns ± 1%  +2.52%  (p=0.000 n=157+108)
BM_MapInsertSeq<Map<int*, int*>>/16                    127ns ± 4%   129ns ± 4%  +1.57%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/32                    229ns ± 4%   230ns ± 4%  +0.49%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/64                    695ns ± 3%   705ns ± 4%  +1.37%  (p=0.000 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/256                  3.64µs ± 7%  3.46µs ± 3%  -4.93%  (p=0.000 n=157+154)
BM_MapInsertSeq<Map<int*, int*>>/4096                 58.8µs ± 2%  60.8µs ± 2%  +3.46%  (p=0.000 n=157+148)
BM_MapInsertSeq<Map<int*, int*>>/65536                1.14ms ± 2%  1.17ms ± 3%  +3.23%  (p=0.000 n=156+157)
BM_MapInsertSeq<Map<int*, int*>>/1048576              41.4ms ±12%  42.3ms ±13%  +2.13%  (p=0.001 n=157+156)
BM_MapInsertSeq<Map<int*, int*>>/16777216              1.10s ±20%   1.12s ±22%    ~     (p=0.082 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/56                    377ns ± 5%   372ns ± 7%  -1.19%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/224                  2.40µs ± 9%  2.23µs ± 4%  -6.97%  (p=0.000 n=157+155)
BM_MapInsertSeq<Map<int*, int*>>/3584                 39.6µs ± 3%  40.0µs ± 2%  +1.01%  (p=0.000 n=157+121)
BM_MapInsertSeq<Map<int*, int*>>/57344                 776µs ± 3%   795µs ± 3%  +2.40%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/917504               31.0ms ±13%  31.4ms ±14%    ~     (p=0.055 n=157+156)
BM_MapInsertSeq<Map<int*, int*>>/14680064              776ms ±22%   783ms ±25%    ~     (p=0.231 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1          41.2ns ± 1%  41.2ns ± 1%    ~     (p=0.516 n=155+154)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/2          46.8ns ± 1%  46.8ns ± 1%    ~     (p=0.324 n=155+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3          52.7ns ± 2%  52.7ns ± 2%    ~     (p=0.122 n=155+154)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4          58.8ns ± 3%  58.7ns ± 1%    ~     (p=0.142 n=156+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/8          82.7ns ± 4%  82.2ns ± 1%  -0.66%  (p=0.004 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16          131ns ± 4%   130ns ± 0%  -0.85%  (p=0.001 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/32          231ns ± 4%   229ns ± 0%  -0.94%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/64          727ns ± 3%   744ns ± 3%  +2.28%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/256        3.71µs ± 3%  3.79µs ± 3%  +2.24%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096       64.4µs ± 2%  66.2µs ± 2%  +2.75%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536      1.64ms ± 4%  1.68ms ± 3%  +2.11%  (p=0.000 n=149+145)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576    39.5ms ±11%  40.5ms ±13%  +2.32%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216    1.14s ±16%   1.16s ±17%    ~     (p=0.082 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/56          375ns ± 4%   371ns ± 0%  -1.03%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/224        2.35µs ± 3%  2.40µs ± 4%  +1.84%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584       42.4µs ± 2%  43.2µs ± 2%  +1.76%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344      1.15ms ± 3%  1.16ms ± 3%  +1.50%  (p=0.000 n=148+144)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504     27.4ms ± 8%  27.8ms ±10%  +1.44%  (p=0.002 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064    786ms ±19%   791ms ±20%    ~     (p=0.239 n=157+157)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1          56.3ns ± 1%  56.4ns ± 1%    ~     (p=0.363 n=151+149)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/2          69.7ns ± 0%  69.7ns ± 0%    ~     (p=0.330 n=153+150)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3          83.2ns ± 0%  83.1ns ± 0%    ~     (p=0.129 n=149+154)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4          96.5ns ± 0%  96.5ns ± 0%    ~     (p=0.619 n=153+153)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/8           154ns ± 0%   154ns ± 0%    ~     (p=0.769 n=147+149)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16          292ns ± 1%   292ns ± 1%    ~     (p=0.511 n=146+146)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/32          542ns ± 4%   542ns ± 4%    ~     (p=0.729 n=157+156)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/64         1.61µs ± 6%  1.62µs ± 4%  +0.62%  (p=0.003 n=157+156)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/256        8.60µs ± 5%  8.69µs ± 5%  +1.02%  (p=0.000 n=148+151)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096        195µs ± 3%   195µs ± 3%    ~     (p=0.668 n=154+155)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536      4.45ms ± 4%  4.45ms ± 4%    ~     (p=0.698 n=148+141)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576     126ms ±11%   130ms ±10%  +3.06%  (p=0.000 n=157+142)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216    3.28s ±12%   3.43s ± 9%  +4.72%  (p=0.000 n=157+137)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/56          942ns ± 5%   939ns ± 5%    ~     (p=0.284 n=156+157)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/224        5.82µs ± 5%  5.86µs ± 5%  +0.63%  (p=0.008 n=153+155)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584        122µs ± 4%   122µs ± 3%    ~     (p=0.822 n=156+154)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344      2.82ms ± 3%  2.82ms ± 3%    ~     (p=0.765 n=150+144)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504     79.4ms ± 7%  80.0ms ± 7%  +0.83%  (p=0.008 n=149+148)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064    2.17s ±15%   2.25s ±14%  +3.51%  (p=0.000 n=157+146)
```
@github-actions github-actions bot requested a review from zygoloid January 3, 2025 13:06
@chandlerc chandlerc self-requested a review January 3, 2025 23:50
Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being slow here -- I've been sick for over a week now...

Looking at this, I'm a bit worried about the regression on ARM CPUs... I think there is a real regression there.

For example, on my M1, I'm seeing:

----------------------------------------------------------------------------------------------------------
Benchmark                                              CPU   Iterations     CYCLES INSTRUCTIONS    KeyRate
----------------------------------------------------------------------------------------------------------
BM_MapInsertSeq<Map<int, int>>/4096               29878 ns        23383   95.6939k      460.51k 137.092M/s
BM_MapInsertSeq<Map<int, int>>/65536             675704 ns         1023   2.16291M     7.37218M 96.9892M/s
BM_MapInsertSeq<Map<int, int>>/1048576         11951256 ns           59    38.268M     114.816M 87.7377M/s
BM_MapInsertSeq<Map<int, int>>/16777216       442136773 ns            2   1.36822G     1.89526G 37.9458M/s

Vs. without this patch:

----------------------------------------------------------------------------------------------------------
Benchmark                                              CPU   Iterations     CYCLES INSTRUCTIONS    KeyRate
----------------------------------------------------------------------------------------------------------
BM_MapInsertSeq<Map<int, int>>/4096               29342 ns        23815   93.9735k     451.537k 139.593M/s
BM_MapInsertSeq<Map<int, int>>/65536             668646 ns         1035   2.13683M     7.19312M  98.013M/s
BM_MapInsertSeq<Map<int, int>>/1048576         12345606 ns           57   39.5332M     113.325M 84.9352M/s
BM_MapInsertSeq<Map<int, int>>/16777216       436957105 ns            2   1.35579G      1.8614G 38.3956M/s

I've zoomed in on the relevant columns. Note the higher instruction count without this patch consistently. This is particularly surprising with the <int, int> map because that one should be entirely deterministic -- there shouldn't be any noise or fluctuations here outside of things like context switches, and so the difference is very likely specific to this change.

And it doesn't seem to be that the latency or throughput is improved, as the cycle count for these benchmarks also seems to consistently regress with this change.

I've not yet had a chance to specifically look at the instruction trace of the grow routine before/after to try and spot why this doesn't work out as a win on ARM, but I think that's the next step here.

Comment on lines 1455 to 1464
// Toggle the newly added bit of the index to get to the other possible
// target index.
if constexpr (MetadataGroup::FastByteClear) {
(new_index == old_hash_index ? high_g : low_g).ClearByte(byte_index);
if constexpr (MetadataGroup::FastByteSet) {
(new_index == old_hash_index ? low_g : high_g)
.OverwriteEmptyByte(byte_index, tag | MetadataGroup::PresentMask);
new_index += byte_index;
} else {
new_index += byte_index;
new_metadata[new_index ^ old_size] = MetadataGroup::Empty;
new_metadata[new_index] = tag | MetadataGroup::PresentMask;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug into the assembly on ARM and I think I see why this regresses.

The code for this operation is quite different before and after. I think I have isolated the correct instruction sequence (but my debug info is broken at the moment on Linux ARM, so having to guess a bit). These are extracted from running InsertSeq<Map<int, int>> up to 1M size under perf. The cycle sample counts are shown for the snippet -- these aren't directly comparable even though the # of iterations was fixed because the sampling could have randomly been different. But the relative weight from one instruction to the next should be.

Before:

   1.15 │       and  x8, x21, x8
        │       and  x9, x24, #0x38
        │       lsl  x9, x17, x9
        │       cmp  x8, x25
        │       csel x8, x20, x26, eq
   1.06 │       bic  x8, x8, x9
   0.49 │       csel x20, x8, x20, eq
  25.51 │       csel x26, x26, x8, eq

The byte clear is done with the lsl instruction (moving 0xff stored in x16 to the byte position to clear), and bic instruction that does x8 ANDN x9 is essence. But cleverly, x8 is whichever of high_g or low_g this if selects, and the result is moved back with the next two csels. At least on the M1, this also has pretty excellent ILP.

After:

   0.96 │       mov   w10, #0x80
        │       and   x9, x27, x9
        │       and   x8, x8, #0x38
        │       bfxil x10, x24, #0, #7
        │       cmp   x9, x25
        │       csel  x9, x20, x22, eq
        │       lsl   x8, x10, x8
  30.47 │       orr   x8, x9, x8
   3.12 │       csel  x22, x22, x8, eq
   2.95 │       csel  x20, x8, x20, eq

First, we unfortunately need to materialize the 0x80 used to mark the tag as present here, inside the loop. And then the code sequence to form the present is longer: bfxil + lsl + orr instead of just lsl + bic to clear. Not sure if the samples landing on orr represent less ability to pipeline or not, would need to put this into llvm-mca to really figure that out, but I've not done the work to thread it into that tool. Even if not a specific ILP problem, it adds up to 4 instructions in the loop where the old code got away with 2 instructions.

Anyways, it does seem to be a non-trivial regression on ARM, although probably very specifically because of how the method is called. This is one of the annoying things about the metadata group operations -- their performance is so deeply dependent on the context in which they are used and how the compiler ends up lowering things.

Hope this detail helps though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants