OverwriteEmptyByte instead of ClearByte in GrowToNextAllocSize. #4754

goldvitaly · 2025-01-03T13:05:42Z

This is simplification that allows to remove ClearDeleted and intended
to have similar performance.

Benchmarks are close to noise. Leaning positive on X86 and leaning
negative on ARM.

On Arm:
&= ~(uint64_t{0xff} << (byte_index * 8)) -> |= (uint64_t{tag | 0x80} << (byte_index * 8)).
This operation is even reducing latency from 4 -> 3.
But we have another shift operation to compute the tag. Theoretically it shouldn't be slower.
On x86:
new_metadata[new_index ^ old_size] = 0; -> new_metadata[new_index] = tag | MetadataGroup::PresentMask;.
Here we have the same latency, but we need an extra shift to compute the tag.

Benchmarks on X86:

name                                                  old cpu/op   new cpu/op   delta
BM_MapInsertSeq<Map<int, int>>/1                      32.7ns ± 4%  32.8ns ± 3%     ~     (p=0.354 n=52+52)
BM_MapInsertSeq<Map<int, int>>/2                      42.3ns ± 3%  40.4ns ± 4%   -4.56%  (p=0.000 n=54+55)
BM_MapInsertSeq<Map<int, int>>/3                      50.5ns ± 3%  51.1ns ± 3%   +1.22%  (p=0.000 n=53+56)
BM_MapInsertSeq<Map<int, int>>/4                      59.9ns ± 4%  60.8ns ± 4%   +1.45%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<int, int>>/8                      98.8ns ± 4%  98.3ns ± 3%   -0.50%  (p=0.044 n=56+56)
BM_MapInsertSeq<Map<int, int>>/16                      163ns ± 4%   158ns ± 4%   -2.77%  (p=0.000 n=55+51)
BM_MapInsertSeq<Map<int, int>>/32                      281ns ± 4%   257ns ± 3%   -8.30%  (p=0.000 n=55+55)
BM_MapInsertSeq<Map<int, int>>/64                      646ns ± 4%   624ns ± 3%   -3.41%  (p=0.000 n=55+55)
BM_MapInsertSeq<Map<int, int>>/256                    2.42µs ± 4%  2.38µs ± 3%   -1.57%  (p=0.000 n=53+56)
BM_MapInsertSeq<Map<int, int>>/4096                   38.4µs ± 3%  38.6µs ± 3%     ~     (p=0.118 n=54+52)
BM_MapInsertSeq<Map<int, int>>/65536                   968µs ± 3%   972µs ± 3%     ~     (p=0.333 n=56+57)
BM_MapInsertSeq<Map<int, int>>/1048576                16.3ms ± 5%  16.3ms ± 6%     ~     (p=0.269 n=56+55)
BM_MapInsertSeq<Map<int, int>>/16777216                646ms ± 3%   646ms ± 4%     ~     (p=0.811 n=57+55)
BM_MapInsertSeq<Map<int, int>>/56                      436ns ± 3%   423ns ± 3%   -2.89%  (p=0.000 n=56+57)
BM_MapInsertSeq<Map<int, int>>/224                    1.66µs ± 4%  1.63µs ± 4%   -2.36%  (p=0.000 n=55+57)
BM_MapInsertSeq<Map<int, int>>/3584                   25.1µs ± 4%  25.4µs ± 5%   +1.03%  (p=0.002 n=55+57)
BM_MapInsertSeq<Map<int, int>>/57344                   559µs ± 3%   567µs ± 4%   +1.31%  (p=0.000 n=56+57)
BM_MapInsertSeq<Map<int, int>>/917504                 10.4ms ± 4%  10.4ms ± 4%     ~     (p=0.417 n=56+55)
BM_MapInsertSeq<Map<int, int>>/14680064                420ms ± 3%   420ms ± 3%     ~     (p=0.740 n=57+55)
BM_MapInsertSeq<Map<int*, int*>>/1                    33.8ns ± 3%  33.9ns ± 3%     ~     (p=0.191 n=55+56)
BM_MapInsertSeq<Map<int*, int*>>/2                    37.0ns ± 3%  37.0ns ± 4%     ~     (p=0.948 n=55+56)
BM_MapInsertSeq<Map<int*, int*>>/3                    41.5ns ± 4%  41.6ns ± 4%     ~     (p=0.503 n=56+56)
BM_MapInsertSeq<Map<int*, int*>>/4                    46.1ns ± 4%  47.2ns ± 4%   +2.31%  (p=0.000 n=56+56)
BM_MapInsertSeq<Map<int*, int*>>/8                    63.6ns ± 4%  65.4ns ± 3%   +2.89%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/16                    133ns ± 4%   128ns ± 4%   -3.74%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/32                    237ns ± 3%   236ns ± 4%     ~     (p=0.082 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/64                    597ns ± 3%   631ns ± 3%   +5.73%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/256                  2.78µs ± 3%  2.89µs ± 4%   +3.82%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/4096                 51.9µs ± 3%  54.2µs ± 4%   +4.46%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/65536                1.16ms ± 3%  1.18ms ± 4%   +1.10%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/1048576              28.9ms ± 4%  29.0ms ± 4%     ~     (p=0.549 n=57+54)
BM_MapInsertSeq<Map<int*, int*>>/16777216              914ms ± 3%   912ms ± 3%     ~     (p=0.147 n=57+54)
BM_MapInsertSeq<Map<int*, int*>>/56                    366ns ± 3%   407ns ± 3%  +11.27%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int*, int*>>/224                  1.86µs ± 4%  1.96µs ± 3%   +5.18%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/3584                 33.9µs ± 4%  35.9µs ± 3%   +6.04%  (p=0.000 n=57+52)
BM_MapInsertSeq<Map<int*, int*>>/57344                 763µs ± 3%   772µs ± 4%   +1.25%  (p=0.000 n=57+57)
BM_MapInsertSeq<Map<int*, int*>>/917504               16.8ms ±11%  16.6ms ± 6%     ~     (p=0.454 n=57+52)
BM_MapInsertSeq<Map<int*, int*>>/14680064              610ms ± 2%   608ms ± 3%     ~     (p=0.127 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1          34.6ns ± 4%  34.5ns ± 3%     ~     (p=0.331 n=56+52)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/2          43.9ns ± 4%  37.8ns ± 3%  -13.78%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3          47.9ns ± 4%  48.9ns ± 3%   +2.16%  (p=0.000 n=56+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4          53.6ns ± 3%  54.0ns ± 3%   +0.72%  (p=0.012 n=56+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/8          84.6ns ± 4%  81.0ns ± 3%   -4.21%  (p=0.000 n=57+50)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16          142ns ± 3%   136ns ± 3%   -4.22%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/32          258ns ± 4%   245ns ± 3%   -4.96%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/64          699ns ± 4%   692ns ± 3%   -1.02%  (p=0.000 n=50+50)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/256        2.96µs ± 6%  2.97µs ± 4%     ~     (p=0.098 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096       50.4µs ± 3%  50.7µs ± 2%   +0.53%  (p=0.030 n=54+53)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536      1.38ms ± 2%  1.42ms ± 3%   +2.76%  (p=0.000 n=54+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576    37.8ms ± 4%  38.1ms ± 4%   +0.78%  (p=0.016 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216    1.19s ± 2%   1.20s ± 3%     ~     (p=0.055 n=57+54)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/56          430ns ± 3%   413ns ± 3%   -4.10%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/224        1.94µs ± 8%  1.93µs ± 6%     ~     (p=0.587 n=57+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584       32.1µs ± 4%  32.2µs ± 3%     ~     (p=0.403 n=56+57)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344       746µs ± 3%   760µs ± 3%   +1.93%  (p=0.000 n=57+56)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504     22.0ms ± 6%  22.2ms ± 5%   +0.90%  (p=0.025 n=57+53)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064    734ms ± 2%   736ms ± 3%     ~     (p=0.107 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1          40.8ns ± 4%  40.8ns ± 4%     ~     (p=0.685 n=54+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/2          49.5ns ± 4%  49.6ns ± 7%     ~     (p=0.927 n=56+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3          58.1ns ± 4%  58.2ns ± 4%     ~     (p=0.632 n=55+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4          66.7ns ± 4%  67.0ns ± 5%     ~     (p=0.235 n=55+56)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/8           104ns ± 6%   105ns ± 6%     ~     (p=0.171 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16          196ns ± 5%   197ns ± 6%     ~     (p=0.189 n=54+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/32          361ns ± 8%   364ns ± 7%     ~     (p=0.156 n=53+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/64         1.05µs ± 6%  1.04µs ± 6%     ~     (p=0.069 n=56+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/256        5.32µs ± 5%  5.24µs ± 4%   -1.46%  (p=0.000 n=57+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096        148µs ± 4%   147µs ± 4%   -0.65%  (p=0.028 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536      3.17ms ± 3%  3.13ms ± 2%   -1.49%  (p=0.000 n=54+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576    99.1ms ± 3%  98.3ms ± 3%   -0.89%  (p=0.000 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216    2.40s ± 3%   2.40s ± 2%     ~     (p=0.247 n=56+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/56          612ns ± 6%   612ns ± 8%     ~     (p=0.968 n=52+56)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/224        3.56µs ± 7%  3.51µs ± 4%   -1.40%  (p=0.006 n=57+55)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584       87.3µs ± 5%  88.5µs ± 6%   +1.30%  (p=0.011 n=57+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344      1.95ms ± 3%  1.93ms ± 4%   -0.77%  (p=0.000 n=55+57)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504     61.2ms ± 4%  61.0ms ± 3%     ~     (p=0.059 n=57+54)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064    1.50s ± 3%   1.50s ± 3%     ~     (p=0.609 n=57+55)

Benchmarks on ARM:

name                                                  old cpu/op   new cpu/op   delta
BM_MapInsertSeq<Map<int, int>>/1                      39.6ns ± 1%  39.6ns ± 1%    ~     (p=0.416 n=155+156)
BM_MapInsertSeq<Map<int, int>>/2                      44.6ns ± 1%  44.6ns ± 1%    ~     (p=0.993 n=155+157)
BM_MapInsertSeq<Map<int, int>>/3                      50.1ns ± 3%  50.1ns ± 2%    ~     (p=0.585 n=156+157)
BM_MapInsertSeq<Map<int, int>>/4                      55.7ns ± 3%  55.4ns ± 1%  -0.61%  (p=0.000 n=156+119)
BM_MapInsertSeq<Map<int, int>>/8                      78.0ns ± 5%  77.2ns ± 1%  -1.01%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/16                      123ns ± 5%   121ns ± 0%  -1.32%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/32                      216ns ± 6%   213ns ± 0%  -1.51%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, int>>/64                      644ns ± 4%   659ns ± 3%  +2.19%  (p=0.000 n=157+146)
BM_MapInsertSeq<Map<int, int>>/256                    3.22µs ± 4%  3.30µs ± 5%  +2.56%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/4096                   53.8µs ± 3%  55.5µs ± 2%  +3.18%  (p=0.000 n=157+144)
BM_MapInsertSeq<Map<int, int>>/65536                  1.29ms ± 3%  1.33ms ± 4%  +2.97%  (p=0.000 n=155+155)
BM_MapInsertSeq<Map<int, int>>/1048576                30.2ms ± 8%  31.0ms ± 9%  +2.58%  (p=0.000 n=157+156)
BM_MapInsertSeq<Map<int, int>>/16777216                902ms ±14%   924ms ±17%  +2.47%  (p=0.005 n=157+157)
BM_MapInsertSeq<Map<int, int>>/56                      350ns ± 6%   345ns ± 0%  -1.64%  (p=0.023 n=157+119)
BM_MapInsertSeq<Map<int, int>>/224                    2.09µs ± 5%  2.14µs ± 5%  +2.01%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/3584                   36.6µs ± 4%  37.3µs ± 3%  +1.76%  (p=0.000 n=157+142)
BM_MapInsertSeq<Map<int, int>>/57344                   903µs ± 5%   923µs ± 4%  +2.23%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, int>>/917504                 21.7ms ± 8%  22.0ms ± 8%  +1.52%  (p=0.000 n=157+153)
BM_MapInsertSeq<Map<int, int>>/14680064                678ms ±18%   692ms ±18%  +2.03%  (p=0.037 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/1                    40.7ns ± 1%  40.7ns ± 1%    ~     (p=0.216 n=152+153)
BM_MapInsertSeq<Map<int*, int*>>/2                    45.5ns ± 1%  45.5ns ± 1%    ~     (p=0.403 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/3                    51.1ns ± 1%  51.1ns ± 1%    ~     (p=0.962 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/4                    57.5ns ± 4%  59.7ns ± 1%  +3.86%  (p=0.000 n=157+149)
BM_MapInsertSeq<Map<int*, int*>>/8                    80.3ns ± 4%  82.3ns ± 1%  +2.52%  (p=0.000 n=157+108)
BM_MapInsertSeq<Map<int*, int*>>/16                    127ns ± 4%   129ns ± 4%  +1.57%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/32                    229ns ± 4%   230ns ± 4%  +0.49%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/64                    695ns ± 3%   705ns ± 4%  +1.37%  (p=0.000 n=155+156)
BM_MapInsertSeq<Map<int*, int*>>/256                  3.64µs ± 7%  3.46µs ± 3%  -4.93%  (p=0.000 n=157+154)
BM_MapInsertSeq<Map<int*, int*>>/4096                 58.8µs ± 2%  60.8µs ± 2%  +3.46%  (p=0.000 n=157+148)
BM_MapInsertSeq<Map<int*, int*>>/65536                1.14ms ± 2%  1.17ms ± 3%  +3.23%  (p=0.000 n=156+157)
BM_MapInsertSeq<Map<int*, int*>>/1048576              41.4ms ±12%  42.3ms ±13%  +2.13%  (p=0.001 n=157+156)
BM_MapInsertSeq<Map<int*, int*>>/16777216              1.10s ±20%   1.12s ±22%    ~     (p=0.082 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/56                    377ns ± 5%   372ns ± 7%  -1.19%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/224                  2.40µs ± 9%  2.23µs ± 4%  -6.97%  (p=0.000 n=157+155)
BM_MapInsertSeq<Map<int*, int*>>/3584                 39.6µs ± 3%  40.0µs ± 2%  +1.01%  (p=0.000 n=157+121)
BM_MapInsertSeq<Map<int*, int*>>/57344                 776µs ± 3%   795µs ± 3%  +2.40%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int*, int*>>/917504               31.0ms ±13%  31.4ms ±14%    ~     (p=0.055 n=157+156)
BM_MapInsertSeq<Map<int*, int*>>/14680064              776ms ±22%   783ms ±25%    ~     (p=0.231 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1          41.2ns ± 1%  41.2ns ± 1%    ~     (p=0.516 n=155+154)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/2          46.8ns ± 1%  46.8ns ± 1%    ~     (p=0.324 n=155+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3          52.7ns ± 2%  52.7ns ± 2%    ~     (p=0.122 n=155+154)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4          58.8ns ± 3%  58.7ns ± 1%    ~     (p=0.142 n=156+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/8          82.7ns ± 4%  82.2ns ± 1%  -0.66%  (p=0.004 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16          131ns ± 4%   130ns ± 0%  -0.85%  (p=0.001 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/32          231ns ± 4%   229ns ± 0%  -0.94%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/64          727ns ± 3%   744ns ± 3%  +2.28%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/256        3.71µs ± 3%  3.79µs ± 3%  +2.24%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096       64.4µs ± 2%  66.2µs ± 2%  +2.75%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536      1.64ms ± 4%  1.68ms ± 3%  +2.11%  (p=0.000 n=149+145)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576    39.5ms ±11%  40.5ms ±13%  +2.32%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216    1.14s ±16%   1.16s ±17%    ~     (p=0.082 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/56          375ns ± 4%   371ns ± 0%  -1.03%  (p=0.000 n=157+119)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/224        2.35µs ± 3%  2.40µs ± 4%  +1.84%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584       42.4µs ± 2%  43.2µs ± 2%  +1.76%  (p=0.000 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344      1.15ms ± 3%  1.16ms ± 3%  +1.50%  (p=0.000 n=148+144)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504     27.4ms ± 8%  27.8ms ±10%  +1.44%  (p=0.002 n=157+157)
BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064    786ms ±19%   791ms ±20%    ~     (p=0.239 n=157+157)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1          56.3ns ± 1%  56.4ns ± 1%    ~     (p=0.363 n=151+149)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/2          69.7ns ± 0%  69.7ns ± 0%    ~     (p=0.330 n=153+150)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3          83.2ns ± 0%  83.1ns ± 0%    ~     (p=0.129 n=149+154)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4          96.5ns ± 0%  96.5ns ± 0%    ~     (p=0.619 n=153+153)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/8           154ns ± 0%   154ns ± 0%    ~     (p=0.769 n=147+149)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16          292ns ± 1%   292ns ± 1%    ~     (p=0.511 n=146+146)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/32          542ns ± 4%   542ns ± 4%    ~     (p=0.729 n=157+156)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/64         1.61µs ± 6%  1.62µs ± 4%  +0.62%  (p=0.003 n=157+156)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/256        8.60µs ± 5%  8.69µs ± 5%  +1.02%  (p=0.000 n=148+151)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096        195µs ± 3%   195µs ± 3%    ~     (p=0.668 n=154+155)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536      4.45ms ± 4%  4.45ms ± 4%    ~     (p=0.698 n=148+141)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576     126ms ±11%   130ms ±10%  +3.06%  (p=0.000 n=157+142)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216    3.28s ±12%   3.43s ± 9%  +4.72%  (p=0.000 n=157+137)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/56          942ns ± 5%   939ns ± 5%    ~     (p=0.284 n=156+157)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/224        5.82µs ± 5%  5.86µs ± 5%  +0.63%  (p=0.008 n=153+155)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584        122µs ± 4%   122µs ± 3%    ~     (p=0.822 n=156+154)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344      2.82ms ± 3%  2.82ms ± 3%    ~     (p=0.765 n=150+144)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504     79.4ms ± 7%  80.0ms ± 7%  +0.83%  (p=0.008 n=149+148)
BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064    2.17s ±15%   2.25s ±14%  +3.51%  (p=0.000 n=157+146)

This is simplification that allows to remove `ClearDeleted` and intended to have similar performance. Benchmarks are close to noise. Leaning positive on X86 and leaning negative on ARM. 1) On Arm: `&= ~(uint64_t{0xff} << (byte_index * 8))` -> `|= (uint64_t{tag | 0x80} << (byte_index * 8))`. This operation is even reducing latency from 4 -> 3. But we have another shift operation to compute the tag. Theoretically it shouldn't be slower. 2) On x86: `new_metadata[new_index ^ old_size] = 0;` -> `new_metadata[new_index] = tag | MetadataGroup::PresentMask;`. Here we have the same latency, but we need an extra shift to compute the tag. Benchmarks on X86: ``` name old cpu/op new cpu/op delta BM_MapInsertSeq<Map<int, int>>/1 32.7ns ± 4% 32.8ns ± 3% ~ (p=0.354 n=52+52) BM_MapInsertSeq<Map<int, int>>/2 42.3ns ± 3% 40.4ns ± 4% -4.56% (p=0.000 n=54+55) BM_MapInsertSeq<Map<int, int>>/3 50.5ns ± 3% 51.1ns ± 3% +1.22% (p=0.000 n=53+56) BM_MapInsertSeq<Map<int, int>>/4 59.9ns ± 4% 60.8ns ± 4% +1.45% (p=0.000 n=54+57) BM_MapInsertSeq<Map<int, int>>/8 98.8ns ± 4% 98.3ns ± 3% -0.50% (p=0.044 n=56+56) BM_MapInsertSeq<Map<int, int>>/16 163ns ± 4% 158ns ± 4% -2.77% (p=0.000 n=55+51) BM_MapInsertSeq<Map<int, int>>/32 281ns ± 4% 257ns ± 3% -8.30% (p=0.000 n=55+55) BM_MapInsertSeq<Map<int, int>>/64 646ns ± 4% 624ns ± 3% -3.41% (p=0.000 n=55+55) BM_MapInsertSeq<Map<int, int>>/256 2.42µs ± 4% 2.38µs ± 3% -1.57% (p=0.000 n=53+56) BM_MapInsertSeq<Map<int, int>>/4096 38.4µs ± 3% 38.6µs ± 3% ~ (p=0.118 n=54+52) BM_MapInsertSeq<Map<int, int>>/65536 968µs ± 3% 972µs ± 3% ~ (p=0.333 n=56+57) BM_MapInsertSeq<Map<int, int>>/1048576 16.3ms ± 5% 16.3ms ± 6% ~ (p=0.269 n=56+55) BM_MapInsertSeq<Map<int, int>>/16777216 646ms ± 3% 646ms ± 4% ~ (p=0.811 n=57+55) BM_MapInsertSeq<Map<int, int>>/56 436ns ± 3% 423ns ± 3% -2.89% (p=0.000 n=56+57) BM_MapInsertSeq<Map<int, int>>/224 1.66µs ± 4% 1.63µs ± 4% -2.36% (p=0.000 n=55+57) BM_MapInsertSeq<Map<int, int>>/3584 25.1µs ± 4% 25.4µs ± 5% +1.03% (p=0.002 n=55+57) BM_MapInsertSeq<Map<int, int>>/57344 559µs ± 3% 567µs ± 4% +1.31% (p=0.000 n=56+57) BM_MapInsertSeq<Map<int, int>>/917504 10.4ms ± 4% 10.4ms ± 4% ~ (p=0.417 n=56+55) BM_MapInsertSeq<Map<int, int>>/14680064 420ms ± 3% 420ms ± 3% ~ (p=0.740 n=57+55) BM_MapInsertSeq<Map<int*, int*>>/1 33.8ns ± 3% 33.9ns ± 3% ~ (p=0.191 n=55+56) BM_MapInsertSeq<Map<int*, int*>>/2 37.0ns ± 3% 37.0ns ± 4% ~ (p=0.948 n=55+56) BM_MapInsertSeq<Map<int*, int*>>/3 41.5ns ± 4% 41.6ns ± 4% ~ (p=0.503 n=56+56) BM_MapInsertSeq<Map<int*, int*>>/4 46.1ns ± 4% 47.2ns ± 4% +2.31% (p=0.000 n=56+56) BM_MapInsertSeq<Map<int*, int*>>/8 63.6ns ± 4% 65.4ns ± 3% +2.89% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int*, int*>>/16 133ns ± 4% 128ns ± 4% -3.74% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int*, int*>>/32 237ns ± 3% 236ns ± 4% ~ (p=0.082 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/64 597ns ± 3% 631ns ± 3% +5.73% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/256 2.78µs ± 3% 2.89µs ± 4% +3.82% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int*, int*>>/4096 51.9µs ± 3% 54.2µs ± 4% +4.46% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/65536 1.16ms ± 3% 1.18ms ± 4% +1.10% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/1048576 28.9ms ± 4% 29.0ms ± 4% ~ (p=0.549 n=57+54) BM_MapInsertSeq<Map<int*, int*>>/16777216 914ms ± 3% 912ms ± 3% ~ (p=0.147 n=57+54) BM_MapInsertSeq<Map<int*, int*>>/56 366ns ± 3% 407ns ± 3% +11.27% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int*, int*>>/224 1.86µs ± 4% 1.96µs ± 3% +5.18% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/3584 33.9µs ± 4% 35.9µs ± 3% +6.04% (p=0.000 n=57+52) BM_MapInsertSeq<Map<int*, int*>>/57344 763µs ± 3% 772µs ± 4% +1.25% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/917504 16.8ms ±11% 16.6ms ± 6% ~ (p=0.454 n=57+52) BM_MapInsertSeq<Map<int*, int*>>/14680064 610ms ± 2% 608ms ± 3% ~ (p=0.127 n=57+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/1 34.6ns ± 4% 34.5ns ± 3% ~ (p=0.331 n=56+52) BM_MapInsertSeq<Map<int, llvm::StringRef>>/2 43.9ns ± 4% 37.8ns ± 3% -13.78% (p=0.000 n=57+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/3 47.9ns ± 4% 48.9ns ± 3% +2.16% (p=0.000 n=56+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/4 53.6ns ± 3% 54.0ns ± 3% +0.72% (p=0.012 n=56+54) BM_MapInsertSeq<Map<int, llvm::StringRef>>/8 84.6ns ± 4% 81.0ns ± 3% -4.21% (p=0.000 n=57+50) BM_MapInsertSeq<Map<int, llvm::StringRef>>/16 142ns ± 3% 136ns ± 3% -4.22% (p=0.000 n=57+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/32 258ns ± 4% 245ns ± 3% -4.96% (p=0.000 n=57+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/64 699ns ± 4% 692ns ± 3% -1.02% (p=0.000 n=50+50) BM_MapInsertSeq<Map<int, llvm::StringRef>>/256 2.96µs ± 6% 2.97µs ± 4% ~ (p=0.098 n=57+54) BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096 50.4µs ± 3% 50.7µs ± 2% +0.53% (p=0.030 n=54+53) BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536 1.38ms ± 2% 1.42ms ± 3% +2.76% (p=0.000 n=54+56) BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576 37.8ms ± 4% 38.1ms ± 4% +0.78% (p=0.016 n=57+54) BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216 1.19s ± 2% 1.20s ± 3% ~ (p=0.055 n=57+54) BM_MapInsertSeq<Map<int, llvm::StringRef>>/56 430ns ± 3% 413ns ± 3% -4.10% (p=0.000 n=54+57) BM_MapInsertSeq<Map<int, llvm::StringRef>>/224 1.94µs ± 8% 1.93µs ± 6% ~ (p=0.587 n=57+56) BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584 32.1µs ± 4% 32.2µs ± 3% ~ (p=0.403 n=56+57) BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344 746µs ± 3% 760µs ± 3% +1.93% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504 22.0ms ± 6% 22.2ms ± 5% +0.90% (p=0.025 n=57+53) BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064 734ms ± 2% 736ms ± 3% ~ (p=0.107 n=57+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/1 40.8ns ± 4% 40.8ns ± 4% ~ (p=0.685 n=54+54) BM_MapInsertSeq<Map<llvm::StringRef, int>>/2 49.5ns ± 4% 49.6ns ± 7% ~ (p=0.927 n=56+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/3 58.1ns ± 4% 58.2ns ± 4% ~ (p=0.632 n=55+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/4 66.7ns ± 4% 67.0ns ± 5% ~ (p=0.235 n=55+56) BM_MapInsertSeq<Map<llvm::StringRef, int>>/8 104ns ± 6% 105ns ± 6% ~ (p=0.171 n=55+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/16 196ns ± 5% 197ns ± 6% ~ (p=0.189 n=54+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/32 361ns ± 8% 364ns ± 7% ~ (p=0.156 n=53+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/64 1.05µs ± 6% 1.04µs ± 6% ~ (p=0.069 n=56+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/256 5.32µs ± 5% 5.24µs ± 4% -1.46% (p=0.000 n=57+54) BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096 148µs ± 4% 147µs ± 4% -0.65% (p=0.028 n=55+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536 3.17ms ± 3% 3.13ms ± 2% -1.49% (p=0.000 n=54+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576 99.1ms ± 3% 98.3ms ± 3% -0.89% (p=0.000 n=57+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216 2.40s ± 3% 2.40s ± 2% ~ (p=0.247 n=56+54) BM_MapInsertSeq<Map<llvm::StringRef, int>>/56 612ns ± 6% 612ns ± 8% ~ (p=0.968 n=52+56) BM_MapInsertSeq<Map<llvm::StringRef, int>>/224 3.56µs ± 7% 3.51µs ± 4% -1.40% (p=0.006 n=57+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584 87.3µs ± 5% 88.5µs ± 6% +1.30% (p=0.011 n=57+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344 1.95ms ± 3% 1.93ms ± 4% -0.77% (p=0.000 n=55+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504 61.2ms ± 4% 61.0ms ± 3% ~ (p=0.059 n=57+54) BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064 1.50s ± 3% 1.50s ± 3% ~ (p=0.609 n=57+55) ``` Benchmarks on ARM: ``` name old cpu/op new cpu/op delta BM_MapInsertSeq<Map<int, int>>/1 39.6ns ± 1% 39.6ns ± 1% ~ (p=0.416 n=155+156) BM_MapInsertSeq<Map<int, int>>/2 44.6ns ± 1% 44.6ns ± 1% ~ (p=0.993 n=155+157) BM_MapInsertSeq<Map<int, int>>/3 50.1ns ± 3% 50.1ns ± 2% ~ (p=0.585 n=156+157) BM_MapInsertSeq<Map<int, int>>/4 55.7ns ± 3% 55.4ns ± 1% -0.61% (p=0.000 n=156+119) BM_MapInsertSeq<Map<int, int>>/8 78.0ns ± 5% 77.2ns ± 1% -1.01% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, int>>/16 123ns ± 5% 121ns ± 0% -1.32% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, int>>/32 216ns ± 6% 213ns ± 0% -1.51% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, int>>/64 644ns ± 4% 659ns ± 3% +2.19% (p=0.000 n=157+146) BM_MapInsertSeq<Map<int, int>>/256 3.22µs ± 4% 3.30µs ± 5% +2.56% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, int>>/4096 53.8µs ± 3% 55.5µs ± 2% +3.18% (p=0.000 n=157+144) BM_MapInsertSeq<Map<int, int>>/65536 1.29ms ± 3% 1.33ms ± 4% +2.97% (p=0.000 n=155+155) BM_MapInsertSeq<Map<int, int>>/1048576 30.2ms ± 8% 31.0ms ± 9% +2.58% (p=0.000 n=157+156) BM_MapInsertSeq<Map<int, int>>/16777216 902ms ±14% 924ms ±17% +2.47% (p=0.005 n=157+157) BM_MapInsertSeq<Map<int, int>>/56 350ns ± 6% 345ns ± 0% -1.64% (p=0.023 n=157+119) BM_MapInsertSeq<Map<int, int>>/224 2.09µs ± 5% 2.14µs ± 5% +2.01% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, int>>/3584 36.6µs ± 4% 37.3µs ± 3% +1.76% (p=0.000 n=157+142) BM_MapInsertSeq<Map<int, int>>/57344 903µs ± 5% 923µs ± 4% +2.23% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, int>>/917504 21.7ms ± 8% 22.0ms ± 8% +1.52% (p=0.000 n=157+153) BM_MapInsertSeq<Map<int, int>>/14680064 678ms ±18% 692ms ±18% +2.03% (p=0.037 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/1 40.7ns ± 1% 40.7ns ± 1% ~ (p=0.216 n=152+153) BM_MapInsertSeq<Map<int*, int*>>/2 45.5ns ± 1% 45.5ns ± 1% ~ (p=0.403 n=155+156) BM_MapInsertSeq<Map<int*, int*>>/3 51.1ns ± 1% 51.1ns ± 1% ~ (p=0.962 n=155+156) BM_MapInsertSeq<Map<int*, int*>>/4 57.5ns ± 4% 59.7ns ± 1% +3.86% (p=0.000 n=157+149) BM_MapInsertSeq<Map<int*, int*>>/8 80.3ns ± 4% 82.3ns ± 1% +2.52% (p=0.000 n=157+108) BM_MapInsertSeq<Map<int*, int*>>/16 127ns ± 4% 129ns ± 4% +1.57% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/32 229ns ± 4% 230ns ± 4% +0.49% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/64 695ns ± 3% 705ns ± 4% +1.37% (p=0.000 n=155+156) BM_MapInsertSeq<Map<int*, int*>>/256 3.64µs ± 7% 3.46µs ± 3% -4.93% (p=0.000 n=157+154) BM_MapInsertSeq<Map<int*, int*>>/4096 58.8µs ± 2% 60.8µs ± 2% +3.46% (p=0.000 n=157+148) BM_MapInsertSeq<Map<int*, int*>>/65536 1.14ms ± 2% 1.17ms ± 3% +3.23% (p=0.000 n=156+157) BM_MapInsertSeq<Map<int*, int*>>/1048576 41.4ms ±12% 42.3ms ±13% +2.13% (p=0.001 n=157+156) BM_MapInsertSeq<Map<int*, int*>>/16777216 1.10s ±20% 1.12s ±22% ~ (p=0.082 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/56 377ns ± 5% 372ns ± 7% -1.19% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/224 2.40µs ± 9% 2.23µs ± 4% -6.97% (p=0.000 n=157+155) BM_MapInsertSeq<Map<int*, int*>>/3584 39.6µs ± 3% 40.0µs ± 2% +1.01% (p=0.000 n=157+121) BM_MapInsertSeq<Map<int*, int*>>/57344 776µs ± 3% 795µs ± 3% +2.40% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/917504 31.0ms ±13% 31.4ms ±14% ~ (p=0.055 n=157+156) BM_MapInsertSeq<Map<int*, int*>>/14680064 776ms ±22% 783ms ±25% ~ (p=0.231 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/1 41.2ns ± 1% 41.2ns ± 1% ~ (p=0.516 n=155+154) BM_MapInsertSeq<Map<int, llvm::StringRef>>/2 46.8ns ± 1% 46.8ns ± 1% ~ (p=0.324 n=155+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/3 52.7ns ± 2% 52.7ns ± 2% ~ (p=0.122 n=155+154) BM_MapInsertSeq<Map<int, llvm::StringRef>>/4 58.8ns ± 3% 58.7ns ± 1% ~ (p=0.142 n=156+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/8 82.7ns ± 4% 82.2ns ± 1% -0.66% (p=0.004 n=157+119) BM_MapInsertSeq<Map<int, llvm::StringRef>>/16 131ns ± 4% 130ns ± 0% -0.85% (p=0.001 n=157+119) BM_MapInsertSeq<Map<int, llvm::StringRef>>/32 231ns ± 4% 229ns ± 0% -0.94% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, llvm::StringRef>>/64 727ns ± 3% 744ns ± 3% +2.28% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/256 3.71µs ± 3% 3.79µs ± 3% +2.24% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096 64.4µs ± 2% 66.2µs ± 2% +2.75% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536 1.64ms ± 4% 1.68ms ± 3% +2.11% (p=0.000 n=149+145) BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576 39.5ms ±11% 40.5ms ±13% +2.32% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216 1.14s ±16% 1.16s ±17% ~ (p=0.082 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/56 375ns ± 4% 371ns ± 0% -1.03% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, llvm::StringRef>>/224 2.35µs ± 3% 2.40µs ± 4% +1.84% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584 42.4µs ± 2% 43.2µs ± 2% +1.76% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344 1.15ms ± 3% 1.16ms ± 3% +1.50% (p=0.000 n=148+144) BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504 27.4ms ± 8% 27.8ms ±10% +1.44% (p=0.002 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064 786ms ±19% 791ms ±20% ~ (p=0.239 n=157+157) BM_MapInsertSeq<Map<llvm::StringRef, int>>/1 56.3ns ± 1% 56.4ns ± 1% ~ (p=0.363 n=151+149) BM_MapInsertSeq<Map<llvm::StringRef, int>>/2 69.7ns ± 0% 69.7ns ± 0% ~ (p=0.330 n=153+150) BM_MapInsertSeq<Map<llvm::StringRef, int>>/3 83.2ns ± 0% 83.1ns ± 0% ~ (p=0.129 n=149+154) BM_MapInsertSeq<Map<llvm::StringRef, int>>/4 96.5ns ± 0% 96.5ns ± 0% ~ (p=0.619 n=153+153) BM_MapInsertSeq<Map<llvm::StringRef, int>>/8 154ns ± 0% 154ns ± 0% ~ (p=0.769 n=147+149) BM_MapInsertSeq<Map<llvm::StringRef, int>>/16 292ns ± 1% 292ns ± 1% ~ (p=0.511 n=146+146) BM_MapInsertSeq<Map<llvm::StringRef, int>>/32 542ns ± 4% 542ns ± 4% ~ (p=0.729 n=157+156) BM_MapInsertSeq<Map<llvm::StringRef, int>>/64 1.61µs ± 6% 1.62µs ± 4% +0.62% (p=0.003 n=157+156) BM_MapInsertSeq<Map<llvm::StringRef, int>>/256 8.60µs ± 5% 8.69µs ± 5% +1.02% (p=0.000 n=148+151) BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096 195µs ± 3% 195µs ± 3% ~ (p=0.668 n=154+155) BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536 4.45ms ± 4% 4.45ms ± 4% ~ (p=0.698 n=148+141) BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576 126ms ±11% 130ms ±10% +3.06% (p=0.000 n=157+142) BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216 3.28s ±12% 3.43s ± 9% +4.72% (p=0.000 n=157+137) BM_MapInsertSeq<Map<llvm::StringRef, int>>/56 942ns ± 5% 939ns ± 5% ~ (p=0.284 n=156+157) BM_MapInsertSeq<Map<llvm::StringRef, int>>/224 5.82µs ± 5% 5.86µs ± 5% +0.63% (p=0.008 n=153+155) BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584 122µs ± 4% 122µs ± 3% ~ (p=0.822 n=156+154) BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344 2.82ms ± 3% 2.82ms ± 3% ~ (p=0.765 n=150+144) BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504 79.4ms ± 7% 80.0ms ± 7% +0.83% (p=0.008 n=149+148) BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064 2.17s ±15% 2.25s ±14% +3.51% (p=0.000 n=157+146) ```

chandlerc

Sorry for being slow here -- I've been sick for over a week now...

Looking at this, I'm a bit worried about the regression on ARM CPUs... I think there is a real regression there.

For example, on my M1, I'm seeing:

----------------------------------------------------------------------------------------------------------
Benchmark                                              CPU   Iterations     CYCLES INSTRUCTIONS    KeyRate
----------------------------------------------------------------------------------------------------------
BM_MapInsertSeq<Map<int, int>>/4096               29878 ns        23383   95.6939k      460.51k 137.092M/s
BM_MapInsertSeq<Map<int, int>>/65536             675704 ns         1023   2.16291M     7.37218M 96.9892M/s
BM_MapInsertSeq<Map<int, int>>/1048576         11951256 ns           59    38.268M     114.816M 87.7377M/s
BM_MapInsertSeq<Map<int, int>>/16777216       442136773 ns            2   1.36822G     1.89526G 37.9458M/s

Vs. without this patch:

----------------------------------------------------------------------------------------------------------
Benchmark                                              CPU   Iterations     CYCLES INSTRUCTIONS    KeyRate
----------------------------------------------------------------------------------------------------------
BM_MapInsertSeq<Map<int, int>>/4096               29342 ns        23815   93.9735k     451.537k 139.593M/s
BM_MapInsertSeq<Map<int, int>>/65536             668646 ns         1035   2.13683M     7.19312M  98.013M/s
BM_MapInsertSeq<Map<int, int>>/1048576         12345606 ns           57   39.5332M     113.325M 84.9352M/s
BM_MapInsertSeq<Map<int, int>>/16777216       436957105 ns            2   1.35579G      1.8614G 38.3956M/s

I've zoomed in on the relevant columns. Note the higher instruction count without this patch consistently. This is particularly surprising with the <int, int> map because that one should be entirely deterministic -- there shouldn't be any noise or fluctuations here outside of things like context switches, and so the difference is very likely specific to this change.

And it doesn't seem to be that the latency or throughput is improved, as the cycle count for these benchmarks also seems to consistently regress with this change.

I've not yet had a chance to specifically look at the instruction trace of the grow routine before/after to try and spot why this doesn't work out as a win on ARM, but I think that's the next step here.

chandlerc · 2025-01-20T11:05:10Z

common/raw_hashtable.h

      // Toggle the newly added bit of the index to get to the other possible
      // target index.
-      if constexpr (MetadataGroup::FastByteClear) {
-        (new_index == old_hash_index ? high_g : low_g).ClearByte(byte_index);
+      if constexpr (MetadataGroup::FastByteSet) {
+        (new_index == old_hash_index ? low_g : high_g)
+            .OverwriteEmptyByte(byte_index, tag | MetadataGroup::PresentMask);
        new_index += byte_index;
      } else {
        new_index += byte_index;
-        new_metadata[new_index ^ old_size] = MetadataGroup::Empty;
+        new_metadata[new_index] = tag | MetadataGroup::PresentMask;
      }


I dug into the assembly on ARM and I think I see why this regresses.

The code for this operation is quite different before and after. I think I have isolated the correct instruction sequence (but my debug info is broken at the moment on Linux ARM, so having to guess a bit). These are extracted from running InsertSeq<Map<int, int>> up to 1M size under perf. The cycle sample counts are shown for the snippet -- these aren't directly comparable even though the # of iterations was fixed because the sampling could have randomly been different. But the relative weight from one instruction to the next should be.

Before:

1.15 │ and x8, x21, x8 │ and x9, x24, #0x38 │ lsl x9, x17, x9 │ cmp x8, x25 │ csel x8, x20, x26, eq 1.06 │ bic x8, x8, x9 0.49 │ csel x20, x8, x20, eq 25.51 │ csel x26, x26, x8, eq

The byte clear is done with the lsl instruction (moving 0xff stored in x16 to the byte position to clear), and bic instruction that does x8 ANDN x9 is essence. But cleverly, x8 is whichever of high_g or low_g this if selects, and the result is moved back with the next two csels. At least on the M1, this also has pretty excellent ILP.

After:

0.96 │ mov w10, #0x80 │ and x9, x27, x9 │ and x8, x8, #0x38 │ bfxil x10, x24, #0, #7 │ cmp x9, x25 │ csel x9, x20, x22, eq │ lsl x8, x10, x8 30.47 │ orr x8, x9, x8 3.12 │ csel x22, x22, x8, eq 2.95 │ csel x20, x8, x20, eq

First, we unfortunately need to materialize the 0x80 used to mark the tag as present here, inside the loop. And then the code sequence to form the present is longer: bfxil + lsl + orr instead of just lsl + bic to clear. Not sure if the samples landing on orr represent less ability to pipeline or not, would need to put this into llvm-mca to really figure that out, but I've not done the work to thread it into that tool. Even if not a specific ILP problem, it adds up to 4 instructions in the loop where the old code got away with 2 instructions.

Anyways, it does seem to be a non-trivial regression on ARM, although probably very specifically because of how the method is called. This is one of the annoying things about the metadata group operations -- their performance is so deeply dependent on the context in which they are used and how the compiler ends up lowering things.

Hope this detail helps though!

goldvitaly added 2 commits January 3, 2025 13:49

Fix comment about FastSetByte.

e00d459

github-actions bot added the toolchain label Jan 3, 2025

github-actions bot requested a review from zygoloid January 3, 2025 13:06

chandlerc self-requested a review January 3, 2025 23:50

chandlerc reviewed Jan 19, 2025

View reviewed changes

chandlerc reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverwriteEmptyByte instead of ClearByte in GrowToNextAllocSize. #4754

OverwriteEmptyByte instead of ClearByte in GrowToNextAllocSize. #4754

goldvitaly commented Jan 3, 2025

chandlerc left a comment

chandlerc Jan 20, 2025

OverwriteEmptyByte instead of ClearByte in GrowToNextAllocSize. #4754

Are you sure you want to change the base?

OverwriteEmptyByte instead of ClearByte in GrowToNextAllocSize. #4754

Conversation

goldvitaly commented Jan 3, 2025

chandlerc left a comment

Choose a reason for hiding this comment

chandlerc Jan 20, 2025

Choose a reason for hiding this comment