-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OverwriteEmptyByte instead of ClearByte in GrowToNextAllocSize. #4754
base: trunk
Are you sure you want to change the base?
Conversation
This is simplification that allows to remove `ClearDeleted` and intended to have similar performance. Benchmarks are close to noise. Leaning positive on X86 and leaning negative on ARM. 1) On Arm: `&= ~(uint64_t{0xff} << (byte_index * 8))` -> `|= (uint64_t{tag | 0x80} << (byte_index * 8))`. This operation is even reducing latency from 4 -> 3. But we have another shift operation to compute the tag. Theoretically it shouldn't be slower. 2) On x86: `new_metadata[new_index ^ old_size] = 0;` -> `new_metadata[new_index] = tag | MetadataGroup::PresentMask;`. Here we have the same latency, but we need an extra shift to compute the tag. Benchmarks on X86: ``` name old cpu/op new cpu/op delta BM_MapInsertSeq<Map<int, int>>/1 32.7ns ± 4% 32.8ns ± 3% ~ (p=0.354 n=52+52) BM_MapInsertSeq<Map<int, int>>/2 42.3ns ± 3% 40.4ns ± 4% -4.56% (p=0.000 n=54+55) BM_MapInsertSeq<Map<int, int>>/3 50.5ns ± 3% 51.1ns ± 3% +1.22% (p=0.000 n=53+56) BM_MapInsertSeq<Map<int, int>>/4 59.9ns ± 4% 60.8ns ± 4% +1.45% (p=0.000 n=54+57) BM_MapInsertSeq<Map<int, int>>/8 98.8ns ± 4% 98.3ns ± 3% -0.50% (p=0.044 n=56+56) BM_MapInsertSeq<Map<int, int>>/16 163ns ± 4% 158ns ± 4% -2.77% (p=0.000 n=55+51) BM_MapInsertSeq<Map<int, int>>/32 281ns ± 4% 257ns ± 3% -8.30% (p=0.000 n=55+55) BM_MapInsertSeq<Map<int, int>>/64 646ns ± 4% 624ns ± 3% -3.41% (p=0.000 n=55+55) BM_MapInsertSeq<Map<int, int>>/256 2.42µs ± 4% 2.38µs ± 3% -1.57% (p=0.000 n=53+56) BM_MapInsertSeq<Map<int, int>>/4096 38.4µs ± 3% 38.6µs ± 3% ~ (p=0.118 n=54+52) BM_MapInsertSeq<Map<int, int>>/65536 968µs ± 3% 972µs ± 3% ~ (p=0.333 n=56+57) BM_MapInsertSeq<Map<int, int>>/1048576 16.3ms ± 5% 16.3ms ± 6% ~ (p=0.269 n=56+55) BM_MapInsertSeq<Map<int, int>>/16777216 646ms ± 3% 646ms ± 4% ~ (p=0.811 n=57+55) BM_MapInsertSeq<Map<int, int>>/56 436ns ± 3% 423ns ± 3% -2.89% (p=0.000 n=56+57) BM_MapInsertSeq<Map<int, int>>/224 1.66µs ± 4% 1.63µs ± 4% -2.36% (p=0.000 n=55+57) BM_MapInsertSeq<Map<int, int>>/3584 25.1µs ± 4% 25.4µs ± 5% +1.03% (p=0.002 n=55+57) BM_MapInsertSeq<Map<int, int>>/57344 559µs ± 3% 567µs ± 4% +1.31% (p=0.000 n=56+57) BM_MapInsertSeq<Map<int, int>>/917504 10.4ms ± 4% 10.4ms ± 4% ~ (p=0.417 n=56+55) BM_MapInsertSeq<Map<int, int>>/14680064 420ms ± 3% 420ms ± 3% ~ (p=0.740 n=57+55) BM_MapInsertSeq<Map<int*, int*>>/1 33.8ns ± 3% 33.9ns ± 3% ~ (p=0.191 n=55+56) BM_MapInsertSeq<Map<int*, int*>>/2 37.0ns ± 3% 37.0ns ± 4% ~ (p=0.948 n=55+56) BM_MapInsertSeq<Map<int*, int*>>/3 41.5ns ± 4% 41.6ns ± 4% ~ (p=0.503 n=56+56) BM_MapInsertSeq<Map<int*, int*>>/4 46.1ns ± 4% 47.2ns ± 4% +2.31% (p=0.000 n=56+56) BM_MapInsertSeq<Map<int*, int*>>/8 63.6ns ± 4% 65.4ns ± 3% +2.89% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int*, int*>>/16 133ns ± 4% 128ns ± 4% -3.74% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int*, int*>>/32 237ns ± 3% 236ns ± 4% ~ (p=0.082 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/64 597ns ± 3% 631ns ± 3% +5.73% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/256 2.78µs ± 3% 2.89µs ± 4% +3.82% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int*, int*>>/4096 51.9µs ± 3% 54.2µs ± 4% +4.46% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/65536 1.16ms ± 3% 1.18ms ± 4% +1.10% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/1048576 28.9ms ± 4% 29.0ms ± 4% ~ (p=0.549 n=57+54) BM_MapInsertSeq<Map<int*, int*>>/16777216 914ms ± 3% 912ms ± 3% ~ (p=0.147 n=57+54) BM_MapInsertSeq<Map<int*, int*>>/56 366ns ± 3% 407ns ± 3% +11.27% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int*, int*>>/224 1.86µs ± 4% 1.96µs ± 3% +5.18% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/3584 33.9µs ± 4% 35.9µs ± 3% +6.04% (p=0.000 n=57+52) BM_MapInsertSeq<Map<int*, int*>>/57344 763µs ± 3% 772µs ± 4% +1.25% (p=0.000 n=57+57) BM_MapInsertSeq<Map<int*, int*>>/917504 16.8ms ±11% 16.6ms ± 6% ~ (p=0.454 n=57+52) BM_MapInsertSeq<Map<int*, int*>>/14680064 610ms ± 2% 608ms ± 3% ~ (p=0.127 n=57+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/1 34.6ns ± 4% 34.5ns ± 3% ~ (p=0.331 n=56+52) BM_MapInsertSeq<Map<int, llvm::StringRef>>/2 43.9ns ± 4% 37.8ns ± 3% -13.78% (p=0.000 n=57+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/3 47.9ns ± 4% 48.9ns ± 3% +2.16% (p=0.000 n=56+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/4 53.6ns ± 3% 54.0ns ± 3% +0.72% (p=0.012 n=56+54) BM_MapInsertSeq<Map<int, llvm::StringRef>>/8 84.6ns ± 4% 81.0ns ± 3% -4.21% (p=0.000 n=57+50) BM_MapInsertSeq<Map<int, llvm::StringRef>>/16 142ns ± 3% 136ns ± 3% -4.22% (p=0.000 n=57+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/32 258ns ± 4% 245ns ± 3% -4.96% (p=0.000 n=57+55) BM_MapInsertSeq<Map<int, llvm::StringRef>>/64 699ns ± 4% 692ns ± 3% -1.02% (p=0.000 n=50+50) BM_MapInsertSeq<Map<int, llvm::StringRef>>/256 2.96µs ± 6% 2.97µs ± 4% ~ (p=0.098 n=57+54) BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096 50.4µs ± 3% 50.7µs ± 2% +0.53% (p=0.030 n=54+53) BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536 1.38ms ± 2% 1.42ms ± 3% +2.76% (p=0.000 n=54+56) BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576 37.8ms ± 4% 38.1ms ± 4% +0.78% (p=0.016 n=57+54) BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216 1.19s ± 2% 1.20s ± 3% ~ (p=0.055 n=57+54) BM_MapInsertSeq<Map<int, llvm::StringRef>>/56 430ns ± 3% 413ns ± 3% -4.10% (p=0.000 n=54+57) BM_MapInsertSeq<Map<int, llvm::StringRef>>/224 1.94µs ± 8% 1.93µs ± 6% ~ (p=0.587 n=57+56) BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584 32.1µs ± 4% 32.2µs ± 3% ~ (p=0.403 n=56+57) BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344 746µs ± 3% 760µs ± 3% +1.93% (p=0.000 n=57+56) BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504 22.0ms ± 6% 22.2ms ± 5% +0.90% (p=0.025 n=57+53) BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064 734ms ± 2% 736ms ± 3% ~ (p=0.107 n=57+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/1 40.8ns ± 4% 40.8ns ± 4% ~ (p=0.685 n=54+54) BM_MapInsertSeq<Map<llvm::StringRef, int>>/2 49.5ns ± 4% 49.6ns ± 7% ~ (p=0.927 n=56+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/3 58.1ns ± 4% 58.2ns ± 4% ~ (p=0.632 n=55+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/4 66.7ns ± 4% 67.0ns ± 5% ~ (p=0.235 n=55+56) BM_MapInsertSeq<Map<llvm::StringRef, int>>/8 104ns ± 6% 105ns ± 6% ~ (p=0.171 n=55+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/16 196ns ± 5% 197ns ± 6% ~ (p=0.189 n=54+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/32 361ns ± 8% 364ns ± 7% ~ (p=0.156 n=53+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/64 1.05µs ± 6% 1.04µs ± 6% ~ (p=0.069 n=56+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/256 5.32µs ± 5% 5.24µs ± 4% -1.46% (p=0.000 n=57+54) BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096 148µs ± 4% 147µs ± 4% -0.65% (p=0.028 n=55+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536 3.17ms ± 3% 3.13ms ± 2% -1.49% (p=0.000 n=54+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576 99.1ms ± 3% 98.3ms ± 3% -0.89% (p=0.000 n=57+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216 2.40s ± 3% 2.40s ± 2% ~ (p=0.247 n=56+54) BM_MapInsertSeq<Map<llvm::StringRef, int>>/56 612ns ± 6% 612ns ± 8% ~ (p=0.968 n=52+56) BM_MapInsertSeq<Map<llvm::StringRef, int>>/224 3.56µs ± 7% 3.51µs ± 4% -1.40% (p=0.006 n=57+55) BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584 87.3µs ± 5% 88.5µs ± 6% +1.30% (p=0.011 n=57+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344 1.95ms ± 3% 1.93ms ± 4% -0.77% (p=0.000 n=55+57) BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504 61.2ms ± 4% 61.0ms ± 3% ~ (p=0.059 n=57+54) BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064 1.50s ± 3% 1.50s ± 3% ~ (p=0.609 n=57+55) ``` Benchmarks on ARM: ``` name old cpu/op new cpu/op delta BM_MapInsertSeq<Map<int, int>>/1 39.6ns ± 1% 39.6ns ± 1% ~ (p=0.416 n=155+156) BM_MapInsertSeq<Map<int, int>>/2 44.6ns ± 1% 44.6ns ± 1% ~ (p=0.993 n=155+157) BM_MapInsertSeq<Map<int, int>>/3 50.1ns ± 3% 50.1ns ± 2% ~ (p=0.585 n=156+157) BM_MapInsertSeq<Map<int, int>>/4 55.7ns ± 3% 55.4ns ± 1% -0.61% (p=0.000 n=156+119) BM_MapInsertSeq<Map<int, int>>/8 78.0ns ± 5% 77.2ns ± 1% -1.01% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, int>>/16 123ns ± 5% 121ns ± 0% -1.32% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, int>>/32 216ns ± 6% 213ns ± 0% -1.51% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, int>>/64 644ns ± 4% 659ns ± 3% +2.19% (p=0.000 n=157+146) BM_MapInsertSeq<Map<int, int>>/256 3.22µs ± 4% 3.30µs ± 5% +2.56% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, int>>/4096 53.8µs ± 3% 55.5µs ± 2% +3.18% (p=0.000 n=157+144) BM_MapInsertSeq<Map<int, int>>/65536 1.29ms ± 3% 1.33ms ± 4% +2.97% (p=0.000 n=155+155) BM_MapInsertSeq<Map<int, int>>/1048576 30.2ms ± 8% 31.0ms ± 9% +2.58% (p=0.000 n=157+156) BM_MapInsertSeq<Map<int, int>>/16777216 902ms ±14% 924ms ±17% +2.47% (p=0.005 n=157+157) BM_MapInsertSeq<Map<int, int>>/56 350ns ± 6% 345ns ± 0% -1.64% (p=0.023 n=157+119) BM_MapInsertSeq<Map<int, int>>/224 2.09µs ± 5% 2.14µs ± 5% +2.01% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, int>>/3584 36.6µs ± 4% 37.3µs ± 3% +1.76% (p=0.000 n=157+142) BM_MapInsertSeq<Map<int, int>>/57344 903µs ± 5% 923µs ± 4% +2.23% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, int>>/917504 21.7ms ± 8% 22.0ms ± 8% +1.52% (p=0.000 n=157+153) BM_MapInsertSeq<Map<int, int>>/14680064 678ms ±18% 692ms ±18% +2.03% (p=0.037 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/1 40.7ns ± 1% 40.7ns ± 1% ~ (p=0.216 n=152+153) BM_MapInsertSeq<Map<int*, int*>>/2 45.5ns ± 1% 45.5ns ± 1% ~ (p=0.403 n=155+156) BM_MapInsertSeq<Map<int*, int*>>/3 51.1ns ± 1% 51.1ns ± 1% ~ (p=0.962 n=155+156) BM_MapInsertSeq<Map<int*, int*>>/4 57.5ns ± 4% 59.7ns ± 1% +3.86% (p=0.000 n=157+149) BM_MapInsertSeq<Map<int*, int*>>/8 80.3ns ± 4% 82.3ns ± 1% +2.52% (p=0.000 n=157+108) BM_MapInsertSeq<Map<int*, int*>>/16 127ns ± 4% 129ns ± 4% +1.57% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/32 229ns ± 4% 230ns ± 4% +0.49% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/64 695ns ± 3% 705ns ± 4% +1.37% (p=0.000 n=155+156) BM_MapInsertSeq<Map<int*, int*>>/256 3.64µs ± 7% 3.46µs ± 3% -4.93% (p=0.000 n=157+154) BM_MapInsertSeq<Map<int*, int*>>/4096 58.8µs ± 2% 60.8µs ± 2% +3.46% (p=0.000 n=157+148) BM_MapInsertSeq<Map<int*, int*>>/65536 1.14ms ± 2% 1.17ms ± 3% +3.23% (p=0.000 n=156+157) BM_MapInsertSeq<Map<int*, int*>>/1048576 41.4ms ±12% 42.3ms ±13% +2.13% (p=0.001 n=157+156) BM_MapInsertSeq<Map<int*, int*>>/16777216 1.10s ±20% 1.12s ±22% ~ (p=0.082 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/56 377ns ± 5% 372ns ± 7% -1.19% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/224 2.40µs ± 9% 2.23µs ± 4% -6.97% (p=0.000 n=157+155) BM_MapInsertSeq<Map<int*, int*>>/3584 39.6µs ± 3% 40.0µs ± 2% +1.01% (p=0.000 n=157+121) BM_MapInsertSeq<Map<int*, int*>>/57344 776µs ± 3% 795µs ± 3% +2.40% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int*, int*>>/917504 31.0ms ±13% 31.4ms ±14% ~ (p=0.055 n=157+156) BM_MapInsertSeq<Map<int*, int*>>/14680064 776ms ±22% 783ms ±25% ~ (p=0.231 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/1 41.2ns ± 1% 41.2ns ± 1% ~ (p=0.516 n=155+154) BM_MapInsertSeq<Map<int, llvm::StringRef>>/2 46.8ns ± 1% 46.8ns ± 1% ~ (p=0.324 n=155+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/3 52.7ns ± 2% 52.7ns ± 2% ~ (p=0.122 n=155+154) BM_MapInsertSeq<Map<int, llvm::StringRef>>/4 58.8ns ± 3% 58.7ns ± 1% ~ (p=0.142 n=156+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/8 82.7ns ± 4% 82.2ns ± 1% -0.66% (p=0.004 n=157+119) BM_MapInsertSeq<Map<int, llvm::StringRef>>/16 131ns ± 4% 130ns ± 0% -0.85% (p=0.001 n=157+119) BM_MapInsertSeq<Map<int, llvm::StringRef>>/32 231ns ± 4% 229ns ± 0% -0.94% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, llvm::StringRef>>/64 727ns ± 3% 744ns ± 3% +2.28% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/256 3.71µs ± 3% 3.79µs ± 3% +2.24% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/4096 64.4µs ± 2% 66.2µs ± 2% +2.75% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/65536 1.64ms ± 4% 1.68ms ± 3% +2.11% (p=0.000 n=149+145) BM_MapInsertSeq<Map<int, llvm::StringRef>>/1048576 39.5ms ±11% 40.5ms ±13% +2.32% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/16777216 1.14s ±16% 1.16s ±17% ~ (p=0.082 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/56 375ns ± 4% 371ns ± 0% -1.03% (p=0.000 n=157+119) BM_MapInsertSeq<Map<int, llvm::StringRef>>/224 2.35µs ± 3% 2.40µs ± 4% +1.84% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/3584 42.4µs ± 2% 43.2µs ± 2% +1.76% (p=0.000 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/57344 1.15ms ± 3% 1.16ms ± 3% +1.50% (p=0.000 n=148+144) BM_MapInsertSeq<Map<int, llvm::StringRef>>/917504 27.4ms ± 8% 27.8ms ±10% +1.44% (p=0.002 n=157+157) BM_MapInsertSeq<Map<int, llvm::StringRef>>/14680064 786ms ±19% 791ms ±20% ~ (p=0.239 n=157+157) BM_MapInsertSeq<Map<llvm::StringRef, int>>/1 56.3ns ± 1% 56.4ns ± 1% ~ (p=0.363 n=151+149) BM_MapInsertSeq<Map<llvm::StringRef, int>>/2 69.7ns ± 0% 69.7ns ± 0% ~ (p=0.330 n=153+150) BM_MapInsertSeq<Map<llvm::StringRef, int>>/3 83.2ns ± 0% 83.1ns ± 0% ~ (p=0.129 n=149+154) BM_MapInsertSeq<Map<llvm::StringRef, int>>/4 96.5ns ± 0% 96.5ns ± 0% ~ (p=0.619 n=153+153) BM_MapInsertSeq<Map<llvm::StringRef, int>>/8 154ns ± 0% 154ns ± 0% ~ (p=0.769 n=147+149) BM_MapInsertSeq<Map<llvm::StringRef, int>>/16 292ns ± 1% 292ns ± 1% ~ (p=0.511 n=146+146) BM_MapInsertSeq<Map<llvm::StringRef, int>>/32 542ns ± 4% 542ns ± 4% ~ (p=0.729 n=157+156) BM_MapInsertSeq<Map<llvm::StringRef, int>>/64 1.61µs ± 6% 1.62µs ± 4% +0.62% (p=0.003 n=157+156) BM_MapInsertSeq<Map<llvm::StringRef, int>>/256 8.60µs ± 5% 8.69µs ± 5% +1.02% (p=0.000 n=148+151) BM_MapInsertSeq<Map<llvm::StringRef, int>>/4096 195µs ± 3% 195µs ± 3% ~ (p=0.668 n=154+155) BM_MapInsertSeq<Map<llvm::StringRef, int>>/65536 4.45ms ± 4% 4.45ms ± 4% ~ (p=0.698 n=148+141) BM_MapInsertSeq<Map<llvm::StringRef, int>>/1048576 126ms ±11% 130ms ±10% +3.06% (p=0.000 n=157+142) BM_MapInsertSeq<Map<llvm::StringRef, int>>/16777216 3.28s ±12% 3.43s ± 9% +4.72% (p=0.000 n=157+137) BM_MapInsertSeq<Map<llvm::StringRef, int>>/56 942ns ± 5% 939ns ± 5% ~ (p=0.284 n=156+157) BM_MapInsertSeq<Map<llvm::StringRef, int>>/224 5.82µs ± 5% 5.86µs ± 5% +0.63% (p=0.008 n=153+155) BM_MapInsertSeq<Map<llvm::StringRef, int>>/3584 122µs ± 4% 122µs ± 3% ~ (p=0.822 n=156+154) BM_MapInsertSeq<Map<llvm::StringRef, int>>/57344 2.82ms ± 3% 2.82ms ± 3% ~ (p=0.765 n=150+144) BM_MapInsertSeq<Map<llvm::StringRef, int>>/917504 79.4ms ± 7% 80.0ms ± 7% +0.83% (p=0.008 n=149+148) BM_MapInsertSeq<Map<llvm::StringRef, int>>/14680064 2.17s ±15% 2.25s ±14% +3.51% (p=0.000 n=157+146) ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being slow here -- I've been sick for over a week now...
Looking at this, I'm a bit worried about the regression on ARM CPUs... I think there is a real regression there.
For example, on my M1, I'm seeing:
----------------------------------------------------------------------------------------------------------
Benchmark CPU Iterations CYCLES INSTRUCTIONS KeyRate
----------------------------------------------------------------------------------------------------------
BM_MapInsertSeq<Map<int, int>>/4096 29878 ns 23383 95.6939k 460.51k 137.092M/s
BM_MapInsertSeq<Map<int, int>>/65536 675704 ns 1023 2.16291M 7.37218M 96.9892M/s
BM_MapInsertSeq<Map<int, int>>/1048576 11951256 ns 59 38.268M 114.816M 87.7377M/s
BM_MapInsertSeq<Map<int, int>>/16777216 442136773 ns 2 1.36822G 1.89526G 37.9458M/s
Vs. without this patch:
----------------------------------------------------------------------------------------------------------
Benchmark CPU Iterations CYCLES INSTRUCTIONS KeyRate
----------------------------------------------------------------------------------------------------------
BM_MapInsertSeq<Map<int, int>>/4096 29342 ns 23815 93.9735k 451.537k 139.593M/s
BM_MapInsertSeq<Map<int, int>>/65536 668646 ns 1035 2.13683M 7.19312M 98.013M/s
BM_MapInsertSeq<Map<int, int>>/1048576 12345606 ns 57 39.5332M 113.325M 84.9352M/s
BM_MapInsertSeq<Map<int, int>>/16777216 436957105 ns 2 1.35579G 1.8614G 38.3956M/s
I've zoomed in on the relevant columns. Note the higher instruction count without this patch consistently. This is particularly surprising with the <int, int>
map because that one should be entirely deterministic -- there shouldn't be any noise or fluctuations here outside of things like context switches, and so the difference is very likely specific to this change.
And it doesn't seem to be that the latency or throughput is improved, as the cycle count for these benchmarks also seems to consistently regress with this change.
I've not yet had a chance to specifically look at the instruction trace of the grow routine before/after to try and spot why this doesn't work out as a win on ARM, but I think that's the next step here.
// Toggle the newly added bit of the index to get to the other possible | ||
// target index. | ||
if constexpr (MetadataGroup::FastByteClear) { | ||
(new_index == old_hash_index ? high_g : low_g).ClearByte(byte_index); | ||
if constexpr (MetadataGroup::FastByteSet) { | ||
(new_index == old_hash_index ? low_g : high_g) | ||
.OverwriteEmptyByte(byte_index, tag | MetadataGroup::PresentMask); | ||
new_index += byte_index; | ||
} else { | ||
new_index += byte_index; | ||
new_metadata[new_index ^ old_size] = MetadataGroup::Empty; | ||
new_metadata[new_index] = tag | MetadataGroup::PresentMask; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dug into the assembly on ARM and I think I see why this regresses.
The code for this operation is quite different before and after. I think I have isolated the correct instruction sequence (but my debug info is broken at the moment on Linux ARM, so having to guess a bit). These are extracted from running InsertSeq<Map<int, int>>
up to 1M size under perf
. The cycle sample counts are shown for the snippet -- these aren't directly comparable even though the # of iterations was fixed because the sampling could have randomly been different. But the relative weight from one instruction to the next should be.
Before:
1.15 │ and x8, x21, x8
│ and x9, x24, #0x38
│ lsl x9, x17, x9
│ cmp x8, x25
│ csel x8, x20, x26, eq
1.06 │ bic x8, x8, x9
0.49 │ csel x20, x8, x20, eq
25.51 │ csel x26, x26, x8, eq
The byte clear is done with the lsl
instruction (moving 0xff
stored in x16
to the byte position to clear), and bic
instruction that does x8
ANDN x9
is essence. But cleverly, x8
is whichever of high_g
or low_g
this if
selects, and the result is moved back with the next two csel
s. At least on the M1, this also has pretty excellent ILP.
After:
0.96 │ mov w10, #0x80
│ and x9, x27, x9
│ and x8, x8, #0x38
│ bfxil x10, x24, #0, #7
│ cmp x9, x25
│ csel x9, x20, x22, eq
│ lsl x8, x10, x8
30.47 │ orr x8, x9, x8
3.12 │ csel x22, x22, x8, eq
2.95 │ csel x20, x8, x20, eq
First, we unfortunately need to materialize the 0x80
used to mark the tag as present here, inside the loop. And then the code sequence to form the present is longer: bfxil
+ lsl
+ orr
instead of just lsl
+ bic
to clear. Not sure if the samples landing on orr
represent less ability to pipeline or not, would need to put this into llvm-mca
to really figure that out, but I've not done the work to thread it into that tool. Even if not a specific ILP problem, it adds up to 4 instructions in the loop where the old code got away with 2 instructions.
Anyways, it does seem to be a non-trivial regression on ARM, although probably very specifically because of how the method is called. This is one of the annoying things about the metadata group operations -- their performance is so deeply dependent on the context in which they are used and how the compiler ends up lowering things.
Hope this detail helps though!
This is simplification that allows to remove
ClearDeleted
and intendedto have similar performance.
Benchmarks are close to noise. Leaning positive on X86 and leaning
negative on ARM.
On Arm:
&= ~(uint64_t{0xff} << (byte_index * 8))
->|= (uint64_t{tag | 0x80} << (byte_index * 8))
.This operation is even reducing latency from 4 -> 3.
But we have another shift operation to compute the tag. Theoretically it shouldn't be slower.
On x86:
new_metadata[new_index ^ old_size] = 0;
->new_metadata[new_index] = tag | MetadataGroup::PresentMask;
.Here we have the same latency, but we need an extra shift to compute the tag.
Benchmarks on X86:
Benchmarks on ARM: