Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codegen/llvm: improve loadTruncate() to use a simple load when possible #18137

Closed
wants to merge 1 commit into from

Conversation

xxxbxxx
Copy link
Contributor

@xxxbxxx xxxbxxx commented Nov 26, 2023

forcing llvm to truncate the padding bits on load prevents some optimizations.

fixes 370662c
closes #17768

@xxxbxxx
Copy link
Contributor Author

xxxbxxx commented Nov 26, 2023

The idea of this change is to assume that if there is a local variable in it's own alloca,
then the padding bits are clean.
Now,

  • I don't know if it's true (I expect it is possible to write garbage to the bits with some intToPtr / ptrCast. but maybe it can happen when coping from an other variable? )
  • I don't know if it the proper way to implement this...

however, it does get the job done:

Benchmark 1 (81 runs): ./benchmark_baseline 1 1000/10000 281483566841860
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           249ms ± 13.0ms     225ms …  298ms          4 ( 5%)        0%
  peak_rss            880KB ± 7.17KB     819KB …  881KB          2 ( 2%)        0%
  cpu_cycles          836M  ± 6.80M      828M  …  859M           2 ( 2%)        0%
  instructions       2.06G  ± 67.9      2.06G  … 2.06G          10 (12%)        0%
  cache_references    759K  ±  142K      521K  … 1.21M           1 ( 1%)        0%
  cache_misses       17.2K  ± 9.76K     1.56K  … 53.8K           1 ( 1%)        0%
  branch_misses      5.77M  ± 16.6K     5.73M  … 5.80M           0 ( 0%)        0%
Benchmark 2 (70 runs): ./benchmark_master 1 1000/10000 281483566841860
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           286ms ± 7.91ms     273ms …  306ms          0 ( 0%)        💩+ 15.0% ±  1.4%
  peak_rss            909KB ±    0       909KB …  909KB          0 ( 0%)        💩+  3.4% ±  0.2%
  cpu_cycles         1.00G  ± 8.34M      987M  … 1.02G           0 ( 0%)        💩+ 19.9% ±  0.3%
  instructions       2.39G  ± 69.2      2.39G  … 2.39G           5 ( 7%)        💩+ 16.0% ±  0.0%
  cache_references    943K  ±  697K      539K  … 6.51M           3 ( 4%)        💩+ 24.3% ± 20.5%
  cache_misses       28.9K  ± 11.8K     6.86K  … 65.4K           1 ( 1%)        💩+ 68.4% ± 20.0%
  branch_misses      5.89M  ± 13.8K     5.87M  … 5.92M           0 ( 0%)        💩+  2.1% ±  0.1%
Benchmark 3 (79 runs): ./benchmark_patched 1 1000/10000 281483566841860
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           253ms ± 8.16ms     237ms …  285ms          6 ( 8%)          +  1.7% ±  1.4%
  peak_rss            909KB ±    0       909KB …  909KB          0 ( 0%)        💩+  3.4% ±  0.2%
  cpu_cycles          888M  ± 8.69M      874M  …  918M           1 ( 1%)        💩+  6.2% ±  0.3%
  instructions       2.04G  ± 59.2      2.04G  … 2.04G          14 (18%)        ⚡-  1.1% ±  0.0%
  cache_references    978K  ± 1.56M      469K  … 14.5M          10 (13%)          + 29.0% ± 44.9%
  cache_misses       25.8K  ± 16.7K     1.10K  … 70.9K           6 ( 8%)        💩+ 50.0% ± 24.6%
  branch_misses      5.80M  ± 16.7K     5.76M  … 5.84M          13 (16%)          +  0.5% ±  0.1%

forcing llvm to truncate the padding bits prevents some optimizations.

fixes 370662c
see ziglang#17768
@andrewrk
Copy link
Member

I don't know if it the proper way to implement this...

given this, combined with the fact that it's failing CI checks, needs a rebase, I'm going to close it. I don't know if it's the proper way to implement it either, unless I dive in and implement the whole thing myself, in which case I might as well close this and start over. sorry for your wasted efforts. feel free to reopen to start the discussion anew.

@andrewrk andrewrk closed this Aug 24, 2024
@xxxbxxx
Copy link
Contributor Author

xxxbxxx commented Aug 24, 2024

yeah I'm guessing it should be done by some generic "provenance" framework, maybe in relation to alias analysis / result location semantics..

for the record, to leave some breadcrumb for future implementer:

  • I've kept the commit up to date and active my local branch and the compiler works fine with it, never stumbled on a miscompilation since it's last year.
  • I've updated the branch xxxbxxx:load-truncate-optim with commit rebased on the current master branch (but I don't know if it's possible to update the closed pullrequest)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

20%+ regression in ReleaseFast performance since 0.11.0
2 participants