-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch CARBON_VLOG
to support a format string API.
#4283
Conversation
The goal is to replace our stream operator APIs with format string APIs that can be made to have much less impact on inlining and other optimizations of the performance critical path through the code. Several experiments show that the most compact representation we can arrange for is one that calls an uninlined function and passes a minimal number of arguments to it. It doesn't help to do any work to minimize the arguments such as building a lambda -- the cost of extra code to merge the arguments is likely to outweigh the benefit. Initial experiments showed that switching a hot but uninlined function to this new API enabled inlining and the subsequent performance improvement. This also adds a 'TemplateString` utility that allows using a string literal as a template parameter. This is useful to remove the format string itself from the arguments passed to the function by passing it as a template argument instead. Currently, support is left in place for both APIs because with `CARBON_VLOG` we can detect whether or not any message was provided expecting a format string. This should allow incrementally migrating code to this API. I've added some test coverage in this PR, but I'll separate out any switching of parts of the codebase over. The goal is to eventually replace all the usages and remove the streaming support entirely. This PR doesn't update `CARBON_CHECK` in the same way because it is substantially more complex to switch. I have a few experimental PRs looking at that and will discuss how best to approach this with the specific challenges check presents separately. But the goal is for all of the macro-based output APIs to move to format strings rather than streams.
Can look at #4284 to see what the result of this looks like. |
This switches `DCHECK` and `FATAL` as well. The goal is to reduce the code size impact of these assertions so that we can keep more of them enabled. Currently, the largest cost I see from `CHECK` is not the actual check or the cold code itself, but actually the failure to inline trivial functions due to the presence of the cold code. This means that our goal isn't to reduce apparent code size in the final binary but the LLVM IR cost assessed for these routines in the inliner, which closely correlates with code size but is a bit different. As discussed in carbon-language#4283, experimentation shows that a single function call with a minimal number of arguments is the lowest cost model for these. This is easily achieved with a format-string API that internally uses `llvm::formatv`. This PR is essentially the `CHECK` version of carbon-language#4283. However, the check macros are *substantially* harder to make work with both format strings and streaming because they also take a condition. Also, unexpectedly, I was very successful at devising a regular expression based automated rewrite from the streaming to the format string form with only low 10s of manual fixes. This includes compacting strings broken up across lines, etc. Given how well that went, I've prepared this PR which just directly switches to the format string API and migrate everything to use it. One nice side-effect is that the format string approach ends up greatly simplifying the implementation here as well.
This switches `DCHECK` and `FATAL` as well. The goal is to reduce the code size impact of these assertions so that we can keep more of them enabled. Currently, the largest cost I see from `CHECK` is not the actual check or the cold code itself, but actually the failure to inline trivial functions due to the presence of the cold code. This means that our goal isn't to reduce apparent code size in the final binary but the LLVM IR cost assessed for these routines in the inliner, which closely correlates with code size but is a bit different. As discussed in carbon-language#4283, experimentation shows that a single function call with a minimal number of arguments is the lowest cost model for these. This is easily achieved with a format-string API that internally uses `llvm::formatv`. This PR is essentially the `CHECK` version of carbon-language#4283. However, the check macros are *substantially* harder to make work with both format strings and streaming because they also take a condition. Also, unexpectedly, I was very successful at devising a regular expression based automated rewrite from the streaming to the format string form with only low 10s of manual fixes. This includes compacting strings broken up across lines, etc. Given how well that went, I've prepared this PR which just directly switches to the format string API and migrate everything to use it. One nice side-effect is that the format string approach ends up greatly simplifying the implementation here as well. This is ... *shockingly* effective. Parsing speeds up by more than 3% with just this change. And checking speeds up by **8%** with this change alone: ``` BM_CompileAPIFileDenseDecls<Phase::Parse>/256 86.3µs ± 1% 82.9µs ± 1% -3.94% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/1024 431µs ± 1% 415µs ± 1% -3.76% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/4096 1.77ms ± 1% 1.71ms ± 1% -3.18% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/16384 7.44ms ± 1% 7.17ms ± 2% -3.56% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/65536 30.7ms ± 1% 29.7ms ± 1% -3.15% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/262144 131ms ± 1% 127ms ± 1% -2.81% (p=0.000 n=18+18) BM_CompileAPIFileDenseDecls<Phase::Check>/256 878µs ± 2% 800µs ± 1% -8.91% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/1024 1.88ms ± 2% 1.72ms ± 1% -8.56% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/4096 5.78ms ± 2% 5.28ms ± 1% -8.70% (p=0.000 n=20+18) BM_CompileAPIFileDenseDecls<Phase::Check>/16384 21.9ms ± 1% 20.1ms ± 1% -8.02% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Check>/65536 90.4ms ± 2% 83.1ms ± 1% -8.04% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/262144 381ms ± 2% 352ms ± 1% -7.79% (p=0.000 n=19+19) ```
common/template_string.h
Outdated
// Represent a compile-time string in a form suitable for non-type template | ||
// arguments. | ||
// | ||
// These arguments are required to be "structural", and so we copy the string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until I looked up "structural" and found https://en.cppreference.com/w/cpp/language/template_parameters I was pretty confused by this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a link to the specific subsection of that article.
This switches `DCHECK` and `FATAL` as well. The goal is to reduce the code size impact of these assertions so that we can keep more of them enabled. Currently, the largest cost I see from `CHECK` is not the actual check or the cold code itself, but actually the failure to inline trivial functions due to the presence of the cold code. This means that our goal isn't to reduce apparent code size in the final binary but the LLVM IR cost assessed for these routines in the inliner, which closely correlates with code size but is a bit different. As discussed in carbon-language#4283, experimentation shows that a single function call with a minimal number of arguments is the lowest cost model for these. This is easily achieved with a format-string API that internally uses `llvm::formatv`. This PR is essentially the `CHECK` version of carbon-language#4283. However, the check macros are *substantially* harder to make work with both format strings and streaming because they also take a condition. Also, unexpectedly, I was very successful at devising a regular expression based automated rewrite from the streaming to the format string form with only low 10s of manual fixes. This includes compacting strings broken up across lines, etc. Given how well that went, I've prepared this PR which just directly switches to the format string API and migrate everything to use it. One nice side-effect is that the format string approach ends up greatly simplifying the implementation here as well. This is ... *shockingly* effective. Parsing speeds up by more than 3% with just this change. And checking speeds up by **8%** with this change alone: ``` BM_CompileAPIFileDenseDecls<Phase::Parse>/256 86.3µs ± 1% 82.9µs ± 1% -3.94% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/1024 431µs ± 1% 415µs ± 1% -3.76% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/4096 1.77ms ± 1% 1.71ms ± 1% -3.18% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/16384 7.44ms ± 1% 7.17ms ± 2% -3.56% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/65536 30.7ms ± 1% 29.7ms ± 1% -3.15% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/262144 131ms ± 1% 127ms ± 1% -2.81% (p=0.000 n=18+18) BM_CompileAPIFileDenseDecls<Phase::Check>/256 878µs ± 2% 800µs ± 1% -8.91% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/1024 1.88ms ± 2% 1.72ms ± 1% -8.56% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/4096 5.78ms ± 2% 5.28ms ± 1% -8.70% (p=0.000 n=20+18) BM_CompileAPIFileDenseDecls<Phase::Check>/16384 21.9ms ± 1% 20.1ms ± 1% -8.02% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Check>/65536 90.4ms ± 2% 83.1ms ± 1% -8.04% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/262144 381ms ± 2% 352ms ± 1% -7.79% (p=0.000 n=19+19) ```
Co-authored-by: josh11b <15258583+josh11b@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, PTAL! I've also incorporated some of what I learned working on the check version.
common/template_string.h
Outdated
// Represent a compile-time string in a form suitable for non-type template | ||
// arguments. | ||
// | ||
// These arguments are required to be "structural", and so we copy the string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a link to the specific subsection of that article.
common/template_string.h
Outdated
// Designed to support implicitly deduced construction from a string literal | ||
// template argument. This type will implicitly convert to an `llvm::StringRef` | ||
// for accessing the string contents, and also provides a dedicated `c_str()` | ||
// method to access the string as a C-string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm picking trivial nits, I would spell this as "C string" not "C-string".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Co-authored-by: Richard Smith <richard@metafoo.co.uk>
Co-authored-by: Richard Smith <richard@metafoo.co.uk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied all the fixes, including the marvelous compile-time testing for validity. Did some cleanup and commenting there to try and make sure its clear what we're doing.
Merging, but please flag anything I should follow up on!
common/template_string.h
Outdated
// Designed to support implicitly deduced construction from a string literal | ||
// template argument. This type will implicitly convert to an `llvm::StringRef` | ||
// for accessing the string contents, and also provides a dedicated `c_str()` | ||
// method to access the string as a C-string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This switches `DCHECK` and `FATAL` as well. The goal is to reduce the code size impact of these assertions so that we can keep more of them enabled. Currently, the largest cost I see from `CHECK` is not the actual check or the cold code itself, but actually the failure to inline trivial functions due to the presence of the cold code. This means that our goal isn't to reduce apparent code size in the final binary but the LLVM IR cost assessed for these routines in the inliner, which closely correlates with code size but is a bit different. As discussed in carbon-language#4283, experimentation shows that a single function call with a minimal number of arguments is the lowest cost model for these. This is easily achieved with a format-string API that internally uses `llvm::formatv`. This PR is essentially the `CHECK` version of carbon-language#4283. However, the check macros are *substantially* harder to make work with both format strings and streaming because they also take a condition. Also, unexpectedly, I was very successful at devising a regular expression based automated rewrite from the streaming to the format string form with only low 10s of manual fixes. This includes compacting strings broken up across lines, etc. Given how well that went, I've prepared this PR which just directly switches to the format string API and migrate everything to use it. One nice side-effect is that the format string approach ends up greatly simplifying the implementation here as well. This is ... *shockingly* effective. Parsing speeds up by more than 3% with just this change. And checking speeds up by **8%** with this change alone: ``` BM_CompileAPIFileDenseDecls<Phase::Parse>/256 86.3µs ± 1% 82.9µs ± 1% -3.94% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/1024 431µs ± 1% 415µs ± 1% -3.76% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/4096 1.77ms ± 1% 1.71ms ± 1% -3.18% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/16384 7.44ms ± 1% 7.17ms ± 2% -3.56% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/65536 30.7ms ± 1% 29.7ms ± 1% -3.15% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/262144 131ms ± 1% 127ms ± 1% -2.81% (p=0.000 n=18+18) BM_CompileAPIFileDenseDecls<Phase::Check>/256 878µs ± 2% 800µs ± 1% -8.91% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/1024 1.88ms ± 2% 1.72ms ± 1% -8.56% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/4096 5.78ms ± 2% 5.28ms ± 1% -8.70% (p=0.000 n=20+18) BM_CompileAPIFileDenseDecls<Phase::Check>/16384 21.9ms ± 1% 20.1ms ± 1% -8.02% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Check>/65536 90.4ms ± 2% 83.1ms ± 1% -8.04% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/262144 381ms ± 2% 352ms ± 1% -7.79% (p=0.000 n=19+19) ```
This switches `DCHECK` and `FATAL` as well. The goal is to reduce the code size impact of these assertions so that we can keep more of them enabled. Currently, the largest cost I see from `CHECK` is not the actual check or the cold code itself, but actually the failure to inline trivial functions due to the presence of the cold code. This means that our goal isn't to reduce apparent code size in the final binary but the LLVM IR cost assessed for these routines in the inliner, which closely correlates with code size but is a bit different. As discussed in carbon-language#4283, experimentation shows that a single function call with a minimal number of arguments is the lowest cost model for these. This is easily achieved with a format-string API that internally uses `llvm::formatv`. This PR is essentially the `CHECK` version of carbon-language#4283. However, the check macros are *substantially* harder to make work with both format strings and streaming because they also take a condition. Also, unexpectedly, I was very successful at devising a regular expression based automated rewrite from the streaming to the format string form with only low 10s of manual fixes. This includes compacting strings broken up across lines, etc. Given how well that went, I've prepared this PR which just directly switches to the format string API and migrate everything to use it. One nice side-effect is that the format string approach ends up greatly simplifying the implementation here as well. This is ... *shockingly* effective. Parsing speeds up by more than 3% with just this change. And checking speeds up by **8%** with this change alone: ``` BM_CompileAPIFileDenseDecls<Phase::Parse>/256 86.3µs ± 1% 82.9µs ± 1% -3.94% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/1024 431µs ± 1% 415µs ± 1% -3.76% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/4096 1.77ms ± 1% 1.71ms ± 1% -3.18% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/16384 7.44ms ± 1% 7.17ms ± 2% -3.56% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/65536 30.7ms ± 1% 29.7ms ± 1% -3.15% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/262144 131ms ± 1% 127ms ± 1% -2.81% (p=0.000 n=18+18) BM_CompileAPIFileDenseDecls<Phase::Check>/256 878µs ± 2% 800µs ± 1% -8.91% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/1024 1.88ms ± 2% 1.72ms ± 1% -8.56% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/4096 5.78ms ± 2% 5.28ms ± 1% -8.70% (p=0.000 n=20+18) BM_CompileAPIFileDenseDecls<Phase::Check>/16384 21.9ms ± 1% 20.1ms ± 1% -8.02% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Check>/65536 90.4ms ± 2% 83.1ms ± 1% -8.04% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/262144 381ms ± 2% 352ms ± 1% -7.79% (p=0.000 n=19+19) ```
This switches `DCHECK` and `FATAL` as well. The goal is to reduce the code size impact of these assertions so that we can keep more of them enabled. Currently, the largest cost I see from `CHECK` is not the actual check or the cold code itself, but actually the failure to inline trivial functions due to the presence of the cold code. This means that our goal isn't to reduce apparent code size in the final binary but the LLVM IR cost assessed for these routines in the inliner, which closely correlates with code size but is a bit different. As discussed in #4283, experimentation shows that a single function call with a minimal number of arguments is the lowest cost model for these. This is easily achieved with a format-string API that internally uses `llvm::formatv`. This PR is essentially the `CHECK` version of #4283. However, the check macros are substantially harder to make work with both format strings and streaming because they also take a condition. Also, unexpectedly, I was very successful at devising a regular expression based automated rewrite from the streaming to the format string form with only low 10s of manual fixes. This includes compacting strings broken up across lines, etc. Given how well that went, I've prepared this PR which just directly switches to the format string API and migrate everything to use it. One nice side-effect is that the format string approach ends up greatly simplifying the implementation here as well. This is ... *shockingly* effective. Parsing speeds up by more than 3% with just this change. And checking speeds up by **8%** with this change alone: ``` BM_CompileAPIFileDenseDecls<Phase::Parse>/256 86.3µs ± 1% 82.9µs ± 1% -3.94% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/1024 431µs ± 1% 415µs ± 1% -3.76% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/4096 1.77ms ± 1% 1.71ms ± 1% -3.18% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/16384 7.44ms ± 1% 7.17ms ± 2% -3.56% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/65536 30.7ms ± 1% 29.7ms ± 1% -3.15% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/262144 131ms ± 1% 127ms ± 1% -2.81% (p=0.000 n=18+18) BM_CompileAPIFileDenseDecls<Phase::Check>/256 878µs ± 2% 800µs ± 1% -8.91% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/1024 1.88ms ± 2% 1.72ms ± 1% -8.56% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/4096 5.78ms ± 2% 5.28ms ± 1% -8.70% (p=0.000 n=20+18) BM_CompileAPIFileDenseDecls<Phase::Check>/16384 21.9ms ± 1% 20.1ms ± 1% -8.02% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Check>/65536 90.4ms ± 2% 83.1ms ± 1% -8.04% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Check>/262144 381ms ± 2% 352ms ± 1% -7.79% (p=0.000 n=19+19) ``` --------- Co-authored-by: Richard Smith <richard@metafoo.co.uk> Co-authored-by: josh11b <15258583+josh11b@users.noreply.github.com>
The goal is to replace our stream operator APIs with format string APIs that can be made to have much less impact on inlining and other optimizations of the performance critical path through the code.
Several experiments show that the most compact representation we can arrange for is one that calls an uninlined function and passes a minimal number of arguments to it. It doesn't help to do any work to minimize the arguments such as building a lambda -- the cost of extra code to merge the arguments is likely to outweigh the benefit.
Initial experiments showed that switching a hot but uninlined function to this new API enabled inlining and the subsequent performance improvement.
This also adds a 'TemplateString` utility that allows using a string literal as a template parameter. This is useful to remove the format string itself from the arguments passed to the function by passing it as a template argument instead.
Currently, support is left in place for both APIs because with
CARBON_VLOG
we can detect whether or not any message was provided expecting a format string. This should allow incrementally migrating code to this API. I've added some test coverage in this PR, but I'll separate out any switching of parts of the codebase over.The goal is to eventually replace all the usages and remove the streaming support entirely.
This PR doesn't update
CARBON_CHECK
in the same way because it is substantially more complex to switch. I have a few experimental PRs looking at that and will discuss how best to approach this with the specific challenges check presents separately. But the goal is for all of the macro-based output APIs to move to format strings rather than streams.