-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cross Module Quickening Paper #696
Comments
This is really similar to my idea in #432, though they have obviously taken it much further. I'm intrigued. |
My thoughts after reading this paper: The results are impressive, and having written a lot of Numpy-using code in the past, I'm not surprised that the type stability is so exploitable. As with most optimizations of the Python side of Numpy, this benefits most when the data is small, and decreases as the data gets larger. Still, seems like it could be a way for the Faster CPython work to have more of an impact on numeric computing. There is of course a small penalty for the extra work in the non-specialized instructions to make this work, even for those who never load an optimization-aware C extension. I would want to run this against some pure Python benchmarks, such as those in pyperformance, to better understand what that overhead is -- and maybe there are clever ways to negate it when we know it won't be needed. I agree with the authors that Numpy is an important and interesting use case. It's not immediately clear how well it generalizes, but maybe within the stdlib, it could improve the performance of decimal and fraction (though maybe their work planning/type dispatch overhead is already so much lower than Numpy's that it doesn't matter as much). There is mention of using the inline cache to store the subscript of the expression:
which is syntactic sugar for:
Would the constant propagation work/plans we have already handle this case? (@Fidget-Spinner?) If so, at least for this somewhat common Numpy idiom, that optimization might be handled by the interpreter optimizer itself in the near future. For reference the code from the paper:
@fberlakovich, @sbrunthaler: Thanks for the great paper! |
If @brandtbucher 's work on constant pools for Tier 2 is merged, yeah we can do constant evaluation of slices in tier 2 in the future with the partial evaluation pass. |
Not merged yet, but I'll open a PR soon. |
That may reduce the need for the "extension-delimited superinstructions" (specializing a range of instructions with a single instruction) mentioned in the paper, and avoid that complexity. |
EDIT: I think the base commit was selected incorrectly (it's an unusual case because it's on the 3.12 branch, not main). I'm going to rerun the correct base and recreate the results. |
The author's patch has a performance penalty below the noise threshold. This is for benchmarks that don't take advantage of the added CMQ APIs, of course. |
I found this interesting paper on Cross-Module Quickening, by Felix Berlakovich and Stefan Brunthaler at ECOOP 2024. It makes use of CPython 3.12.0's specializing interpreter to quicken across C extensions. So far it seems they introduce more specializations to achieve this. Still reading the paper to learn more.
https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2024.6
The text was updated successfully, but these errors were encountered: