-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewrite starargs in some cases #669
base: master
Are you sure you want to change the base?
Conversation
First, that we would only apply our patchset once. If we ever revert the patches (I'm not sure under what conditions that happens), we previously would never apply them again. Attempted to fix this by adding a special patch that adds a new file that CMake looks for; if the file doesn't exist, cmake runs the patches again. Second, that we didn't rebuild libunwind if we apply new patches. I'm not sure if there's a good general solution to this, but I was able to figure out how to force libunwind to rebuild if we need to rerun the patch command. It took some hacking since CMake doesn't track dependencies on external projects, so we have to add some custom dependencies.
Fix some issues with the way we build libunwind
I think this lets us specify that certain functions should be put together at the end of the text segment. This is inspired by a similar feature of HHVM's build, though the goal for us for now is just to improve performance consistency rather than overall performance. Hopefully soon/eventually we can do profile-guided sorting like they do.
Add a section-ordering script
Conflicts: Makefile from_cpython/Include/dictobject.h from_cpython/Include/object.h from_cpython/Include/pyconfig.h.in src/CMakeLists.txt
getattr throws an exception if the attribute is not present. getattrFunc already throws the same exception (if there isn't a default value passed in).
for getattrFunc use getattrInternal instead of getattr
The motivating one was classLookup(), since this happened extremely frequently (once for every old-style instance lookup), but I decided to go through and get some others.
and have the compiler pick the best way to convert to StringRef I was running into some cases where we had StringRefs but would call boxString which takes an std::string. So hopefully this change makes things cleaner and (slightly) faster.
Reduce string allocations
The notable places that are changed are PyEq, PyLt, and compare/compareInternal. The old codepaths are still in there (though thankfully now with reduced duplication with cpython), since for anything that defines a Python-level __lt__, it's better to call that with our rewriting support, rather than calling it through slot_tp_richcompare. The control flow is kind of messy, since I don't think we know what the right long-term organization is for these kinds of things. But basically, it's: - if we can't rewrite, just call the C slot - if we can rewrite, and we think calling tp_richcompare is profitable (ie it's not slot_tp_richcompare), we call that and emit a call to it in the patchpoint - otherwise, we try calling the python attribute Actual conversion of our attributes to tp_richcompare in the next commit.
int, long, str, tuple, type int and long are implemented using tp_compare in CPython, which is the old-style comparison method. I don't really understand its semantics which rely on type coercion, and we don't have the methods it needs, so just implement it as tp_richcompare for now. I think this is still an overall compatibility improvement. str_richcompare is very odd where we have to do some weird things to convince the compiler to produce the best code it can.
virtualenv_test currently takes ~7m for me. we should probably move most of the stuff it does to the "extra" tests, but for now at least make `make check` not fail.
tp_richcompare
I don't think we regularly run make check anymore, and it's a pain to make sure the user has an up-to-date gcc (would need to come up with a good set of instructions for anyone on 12.04). So just remove it from make check for now.
This roughly halves stattimer overhead, from +80% to +40% (on a very small benchmark; more complex benchmarks probably have lower overhead)
I would have expected this to help a bunch but it seems to actually hurt
Main change is separating the representation of the timer stack (StatTimer) from the manipulation of it (ScopedStatTimer). This makes it easier to keep the different use cases (top level timers vs non-top-level) separate without having the individual functions check whether they think they are a top-level timer or not.
Optimize stattimers
…ySearch to power ::allocationFrom
add -a flag which outputs assembly of ICs
Conflicts: src/runtime/iterobject.cpp
…ues. and make code ready for the new JIT tier.
It deleted the passed ICSlotRewrite* and there was no way for a caller to know this without looking at the source. Make the ownership explicit by using a std::unique_ptr
…the new JIT tier IC sizes are guessed...
Conflicts: src/core/types.h
Preperations for the new JIT tier
I
I was interested in doing all of this because it gets
|
I also tried re-writing functions that pass keyword arguments, you can check out the commit here: tjhance@57df7d1 This last commit gives no noticeable perf impact:
(-0.1%!!! score!!!) I think the assembly can be optimized a good deal further, though. |
BTW, the ics tests I added are kind of sketchy, it seems annoying to get the stats to measure exactly what i measure so I end up tweaking numbers until (i) it passes and (ii) I'm confident that the stats logically imply that every desired function call gets rewritten and ran via IC. But it seems like small perturbations could shake the numbers up. Suggestions? |
re: stats, what kinds of perturbations are you worried about? I think if you just run the test enough times and then assert that the slowpath-count is lower than the loop count, I think it should be ok. Does it work to do fewer warmups but more iterations post-stat-clearing? On some of the tests, the slowpath limit is ~= to the number of iterations of the loop, but if we make the iterations higher, then it could be clear that we can't be hitting the slowpaths consistently. |
I'm going to leave off the last commit for now since it's failing the build. I'll make the other changes you suggest. |
also re: re: stats, idk, but the stats seem to change unpredictably and non-predictably as I moved the numbers, and I figured there was something going (more hidden function calls, or logic for choosing when to do rewriting) that I didn't understand |
57df7d1
to
7a9a999
Compare
I addressed your comments... let's see if the build passes |
7a9a999
to
a2e932b
Compare
Ok I took a fuller look (btw, next time let's separate the moving of the code from the changing of it; I had to move the code back to get a reasonable diff). I'm pretty uneasy about having two large and completely unrelated codepaths for handling the current case and handling the rewriter case. I think this format makes it very hard to verify that the rewritten case will do the same thing as the first time through, especially since the rewriter section breaks things into cases in a different way than the non-rewriter section does. I know it would be some work, but what do you think about moving this back to the form of doing the rewriting inline with the handling of the current case? For example, it seems very risky to have the new helper functions only get called from the rewriter path; I think the code would be more "obviously right" if it had the form "call this function, and have the rewriter emit a call to the same function with the same arguments". I know it seems like extra work, but seeing how often we have rewriter issues even when the rewrite is next to the normal path, my confidence is low in our ability to maintain + update this new form :/ |
Hm sorry if it's hard to read the diff. Did you check out the individual commits? I tried to do the 'move stuff around' work in its own commit. Anyway, I don't really agree that the logic is clearer or easier to maintain when it's intertwined (that said, I'm biased :P). It seemed to make sense back each rewriting step corresponded to some step in the slowpath logic. But as we rewrite more things, the casework explodes, steps don't really map one-to-one anymore, and keeping the logic intertwined was a huge headache. This way, at least, it's easy to say "this chunk of code [the slowpath] is what the logic is supposed to look like" or "this chunk of code is what the assembly looks like when the args and params have such-and-such pattern". Also, since the steps don't map one-to-one as well, it's harder to reuse the helper functions. You could say the non-one-to-one-ness is a code smell, but I think that if we want to emit the simplest possible assembly for every case, we really do need to break it up into cases like this, because when varargs are involved it's inherently very dynamic and when varargs aren't involved, you basically know in advance exactly where to copy what. And I think if we make even more optimizations, like if varargs is a list or tuple, then it will get every more hairy and we'll appreciate that the cases are separate. So that's why I made the change to begin with, and I was mostly happy with it as I updated it to support keyword args (not in this PR) (the biggest issue was that I had to save some information from the slowpath case in a vector to use later in rewriting the assembly). Let me know what you think. |
Yeah I appreciated the separate commit but at this point there are changes on both sides of the move commit (I guess that would have been hard to anticipate). Anyway, I'm not against having a bunch of different cases, but I guess I'm saying that instead of something like this: genericHandleCurrentCall();
if (case1)
rewriter->call(handleCase1);
else if (case2)
rewriter->call(handleCase2);
// else no rewrite we do something like this if (case1) {
handleCase1();
rewriter->call(handleCase1);
} else if (case2) {
handleCase2();
rewriter->call(handleCase2);
} else {
genericHandleCurrentCall();
} Which should make it much easier to verify that the initial call and the rewritten version have the same behavior. Currently, it's really tough to verify that In lieu of the code being verifiable, I'd want us to have rock-solid testing of this. Something like, test all reasonable combinations of argspec+paramspec+passed starargs (and having those all hit the rewritten version). Anyway, I totally agree with what you're trying to get at, and something had to change about this function since the generic handling was not amenable to rewriting at all. I think we can have it all though :) |
352fd89
to
6488a3e
Compare
now we can rewrite function calls that pass starargs and possible receive varargs (but no defaults or kwargs). The new test case has an exhaustive list of all such (small) cases.
Performance impact is negligible, hopefully it will be better when we can rewrite kwargs too.