Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite starargs in some cases #669

Open
wants to merge 2,461 commits into
base: master
Choose a base branch
from
Open

Conversation

tjhance
Copy link
Contributor

@tjhance tjhance commented Jul 5, 2015

now we can rewrite function calls that pass starargs and possible receive varargs (but no defaults or kwargs). The new test case has an exhaustive list of all such (small) cases.

Performance impact is negligible, hopefully it will be better when we can rewrite kwargs too.

       django_template.py             4.9s (8)             4.9s (6)  -0.7%
            pyxl_bench.py             4.1s (8)             4.1s (6)  +0.3%
 sqlalchemy_imperative.py             2.0s (8)             2.0s (6)  +0.0%
        django_migrate.py             1.9s (8)             1.9s (6)  -0.4%
      virtualenv_bench.py             8.2s (8)             8.2s (6)  -0.8%
                  geomean                 3.7s                 3.6s  -0.3%

kmod and others added 30 commits June 2, 2015 05:37
First, that we would only apply our patchset once.  If we ever revert the
patches (I'm not sure under what conditions that happens), we previously would
never apply them again.  Attempted to fix this by adding a special patch that
adds a new file that CMake looks for; if the file doesn't exist, cmake runs the
patches again.

Second, that we didn't rebuild libunwind if we apply new patches.
I'm not sure if there's a good general solution to this, but I was able
to figure out how to force libunwind to rebuild if we need to rerun the
patch command.  It took some hacking since CMake doesn't track dependencies
on external projects, so we have to add some custom dependencies.
Fix some issues with the way we build libunwind
I think this lets us specify that certain functions should be
put together at the end of the text segment.

This is inspired by a similar feature of HHVM's build, though the goal
for us for now is just to improve performance consistency rather
than overall performance.  Hopefully soon/eventually we can do
profile-guided sorting like they do.
Add a section-ordering script
Conflicts:
	Makefile
	from_cpython/Include/dictobject.h
	from_cpython/Include/object.h
	from_cpython/Include/pyconfig.h.in
	src/CMakeLists.txt
getattr throws an exception if the attribute is not present.  getattrFunc already throws the same exception (if there isn't a default value passed in).
for getattrFunc use getattrInternal instead of getattr
The motivating one was classLookup(), since this happened
extremely frequently (once for every old-style instance
lookup), but I decided to go through and get some others.
and have the compiler pick the best way to convert to StringRef

I was running into some cases where we had StringRefs but would call
boxString which takes an std::string.  So hopefully this change
makes things cleaner and (slightly) faster.
The notable places that are changed are PyEq, PyLt, and
compare/compareInternal.

The old codepaths are still in there (though thankfully
now with reduced duplication with cpython), since for anything
that defines a Python-level __lt__, it's better to call that
with our rewriting support, rather than calling it through
slot_tp_richcompare.

The control flow is kind of messy, since I don't think we
know what the right long-term organization is for these kinds of
things.  But basically, it's:
- if we can't rewrite, just call the C slot
- if we can rewrite, and we think calling tp_richcompare is profitable
  (ie it's not slot_tp_richcompare), we call that and emit a call to
  it in the patchpoint
- otherwise, we try calling the python attribute

Actual conversion of our attributes to tp_richcompare in the next commit.
int, long, str, tuple, type

int and long are implemented using tp_compare in CPython,
which is the old-style comparison method.  I don't really understand
its semantics which rely on type coercion, and we don't have the
methods it needs, so just implement it as tp_richcompare for now.
I think this is still an overall compatibility improvement.

str_richcompare is very odd where we have to do some weird things
to convince the compiler to produce the best code it can.
virtualenv_test currently takes ~7m for me.  we should probably move
most of the stuff it does to the "extra" tests, but for now at least
make `make check` not fail.
I don't think we regularly run make check anymore, and it's a pain
to make sure the user has an up-to-date gcc (would need to come up
with a good set of instructions for anyone on 12.04).  So just remove
it from make check for now.
This roughly halves stattimer overhead, from +80% to +40%
(on a very small benchmark; more complex benchmarks probably
have lower overhead)
I would have expected this to help a bunch but it seems to actually hurt
Main change is separating the representation of the timer stack (StatTimer) from
the manipulation of it (ScopedStatTimer).  This makes it easier to keep the different
use cases (top level timers vs non-top-level) separate without having the individual
functions check whether they think they are a top-level timer or not.
Chris Toshok and others added 12 commits June 30, 2015 20:37
add -a flag which outputs assembly of ICs
Conflicts:
	src/runtime/iterobject.cpp
…ues.

and make code ready for the new JIT tier.
It deleted the passed ICSlotRewrite* and there was no way for a caller to know this without looking at the source.
Make the ownership explicit by using a std::unique_ptr
Preperations for the new JIT tier
@tjhance
Copy link
Contributor Author

tjhance commented Jul 7, 2015

I

  • cleaned up rearrangeArguments a bit, and moved it into its own file (also in the process, I think I fixed a bug where it wasn't copying args even though it was modifying its contents)
  • made it work when takes_kwargs is true
  • made it not abort the rewrite when calling typeCallInner just because it had starargs (I guess this is okay?)

I was interested in doing all of this because it gets itertools.chain to re-write, and the sqlalchemy_imperative was taking the slowpath for that function over 400,000 times. It did indeed have a noticeable impact on that benchmark, although it wasn't very dramatic on the others.

                           fd0c77b623f6f63526:  e7f77b3d41892849d8:
       django_template.py            4.9s (14)            4.9s (14)  -0.2%
            pyxl_bench.py            4.0s (14)            4.0s (14)  -0.0%
sqlalchemy_imperative2.py             5.6s (2)             5.5s (4)  -1.9%
        django_migrate.py            1.9s (14)            1.9s (14)  -0.4%
      virtualenv_bench.py            8.2s (14)            8.2s (14)  +0.0%
                  geomean                 4.4s                 4.4s  -0.5%

@tjhance
Copy link
Contributor Author

tjhance commented Jul 9, 2015

I also tried re-writing functions that pass keyword arguments, you can check out the commit here: tjhance@57df7d1

This last commit gives no noticeable perf impact:

       django_template.py             4.9s (4)             4.9s (6)  +1.2%
            pyxl_bench.py             4.1s (4)             4.1s (6)  -0.8%
sqlalchemy_imperative2.py             5.6s (4)             5.5s (6)  -0.6%
        django_migrate.py             1.9s (4)             1.9s (6)  -0.4%
      virtualenv_bench.py             8.3s (4)             8.3s (6)  +0.1%
                  geomean                 4.5s                 4.5s  -0.1%

(-0.1%!!! score!!!)

I think the assembly can be optimized a good deal further, though.

@tjhance
Copy link
Contributor Author

tjhance commented Jul 9, 2015

BTW, the ics tests I added are kind of sketchy, it seems annoying to get the stats to measure exactly what i measure so I end up tweaking numbers until (i) it passes and (ii) I'm confident that the stats logically imply that every desired function call gets rewritten and ran via IC. But it seems like small perturbations could shake the numbers up.

Suggestions?

@kmod
Copy link
Collaborator

kmod commented Jul 9, 2015

re: stats, what kinds of perturbations are you worried about? I think if you just run the test enough times and then assert that the slowpath-count is lower than the loop count, I think it should be ok. Does it work to do fewer warmups but more iterations post-stat-clearing? On some of the tests, the slowpath limit is ~= to the number of iterations of the loop, but if we make the iterations higher, then it could be clear that we can't be hitting the slowpaths consistently.

@tjhance
Copy link
Contributor Author

tjhance commented Jul 10, 2015

I'm going to leave off the last commit for now since it's failing the build. I'll make the other changes you suggest.

@tjhance
Copy link
Contributor Author

tjhance commented Jul 10, 2015

also re: re: stats, idk, but the stats seem to change unpredictably and non-predictably as I moved the numbers, and I figured there was something going (more hidden function calls, or logic for choosing when to do rewriting) that I didn't understand

@tjhance tjhance force-pushed the rewrite-starargs branch from 57df7d1 to 7a9a999 Compare July 10, 2015 06:27
@tjhance
Copy link
Contributor Author

tjhance commented Jul 10, 2015

I addressed your comments... let's see if the build passes

@tjhance tjhance force-pushed the rewrite-starargs branch from 7a9a999 to a2e932b Compare July 11, 2015 03:22
@kmod
Copy link
Collaborator

kmod commented Jul 14, 2015

Ok I took a fuller look (btw, next time let's separate the moving of the code from the changing of it; I had to move the code back to get a reasonable diff). I'm pretty uneasy about having two large and completely unrelated codepaths for handling the current case and handling the rewriter case. I think this format makes it very hard to verify that the rewritten case will do the same thing as the first time through, especially since the rewriter section breaks things into cases in a different way than the non-rewriter section does.

I know it would be some work, but what do you think about moving this back to the form of doing the rewriting inline with the handling of the current case? For example, it seems very risky to have the new helper functions only get called from the rewriter path; I think the code would be more "obviously right" if it had the form "call this function, and have the rewriter emit a call to the same function with the same arguments". I know it seems like extra work, but seeing how often we have rewriter issues even when the rewrite is next to the normal path, my confidence is low in our ability to maintain + update this new form :/

@tjhance
Copy link
Contributor Author

tjhance commented Jul 14, 2015

Hm sorry if it's hard to read the diff. Did you check out the individual commits? I tried to do the 'move stuff around' work in its own commit.

Anyway, I don't really agree that the logic is clearer or easier to maintain when it's intertwined (that said, I'm biased :P). It seemed to make sense back each rewriting step corresponded to some step in the slowpath logic. But as we rewrite more things, the casework explodes, steps don't really map one-to-one anymore, and keeping the logic intertwined was a huge headache. This way, at least, it's easy to say "this chunk of code [the slowpath] is what the logic is supposed to look like" or "this chunk of code is what the assembly looks like when the args and params have such-and-such pattern". Also, since the steps don't map one-to-one as well, it's harder to reuse the helper functions.

You could say the non-one-to-one-ness is a code smell, but I think that if we want to emit the simplest possible assembly for every case, we really do need to break it up into cases like this, because when varargs are involved it's inherently very dynamic and when varargs aren't involved, you basically know in advance exactly where to copy what. And I think if we make even more optimizations, like if varargs is a list or tuple, then it will get every more hairy and we'll appreciate that the cases are separate.

So that's why I made the change to begin with, and I was mostly happy with it as I updated it to support keyword args (not in this PR) (the biggest issue was that I had to save some information from the slowpath case in a vector to use later in rewriting the assembly).

Let me know what you think.

@kmod
Copy link
Collaborator

kmod commented Jul 14, 2015

Yeah I appreciated the separate commit but at this point there are changes on both sides of the move commit (I guess that would have been hard to anticipate).

Anyway, I'm not against having a bunch of different cases, but I guess I'm saying that instead of something like this:

genericHandleCurrentCall();

if (case1)
    rewriter->call(handleCase1);
else if (case2)
    rewriter->call(handleCase2);
// else no rewrite

we do something like this

if (case1) {
    handleCase1();
    rewriter->call(handleCase1);
} else if (case2) {
    handleCase2();
    rewriter->call(handleCase2);
} else {
    genericHandleCurrentCall();
}

Which should make it much easier to verify that the initial call and the rewritten version have the same behavior. Currently, it's really tough to verify that handleCase1 has the same behavior as if we had run those cases through genericHandleCurrentCall, and plus we have the issue that most of the error-handling code in handleCase1 won't get exercised, since it's very rare that a callsite will work the first time but not subsequent times. (Makes me wish we had coverage info.)

In lieu of the code being verifiable, I'd want us to have rock-solid testing of this. Something like, test all reasonable combinations of argspec+paramspec+passed starargs (and having those all hit the rewritten version).

Anyway, I totally agree with what you're trying to get at, and something had to change about this function since the generic handling was not amenable to rewriting at all. I think we can have it all though :)

@kmod kmod self-assigned this Jul 14, 2015
@kmod kmod force-pushed the master branch 2 times, most recently from 352fd89 to 6488a3e Compare October 28, 2020 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants