rewrite starargs in some cases #669

tjhance · 2015-07-05T23:22:46Z

now we can rewrite function calls that pass starargs and possible receive varargs (but no defaults or kwargs). The new test case has an exhaustive list of all such (small) cases.

Performance impact is negligible, hopefully it will be better when we can rewrite kwargs too.

       django_template.py             4.9s (8)             4.9s (6)  -0.7%
            pyxl_bench.py             4.1s (8)             4.1s (6)  +0.3%
 sqlalchemy_imperative.py             2.0s (8)             2.0s (6)  +0.0%
        django_migrate.py             1.9s (8)             1.9s (6)  -0.4%
      virtualenv_bench.py             8.2s (8)             8.2s (6)  -0.8%
                  geomean                 3.7s                 3.6s  -0.3%

First, that we would only apply our patchset once. If we ever revert the patches (I'm not sure under what conditions that happens), we previously would never apply them again. Attempted to fix this by adding a special patch that adds a new file that CMake looks for; if the file doesn't exist, cmake runs the patches again. Second, that we didn't rebuild libunwind if we apply new patches. I'm not sure if there's a good general solution to this, but I was able to figure out how to force libunwind to rebuild if we need to rerun the patch command. It took some hacking since CMake doesn't track dependencies on external projects, so we have to add some custom dependencies.

Fix some issues with the way we build libunwind

I think this lets us specify that certain functions should be put together at the end of the text segment. This is inspired by a similar feature of HHVM's build, though the goal for us for now is just to improve performance consistency rather than overall performance. Hopefully soon/eventually we can do profile-guided sorting like they do.

Add a section-ordering script

Conflicts: Makefile from_cpython/Include/dictobject.h from_cpython/Include/object.h from_cpython/Include/pyconfig.h.in src/CMakeLists.txt

getattr throws an exception if the attribute is not present. getattrFunc already throws the same exception (if there isn't a default value passed in).

Float pow

for getattrFunc use getattrInternal instead of getattr

The motivating one was classLookup(), since this happened extremely frequently (once for every old-style instance lookup), but I decided to go through and get some others.

and have the compiler pick the best way to convert to StringRef I was running into some cases where we had StringRefs but would call boxString which takes an std::string. So hopefully this change makes things cleaner and (slightly) faster.

Reduce string allocations

The notable places that are changed are PyEq, PyLt, and compare/compareInternal. The old codepaths are still in there (though thankfully now with reduced duplication with cpython), since for anything that defines a Python-level __lt__, it's better to call that with our rewriting support, rather than calling it through slot_tp_richcompare. The control flow is kind of messy, since I don't think we know what the right long-term organization is for these kinds of things. But basically, it's: - if we can't rewrite, just call the C slot - if we can rewrite, and we think calling tp_richcompare is profitable (ie it's not slot_tp_richcompare), we call that and emit a call to it in the patchpoint - otherwise, we try calling the python attribute Actual conversion of our attributes to tp_richcompare in the next commit.

int, long, str, tuple, type int and long are implemented using tp_compare in CPython, which is the old-style comparison method. I don't really understand its semantics which rely on type coercion, and we don't have the methods it needs, so just implement it as tp_richcompare for now. I think this is still an overall compatibility improvement. str_richcompare is very odd where we have to do some weird things to convince the compiler to produce the best code it can.

virtualenv_test currently takes ~7m for me. we should probably move most of the stuff it does to the "extra" tests, but for now at least make `make check` not fail.

tp_richcompare

I don't think we regularly run make check anymore, and it's a pain to make sure the user has an up-to-date gcc (would need to come up with a good set of instructions for anyone on 12.04). So just remove it from make check for now.

This roughly halves stattimer overhead, from +80% to +40% (on a very small benchmark; more complex benchmarks probably have lower overhead)

…help more

I would have expected this to help a bunch but it seems to actually hurt

Main change is separating the representation of the timer stack (StatTimer) from the manipulation of it (ScopedStatTimer). This makes it easier to keep the different use cases (top level timers vs non-top-level) separate without having the individual functions check whether they think they are a top-level timer or not.

Optimize stattimers

…ySearch to power ::allocationFrom

…essing up

add -a flag which outputs assembly of ICs

Conflicts: src/runtime/iterobject.cpp

…d jumps

…ues. and make code ready for the new JIT tier.

It deleted the passed ICSlotRewrite* and there was no way for a caller to know this without looking at the source. Make the ownership explicit by using a std::unique_ptr

…the new JIT tier IC sizes are guessed...

Conflicts: src/core/types.h

Preperations for the new JIT tier

tjhance · 2015-07-07T08:39:16Z

I

cleaned up rearrangeArguments a bit, and moved it into its own file (also in the process, I think I fixed a bug where it wasn't copying args even though it was modifying its contents)
made it work when takes_kwargs is true
made it not abort the rewrite when calling typeCallInner just because it had starargs (I guess this is okay?)

I was interested in doing all of this because it gets itertools.chain to re-write, and the sqlalchemy_imperative was taking the slowpath for that function over 400,000 times. It did indeed have a noticeable impact on that benchmark, although it wasn't very dramatic on the others.

                           fd0c77b623f6f63526:  e7f77b3d41892849d8:
       django_template.py            4.9s (14)            4.9s (14)  -0.2%
            pyxl_bench.py            4.0s (14)            4.0s (14)  -0.0%
sqlalchemy_imperative2.py             5.6s (2)             5.5s (4)  -1.9%
        django_migrate.py            1.9s (14)            1.9s (14)  -0.4%
      virtualenv_bench.py            8.2s (14)            8.2s (14)  +0.0%
                  geomean                 4.4s                 4.4s  -0.5%

tjhance · 2015-07-09T06:47:35Z

I also tried re-writing functions that pass keyword arguments, you can check out the commit here: tjhance@57df7d1

This last commit gives no noticeable perf impact:

       django_template.py             4.9s (4)             4.9s (6)  +1.2%
            pyxl_bench.py             4.1s (4)             4.1s (6)  -0.8%
sqlalchemy_imperative2.py             5.6s (4)             5.5s (6)  -0.6%
        django_migrate.py             1.9s (4)             1.9s (6)  -0.4%
      virtualenv_bench.py             8.3s (4)             8.3s (6)  +0.1%
                  geomean                 4.5s                 4.5s  -0.1%

(-0.1%!!! score!!!)

I think the assembly can be optimized a good deal further, though.

tjhance · 2015-07-09T06:50:06Z

BTW, the ics tests I added are kind of sketchy, it seems annoying to get the stats to measure exactly what i measure so I end up tweaking numbers until (i) it passes and (ii) I'm confident that the stats logically imply that every desired function call gets rewritten and ran via IC. But it seems like small perturbations could shake the numbers up.

Suggestions?

kmod · 2015-07-09T07:51:22Z

re: stats, what kinds of perturbations are you worried about? I think if you just run the test enough times and then assert that the slowpath-count is lower than the loop count, I think it should be ok. Does it work to do fewer warmups but more iterations post-stat-clearing? On some of the tests, the slowpath limit is ~= to the number of iterations of the loop, but if we make the iterations higher, then it could be clear that we can't be hitting the slowpaths consistently.

tjhance · 2015-07-10T04:44:36Z

I'm going to leave off the last commit for now since it's failing the build. I'll make the other changes you suggest.

tjhance · 2015-07-10T04:48:16Z

also re: re: stats, idk, but the stats seem to change unpredictably and non-predictably as I moved the numbers, and I figured there was something going (more hidden function calls, or logic for choosing when to do rewriting) that I didn't understand

tjhance · 2015-07-10T06:29:55Z

I addressed your comments... let's see if the build passes

kmod · 2015-07-14T04:36:03Z

Ok I took a fuller look (btw, next time let's separate the moving of the code from the changing of it; I had to move the code back to get a reasonable diff). I'm pretty uneasy about having two large and completely unrelated codepaths for handling the current case and handling the rewriter case. I think this format makes it very hard to verify that the rewritten case will do the same thing as the first time through, especially since the rewriter section breaks things into cases in a different way than the non-rewriter section does.

I know it would be some work, but what do you think about moving this back to the form of doing the rewriting inline with the handling of the current case? For example, it seems very risky to have the new helper functions only get called from the rewriter path; I think the code would be more "obviously right" if it had the form "call this function, and have the rewriter emit a call to the same function with the same arguments". I know it seems like extra work, but seeing how often we have rewriter issues even when the rewrite is next to the normal path, my confidence is low in our ability to maintain + update this new form :/

tjhance · 2015-07-14T18:34:08Z

Hm sorry if it's hard to read the diff. Did you check out the individual commits? I tried to do the 'move stuff around' work in its own commit.

Anyway, I don't really agree that the logic is clearer or easier to maintain when it's intertwined (that said, I'm biased :P). It seemed to make sense back each rewriting step corresponded to some step in the slowpath logic. But as we rewrite more things, the casework explodes, steps don't really map one-to-one anymore, and keeping the logic intertwined was a huge headache. This way, at least, it's easy to say "this chunk of code [the slowpath] is what the logic is supposed to look like" or "this chunk of code is what the assembly looks like when the args and params have such-and-such pattern". Also, since the steps don't map one-to-one as well, it's harder to reuse the helper functions.

You could say the non-one-to-one-ness is a code smell, but I think that if we want to emit the simplest possible assembly for every case, we really do need to break it up into cases like this, because when varargs are involved it's inherently very dynamic and when varargs aren't involved, you basically know in advance exactly where to copy what. And I think if we make even more optimizations, like if varargs is a list or tuple, then it will get every more hairy and we'll appreciate that the cases are separate.

So that's why I made the change to begin with, and I was mostly happy with it as I updated it to support keyword args (not in this PR) (the biggest issue was that I had to save some information from the slowpath case in a vector to use later in rewriting the assembly).

Let me know what you think.

kmod · 2015-07-14T20:13:48Z

Yeah I appreciated the separate commit but at this point there are changes on both sides of the move commit (I guess that would have been hard to anticipate).

Anyway, I'm not against having a bunch of different cases, but I guess I'm saying that instead of something like this:

genericHandleCurrentCall();

if (case1)
    rewriter->call(handleCase1);
else if (case2)
    rewriter->call(handleCase2);
// else no rewrite

we do something like this

if (case1) {
    handleCase1();
    rewriter->call(handleCase1);
} else if (case2) {
    handleCase2();
    rewriter->call(handleCase2);
} else {
    genericHandleCurrentCall();
}

Which should make it much easier to verify that the initial call and the rewritten version have the same behavior. Currently, it's really tough to verify that handleCase1 has the same behavior as if we had run those cases through genericHandleCurrentCall, and plus we have the issue that most of the error-handling code in handleCase1 won't get exercised, since it's very rare that a callsite will work the first time but not subsequent times. (Makes me wish we had coverage info.)

In lieu of the code being verifiable, I'd want us to have rock-solid testing of this. Something like, test all reasonable combinations of argspec+paramspec+passed starargs (and having those all hit the rewritten version).

Anyway, I totally agree with what you're trying to get at, and something had to change about this function since the generic handling was not amenable to rewriting at all. I think we can have it all though :)

kmod and others added 30 commits June 2, 2015 05:37

Merge pull request pyston#575 from kmod/libunwind_build

c4c58d0

Fix some issues with the way we build libunwind

Add some more 'hot functions'

d8e5b7f

Merge pull request pyston#574 from kmod/linkorder

1eb9961

Add a section-ordering script

Merge remote-tracking branch 'upstream/master' into cmake-default

1906691

Conflicts: Makefile from_cpython/Include/dictobject.h from_cpython/Include/object.h from_cpython/Include/pyconfig.h.in src/CMakeLists.txt

for getattrFunc use getattrInternal instead of getattr

1603542

getattr throws an exception if the attribute is not present. getattrFunc already throws the same exception (if there isn't a default value passed in).

Merge pull request pyston#549 from tjhance/float_pow

2be0638

Float pow

Merge pull request pyston#578 from toshok/getattr-default-exc

412730c

for getattrFunc use getattrInternal instead of getattr

I guess sre_parse.parse is slow

c510bb9

Switch some more runtime functions to taking StringRefs

a2fe70c

The motivating one was classLookup(), since this happened extremely frequently (once for every old-style instance lookup), but I decided to go through and get some others.

Get rid of boxString overloads

c44e4da

and have the compiler pick the best way to convert to StringRef I was running into some cases where we had StringRefs but would call boxString which takes an std::string. So hopefully this change makes things cleaner and (slightly) faster.

Minor compareInternal optimization

abd87d3

Merge pull request pyston#579 from kmod/string_allocs

2414077

Reduce string allocations

Minor issue in libunwind binary search

f97bfec

Bump integration test timeout

cd6654b

virtualenv_test currently takes ~7m for me. we should probably move most of the stuff it does to the "extra" tests, but for now at least make `make check` not fail.

Merge pull request pyston#577 from kmod/tp_richcompare

b8cd6e6

tp_richcompare

Remove pyston_gcc from make check

6a6d049

I don't think we regularly run make check anymore, and it's a pain to make sure the user has an up-to-date gcc (would need to come up with a good set of instructions for anyone on 12.04). So just remove it from make check for now.

more benchmarks

4411108

Fix compile error: inline function 'gc_free' is not defined

4fc3183

Update libpypa to fix decorator_whitespace.py

e7d1650

Move a bunch of stattimer stuff into header files

4bde2af

This roughly halves stattimer overhead, from +80% to +40% (on a very small benchmark; more complex benchmarks probably have lower overhead)

add _elementtree - CAPI Implementation of ElementTree

0b954e7

Move the static computation out of the hot areas; thought this would …

28df510

…help more

Switch stats to cache the pointer directly

887d59a

I would have expected this to help a bunch but it seems to actually hurt

Despite my best efforts, some places are just too hot for StatTimers

c2e037b

Merge pull request pyston#582 from kmod/optimize_statcounters

57a0f30

Optimize stattimers

Chris Toshok and others added 12 commits June 30, 2015 20:37

factor out find_cf to pyston::binarySearch.

367d8b2

build sorted vectors for large/huge objects at gc time, and use binar…

e8fefe9

…ySearch to power ::allocationFrom

break this into two $(MAKE) invocations to keep parallel makes from m…

dde8ba8

…essing up

Merge pull request pyston#643 from tjhance/disasm

177eb41

add -a flag which outputs assembly of ICs

Merge commit pyston#647 from undingen:merge_arg

474d13d

Conflicts: src/runtime/iterobject.cpp

assembler: add leave, jmp indirect and a helper for generating forwar…

f45e5d7

…d jumps

rewriter: Remove RewriterVar::nvars because it has false positive iss…

b1b0712

…ues. and make code ready for the new JIT tier.

rewriter: clean up memory managment.

5281aab

It deleted the passed ICSlotRewrite* and there was no way for a caller to know this without looking at the source. Make the ownership explicit by using a std::unique_ptr

Add more runtime ICs and make the EH frame generation code ready for …

53862fa

…the new JIT tier IC sizes are guessed...

Merge PR pyston#661 from toshok:binary-search-large-and-huge-arenas

e931682

Conflicts: src/core/types.h

Merge pull request pyston#664 from undingen/prep_bjit

9686c58

Preperations for the new JIT tier

rewrite starargs in some cases

225b113

tjhance force-pushed the rewrite-starargs branch from 57df7d1 to 7a9a999 Compare July 10, 2015 06:27

tjhance force-pushed the rewrite-starargs branch from 7a9a999 to a2e932b Compare July 11, 2015 03:22

kmod added changes requested and removed changes requested labels Jul 14, 2015

kmod added the changes requested label Jul 14, 2015

kmod self-assigned this Jul 14, 2015

kmod force-pushed the master branch 2 times, most recently from 352fd89 to 6488a3e Compare October 28, 2020 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite starargs in some cases #669

rewrite starargs in some cases #669

tjhance commented Jul 5, 2015

tjhance commented Jul 7, 2015

tjhance commented Jul 9, 2015

tjhance commented Jul 9, 2015

kmod commented Jul 9, 2015

tjhance commented Jul 10, 2015

tjhance commented Jul 10, 2015

tjhance commented Jul 10, 2015

kmod commented Jul 14, 2015

tjhance commented Jul 14, 2015

kmod commented Jul 14, 2015

rewrite starargs in some cases #669

Are you sure you want to change the base?

rewrite starargs in some cases #669

Conversation

tjhance commented Jul 5, 2015

tjhance commented Jul 7, 2015

tjhance commented Jul 9, 2015

tjhance commented Jul 9, 2015

kmod commented Jul 9, 2015

tjhance commented Jul 10, 2015

tjhance commented Jul 10, 2015

tjhance commented Jul 10, 2015

kmod commented Jul 14, 2015

tjhance commented Jul 14, 2015

kmod commented Jul 14, 2015