-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Effects: partial CPS transform #1384
Conversation
|
I would be nice to be able to propagate information across units during separate compilation. (related to #550) |
Would global_flow.ml allow to address #594 ? |
35129dc
to
d17a3e6
Compare
I have added a test |
d2f9d2f
to
febb7c8
Compare
@pmwhite Do you want to give it a try? |
I didn't read in details but It look good from there. The performance wiki page still need to be updated (text and graphs). I would be nice to show timings improvement for non-benchmark progams. |
Indeed, I would. Likely some time next week. |
Thanks for this work. The improvements are great. IIUC the benchmarks don't use effect handlers. I would also be interested in seeing the improvements in programs that use effect handlers. In the PLDI 21 paper on effect handlers, we studied the performance of effect handlers using two small benchmarks -- chameneos redux (Section 6.3.1) and generators (Section 6.3.2). The source code for the benchmarks is here: https://github.com/kayceesrk/code-snippets/tree/master/eff_bench. I would be interested in seeing the performance difference between |
dbdece8
to
af22490
Compare
This makes a significant difference as well. Here is a quick measurement:
|
Thanks for the results. It is good to see partial CPS doing well here. The next question is harder to answer, because it may be ill informed, but let me ask that anyway. On these benchmarks, how close to perfectly precise / optimal performance is the current partial CPS? As in, if you had a chance to only CPS those functions which are absolutely needed in these benchmarks, what would the performance be? |
I think the code for
let chams = List.map ~f:(fun c -> ref c) colors in
...
let ns = List.map ~f:MVar.take fs in |
Thanks @vouillon for your answer. It helped me put the numbers in perspective. |
Do you expect the analysis to be more expansive when effects is off ? |
Just tried this patch out. I've run into an issue, I believe with the lexer, which is having trouble parsing column 57 of this line. I assume this has to do with the recent changes to the lexer/parser, and not with this particular PR, but it does block me from testing this PR itself. |
Should be fixed by #1395 |
af22490
to
697de75
Compare
I've now run into the following error:
This refers, I believe, to this library; hopefully that reproduces easily enough. |
|
@vouillon, should we merge ? |
I still need to update the documentation... |
241ccb6
to
3542614
Compare
We start from a pretty good ordering (reverse postorder is optimal when there is no loop). Then we use a queue so that we process all other nodes before coming back to a node, resulting in less iterations.
This is useful when the graph changes dynamically
We omit stack checks when jumping from one block to another within a function, except for backward edges. Stack checks are also omitted when calling the function continuations. We have to check the stack depth in `caml_alloc_stack` for the test `evenodd.ml` to succeed. Otherwise, popping all the fibers exhaust the JavaScript stack. We don't have this issue with the OCaml runtime since it allocates one stack per fiber.
I think the issue only occurs when optimization of tail recursion is enabled
We analyse the call graph to avoid turning functions into CPS when we know that they don't involve effects. This relies on a global control flow analysis to find which function might be called where.
3542614
to
cf53d33
Compare
cf53d33
to
43b3650
Compare
Hi, I'm surprised to see in this benchmark that partial CPS is in some cases much slower ? (The median is 0.65 faster than full CPS, but in some cases it's > 4 times slower, meaning 25/30 times slower than no CPS. Here are the tests for which partial is slower:
|
The slower tests seems to be at the end of the table. If the order correspond the the order of execution, maybe something happened to the machine while running the tests .... Another explanation could be that some control flow are exception based and the full cps version would |
That was my hypothesis as well. We should try to reproduce this at some point. Unfortunately, compiling this benchmarks is a bit complicated when you are not at Jane Street... |
We identify functions that don't involve effects by analyzing the call graph and we keep then in direct style. This relies on a global control flow analysis to find which function might be called where.
The analysis is very effective on small / monomorphic programs.
hamming
is somewhat slower since it uses lazy values (we don't analyze mutable values).nucleic
is faster since the global control flow analysis is used to avoid some slow function calls. This measurement was made before Apply functions: optimizations #1358 was merged. The gap is probably narrower now.The analysis is less effective on large programs. Higher-order functions such as
List.iter
are turned into CPS and then all functions that calls directly or indirectly such a function needs to be turned into CPS as well. There is also some horizontal contamination, where a function needs to be turned into CPS since it is used in a context which expects a CPS function, and then this impacts all other places it is called. Still,ocamlc
is now only about ~10% slower (it is about 60% slower with the released version of Js_of_ocaml).CAMLboy is less than 25% slower (650 FPS instead of 800 FPS).
The size of the generated code is less than 20% larger, a few percents larger when compressed. For a large Web app, I have a 44% increase of generated code (6% when compressed).
Compiling
ocamlc
is about 25% slower.