Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effects: partial CPS transform #1384

Merged
merged 17 commits into from
Feb 2, 2023
Merged

Effects: partial CPS transform #1384

merged 17 commits into from
Feb 2, 2023

Conversation

vouillon
Copy link
Member

We identify functions that don't involve effects by analyzing the call graph and we keep then in direct style. This relies on a global control flow analysis to find which function might be called where.

The analysis is very effective on small / monomorphic programs.

  • hamming is somewhat slower since it uses lazy values (we don't analyze mutable values).
  • nucleic is faster since the global control flow analysis is used to avoid some slow function calls. This measurement was made before Apply functions: optimizations #1358 was merged. The gap is probably narrower now.
    time-effects

The analysis is less effective on large programs. Higher-order functions such as List.iter are turned into CPS and then all functions that calls directly or indirectly such a function needs to be turned into CPS as well. There is also some horizontal contamination, where a function needs to be turned into CPS since it is used in a context which expects a CPS function, and then this impacts all other places it is called. Still, ocamlc is now only about ~10% slower (it is about 60% slower with the released version of Js_of_ocaml).

CAMLboy is less than 25% slower (650 FPS instead of 800 FPS).

The size of the generated code is less than 20% larger, a few percents larger when compressed. For a large Web app, I have a 44% increase of generated code (6% when compressed).
size-effects
size-bzip2-effects

Compiling ocamlc is about 25% slower.

@hhugo
Copy link
Member

hhugo commented Jan 13, 2023

Effects: use global flow analysis to detect more exact calls doesn't affect any expect tests. I think we should add some.

@hhugo
Copy link
Member

hhugo commented Jan 13, 2023

I would be nice to be able to propagate information across units during separate compilation.

(related to #550)

@hhugo
Copy link
Member

hhugo commented Jan 13, 2023

Would global_flow.ml allow to address #594 ?

@vouillon
Copy link
Member Author

Effects: use global flow analysis to detect more exact calls doesn't affect any expect tests. I think we should add some.

I have added a test

@vouillon
Copy link
Member Author

Would global_flow.ml allow to address #594 ?

Yes, we can use it for that. My concern is that it might be a bit expansive.
Maybe we can have a variant where information does not flow through function arguments, which would still address #594 and should be faster.

compiler/lib/effects.ml Outdated Show resolved Hide resolved
compiler/lib/partial_cps_analysis.ml Outdated Show resolved Hide resolved
compiler/lib/partial_cps_analysis.ml Outdated Show resolved Hide resolved
compiler/lib/partial_cps_analysis.ml Outdated Show resolved Hide resolved
compiler/lib/global_flow.ml Outdated Show resolved Hide resolved
compiler/lib/global_flow.ml Outdated Show resolved Hide resolved
compiler/lib/global_flow.ml Show resolved Hide resolved
@vouillon vouillon force-pushed the effects-improvements branch 2 times, most recently from d2f9d2f to febb7c8 Compare January 19, 2023 16:38
@vouillon
Copy link
Member Author

@pmwhite Do you want to give it a try?

@hhugo
Copy link
Member

hhugo commented Jan 20, 2023

I didn't read in details but It look good from there.

The performance wiki page still need to be updated (text and graphs).

I would be nice to show timings improvement for non-benchmark progams.

@pmwhite
Copy link
Contributor

pmwhite commented Jan 20, 2023

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

@kayceesrk
Copy link

Thanks for this work. The improvements are great.

IIUC the benchmarks don't use effect handlers. I would also be interested in seeing the improvements in programs that use effect handlers. In the PLDI 21 paper on effect handlers, we studied the performance of effect handlers using two small benchmarks -- chameneos redux (Section 6.3.1) and generators (Section 6.3.2). The source code for the benchmarks is here: https://github.com/kayceesrk/code-snippets/tree/master/eff_bench. I would be interested in seeing the performance difference between --enable=effects on trunk and this PR.

@vouillon vouillon force-pushed the effects-improvements branch 2 times, most recently from dbdece8 to af22490 Compare January 22, 2023 17:15
@vouillon
Copy link
Member Author

IIUC the benchmarks don't use effect handlers. I would also be interested in seeing the improvements in programs that use effect handlers. In the PLDI 21 paper on effect handlers, we studied the performance of effect handlers using two small benchmarks -- chameneos redux (Section 6.3.1) and generators (Section 6.3.2). The source code for the benchmarks is here: https://github.com/kayceesrk/code-snippets/tree/master/eff_bench. I would be interested in seeing the performance difference between --enable=effects on trunk and this PR.

This makes a significant difference as well. Here is a quick measurement:

generator chameneos (1000 iterations)
master 10.3s 2.3s
this PR 6.6s 1.5s

@kayceesrk
Copy link

Thanks for the results. It is good to see partial CPS doing well here.

The next question is harder to answer, because it may be ill informed, but let me ask that anyway. On these benchmarks, how close to perfectly precise / optimal performance is the current partial CPS? As in, if you had a chance to only CPS those functions which are absolutely needed in these benchmarks, what would the performance be?

@vouillon
Copy link
Member Author

The next question is harder to answer, because it may be ill informed, but let me ask that anyway. On these benchmarks, how close to perfectly precise / optimal performance is the current partial CPS? As in, if you had a chance to only CPS those functions which are absolutely needed in these benchmarks, what would the performance be?

I think the code for generator is optimal.

chameneos is using Printf, which in unnecessarily CPS-transformed. By commenting out all printf, it goes down to 1.3s, but I'm not sure how much of the difference is due to just using Printf and how much comes from the CPS transformation. Otherwise, List.map is called both with function MVar.take that needs to be in CPS and with a function that could have been kept in direct style. But that does not make any performance difference.

  let chams = List.map ~f:(fun c -> ref c) colors in
...
  let ns = List.map ~f:MVar.take fs in

@kayceesrk
Copy link

kayceesrk commented Jan 24, 2023

Thanks @vouillon for your answer. It helped me put the numbers in perspective.

@hhugo
Copy link
Member

hhugo commented Jan 25, 2023

Would global_flow.ml allow to address #594 ?

Yes, we can use it for that. My concern is that it might be a bit expansive. Maybe we can have a variant where information does not flow through function arguments, which would still address #594 and should be faster.

Do you expect the analysis to be more expansive when effects is off ?

@hhugo
Copy link
Member

hhugo commented Jan 25, 2023

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

@pmwhite,any news ?

@pmwhite
Copy link
Contributor

pmwhite commented Jan 25, 2023

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

@pmwhite,any news ?

Just tried this patch out. I've run into an issue, I believe with the lexer, which is having trouble parsing column 57 of this line. I assume this has to do with the recent changes to the lexer/parser, and not with this particular PR, but it does block me from testing this PR itself.

@hhugo
Copy link
Member

hhugo commented Jan 25, 2023

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

@pmwhite,any news ?

Just tried this patch out. I've run into an issue, I believe with the lexer, which is having trouble parsing column 57 of this line. I assume this has to do with the recent changes to the lexer/parser, and not with this particular PR, but it does block me from testing this PR itself.

@pmwhite Do you want to give it a try?

Indeed, I would. Likely some time next week.

@pmwhite,any news ?

Just tried this patch out. I've run into an issue, I believe with the lexer, which is having trouble parsing column 57 of this line. I assume this has to do with the recent changes to the lexer/parser, and not with this particular PR, but it does block me from testing this PR itself.

Should be fixed by #1395

@pmwhite
Copy link
Contributor

pmwhite commented Jan 25, 2023

I've now run into the following error:

> js_of_ocaml.exe -I . -o ./delimited_kernel.cma.js --enable with-js-error --source-map-inline --pretty ./delimited_kernel.cma
Error: Some variables escaped (#1)

This refers, I believe, to this library; hopefully that reproduces easily enough.

@hhugo
Copy link
Member

hhugo commented Jan 25, 2023

I've now run into the following error:

> js_of_ocaml.exe -I . -o ./delimited_kernel.cma.js --enable with-js-error --source-map-inline --pretty ./delimited_kernel.cma
Error: Some variables escaped (#1)

This refers, I believe, to this library; hopefully that reproduces easily enough.

--debug shortvar should print the generated file on stderr with the problematic varaible of the form<var>

@hhugo
Copy link
Member

hhugo commented Feb 1, 2023

@vouillon, should we merge ?

@vouillon
Copy link
Member Author

vouillon commented Feb 1, 2023

@vouillon, should we merge ?

I still need to update the documentation...

We start from a pretty good ordering (reverse postorder is optimal when
there is no loop). Then we use a queue so that we process all other
nodes before coming back to a node, resulting in less iterations.
This is useful when the graph changes dynamically
We omit stack checks when jumping from one block to another within a
function, except for backward edges. Stack checks are also omitted when
calling the function continuations.

We have to check the stack depth in `caml_alloc_stack` for the test
`evenodd.ml` to succeed. Otherwise, popping all the fibers exhaust the
JavaScript stack. We don't have this issue with the OCaml runtime since
it allocates one stack per fiber.
I think the issue only occurs when optimization of tail recursion is enabled
We analyse the call graph to avoid turning functions into CPS when we
know that they don't involve effects. This relies on a global control
flow analysis to find which function might be called where.
@hhugo hhugo merged commit 6e15a48 into master Feb 2, 2023
@hhugo hhugo deleted the effects-improvements branch February 2, 2023 21:50
@bnguyenvanyen
Copy link

I just ran these benchmarks against three versions of the compiler: "no effects", "partial CPS" which is this PR with effects enabled, and "full CPS", which is a recent-ish commit on master with effects enabled. Here are the results:

Hi, I'm surprised to see in this benchmark that partial CPS is in some cases much slower ? (The median is 0.65 faster than full CPS, but in some cases it's > 4 times slower, meaning 25/30 times slower than no CPS.
Is there an explanation ? Or am I reading the benchmark wrong ?

Here are the tests for which partial is slower:

Dynamic_cells: Focus up and down in 10 element map
Dynamic_cells: Page up and down in 101 element map
Dynamic_cells: Scroll 1-wide window from 0 to 9 and back in 1000 element map
Dynamic_cells: Apply 4 filters and clear with 101 element map using 10 window
Dynamic_cells: Apply 4 filters and clear with 10000 element map using 50 window
Dynamic_cells: Apply 4 filters and clear with 10000 element map using 100 window
Dynamic_cells: Invert ordering of 10 element map twice
Dynamic_cells: Invert ordering of 100 element map twice
Dynamic_cells: Invert ordering of 1000 element map twice
Dynamic_cells: Perform 10 sets of 1 items in a 10 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 5 items in a 10 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 1 items in a 11 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 5 items in a 11 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 1 items in a 100 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 5 items in a 100 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 1 items in a 1000 element map with 10-wide window
Dynamic_cells: Perform 10 sets of 5 items in a 1000 element map with 10-wide window

And a plot:
20230228_jsoo_effects_bench_stripplot_slow

@hhugo
Copy link
Member

hhugo commented Feb 28, 2023

The slower tests seems to be at the end of the table. If the order correspond the the order of execution, maybe something happened to the machine while running the tests ....

Another explanation could be that some control flow are exception based and the full cps version would throw less..

@vouillon
Copy link
Member Author

vouillon commented Mar 1, 2023

The slower tests seems to be at the end of the table. If the order correspond the the order of execution, maybe something happened to the machine while running the tests ....

That was my hypothesis as well. We should try to reproduce this at some point. Unfortunately, compiling this benchmarks is a bit complicated when you are not at Jane Street...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants