Replies: 7 comments
-
@turion thanks a lot for this very detailed report. I know it's a lot to ask on top of what you've already done, but would you be able to put together a small benchmark that shows the difference in terms of performance with one and the other? |
Beta Was this translation helpful? Give feedback.
-
Yes, I managed to find a benchmark using ListT that reproduces the issue, I believe. It's much more visible there because many MSFs are launched. Reproduce with Library on branch
|
Beta Was this translation helpful? Give feedback.
-
The branch may serve as a starting point for further future optimizations, of which |
Beta Was this translation helpful? Give feedback.
-
Further bisecting of all the possible inlines suggests that inlining |
Beta Was this translation helpful? Give feedback.
-
My current strategy to identify the minimal set of necessary changes is:
|
Beta Was this translation helpful? Give feedback.
-
Just a note: Issue #375 addressed the introduction of a benchmark. It will require discussion before making changes, but it may be worth seeing if those benchmarks, or how they are used, should be improved. |
Beta Was this translation helpful? Give feedback.
-
Trying the benchmarks briefly seems to show that these changes here have a significant impact on the benchmarks. I think it would be very helpful if we could plot benchmarks (+ error bars) for several commits to compare them. |
Beta Was this translation helpful? Give feedback.
-
While investigating turion/rhine#227, I found that replacing the current definition of
arrM
in terms ofmorphGS
with a direct definition gives a considerable speedup, and I'm trying here to explain where this speedup might come from. I'd recommend replacing the definition, becausearrM
is ubiquitous in everydunai
program.The implementations:
https://github.com/ivanperez-keera/dunai/blob/fc0559b0658adb868d474a5421d8d6dde5bb2ca8/dunai/src/Data/MonadicStreamFunction/Core.hs#LL122C1-L130C72
The implementation in terms of
morphGS
has the advantage of not using theMSF
constructor, but it is also much slower in runtime.Lengthy derivation of example program evaluation
Example program
Let's consider this simple example program which creates a random string (using a fictitious function
rand :: IO String
) and prints it:Let us expand the definitions of the functions to see how this is evaluated and executed:
Now it depends on how
arrM
is defined. Let us chose the simple direct implementation first (the one that is not used in the library):The direct implementation
We've created two thunks, one for each component of the whole
MSF
, and executed both their bodies. From here on, the program will go on endlessly, and no new thunks are generated.Maybe GHC does further optimizations (like inlining something or optimizing the
return
), I don't know. I didn't look at the core because I don't understand that well yet.Now what happens if we define
arrM
in the current way?The library implementation
Note that we have now created a thunk for the partial application
unMSF msf
.There are no references to the partial application
pap
anymore, it needs to be garbage collected!What's interesting here is that we've created two thunks for something where we only needed one thunk before. And worse, there was a third thunk from the partial application, which then needs to be garbage collection. So we might already suspect that (in the absence of clever optimizations), this version will use a constant factor more space, and put load on the garbage collector.
For the additional thunks, the reason is not so much the definition in terms of helper functions, but
morphGS
being higher order in theMSF
! It takes anMSF
as input (in this case a trivial identity function), and that needs to be carried around as well now.I also believe that it cannot be optimized away easily, because
morphGS
doesn't applyunMSF
fully (like e.g.>>>
does), but only partially,unMSF msf
. That way the optimizer cannot inline the definition of theMSF
. But this is speculation, I don't really understand the optimizer in detail, and haven't looked at the Core.Also, this partial application seems to trigger the garbage collector.
Anyways, let's continue the evaluation. Since the other
arrM
is defined in the same way, one might think that it would analogously produce a further thunk, but I believe thatid
is shared:Summary
I believe that with the current version, we probably use more time, space, and we introduce garbage collection in the first place although there is no internal state. The red flags are probably either higher order functions in
MSF
or partial application ofunMSF
. This is all theoretical, but it fits the observations in turion/rhine#227.Beta Was this translation helpful? Give feedback.
All reactions