Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to narrow test integration interface #1160

Merged
merged 1 commit into from
Dec 16, 2020
Merged

Conversation

mbj
Copy link
Owner

@mbj mbj commented Dec 16, 2020

  • This reduces the payload that has to be communicate from the killforks
    to the main process.
  • Nice increase in synthetic benchmarks.

* This reduces the payload that has to be communicate from the killforks
  to the main process.
* Nice increase in synthetic benchmarks.
@mbj mbj merged commit 2af00e3 into master Dec 16, 2020
@mbj mbj deleted the change/to-narrow-interface branch December 16, 2020 15:07
@mbj
Copy link
Owner Author

mbj commented Dec 16, 2020

@dgollahon, @pawelpacana On rspec payloads this releatively simple change has a dramatic effect:

On mbj/auom (rspec, minitest had more narrow selection where it did not matter that much):

Previous:

Mutations/s:     23.61

now:

Mutations/s:     44.76

@mbj
Copy link
Owner Author

mbj commented Dec 16, 2020

Do you mind to try master on your projects, just to verify I'm not lying to myself?

include Result, Anima.new(
:passed,
:runtime,
:tests
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one was entirely unused. Historic artifact, but transitively referenced lots of objects that where needlessly run through the wire.

@dgollahon
Copy link
Collaborator

I picked 5 different runs over various subjects in my main application codebase (rspec):

Subject 10.19 10.20 I/O
A 20.30/s 19.13/s No
B 2.48/s 2.50/s Yes
C 10.51/s 8.44/s No
D 12.05/s 12.94/s Yes
E 18.85/s 17.26/s No

I ran all the 10.19 cases and then the 10.20 cases so I didn't have to keep messing with bundler. I then thought maybe my laptop just got overheated or something and the reason the second set was slower was because of overheating or some other variable so I re-ran C on 10.20 first and got 9.37/s the second time and then switched back to 10.19 and got 10.16/s. I think 10.20 looks like a performance regression on my real world app.

@dgollahon
Copy link
Collaborator

It is much faster on auom on my machine as well: 10.19 is 42.19/s and 10.20 is 75.29/s.

I suspect we need better profiling tools to say much about what is going on here.

@mbj
Copy link
Owner Author

mbj commented Dec 16, 2020

I suspect we need better profiling tools to say much about what is going on here.

Actually no we do not for this case, its easy to explain. We removed redundant work, the examples you ran A/B/C/D/E are all within standard variance and bound by computation inside the killfork. Measurement on cold and hot start (literally, as CPU throttling easily kicks in earlier for later runs) distort later runs also. Given we clearly removed "static" work from the system as soon you hit an computation intensive subject auom is NOT: you get lots of standard jitte on the non auom case.

I suspect multiple re-runs on .19 and .20 would show a small bias towards .20 on your examples. Whereas auom produces a BIG signal regardless on machine base state.

This, plus the information we clearly removed static overhead: Means .20 is not a regression, but a step in the right direction.

@dgollahon
Copy link
Collaborator

dgollahon commented Dec 16, 2020

Since there is some variability I wanted to add a couple more data points to confirm some of the rows were actually slower. Pretty surprised that subject A dropped so much in the re-test but I double checked the original output. 🤷‍♂️

19.37/s and 19.59/s on 10.19 and 16.77/s and 16.05/s on 10.20 for A
19.40/s and 20.25/s on 10.19 and 19.11/s and 19.14/s on 10.20 for E

(note: i ran 10.20 first on most of these^)

@mbj
Copy link
Owner Author

mbj commented Dec 16, 2020

@dgollahon Given your examples have high variability (likely induced from GC) you need to run them lots of times till they are statistically relevant.

What matters to me is the auom case that clearly shows all samples are now way higher than all samples before. In your cases the samples still intersect with .19 and .20 point to more computation within tests than in auom, hence you need way more iterations to show a bias there.

@mbj
Copy link
Owner Author

mbj commented Dec 16, 2020

@dgollahon Side note: There is a change in the pipeline that avoids inheriting the result objects from the main process into the killforks.

This reduces the chance of what I call GC amplification which induces lots of variance on bigger projects.

GC amplification works like this:

  • Main process had some activity that fulfilled the GC trigger by 99%
  • Multiple forks happen, and the activity inside the forks pushes each fork over the GC threshold. Where if the parent process had not accumulated that many objects, the killforks may have exited before major GC.
  • Parent triggers GC, making sure that next scheduled forks are cleaner, but scheduling also pauses, reducing the chances of children to trigger GC.

On bigger projects than auom this effect is massive, triggers lots of CoW page expansions and hurts performance dramatically.

Now as we clearly generate less objects for the parent to inherit (the result read back from the killforks is smaller) I gurantee that with a large enough sample of heavier code: .20 provides a speed-up, likely not as massive one as on an intentionally small project like auom. But the subjects you chose may just have been shy of the 99.9% threshold on average, some may even get worse for this effect. As mutant now generates less garbadge to collect by its own.

The real fix is to fork off a process that does not accumulate state. And I'd still bet that .20 is a noticable performance increase if you had high test iteration counts on your subjects.

@dgollahon
Copy link
Collaborator

5 back-to-back runs of subject C on 10.20:

9.09/s, 9.05/s, 9.54/s, 8.59/s, and 8.20/s (this run segfaulted) (I think my computer slows down as it overheats, not 100% sure)

switch back to 10.19 (computer still very hot from 10.20 runs so i suspect this is unfavorable to 10.19 but am not sure):

10.52/s, 10.97/s, 11.81/s, 8.40/s, 8.40/s (again -- this time there was a segfault)

Decided to do a 6th run on 10.19: 8.89/s and then updated back to 10.20 and got: 9.38 (segfault on this one too). Switched back to 10.19 and got 11.45/s. I am not sure but I suspect that the act of switching between versions gives my computer enough of a break to cool down a bit and perform better.

I am not saying this data is conclusive but there are several more runs across all subjects where 10.19 outperforms 10.20 than the other way around and by larger margins. That said, I am also getting occasional segfaults so I'm not sure how much that taints the numbers. Either way, I'm not that convinced on:

And I'd still bet that .20 is a noticable performance increase if you had high test iteration counts on your subjects.

It may be better, but if so, not noticeably so under normal conditions for me. I think this data is mixed and sometimes negative enough that we should at least admit the possibility that it is actually slower on my project overall for some non-obvious reason. I can't collect any additional data right now because it is too slow and I have other things to do. I am not sure how many runs or under what conditions you would find data from my project convincing.

I still think it would be helpful to have better profiling tools so we can actually check hypotheses like GC thresholds, GC runs, memory usage, time on a given mutation, etc. instead of speculating. The mutations/s can't tell you why it was slower or faster that run.

re: the GC commentary--should mutant try to do some GC-tuning on your behalf? or have an option in addition to the ruby options? Or maybe a section in the documentation on adjusting it? If GC has so much of an effect, I would guess that you'd get better runs by turning off GC, GC'ing before fork, having huge GC sizes, infrequent collections, etc.

did you get a difference in speed on running mutant on itself? that might also be an interesting test.

@mostlyobvious
Copy link
Contributor

mostlyobvious commented Dec 16, 2020

Results from a single run, highly dependent on shared machine load:

  • rails_event_store_active_record

    Mostly slow tests (database), no significant difference:

    • 0.10.19 was progress: 2871/2874 alive: 0 runtime: 1727.27s killtime: 2420.63s mutations/s: 1.66
    • 0.10.20 was progress: 2859/2874 alive: 0 runtime: 1684.60s killtime: 3008.85s mutations/s: 1.70
  • ruby_event_store

    Rather quick tests, lots of subjects, huge improvement:

    • 0.10.19 was progress: 11604/11605 alive: 0 runtime: 5022.55s killtime: 1432.42s mutations/s: 2.31
    • 0.10.20 was progress: 11598/11605 alive: 0 runtime: 1763.99s killtime: 1656.08s mutations/s: 6.57

    Additionally, when running on i7-7700K it now never stops fans which is an improvement as well:
    progress: 11603/11605 alive: 0 runtime: 427.98s killtime: 2349.10s mutations/s: 27.11

@dgollahon
Copy link
Collaborator

I guess i just don't have whatever preconditions are required to make this faster on my project 🤷‍♂️

It still seems slower in several cases for me but maybe that's just noise or some kind of particularly unfavorable situation.

@mbj
Copy link
Owner Author

mbj commented Dec 16, 2020

@dgollahon Wait, I've got more in the pipeline for you.

@pawelpacana You are affected by huge amounts of degeneration for the large pass. Mutant never was optimized for these, working on it.

@dkubb
Copy link
Collaborator

dkubb commented Dec 17, 2020

@dgollahon what about turning off the GC before running the benchmarks?

@dgollahon
Copy link
Collaborator

@dkubb I have not tried that yet but that is a good idea.

As a note I am currently inclined to agree with @mbj's analysis and I am probably seeing slower runs because of some minor threshold thing with the GC or it is pure noise. I looked closer at the changes from 10.19 to 10.20 and it's hard to see why they would make things slower in general.

I may retry on another day if I have time with different GC settings and see if I can get a stable result but I do not currently have time to look into it further today.

@mbj
Copy link
Owner Author

mbj commented Dec 17, 2020

Just to sump this up again: The root cause is mutants process model. Fixing the process model removes lots of known to be very suboptimal behavior.

The key is to have the process the killforks getting forked of: minimal and size wise stable. AKA not having the killforks accumulate any state they are not interested in.

I hope I can get the process model change in before having to switch to the new main process written in Haskell. We'll see.

@mbj
Copy link
Owner Author

mbj commented Dec 23, 2020

@pawelpacana Your

  • Rather quick tests, lots of subjects, huge improvement:

With #1096 this case is now up to:

Mutations/s:     9.10

Its likely that now your test suite execution is the bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants