Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between Chrome's performance measurements and benchmark tests #826

Closed
andyhasit opened this issue Nov 11, 2020 · 21 comments
Closed

Comments

@andyhasit
Copy link
Contributor

I'm comparing my framework (not ready, not submitted) against Svelte and getting very different stories from Chrome's devtools (just the record button) compared to the benchmark tests.

For example, running "Add 10,000 rows" I get:

Framework DevTools Benchmark
Svelte avg 8500 ms 2,196 ms
Mine avg 3700 ms 2,226 ms

I've run all of these dozens of times now.

I am not concerned about the difference between Svelte and my framework, or between Chrome+devtools and Chromium. What concerns me is how it is possible for devtools to clock Svelte as being twice as slow as my framework, when the benchmark clearly puts it ahead.

Surely devtools is not that unreliable?

Screenshot from local results table:

image

Screenshot of my framework add 10,000 rows:

image

Screenshot of Svelte add 10,000 rows:

image

@leeoniya
Copy link
Contributor

have you tried Disable JavaScript samples?

image

@andyhasit
Copy link
Contributor Author

@leeoniya that does speed up the times by a bit:

image

But I am still seeing Svelte as roughly twice as slow on Chrome.

It hadn't occurred to me to manually test in Chromium, so I did that, and here are roughly representative samples of what I'm getting (with samples disabled):

Svelte on Chromium:

image

My framework on Chromium:

image

As you can see there's less of a difference, but it still shows Svelte as slower, even excluding the "Loading" step, and I'm not sure what that's about.

Back to chrome, I'm trying it with devtools shut and a stop watch (so scientific I know) and mine is going in the 3.40 range with Svelte in the 4.40 range, which despite the inaccuracy of the method is a valid difference.

So either me, Chrome, Svelte or the benchmark is doing something silly (let's assume me).

@krausest
Copy link
Owner

krausest commented Nov 13, 2020

The benchmark driver uses the chrome event log that chrome dev tools use too.
If you want to validate the benchmark driver results you should look at the timeline and measure between the click event and the (last) green paint event. I don't know what the summary reports in all cases but this can differ sometimes from what I measure:
Screenshot from 2020-11-13 19-53-12
If you take the screenshot above the benchmark driver would report something close to the 1662 msecs not the 3109 from the summary.
Most of the time it looked like dev tools report slower results than the benchmark driver - at some point in time I started to accept that.
(Benchmark drivers measured 1240.635,1241.486,1245.139,1249.259,1255.581,1270.585,1341.12,1341.946,1344.192,1356.405, with the dev tools in the browser I get 1662, 1659, 1659, 1568, 1715]
Does that help explaining the discrepancy?

@andyhasit
Copy link
Contributor Author

andyhasit commented Nov 14, 2020

@krausest thanks for the response, and thank you for making such an awesome tool - I have found it really, really useful.

I accept that there will be a difference between the benchmark and devtools. But I would expect that percentage to be the consistent between frameworks, e.g. 15% for all, but my observation (if correct) shows this is not the case.

If I add the times from my Chromium screenshots above and compare those to the benchmark, we see the percentages are different:

Framework DevTools Benchmark Diff1 Diff 2
Svelte 2,536 ms 2,196 ms +21.7% +15.5 %
Mine 2,275 ms 2,226 ms +2.2 % +2.2 %

Diff1 is if I add the Scripting + Rendering + Loading from the pie chart.
Diff2 is if I add the Scripting + Rendering but excluding Loading.

I don't know what "Loading" refers to, or why it is 0ms for my framework but 137ms for Svelte, so I included the percentage without it. But even with "Loading" taken out, the percentage is very different: 2% vs 15%.

Unfortunately I am struggling to figure out how to measure the exact time between two events in devtools as you described.

But am I understanding correctly that the two points you are measuring between represent a subset, or a slice, of the total which I would get if I manually adding up the coloured segments from the pie chart summary as I did above?

If so, this is a problem because the latter method of measuring seems to me to be far more indicative of the time it these operations really take, i.e. as the user would experience them in the browser. And if the benchmark is going to measure just a slice of that, then the slice should be the same percentage (within a pretty narrow margin) for each framework. Otherwise the benchmark results, and the comparison drawn from them would be invalid.

I measured that difference as 2.2% for one framework, and 21.7% for another, which is clearly very big.

But before I go to repeat that to improve the accuracy (because it really isn't very accurate) and compare it to the difference for other frameworks, I would be grateful if you could confirm whether my logic and assumptions are sound, because I'm not 100% on it.

@krausest
Copy link
Owner

I really recommend not to look at the pie chart when comparing the results between chrome dev tools and the benchmark driver.
In the timeline you see the click event for create 10,000 rows and the green paint events. The benchmark driver measures the duration from the start of the click event to the end of the PAINT event. (You have to zoom in to see that the paint event is the first green bar). If you select the range with the mouse the dev tools print the duration (1641.4 ms in the screenshot).
This measurement should represent the duration a user must wait to see the results. I've no idea why the pie chart can differ.
If you measure this way in the dev tools do you get the same discrepancy? If so please submit a draft PR and I'll take a look at it.
Screenshot from 2020-11-14 17-50-14

@andyhasit
Copy link
Contributor Author

@krausest I just raised #829 and seeing I did some gzip my framework in between measuring some of the above (and haven't kept track of when) it means my numbers are void. Please keep this issue open a while longer to let me re-run tests without the gzip artefact.

@andyhasit
Copy link
Contributor Author

OK, we're getting somewhere. I figured out how to measure from devetools with shift+mouse :-)

image

I took the benchmark readings straight out of the results json files:

cat results/svelte-v3.28.0-keyed_07_create10k.json 
{"framework":"svelte-v3.28.0-keyed","keyed":true,"benchmark":"07_create10k","type":"cpu","min":2117.137,"max":2473.851,"mean":2241.2691999999997,"median":2196.704,"geometricMean":2238.6954222898994,"standardDeviation":114.66626711635813,"values":[2181.184,2473.851,2212.224,2129.022,2350.627,2173.786,2345.91,2117.137,2252.424,2176.527]}

Hopefully the two means are self explanatory, and the "bench says" is what I read from the Interactive Results table:

image

I am not sure how that is calculated.

If we look at the geometric mean as I calculate it (google sheet GEOMEAN formula) we see the benchmark lists RedRunner as 4.85% slower than it really* is, and Svelte as a whopping 18.4% faster!

* assuming dev tools measurements match browser reality more closely than the benchmark measurement, which I think is the case, though my empirical evidence for that in previous comments is void due to the gzip issue.

Here's the spreadhsheet if you want.

I'll submit a PR with RedRunner and you can perhaps tell me if you see the same.

@krausest
Copy link
Owner

krausest commented Nov 15, 2020

I can confirm there's something strange going on.

Manual testing from dev tools shows that redrunner is faster than svelte

Redrunner Svelte
1365.6 1666.5
1398 1667.3
1375 1663
1400 1656.6
1357.7 1710.9
avg 1379.26 1672.86

But the benchmark driver claims that svelte is faster

Redrunner Svelte
1328.459 1258.302
1328.846 1260.534
1333.298 1266.905
1334.27 1268.995
1340.083 1275.031
1342.184 1284.834
1343.079 1302.363
1343.947 1342.075
1347.114 1353.73
1347.28 1368.456
avg 1338.856 1298.1225

It's important to understand that the benchmark driver acutally uses the chrome dev tools events (so it's not two measurements, just one without a GUI and with different options like traceCategories), but obviously there's a difference somewhere. I'll try to find out where the difference comes from.

@krausest
Copy link
Owner

Looks like it's due to #549
The benchmark driver used to pass all lighthouse.traceCategories for the perfLoggingPrefs when initializing chrome. These options are "categories -*, disabled-by-default-lighthouse, loading, v8, v8.execute, blink.user_timing, blink.console, devtools.timeline, disabled-by-default-devtools.timeline, disabled-by-default-devtools.screenshot, disabled-by-default-devtools.timeline.stack"
This setting caused a massive slow down for fidan and svelte. I changed the perfLoggingPrefs to "devtools.timeline,blink.user_timing" which is enough to extract the events we're interested in.

When I run svelte with the "categories -*, disabled-by-default-lighthouse, loading, v8, v8.execute, blink.user_timing, blink.console, devtools.timeline, disabled-by-default-devtools.timeline, disabled-by-default-devtools.screenshot, disabled-by-default-devtools.timeline.stack" svelte is reported to be much slower and much closer to the manual testing in the devtools:

Svelte with lighthouse traceCategories
1585.587
1665.17
1705.964
1711.289
1732.379
1741.692
1747.528
1747.97
1761.205
1763.695

So I suspect that chrome dev tools use perfLoggingPrefs similar to the lighthouse.traceCategories.
You can check that behaviour by adding a "const lighthouse = require('lighthouse');" in webdriverAccess.ts and changing line 250 to "traceCategories": lighthouse.traceCategories.join(", ").
Will have to look into it whether there's a way to find out which is more correct.

@krausest
Copy link
Owner

krausest commented Nov 15, 2020

Seems like the traceCategory disabled-by-default-devtools.timeline.stack causes the difference (and I think disabled-by-default-devtools.timeline adds a few msecs too).
At the moment I currently believe that less traceCategories tend to cause less overhead and thus be closer to the truth.

@krausest
Copy link
Owner

Some time ago i had an additional way to estimate the duration by adding a startMeasure and stopMeasure call at the beginning or end of a method with an implementation like that:

let startTime;	
let lastMeasure;
let startMeasure = function (name) {	
  startTime = performance.now();	
  lastMeasure = name;	
};	
let stopMeasure = function () {	
  var last = lastMeasure;	
  if (lastMeasure) {	
    window.setTimeout(function () {	
      lastMeasure = null;	
      var stop = performance.now();	
      var duration = 0;	
      console.log(last + ' took ' + (stop - startTime));	
    }, 0);	
  }	
};

The idea is that the setTimeout function is executed after finishing rendering. This way to measure got beaten up badly by some people and I really don't think it's reliable, but currently I see no other option to decide whether the benchmark driver or chrome's timeline is correct.
At least for svelte and redrunner I see the desired effect in the timeline. The timeout fires after rendering (you have to zoom in a lot to find the timeout)
Screenshot from 2020-11-15 16-22-52

So I reloaded the page, hit create 10,000 rows and then open the console and look what's printed there. After that I close the console again (because I'm a bit afraid keeping it open might influence the measurement otherwise).
These are the results I'm getting then.

Redrunner Svelte
1323.795 1291.29
1362.28 1252.564
1288.62 1225.3
1373.964 1291.93
1334.23 1253.4
1318.44 1285.57
1310.84 1269.404
avg 1330 avg 1267

According to this metric svelte looks indeed a bit faster and svelte is slower in chrome dev timeline due to some profiling overhead that the benchmark driver does not have. Please feel free to challenge that statement.

(A note to the geometric mean you mentioned above. For each table entry we compute a factor which is the duration in this cell divided by the duration for the fastest framework. The geometric mean is computed from those factors, not from individual samples.)

@andyhasit
Copy link
Contributor Author

andyhasit commented Nov 15, 2020

Thanks for looking into this so thoroughly.

I am not in a position to challenge the statement on Svelte vs RedRunner, nor is that my goal. And unfortunately I am not in a great position to assist with or comment on Chrome or lighthouse configuration, or how to measure accurately.

Of course I'm less concerned about discrepancies between benchmark and dev tools, than I am with whether the benchmarks are representative of real world performance. I think we need to find a way to calibrate benchmark recorded times against actual browser times (so, without devtools profiling overhead) and check that the inevitable gap is consistent across frameworks. And if it isn't, then we'd need to look into that.

The only way I can think of to measure real time without devtools profiling overhead is to take a sufficiently long running operation (e.g. 10,000 rows at 16x slowdown) and measure it with an external timer of sorts. Just as an occasional one-off to calibrate.

I'll probably be called a caveman for suggesting a stopwatch, but you can at least get a feel for whether framework A is faster than B, and if the benchmark says the opposite, then you at least know there's a problem.

With regards to this:

(A note to the geometric mean you mentioned above. For each table entry we compute a factor which is the duration in this cell divided by the duration for the fastest framework. The geometric mean is computed from those factors, not from individual samples.)

I assumed that is how we calculate the time in brackets (1.50), (1.67) but surely that doesn't apply to the times in ms?

Here are the means vs what's in the table:

Framework Mean Geometric Mean Table
RedRunner 2238.04 2237.48 2,185.50
Svelte 2241.27 2238.7 2,100.70

Surely the ms in the table should use the mean or geometric mean of the ten results, or something not far off? The geometric mean of ten results has 1ms difference between the two frameworks, yet the table places 85ms between them... I don't see how it gets to that...

Thank you for your continued patience and answers :-)

@krausest
Copy link
Owner

That startMeasure / stopMeasure was my caveman's tool. It allowed me to validate (somewhat) the benchmark results, but there were some voices that it might not always yield valid results.

Here's an example for computing the geometric average:
Screenshot from 2020-11-15 19-20-44
First you have to compute the factor how much the implementation is slower than the fastest (second row).
Screenshot from 2020-11-15 19-23-15
Then you'd multiply the factors up and raise it to 1/#factors

@andyhasit
Copy link
Contributor Author

andyhasit commented Nov 15, 2020

Actually I made a slight mistake, I pulled the Svelte numbers from the json file for a previous version. I've copied the v0.3.29 into the spreadsheet and it shows a -23.42% difference, not -18.4%.

image

The means of those correct results are lower that RedRunner's, so at least the numbers in the table make sense now.

But based on that last reply, would it be fair to say that the table does not in fact display "Duration in milliseconds" as stated. As I understand it, only the fastest framework's times would display the accurate time in ms, and the rest are extrapolations?

@krausest
Copy link
Owner

There are two more secret to the results table: The result table displays either the median or mean of the observed samples, depending on what you select in display mode. Both display modes are measured in milliseconds.
Currently I'm using 12 samples for each CPU benchmark and I'm dropping the two slowest to get more stable results.

In each cell the value in the parentheses is a factor how much slower this value is in comparison to the fastest.
The geometric mean averaged those factors.
If you measure only one benchmark the geometric mean must always equal the factor in the cell above.

@andyhasit
Copy link
Contributor Author

I understand the table a bit better now, thanks.

But just to recap, because I created quite a few tangents with side questions:

  1. We both observe that RedRunner's benchmark times are a bit slower than as measured in devtools, but Svelte's are a lot faster (on my machine its +4.8% and -23.4 respectively).
  2. We have yet to establish the relationship between benchmark, devtools and real life times.
  3. If the difference benchmark and real life varies between frameworks then we have a problem, if they don't, then it's more about devtools and we close this.
  4. We need a way to check 3.

Where "devtools" means "devtools as run from the browser".

Is that your understanding too?

@krausest
Copy link
Owner

Yes, mostly.
As for #1 I agree that svelte is about 23-24% slower in devtools (but I can't confirm that the benchmark driver reports always 5% slower numbers than the devtools).

I'd rather reduce #2-#4 to the question whether create 10,000 rows in chrome for svelte takes 1,3 secs or 1,8 secs. I personally believe the answer startMeasure und stopMeasure gives me on this question, but I tried something different:
So I took a video with my smartphone and installed a video editor and extracted the relevant part - maybe you want to try this too and tell me what answer you see then...

@krausest
Copy link
Owner

The start/stopMeasure hack also supports the claim that the dev tools are slowing down svelte.
It reports about 1264 when running with dev tools closed (and opened after the run to see the console output) and e.g. 1671 msecs when the performance recording is running in the dev tools.
I think we're observing an Heisenberg Effect for the dev tool's profiler that somehow hits svelte particularly badly.

@andyhasit
Copy link
Contributor Author

I am not really able to do the camera trick accurately with my phone (good idea though). But using the start/stop console log is very interesting.

Here is me adding 10K lines on redrunner, alternating between devtools open and devtools closed on Chrome (if it starts with 4, devtools were open):

image

Here is me doing the same thing on Chromium:

image

This doesn't help establish whether benchmarks are representative of speeds, but does show that merely opening or closing Chrome's devtools can cause a massive slowdown (that's just the console, no performance recording or anything) Chromium doesn't seem to do this.

Maybe it's just my Chrome (can't think of any plugins that would affect it).

Quick question on a separate note, can we/should we gitignore webdriver-ts/results.json?

@krausest
Copy link
Owner

webdriver-ts/results.json might be useful to extract historic results if (and only if) it's just me committing this file.

@krausest
Copy link
Owner

@andyhasit Thanks for reporting this issue. I'm closing it now. In retrospective I think the implementation is fine, but it's always good to challenge its validity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants