Discrepancy between Chrome's performance measurements and benchmark tests #826

andyhasit · 2020-11-11T22:08:43Z

I'm comparing my framework (not ready, not submitted) against Svelte and getting very different stories from Chrome's devtools (just the record button) compared to the benchmark tests.

For example, running "Add 10,000 rows" I get:

Framework	DevTools	Benchmark
Svelte	avg 8500 ms	2,196 ms
Mine	avg 3700 ms	2,226 ms

I've run all of these dozens of times now.

I am not concerned about the difference between Svelte and my framework, or between Chrome+devtools and Chromium. What concerns me is how it is possible for devtools to clock Svelte as being twice as slow as my framework, when the benchmark clearly puts it ahead.

Surely devtools is not that unreliable?

Screenshot from local results table:

Screenshot of my framework add 10,000 rows:

Screenshot of Svelte add 10,000 rows:

leeoniya · 2020-11-11T22:39:37Z

have you tried Disable JavaScript samples?

andyhasit · 2020-11-11T23:28:12Z

@leeoniya that does speed up the times by a bit:

But I am still seeing Svelte as roughly twice as slow on Chrome.

It hadn't occurred to me to manually test in Chromium, so I did that, and here are roughly representative samples of what I'm getting (with samples disabled):

Svelte on Chromium:

My framework on Chromium:

As you can see there's less of a difference, but it still shows Svelte as slower, even excluding the "Loading" step, and I'm not sure what that's about.

Back to chrome, I'm trying it with devtools shut and a stop watch (so scientific I know) and mine is going in the 3.40 range with Svelte in the 4.40 range, which despite the inaccuracy of the method is a valid difference.

So either me, Chrome, Svelte or the benchmark is doing something silly (let's assume me).

krausest · 2020-11-13T18:55:54Z

The benchmark driver uses the chrome event log that chrome dev tools use too.
If you want to validate the benchmark driver results you should look at the timeline and measure between the click event and the (last) green paint event. I don't know what the summary reports in all cases but this can differ sometimes from what I measure:

If you take the screenshot above the benchmark driver would report something close to the 1662 msecs not the 3109 from the summary.
Most of the time it looked like dev tools report slower results than the benchmark driver - at some point in time I started to accept that.
(Benchmark drivers measured 1240.635,1241.486,1245.139,1249.259,1255.581,1270.585,1341.12,1341.946,1344.192,1356.405, with the dev tools in the browser I get 1662, 1659, 1659, 1568, 1715]
Does that help explaining the discrepancy?

andyhasit · 2020-11-14T15:58:27Z

@krausest thanks for the response, and thank you for making such an awesome tool - I have found it really, really useful.

I accept that there will be a difference between the benchmark and devtools. But I would expect that percentage to be the consistent between frameworks, e.g. 15% for all, but my observation (if correct) shows this is not the case.

If I add the times from my Chromium screenshots above and compare those to the benchmark, we see the percentages are different:

Framework	DevTools	Benchmark	Diff1	Diff 2
Svelte	2,536 ms	2,196 ms	+21.7%	+15.5 %
Mine	2,275 ms	2,226 ms	+2.2 %	+2.2 %

Diff1 is if I add the Scripting + Rendering + Loading from the pie chart.
Diff2 is if I add the Scripting + Rendering but excluding Loading.

I don't know what "Loading" refers to, or why it is 0ms for my framework but 137ms for Svelte, so I included the percentage without it. But even with "Loading" taken out, the percentage is very different: 2% vs 15%.

Unfortunately I am struggling to figure out how to measure the exact time between two events in devtools as you described.

But am I understanding correctly that the two points you are measuring between represent a subset, or a slice, of the total which I would get if I manually adding up the coloured segments from the pie chart summary as I did above?

If so, this is a problem because the latter method of measuring seems to me to be far more indicative of the time it these operations really take, i.e. as the user would experience them in the browser. And if the benchmark is going to measure just a slice of that, then the slice should be the same percentage (within a pretty narrow margin) for each framework. Otherwise the benchmark results, and the comparison drawn from them would be invalid.

I measured that difference as 2.2% for one framework, and 21.7% for another, which is clearly very big.

But before I go to repeat that to improve the accuracy (because it really isn't very accurate) and compare it to the difference for other frameworks, I would be grateful if you could confirm whether my logic and assumptions are sound, because I'm not 100% on it.

krausest · 2020-11-14T16:59:41Z

I really recommend not to look at the pie chart when comparing the results between chrome dev tools and the benchmark driver.
In the timeline you see the click event for create 10,000 rows and the green paint events. The benchmark driver measures the duration from the start of the click event to the end of the PAINT event. (You have to zoom in to see that the paint event is the first green bar). If you select the range with the mouse the dev tools print the duration (1641.4 ms in the screenshot).
This measurement should represent the duration a user must wait to see the results. I've no idea why the pie chart can differ.
If you measure this way in the dev tools do you get the same discrepancy? If so please submit a draft PR and I'll take a look at it.

andyhasit · 2020-11-14T23:40:05Z

@krausest I just raised #829 and seeing I did some gzip my framework in between measuring some of the above (and haven't kept track of when) it means my numbers are void. Please keep this issue open a while longer to let me re-run tests without the gzip artefact.

andyhasit · 2020-11-15T11:44:13Z

OK, we're getting somewhere. I figured out how to measure from devetools with shift+mouse :-)

I took the benchmark readings straight out of the results json files:

cat results/svelte-v3.28.0-keyed_07_create10k.json 
{"framework":"svelte-v3.28.0-keyed","keyed":true,"benchmark":"07_create10k","type":"cpu","min":2117.137,"max":2473.851,"mean":2241.2691999999997,"median":2196.704,"geometricMean":2238.6954222898994,"standardDeviation":114.66626711635813,"values":[2181.184,2473.851,2212.224,2129.022,2350.627,2173.786,2345.91,2117.137,2252.424,2176.527]}

Hopefully the two means are self explanatory, and the "bench says" is what I read from the Interactive Results table:

I am not sure how that is calculated.

If we look at the geometric mean as I calculate it (google sheet GEOMEAN formula) we see the benchmark lists RedRunner as 4.85% slower than it really* is, and Svelte as a whopping 18.4% faster!

* assuming dev tools measurements match browser reality more closely than the benchmark measurement, which I think is the case, though my empirical evidence for that in previous comments is void due to the gzip issue.

Here's the spreadhsheet if you want.

I'll submit a PR with RedRunner and you can perhaps tell me if you see the same.

krausest · 2020-11-15T13:03:40Z

I can confirm there's something strange going on.

Manual testing from dev tools shows that redrunner is faster than svelte

Redrunner	Svelte
1365.6	1666.5
1398	1667.3
1375	1663
1400	1656.6
1357.7	1710.9
avg 1379.26	1672.86

But the benchmark driver claims that svelte is faster

Redrunner	Svelte
1328.459	1258.302
1328.846	1260.534
1333.298	1266.905
1334.27	1268.995
1340.083	1275.031
1342.184	1284.834
1343.079	1302.363
1343.947	1342.075
1347.114	1353.73
1347.28	1368.456
avg 1338.856	1298.1225

It's important to understand that the benchmark driver acutally uses the chrome dev tools events (so it's not two measurements, just one without a GUI and with different options like traceCategories), but obviously there's a difference somewhere. I'll try to find out where the difference comes from.

krausest · 2020-11-15T13:50:30Z

Looks like it's due to #549
The benchmark driver used to pass all lighthouse.traceCategories for the perfLoggingPrefs when initializing chrome. These options are "categories -*, disabled-by-default-lighthouse, loading, v8, v8.execute, blink.user_timing, blink.console, devtools.timeline, disabled-by-default-devtools.timeline, disabled-by-default-devtools.screenshot, disabled-by-default-devtools.timeline.stack"
This setting caused a massive slow down for fidan and svelte. I changed the perfLoggingPrefs to "devtools.timeline,blink.user_timing" which is enough to extract the events we're interested in.

When I run svelte with the "categories -*, disabled-by-default-lighthouse, loading, v8, v8.execute, blink.user_timing, blink.console, devtools.timeline, disabled-by-default-devtools.timeline, disabled-by-default-devtools.screenshot, disabled-by-default-devtools.timeline.stack" svelte is reported to be much slower and much closer to the manual testing in the devtools:

Svelte with lighthouse traceCategories
1585.587
1665.17
1705.964
1711.289
1732.379
1741.692
1747.528
1747.97
1761.205
1763.695

So I suspect that chrome dev tools use perfLoggingPrefs similar to the lighthouse.traceCategories.
You can check that behaviour by adding a "const lighthouse = require('lighthouse');" in webdriverAccess.ts and changing line 250 to "traceCategories": lighthouse.traceCategories.join(", ").
Will have to look into it whether there's a way to find out which is more correct.

krausest · 2020-11-15T14:26:41Z

Seems like the traceCategory disabled-by-default-devtools.timeline.stack causes the difference (and I think disabled-by-default-devtools.timeline adds a few msecs too).
At the moment I currently believe that less traceCategories tend to cause less overhead and thus be closer to the truth.

krausest · 2020-11-15T15:37:29Z

Some time ago i had an additional way to estimate the duration by adding a startMeasure and stopMeasure call at the beginning or end of a method with an implementation like that:

let startTime;	
let lastMeasure;
let startMeasure = function (name) {	
  startTime = performance.now();	
  lastMeasure = name;	
};	
let stopMeasure = function () {	
  var last = lastMeasure;	
  if (lastMeasure) {	
    window.setTimeout(function () {	
      lastMeasure = null;	
      var stop = performance.now();	
      var duration = 0;	
      console.log(last + ' took ' + (stop - startTime));	
    }, 0);	
  }	
};

The idea is that the setTimeout function is executed after finishing rendering. This way to measure got beaten up badly by some people and I really don't think it's reliable, but currently I see no other option to decide whether the benchmark driver or chrome's timeline is correct.
At least for svelte and redrunner I see the desired effect in the timeline. The timeout fires after rendering (you have to zoom in a lot to find the timeout)

So I reloaded the page, hit create 10,000 rows and then open the console and look what's printed there. After that I close the console again (because I'm a bit afraid keeping it open might influence the measurement otherwise).
These are the results I'm getting then.

Redrunner	Svelte
1323.795	1291.29
1362.28	1252.564
1288.62	1225.3
1373.964	1291.93
1334.23	1253.4
1318.44	1285.57
1310.84	1269.404
avg 1330	avg 1267

According to this metric svelte looks indeed a bit faster and svelte is slower in chrome dev timeline due to some profiling overhead that the benchmark driver does not have. Please feel free to challenge that statement.

(A note to the geometric mean you mentioned above. For each table entry we compute a factor which is the duration in this cell divided by the duration for the fastest framework. The geometric mean is computed from those factors, not from individual samples.)

andyhasit · 2020-11-15T18:05:40Z

Thanks for looking into this so thoroughly.

I am not in a position to challenge the statement on Svelte vs RedRunner, nor is that my goal. And unfortunately I am not in a great position to assist with or comment on Chrome or lighthouse configuration, or how to measure accurately.

Of course I'm less concerned about discrepancies between benchmark and dev tools, than I am with whether the benchmarks are representative of real world performance. I think we need to find a way to calibrate benchmark recorded times against actual browser times (so, without devtools profiling overhead) and check that the inevitable gap is consistent across frameworks. And if it isn't, then we'd need to look into that.

The only way I can think of to measure real time without devtools profiling overhead is to take a sufficiently long running operation (e.g. 10,000 rows at 16x slowdown) and measure it with an external timer of sorts. Just as an occasional one-off to calibrate.

I'll probably be called a caveman for suggesting a stopwatch, but you can at least get a feel for whether framework A is faster than B, and if the benchmark says the opposite, then you at least know there's a problem.

With regards to this:

(A note to the geometric mean you mentioned above. For each table entry we compute a factor which is the duration in this cell divided by the duration for the fastest framework. The geometric mean is computed from those factors, not from individual samples.)

I assumed that is how we calculate the time in brackets (1.50), (1.67) but surely that doesn't apply to the times in ms?

Here are the means vs what's in the table:

Framework	Mean	Geometric Mean	Table
RedRunner	2238.04	2237.48	2,185.50
Svelte	2241.27	2238.7	2,100.70

Surely the ms in the table should use the mean or geometric mean of the ten results, or something not far off? The geometric mean of ten results has 1ms difference between the two frameworks, yet the table places 85ms between them... I don't see how it gets to that...

Thank you for your continued patience and answers :-)

krausest · 2020-11-15T18:33:29Z

That startMeasure / stopMeasure was my caveman's tool. It allowed me to validate (somewhat) the benchmark results, but there were some voices that it might not always yield valid results.

Here's an example for computing the geometric average:

First you have to compute the factor how much the implementation is slower than the fastest (second row).

Then you'd multiply the factors up and raise it to 1/#factors

andyhasit · 2020-11-15T18:34:41Z

Actually I made a slight mistake, I pulled the Svelte numbers from the json file for a previous version. I've copied the v0.3.29 into the spreadsheet and it shows a -23.42% difference, not -18.4%.

The means of those correct results are lower that RedRunner's, so at least the numbers in the table make sense now.

But based on that last reply, would it be fair to say that the table does not in fact display "Duration in milliseconds" as stated. As I understand it, only the fastest framework's times would display the accurate time in ms, and the rest are extrapolations?

krausest · 2020-11-15T19:17:51Z

There are two more secret to the results table: The result table displays either the median or mean of the observed samples, depending on what you select in display mode. Both display modes are measured in milliseconds.
Currently I'm using 12 samples for each CPU benchmark and I'm dropping the two slowest to get more stable results.

In each cell the value in the parentheses is a factor how much slower this value is in comparison to the fastest.
The geometric mean averaged those factors.
If you measure only one benchmark the geometric mean must always equal the factor in the cell above.

andyhasit · 2020-11-16T09:12:13Z

I understand the table a bit better now, thanks.

But just to recap, because I created quite a few tangents with side questions:

We both observe that RedRunner's benchmark times are a bit slower than as measured in devtools, but Svelte's are a lot faster (on my machine its +4.8% and -23.4 respectively).
We have yet to establish the relationship between benchmark, devtools and real life times.
If the difference benchmark and real life varies between frameworks then we have a problem, if they don't, then it's more about devtools and we close this.
We need a way to check 3.

Where "devtools" means "devtools as run from the browser".

Is that your understanding too?

krausest · 2020-11-16T20:36:13Z

Yes, mostly.
As for #1 I agree that svelte is about 23-24% slower in devtools (but I can't confirm that the benchmark driver reports always 5% slower numbers than the devtools).

I'd rather reduce #2-#4 to the question whether create 10,000 rows in chrome for svelte takes 1,3 secs or 1,8 secs. I personally believe the answer startMeasure und stopMeasure gives me on this question, but I tried something different:
So I took a video with my smartphone and installed a video editor and extracted the relevant part - maybe you want to try this too and tell me what answer you see then...

krausest · 2020-11-17T07:22:41Z

The start/stopMeasure hack also supports the claim that the dev tools are slowing down svelte.
It reports about 1264 when running with dev tools closed (and opened after the run to see the console output) and e.g. 1671 msecs when the performance recording is running in the dev tools.
I think we're observing an Heisenberg Effect for the dev tool's profiler that somehow hits svelte particularly badly.

andyhasit · 2020-11-19T08:59:02Z

I am not really able to do the camera trick accurately with my phone (good idea though). But using the start/stop console log is very interesting.

Here is me adding 10K lines on redrunner, alternating between devtools open and devtools closed on Chrome (if it starts with 4, devtools were open):

Here is me doing the same thing on Chromium:

This doesn't help establish whether benchmarks are representative of speeds, but does show that merely opening or closing Chrome's devtools can cause a massive slowdown (that's just the console, no performance recording or anything) Chromium doesn't seem to do this.

Maybe it's just my Chrome (can't think of any plugins that would affect it).

Quick question on a separate note, can we/should we gitignore webdriver-ts/results.json?

krausest · 2020-11-19T19:52:24Z

webdriver-ts/results.json might be useful to extract historic results if (and only if) it's just me committing this file.

krausest · 2020-12-25T10:45:41Z

@andyhasit Thanks for reporting this issue. I'm closing it now. In retrospective I think the implementation is fine, but it's always good to challenge its validity.

andyhasit mentioned this issue Nov 15, 2020

Added RedRunner 0.8.1 #830

Closed

krausest closed this as completed Dec 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between Chrome's performance measurements and benchmark tests #826

Discrepancy between Chrome's performance measurements and benchmark tests #826

andyhasit commented Nov 11, 2020

leeoniya commented Nov 11, 2020

andyhasit commented Nov 11, 2020

krausest commented Nov 13, 2020 •

edited

Loading

andyhasit commented Nov 14, 2020 •

edited

Loading

krausest commented Nov 14, 2020

andyhasit commented Nov 14, 2020

andyhasit commented Nov 15, 2020

krausest commented Nov 15, 2020 •

edited

Loading

krausest commented Nov 15, 2020

krausest commented Nov 15, 2020 •

edited

Loading

krausest commented Nov 15, 2020

andyhasit commented Nov 15, 2020 •

edited

Loading

krausest commented Nov 15, 2020

andyhasit commented Nov 15, 2020 •

edited

Loading

krausest commented Nov 15, 2020

andyhasit commented Nov 16, 2020

krausest commented Nov 16, 2020

krausest commented Nov 17, 2020

andyhasit commented Nov 19, 2020

krausest commented Nov 19, 2020

krausest commented Dec 25, 2020

Discrepancy between Chrome's performance measurements and benchmark tests #826

Discrepancy between Chrome's performance measurements and benchmark tests #826

Comments

andyhasit commented Nov 11, 2020

leeoniya commented Nov 11, 2020

andyhasit commented Nov 11, 2020

krausest commented Nov 13, 2020 • edited Loading

andyhasit commented Nov 14, 2020 • edited Loading

krausest commented Nov 14, 2020

andyhasit commented Nov 14, 2020

andyhasit commented Nov 15, 2020

krausest commented Nov 15, 2020 • edited Loading

krausest commented Nov 15, 2020

krausest commented Nov 15, 2020 • edited Loading

krausest commented Nov 15, 2020

andyhasit commented Nov 15, 2020 • edited Loading

krausest commented Nov 15, 2020

andyhasit commented Nov 15, 2020 • edited Loading

krausest commented Nov 15, 2020

andyhasit commented Nov 16, 2020

krausest commented Nov 16, 2020

krausest commented Nov 17, 2020

andyhasit commented Nov 19, 2020

krausest commented Nov 19, 2020

krausest commented Dec 25, 2020

krausest commented Nov 13, 2020 •

edited

Loading

andyhasit commented Nov 14, 2020 •

edited

Loading

krausest commented Nov 15, 2020 •

edited

Loading

krausest commented Nov 15, 2020 •

edited

Loading

andyhasit commented Nov 15, 2020 •

edited

Loading

andyhasit commented Nov 15, 2020 •

edited

Loading