Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Mikado v0.7.46 #660

Merged
merged 1 commit into from
Nov 13, 2019
Merged

Update Mikado v0.7.46 #660

merged 1 commit into from
Nov 13, 2019

Conversation

ts-thomas
Copy link
Contributor

I also updated the other Mikado test to the same version as the new proxy version, it might be confuse. Between 0.6.5 and 0.7.4 most of the updates was related to reactive design and stores. Although, the latest version gets some minor improvements, one of them is that I found a way to get finally rid off the synced dom indices. Since the proxy version exists as an independent test in this benchmark, I decide to slightly change this test implementation to get the maximum difference of both implementations. This test do not use a store anymore (before it uses the "loose" internal store). With these minor changes the comparison of "mikado" and "mikado-proxy" becomes "100% functional 0% data-driven VS. 0% functional 100% data-driven".

@krausest krausest merged commit 3327adb into krausest:master Nov 13, 2019
@krausest
Copy link
Owner

Here is the update.
I'd love to convince you that we keep only mikado-proxy. This benchmark should really care about data-driven frameworks and not about dom manipulation frameworks - especially if a framework offers both ways (and even more so considering the non-existing performance difference).

@ts-thomas
Copy link
Contributor Author

ts-thomas commented Nov 14, 2019

Of course, I will do. I really would like to see the discussion behind this, although it might be better when not^^. But you are absolutely right, the difference in your benchmark is "non-existing", but that also shows me that something is not right measured. I assume that the timer isn't captured in the same event loop. That's why the differences are so minimal. Also that makes it synthetic and surreal. This benchmark is good to show how equal each benchmark is performing within a small time frame, lets say around 4 milliseconds. That is good for less performant libraries, but super fast implementations can't show their strength. That is absolutely fine, because that makes this benchmark so famous, it does not discriminate the slower libraries, and slower libraries are often more popular. As a consequence this benchmark can't be used as a measure tool for the development, especially when improving benchmark. Without my own test suite I wasn't able to optimize anything. But please keep the non-discriminate concept constantly. Keyed has nothing to do with data-driven. "This benchmark should really care about data-driven frameworks" isn't
fully comprehensible to me.

Please consider that my poor english skills might read things differently than what I meant to say.

@ryansolid
Copy link
Contributor

ryansolid commented Nov 14, 2019

I don't understand why you keep putting yourself out there with unnecessary exposure. Maybe it's a lost in translation thing. But now any idiot can claim they have a higher IQ than you. Anyway, with that out of the system maybe there is some room to turn this conversation productive.

I think we are just getting to a threshold with a few of the tests that the variability of results on a given run at the top end makes more of a difference than the difference in performance. Since everyone is in the same ballpark for those tests any variation seems considerably different. You can see the overhead of the proxies in a few key areas but in general, it doesn't have as much of an effect of spiking the swap rows or select test. The tests could be re-run tomorrow and any of handful of libraries could be the fastest or faster than Vanilla JS depending on a good run on a couple of those tests. And you are correct in those couple tests there is some sort of framing that is happening with the measuring. Once we are around 16ms or so all bets are off. Even in the sub 40ms range I think the effect is noticeable. This is due to fact of how the rendering is being measured as you noticed paint and all. The improvements in browser performance and computing power of the benchmark machine have even eclipsed the artificial slowdown here. So we are hitting the limits of this but we are also stepping into an area where miniscule differences of the sub ms range of performance are never even going to be noticeable at any realistic scale against the comparably infinitely larger cost of the DOM operations. Those DOM operations are the only place to make substantial gains at this point and the definition of the benchmark, one could argue reasonably so, restricts what can be done here.

But I definitely think if you can turn your mind to how we can improve around those tests within the spirit of this benchmark, everyone can benefit. If you have really learned some new optimization techniques from your own benchmarking suggest some improvements for VanillaJS. One of the second tier objectives here is to keep Vanilla out in front of all the other libraries. If it isn't, we are doing something wrong. I've chalked this up to variability, but I could be wrong. If you want to show your cleverness that also would definitely be a way. I wrote Vanilla JS 1 when were having trouble staying out in front of domc. Same could be true now with Mikado.

@ts-thomas
Copy link
Contributor Author

ts-thomas commented Nov 14, 2019

First of all sorry about that ryan, you were not meant.
As I remember I already tried to give some suggestions. But I didn't got any room for this. Not you, but from almost all other of the "data-driven clan". So I decided to delete my suggestions. By the way I also started to make a vanilla implementation, but then my time was lost because I need that time to get angry. I did not know why I should engage on something for what I get just negative feedback. The last thing was, just as a reminder, that I showed my pure data-driven concept and then trashing has started again. When things begun to became too stupid, I loose my interest outright. I'm sure that I could give some nice inputs to this topic to either, the benchmark unit and also to the templating development topic.

One of the most curiosity of all is this benchmark implementation of Mikado has almost all optimizations disabled. Simply because this benchmark cannot cover it (due to the reloading). You should seriously ask if something on the benchmark environment is might not properly. Anyway I like the diversity of benchmarks and it also shows a possible use case (not a common, but possible). When we would start dealing with fresh created data payloads along the runtime Mikado would blow away anything.

Fortunately I'm not resentful. Please give me some time to think about and also to recover my mood. At least I need some motivation to get into this "environment" again, possibly this comes back again.

BTW the unit scale of an IQ may differs from country to country, but yes, that's absolutely did not matter.

@ts-thomas
Copy link
Contributor Author

@ryansolid I would like to make some suggestions for some improvements to this benchmark suite. Where should I post it? Here?

@ryansolid
Copy link
Contributor

ryansolid commented Nov 15, 2019

@ts-thomas Unless implementation-specific (like say VanillaJS or Mikado or etc) I think just make a dedicated issue. I had started one about the select row on specifically: #613. But if you have specific suggestions that seem manageable, I'd just make an issue for those. Possibly even separate issues so if they are significant enough can be handled independently without being grouped together if they are unrelated. That way each can be discussed/implemented on its own merits. I think this might be really important because if any of the ideas are controversial you don't want the whole ship to sink.

I know @krausest is always looking at making things better as long as it fits in what is already being done here. If the idea is sound the biggest friction would be if we have to go through and update all the implementations. That presents a huge challenge. There have been discussions like this before that have been beneficial (like increasing the gap between rows in swap rows).

@ts-thomas
Copy link
Contributor Author

Let me do some brainstorming, then we can pick those ones which has a realistic benefit. I will post it here, a closed PR is good enough as a backlog :)
It might takes some time, I will also take a deeper look into how the benchmark cycles. Of course my focus is on changes, which did not need updating any of the current test implementations.

@ts-thomas
Copy link
Contributor Author

ts-thomas commented Nov 16, 2019

I just focused on the technical aspect at the first.

The biggest issues from top down:

  1. By far too big mean deviation, which makes the result less meaningful, especially on the top ranks
  2. Capturing timer not properly applied, will end up on results which not differentiated enough (in real there are not so close to each other)
  3. Both issues above will make this benchmark suite useless as a development tool, especially when improving benchmark, although it might not be a goal

Suggestions for improving these issues:

  • The time frame of the async hook is probably too large. There are 3 kinds of waits in webdriver. Has you already tested to use webdriver FluentWait? You need to set the polling value as small as possible. Every additional event loop will add artifacts to the result and increase deviation a lot. As far I could see explicit waits are not recommended for benchmarking purposes (I assume explicit is actually used). This will probably get the biggest improvement.
  • I could not find all places on which the benchmark results are calculated. But some notes to this topic. You often speak of “good runs”, I know that phenomenon, but that also is depending on how you calculate the results. The “index” calculation I used in my own benchmark produce really really stable results regardless how often you run it. I will just show you, that this issue could be reduced by applying a different calculation. One important thing I found out a good while ago is, do not pick median results and certainly not above (e.g. best runs) from a series of the captured frames to apply benchmarking calculations on it. It will effectively cut out the garbage collector. Also this adds a lot of artifacts and irregularities between the runs. Due to the reloading it has much less impact, of course, but definitely I would avoid them. There are some crazy libs around, which tries to statistically counteract this lost time from the garbage collector, which is really stupid and one of the reasons why I did not use them. Instead apply medians during the final calculation, and collect the timers just as plain as they come. I actually use two sorts of scoring in my benchmark:
    Score = Sumtest(lib_ops / median_ops) / test_count * 1000
    Index = Sumtest(lib_ops / max_ops) / test_count * 1000
    The latter might fit perfectly to the spirit of this benchmark also and give very less deviation and on top greatly cut off peaks. The first one is a bit the opposite, it has noticeable deviation (but by far less than here actually) and reflect peaks relational in the score. Important is that the median is applied here at the earliest (of course, do not use average instead of median for this case). The value “1000” is just a scoring resolution, whereas at the first example a 1000 is the midfield and on the second example a 1000 ist the maximum possible result. That’s why I keep both scoring, because each of them shows a different aspect/scale.
  • It will gain some performance when running the chromedriver in incognito mode, also in fullscreen (kiosk) gets a noticeable boost but the latter might be confuse. headless is also possible, on my machine it runs slightly slower in this mode for some reason.

@krausest
Copy link
Owner

krausest commented Nov 16, 2019

I‘m not at home, so just a short comment regarding one incorrect statement. The measurement does not use the duration of webdriver calls (don‘t think that would work). It parses chrome‘s timeline events - please see https://www.stefankrause.net/wp/?p=218 for more information (and thus is not async). An important principle is that it measures js script duration plus rendering time.

@ts-thomas
Copy link
Contributor Author

It is the async nature which is really hard to measure. Let me take a look onto this tomorrow.

@ts-thomas
Copy link
Contributor Author

ts-thomas commented Nov 17, 2019

Let me check that we mean the same about "async". A synced test has all within the same event loop:

  • timer start
  • process
  • timer end

When it is not in the same event loop it is async. There at least some async frameworks in your table, so it is impossible to stay in the same event loop.

The benchmark covers render time. But some points came up to me:

  • the rendering time of the browser to make components visible is something which isn't controlled by the lib (of course they are difference categories like shadow dom)
  • the pure rendering time (paint) should be equal in all libs of the same category
  • the browser render stack is something special, when you take timers from it you will also add pretty much noise
  • timings form the browser render stack isn't something which could be used for high accuracy performance benchmarks
  • recalculation is happened within the same event loop, and already include the most important benchmarks

Let me show you a simple comparison. I added just a "console.time()" to the test case 1 and benchmark a series of 5000 loops within the same event loop:

  • domc: 7080.43408203125ms
  • mikado: 4065.2900390625ms

I think it is very obviously that this timings differs by far more than the results from the framework.

I really would like to help. If you are interested to improve this, please take me to some important lines of the code.

@ryansolid
Copy link
Contributor

the pure rendering time (paint) should be equal in all libs of the same category

Should it? I honestly am not sure. I definitely think that we want to measure more than just the JS time. The JS time alone is almost worthless if it doesn't lead to faster full render time. It can be 1000x times faster and it doesn't really matter. As in reality, it is the full render experience we see. I've seen libraries do really well on the JS side and do poorly on the rendering side. I don't know the exact details of how he measures but in @localvoid UIBench he lets you choose whether to do just JS time vs Full Render time. In that benchmark I've seen libraries invert standing on tests due to it. Like not just close the gap but actually be slower in the JS recording and then faster comparatively with Render.

However, that being said I get that if the browsers timers aren't depicting real render time either they aren't particularly useful as well. I'd have to defer to someone else's expertise here. But conceptually I think as a consumer I'd rather have a slightly flawed measurement that included rendering of some form than one that gives false impressions of performance from JS time alone. As a library writer I like having the flexibility of knowing both scenarios, but I think ultimately I'd take the thing closest to what the end-user sees (but that is just a personal priority, and not speaking for anyone else, it would be good to get a few opinions here). Like even if I was hitting a browser frame rule I feel that makes things more authentic. But it can be argued this test has long moved past the point of authenticity. But you can definitely see how it started its life as the "realest" of synthetic benchmarks.

@ts-thomas
Copy link
Contributor Author

Please keep in mind that there is a huge gab between "what you expect to get" and "what you get".

I write "should be equal" but of course it isn't. At least for me this is no reason to tend adding synthetic cosmetics. It is a rule that every manual adjust which is coming from a good intention has a high potential to lead into wrong results and false interpreting. That's why I exclude this part in my benchmark. I do not believe that these results are wrong just because the paint roundtrip isn't included. When adding this roundtrip it will just adding noise. Of course I did not cover async libs, which is actually not possible this way, but also I get a lot of advantages. But of course we need to differentiate it. There are some special libs which follows its own render strategy (I also made an own render stack a couple years ago). At the end "accuracy" and "compatibility" acts in contrast to each other. Of course I absolutely understood it that this kind of framework could not use the same technical base. That also isn't my intention.

Yes, it is important to seriously ask what it the desired result of a benchmark. I would doubt it that you have the knowledge down the road to the each processing unit of chrome, but maybe you have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants