Measuring Performance #375

sampsyo · 2023-08-23T11:13:25Z

sampsyo
Aug 23, 2023
Maintainer

This thread is for discussing the famous "Producing Wrong Data!" paper by Mytkowicz et al. I (@sampsyo) am the discussion leader and will try to answer all your questions!

rcplane · 2023-08-23T15:59:30Z

rcplane
Aug 23, 2023

The ongoing concern of measurement bias is highly relevant to the production deployment setting of measuring code runtime performance regression. A thoughtful recurring benchmark deployment on continuous integration can identify meaningful runtime shifts using a mean shift statistical analysis, whereas a less thoughtful deployment can unsustainably bother engineers on a daily basis. https://dl.acm.org/doi/10.1145/3358960.3375791

0 replies

surendraGh · 2023-08-23T19:41:18Z

surendraGh
Aug 23, 2023

It seems to me that there are some optimizations that can only make the program faster (ex. dead code elimination) and measurement error doesn't matter. In these cases, is empirical evaluation still necessary and if so, do metrics resilient to measurement error (ex. lines of IR code) suffice demonstrate empirically that the method works? Or are there still environments where dead code would make the code consistently faster?

6 replies

keikun555 Aug 23, 2023

Perhaps a dumb question: does every line of IR code perform the same?

sampsyo Aug 23, 2023
Maintainer Author

No! Because computers do not execute IR code. Compilers have to translate it to machine code first. (And even every machine instruction does not run in the same amount of time… think, for example, of an XOR instruction versus a memory load from a cold cache.)

keikun555 Aug 23, 2023

I think then it’s really hard to come up with a metrics resilient to measurement error other than time and resources used.

alifarahbakhsh Aug 24, 2023

Just curious: are there compilers that actually add dead code for layout optimization purposes? :)

bcarlet Aug 24, 2023

are there compilers that actually add dead code for layout optimization purposes?

I think this is often done for alignment reasons. For instance, it's pretty common to see compilers insert a bunch of nops after a ret to align function boundaries to a power of two, so I suppose that's one small instance of compilers adding dead code.

keikun555 · 2023-08-23T21:01:51Z

keikun555
Aug 23, 2023

Is measurement bias is well known in computer science as it is in “medical and other sciences?” If not, then why?

7 replies

keikun555 Aug 23, 2023

We also underscore testing in computer science courses as well as in industry. When I was working in industry there were many times we pushed changes that passed tests but worsened our performance. It is cumbersome and resource intensive to run performance tests for every code change we push. Many CS students transition to industry so maybe industry does play a role in what gets taught at the undergraduate level. To train students to work as software engineers.

keikun555 Aug 24, 2023

Measurement bias also has different stakes in different fields. In medicine where we often test procedures on living human beings, the stakes are high. We want the best measurement possible. So much that there are FDA regulations and laws in place. Compared to human life, measuring the efficacies of compiler optimizations is not as important.

collinzrj Aug 24, 2023

I haven't learned about any measurement biases in my undergrad CS courses either. Maybe it's good to learn a set of possible CS measurement biases so it's easier for us to identify possible biases in experiments. Moreover, is it correct to say that measurement biases are all from side effects? In computer science you can always control the program you run, so you can compare the original program and the program with optimization. In that case, the difference in experiment setup may only come from side effect of that optimization.

Enochen Aug 24, 2023

I haven't noticed much emphasis on performance measurements in my undergrad CS classes, at least on the algorithm level. I think that the main reason for this is that students are often running code on their personal hardware which vastly varies between student and across time.

NgaiJustin Aug 24, 2023

Echoing similar sentiments as the comments above, I haven't encountered a substantial focus on measurement biases in my earlier CS undergraduate courses. I believe CS 4787 was the first class where we talked a bit about random initialization when measuring the runtime of various gradient descent variants (such as Adagrad, RMSProp, etc.). This emphasis might have stemmed from the inherent sensitivity of specific descent problems to initialization. However, considering our runtime experiments in aggregate, the coherence of the data distribution could potentially unveil insights into the measurement biases inherent in our implementations.

stephenverderame · 2023-08-23T22:42:19Z

stephenverderame
Aug 23, 2023

I'm curious how things may have changed since the paper was published. Have some of the calls to action, such as more diversified benchmark suites, more transparency/cooperation from HW manufacturers, and more consideration of measurement bias from researchers been effectively realized? There has definitely been more work in this area, for example, the more recent paper on Reliable Benchmarking, but it would be interesting to see a follow-up study examining if the overall field has gotten better at handling measurement bias after over a decade.

3 replies

bennyrubin Aug 24, 2023

I'm curious about the same thing. While it's true that there has been recent effort in this area (that has hopefully led to an improvement in measurement bias), I wonder if advanced in hardware and increased complexity in software stacks has instead made measurement bias even harder to detect and avoid. This complexity can cause system performance to be sensitive to many different factors in ways that are hard to predict and explain. A follow up study that compares the increase in complexity to the increase in community awareness of measurement bias would be interesting.

sampsyo Aug 24, 2023
Maintainer Author

My extremely anecdotal experience is roughly that, for the most part, no—systems researchers are still not very careful about accounting for uncertainty. The best have gotten a bit better, but the median is still bad.

If you want to see some cool follow-on research that is influenced by this kind of thing, please try Stabilizer from ASPLOS 2013:
https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf

he-andy Aug 24, 2023

I think this it is also interesting to note that diversifying experimental setups is extremely accessible in computer science (made even easier given the tools that you've mentioned), especially when compared to physical sciences like biology, where it's impossible to just find a perfect diverse range of test subjects.

bennyrubin · 2023-08-24T00:11:20Z

bennyrubin
Aug 24, 2023

I am very interested in the idea that certain factors that can wildly affect performance are hardware specific and there is very little information about them, from the manufacturers and from the research community. The paper mentioned the loop stream detector, but even in my own research I've encountered situations where it's really hard to get a model of how a specific piece of hardware works, which significantly impedes on my ability to explain certain phenomena. This can also lead to measurement bias in ways that are really difficult to detect. This is a frustrating reality and I wonder if there are ways that vendors could work with researchers to help bridge this gap, or perhaps there are financial incentives for vendors to not do that. What are some ways to conduct good systems research when facing this issue?

2 replies

willwng Aug 24, 2023

Interesting you mention financial incentives: the thing that comes to my mind is NVIDIA making their kernel modules open source recently. It seemed this was a big deal, especially since its competition has been open-sourcing their drivers for years. I've also heard some cases of signed firmware being used to reject community-made software. It seems like the providing an "end-to-end" product to the enterprise market has higher priority than the typical consumer.

sampsyo Aug 24, 2023
Maintainer Author

I wonder if there are ways that vendors could work with researchers to help bridge this gap, or perhaps there are financial incentives for vendors to not do that.

A big reason that hardware vendors work hard not to expose these things is for compatibility reasons. If people start depending on those behaviors, they get painted into a corner and can no longer change them. And, more recently, there is another category of reasons that have to do with security…

matth2k · 2023-08-24T01:09:49Z

matth2k
Aug 24, 2023

Question on the consequences of bad data: given your experience @sampsyo, how much do you think the reproducibility crisis applies to the field of compilers and programming language design? If so, do you think things are trending for the better?

Personally, it always seems challenging to follow good practices in data collection when doing compilers and design automation work, because the research largely depends on the integration of tightly-coupled tools. The tool flows are often too brittle and error prone to experiment with sources of measurement bias.

1 reply

sampsyo Aug 24, 2023
Maintainer Author

It seems like this might be a good thing to get everyone's opinion on! But you may have to explain more about the "reproducibility crisis" in social sciences, for the uninitiated. One important distinction that could come up is the difference between an "engineering" mode of research (where you're building something new and trying to suggest that it is "good" for some purpose) vs. a "descriptive" mode of research (where you are just trying to explain a phenomenon that already exists).

willwng · 2023-08-24T01:14:11Z

willwng
Aug 24, 2023

This paper makes me curious why numerical methods typically use "function evaluations" as their primary metric rather than runtime alone. I suppose this gets rid of many sources of measurement bias, typically system-related. However, function evaluations may not be a meaningful metric if your function/gradient evaluations are not the bottleneck (one could solve a quadratic hamiltonian with one newton step, but computing the hessian is pricey). Why not also include a runtime estimate?

There was a recent optimization algorithm which gained a lot of popularity known as FIRE for structural relaxation. In a majority of the benchmarks mentioned in the paper it seems to consistently outperform conjugate gradients, sometimes beating quasi-Newton methods. There was then a following paper that showed situations where FIRE converged slower (or didn't even reach convergence) when good-old conjugate gradient did. It makes me wonder if these benchmarks were hand-picked? It seems not enough to only test a small handful of energy potential functions.

1 reply

sampsyo Aug 24, 2023
Maintainer Author

Hand-picking benchmarks is certainly a danger, and it's something that can happen both intentionally (discarding benchmarks where you do poorly) and unintentionally (where you just get unlucky and don't use enough benchmarks to fully characterize the behavior).

alifarahbakhsh · 2023-08-24T02:08:00Z

alifarahbakhsh
Aug 24, 2023

The question that comes to my mind is: where is the boundary of the "system"? During the design phase of a compiler, we will inevitably rely on some abstract notions that are affected by so many things in reality. One can indefinitely add to the set of control variables, e.g., anything that affects the memory layout, and there always is a potential to unearth some bias. Considering the sheer scale and diversity of different applications running on the same hardware, when should a designer be satisfied with the results?

This is also related to the difficulty of "verifying" that a complex component like a compiler or an OS works as expected.

3 replies

vivianyyd Aug 24, 2023

I was wondering the same question. It seems nearly any portion of the system running benchmarks may contribute to measurement bias, so there must be so many setup configurations that no reasonably small number can form a representative sample. Measurement bias might still occur when varying experimental setup. How do we decide we are happy with a particular experiment?

keikun555 Aug 24, 2023

In statistics we use the notion of a p value (less is good) where we want p to be at most a constant for a result to be called significant. In my understanding, to have such a low p value in compiler optimization experiments we may need hundreds if not thousands of setups in a uniform distribution, which is most times not feasible. So it's a difficult question to answer.

sampsyo Aug 24, 2023
Maintainer Author

Great question. This would be good to get others' opinions on. No experiment is perfect, so when is it good enough?

kevinnegy · 2023-08-24T02:44:26Z

kevinnegy
Aug 24, 2023

I like how thoroughly this paper went into verifying their own results (would be bad and ironic if authors of a paper on measurement bias had an inadequate evaluation/analysis). It is concerning to think that something as simple and seemingly arbitrary as a researcher's environment could be the difference between publishing a paper with positive results and throwing away a project because of negative results, especially considering all the systems that report to add within 10% benefit to something.

I do have a question about something I didn't think was explained well in the paper or maybe I missed it. Why does figure 1b have two sets of data points in it? One set seems fairly consistent at 600k cycles no matter the environment size; is that some sort of baseline? And there was no explanation for the one extremely slow run that took 1.6+ million cycles. Would have liked to know what happened on that run.

1 reply

sampsyo Aug 24, 2023
Maintainer Author

Just to clarify, I don't think Figure 1b has two distinct series in it—it's just that the results fall into 3 "regimes," depending on the environment variable size:

around 600k on the y-axis, and low variance
around 800k, and high variance
the outlier

I think we would all like to know what happened on that outlier run, but we are mostly stuck guessing…

MelindaFang-code · 2023-08-24T04:08:00Z

MelindaFang-code
Aug 24, 2023

This paper is particularly interesting in that it reveals the undeniable fact that tweaking the experimental setup would greatly influence the a system experiment. In particular it give the example of varying memory layout and link order, and also propose two solutions. 1. Create diverse benchmark 2. Use causal analysis. I understand the purpose of this paper is to demonstrate the fact that experiment bias exist, but not to introduce detailed ways to reduce them, but some questions arose when I read this part are: when can we be satisfied that our experiment suite is diverse enough. In the paper, the author provides an example of using a set of suites that changes the memory layout and arrives the conclusion using the distribution of speed up. However, I believe there are numerous conditions affecting the performance measurement beside memory layout. Even if the distribution is tight, it could only prove that the suite is diverse enough to factor out the bias introduced by memory layout.

2 replies

emwangs Aug 24, 2023

I agree! It seems like it is difficult to evaluate "diverse enough" -- the authors mention that size of the benchmark suite doesn't matter; only diversity. In that case, when measuring against such benchmark suites, if the benchmark suite was diverse enough to factor out measurement bias, there would be a tight distribution across environment setups; but isn't it not also the case that that there will be a tight distribution given a benchmark that small but only self-proposed to be diverse? I'm wondering since if this paper was published in 2009, whether or not newer papers have used/provided evidence of using more diverse benchmarks, or if there is a new standard set of benchmarks.

evanmwilliams Aug 24, 2023

Great point! Even more interestingly, I wonder what this paper implies for systems research before 2009. Many important results came out of experimental computer science before then, so it could be interesting to go back and review the older literature to see which papers were flawed (as the authors did in their literature review)

zachary-kent · 2023-08-24T05:29:08Z

zachary-kent
Aug 24, 2023

I very much enjoyed how this paper forced me to consider the disconnect between various stages of the compilation and execution pipeline causing, as @sampsyo said, unintended "downstream" effects. I wonder if we could leverage JIT-compilation to communicate information between traditionally isolated stages of the compiler pipeline to recognize and remedy performance degradation caused by additional optimizations. For example, recompilation allows you to recompile hot JIT'd code at runtime at a higher optimization level in hopes of improving performance. If profiling data gathered at runtime indicates that this recompilation was actually a net negative, we can swap the original unoptimized JIT'd code back in. This would allow us to examine the performance implications of various optimizations as opaque passes without having to consider how they actually transform they code or interact with possibly unspecified details of the hardware target, a challenge noted by the authors.

The authors also discuss how link order can affect code and data layout, which can determine whether a hot loop fits entirely within an i-cache. I wonder if a combination of PGO (to determine the hot loops themselves) and LTO (to ensure these hot loops are relocated to favorable addresses at link time) could address this issue.

1 reply

sampsyo Aug 24, 2023
Maintainer Author

Fun idea! You could even imagine taking a few compiler parameters and "rolling the dice" to produce N different variants of the same code, and then experimentally choosing the one that's fastest (instead of relying on heuristics to guess which one is fastest, as in a traditional optimizing compiler).

Enochen · 2023-08-24T05:51:27Z

Enochen
Aug 24, 2023

Reading the paper and also the discussion on "downstream" effects above makes me wonder how much the designs of software and hardware have driven each other into pockets of local minima in terms of performance. For example, some low-level heuristic at the software level may seem good, get popular, and is then used exclusively in testing of the next hardware generation. Maybe there is a design choice at the hardware level that does really well only in conjunction with this heuristic and thus propagates into later generations. And later software development ends up taking this aspect of the hardware as a given, further solidifying our "path" into some architecture.

However, this seems inevitable. In the context of setup randomization, there has to be a point where certain dimensions of setup customization have to go. Otherwise there are just too many possible setups, and the possibilities need to be maintained to be an accurate sample of the "real world".

2 replies

SanjitBasker Aug 24, 2023

I had a similar thought in terms randomization--it seems like there is a combinatorial explosion in the number of trials that would be needed to account for all sources of error. For instance, in section 7.1 the authors use $N = 22\times22$ experimental setups to perform the measurement. They reports a 95% confidence interval conditioned on this distribution of the 2 parameters, and fixing all other parameters in which there is a speedup due to O3 optimizations.

But if I identify another parameter that causes measurement bias (and I come up with 22 values for this parameter), I'll have to repeat the measurement on $N = 22\times22\times22$ setups to verify this result.
We also might fall into the trap of thinking that higher values of $N$ are strictly better, but different parameters may contribute different amounts of variance in the final distribution. So some care might need to be taken to determine which parameters need more values, etc.

sampsyo Aug 24, 2023
Maintainer Author

To some degree, this kind of co-evolution has happened in an extreme way between C and modern CPUs. It's unclear how we would be designing hardware differently if C were not so dominant for system software.

janpaulpl · 2023-08-24T06:12:38Z

janpaulpl
Aug 24, 2023

At first I was a bit shocked having finished the paper, and then remembering it was ASPLOS 2009! In particular because of the consistent struggle papers seem to still have with establishing ill-informed results — as we’ve recently seen with some replication studies that arguably hold the original authors and the artifacts accountable.

With such a consistent problem throughout the years, there seems to still exist a disconnect between the showcasing of results (which this paper does a good job of highlighting) and the organization of artifacts and artifact bias, which I believe should go through an entire checklist themselves for reproducibility.

Generally, I’m curious if simulations have become more standard for evaluating these biases, as the researchers have full control and knowledge of the “hardware.” Also curious to know if these tools have been easier to access with open source standards for simulation.

More technically, I’m wondering why the authors chose to discard the following bullet point:

Some Linux kernels (e.g., on our Core 2) randomize the starting address of the stack (for security purposes). This feature can make experiments hard to repeat and thus we disabled it for our experiments.

It would definitely be interesting to see the trade-offs in security and optimization as I view this design choice from the Linux kernel as non-trivial and impactful in actual application of research. Are there more research papers that tackle this trade-off in compiler and systems design?

1 reply

sampsyo Aug 24, 2023
Maintainer Author

I don't think I've read any papers that attempt to quantify the performance impact of ASLR! Could be fun to try measuring. While it's not exactly the same thing, checking out the performance analysis in DieHarder could be interesting.

jdroob · 2023-08-24T12:36:17Z

jdroob
Aug 24, 2023

This paper really opened my eyes to a pitfall in computing research that I hadn't really considered before. In retrospect, it makes sense given the complexity of modern hardware systems - it's extremely difficult to factor in every possible contributor to measurement bias from the hardware. Moreover, it appears that the optimal link order & environment size for one machine is not necessarily optimal for another. It was striking to see just how significant of an impact parameters such as link order had (e.g. sometimes O3 optimizations were slower than O2). One step the authors of the paper took which I really appreciated was exploring whether the measurement bias they were observing was specific to gcc (which it wasn't). This further demonstrated that measurement bias must always be taken into consideration when researchers are trying to evaluate the true effect of an optimization. I'm curious if there have been any best practices implemented w.r.t. controlling for measurement bias since this paper was released. Do most papers nowadays take time to discuss environment setup? Is experimental setup randomization common practice now? Are the causal analysis steps usually discussed when authors attribute performance improvements to a specific optimization they made?

0 replies

keikun555 · 2023-08-24T13:45:25Z

keikun555
Aug 24, 2023

I was looking up link order and found that link order can not just affect the performance of the resulting program and can also not even allow programs to compile! Thought it was interesting so sharing it here.

https://stackoverflow.com/questions/45135/why-does-the-order-in-which-libraries-are-linked-sometimes-cause-errors-in-gcc

0 replies

obhalerao · 2023-08-24T14:20:48Z

obhalerao
Aug 24, 2023

Reading this paper brought me back to my experience with machine learning research and the issues that inevitably result when trying to benchmark results from that field. One paper that immediately comes to mind when thinking about this is A Metric Learning Reality Check, which shows that inconsistencies in benchmarking from in 10+ years of research in the field of metric learning have resulted in a perception that the field as a whole has progressed much more than it actually has. In addition, benchmarking machine learning models seems even less potentially error-prone than benchmarking system performance, since the metric for machine learning models is accuracy on a dataset, which suffers from far less potential variance than number of CPU cycles to execute some particular code (and that's not even taking into account the variety of programs that could potentially be run on a particular system, which seems to be far greater than the variety of datasets that an ML model could be evaluated on). As such, after reading this paper I was disappointed but not surprised, as even for the comparatively simpler task of benchmarking an ML model, these types of issues very much still do arise.

0 replies

Measuring Performance #375

sampsyo Aug 23, 2023 Maintainer

Replies: 16 comments · 30 replies

sampsyo Aug 23, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo Aug 24, 2023 Maintainer Author

sampsyo
Aug 23, 2023
Maintainer

Replies: 16 comments 30 replies

sampsyo Aug 23, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author

sampsyo Aug 24, 2023
Maintainer Author