Replies: 16 comments 30 replies
-
The ongoing concern of measurement bias is highly relevant to the production deployment setting of measuring code runtime performance regression. A thoughtful recurring benchmark deployment on continuous integration can identify meaningful runtime shifts using a mean shift statistical analysis, whereas a less thoughtful deployment can unsustainably bother engineers on a daily basis. https://dl.acm.org/doi/10.1145/3358960.3375791 |
Beta Was this translation helpful? Give feedback.
-
It seems to me that there are some optimizations that can only make the program faster (ex. dead code elimination) and measurement error doesn't matter. In these cases, is empirical evaluation still necessary and if so, do metrics resilient to measurement error (ex. lines of IR code) suffice demonstrate empirically that the method works? Or are there still environments where dead code would make the code consistently faster? |
Beta Was this translation helpful? Give feedback.
-
Is measurement bias is well known in computer science as it is in “medical and other sciences?” If not, then why? |
Beta Was this translation helpful? Give feedback.
-
I'm curious how things may have changed since the paper was published. Have some of the calls to action, such as more diversified benchmark suites, more transparency/cooperation from HW manufacturers, and more consideration of measurement bias from researchers been effectively realized? There has definitely been more work in this area, for example, the more recent paper on Reliable Benchmarking, but it would be interesting to see a follow-up study examining if the overall field has gotten better at handling measurement bias after over a decade. |
Beta Was this translation helpful? Give feedback.
-
I am very interested in the idea that certain factors that can wildly affect performance are hardware specific and there is very little information about them, from the manufacturers and from the research community. The paper mentioned the loop stream detector, but even in my own research I've encountered situations where it's really hard to get a model of how a specific piece of hardware works, which significantly impedes on my ability to explain certain phenomena. This can also lead to measurement bias in ways that are really difficult to detect. This is a frustrating reality and I wonder if there are ways that vendors could work with researchers to help bridge this gap, or perhaps there are financial incentives for vendors to not do that. What are some ways to conduct good systems research when facing this issue? |
Beta Was this translation helpful? Give feedback.
-
Question on the consequences of bad data: given your experience @sampsyo, how much do you think the reproducibility crisis applies to the field of compilers and programming language design? If so, do you think things are trending for the better? Personally, it always seems challenging to follow good practices in data collection when doing compilers and design automation work, because the research largely depends on the integration of tightly-coupled tools. The tool flows are often too brittle and error prone to experiment with sources of measurement bias. |
Beta Was this translation helpful? Give feedback.
-
This paper makes me curious why numerical methods typically use "function evaluations" as their primary metric rather than runtime alone. I suppose this gets rid of many sources of measurement bias, typically system-related. However, function evaluations may not be a meaningful metric if your function/gradient evaluations are not the bottleneck (one could solve a quadratic hamiltonian with one newton step, but computing the hessian is pricey). Why not also include a runtime estimate? There was a recent optimization algorithm which gained a lot of popularity known as FIRE for structural relaxation. In a majority of the benchmarks mentioned in the paper it seems to consistently outperform conjugate gradients, sometimes beating quasi-Newton methods. There was then a following paper that showed situations where FIRE converged slower (or didn't even reach convergence) when good-old conjugate gradient did. It makes me wonder if these benchmarks were hand-picked? It seems not enough to only test a small handful of energy potential functions. |
Beta Was this translation helpful? Give feedback.
-
The question that comes to my mind is: where is the boundary of the "system"? During the design phase of a compiler, we will inevitably rely on some abstract notions that are affected by so many things in reality. One can indefinitely add to the set of control variables, e.g., anything that affects the memory layout, and there always is a potential to unearth some bias. Considering the sheer scale and diversity of different applications running on the same hardware, when should a designer be satisfied with the results? This is also related to the difficulty of "verifying" that a complex component like a compiler or an OS works as expected. |
Beta Was this translation helpful? Give feedback.
-
I like how thoroughly this paper went into verifying their own results (would be bad and ironic if authors of a paper on measurement bias had an inadequate evaluation/analysis). It is concerning to think that something as simple and seemingly arbitrary as a researcher's environment could be the difference between publishing a paper with positive results and throwing away a project because of negative results, especially considering all the systems that report to add within 10% benefit to something. I do have a question about something I didn't think was explained well in the paper or maybe I missed it. Why does figure 1b have two sets of data points in it? One set seems fairly consistent at 600k cycles no matter the environment size; is that some sort of baseline? And there was no explanation for the one extremely slow run that took 1.6+ million cycles. Would have liked to know what happened on that run. |
Beta Was this translation helpful? Give feedback.
-
This paper is particularly interesting in that it reveals the undeniable fact that tweaking the experimental setup would greatly influence the a system experiment. In particular it give the example of varying memory layout and link order, and also propose two solutions. 1. Create diverse benchmark 2. Use causal analysis. I understand the purpose of this paper is to demonstrate the fact that experiment bias exist, but not to introduce detailed ways to reduce them, but some questions arose when I read this part are: when can we be satisfied that our experiment suite is diverse enough. In the paper, the author provides an example of using a set of suites that changes the memory layout and arrives the conclusion using the distribution of speed up. However, I believe there are numerous conditions affecting the performance measurement beside memory layout. Even if the distribution is tight, it could only prove that the suite is diverse enough to factor out the bias introduced by memory layout. |
Beta Was this translation helpful? Give feedback.
-
I very much enjoyed how this paper forced me to consider the disconnect between various stages of the compilation and execution pipeline causing, as @sampsyo said, unintended "downstream" effects. I wonder if we could leverage JIT-compilation to communicate information between traditionally isolated stages of the compiler pipeline to recognize and remedy performance degradation caused by additional optimizations. For example, recompilation allows you to recompile hot JIT'd code at runtime at a higher optimization level in hopes of improving performance. If profiling data gathered at runtime indicates that this recompilation was actually a net negative, we can swap the original unoptimized JIT'd code back in. This would allow us to examine the performance implications of various optimizations as opaque passes without having to consider how they actually transform they code or interact with possibly unspecified details of the hardware target, a challenge noted by the authors. The authors also discuss how link order can affect code and data layout, which can determine whether a hot loop fits entirely within an i-cache. I wonder if a combination of PGO (to determine the hot loops themselves) and LTO (to ensure these hot loops are relocated to favorable addresses at link time) could address this issue. |
Beta Was this translation helpful? Give feedback.
-
Reading the paper and also the discussion on "downstream" effects above makes me wonder how much the designs of software and hardware have driven each other into pockets of local minima in terms of performance. For example, some low-level heuristic at the software level may seem good, get popular, and is then used exclusively in testing of the next hardware generation. Maybe there is a design choice at the hardware level that does really well only in conjunction with this heuristic and thus propagates into later generations. And later software development ends up taking this aspect of the hardware as a given, further solidifying our "path" into some architecture. However, this seems inevitable. In the context of setup randomization, there has to be a point where certain dimensions of setup customization have to go. Otherwise there are just too many possible setups, and the possibilities need to be maintained to be an accurate sample of the "real world". |
Beta Was this translation helpful? Give feedback.
-
At first I was a bit shocked having finished the paper, and then remembering it was ASPLOS 2009! In particular because of the consistent struggle papers seem to still have with establishing ill-informed results — as we’ve recently seen with some replication studies that arguably hold the original authors and the artifacts accountable. With such a consistent problem throughout the years, there seems to still exist a disconnect between the showcasing of results (which this paper does a good job of highlighting) and the organization of artifacts and artifact bias, which I believe should go through an entire checklist themselves for reproducibility. Generally, I’m curious if simulations have become more standard for evaluating these biases, as the researchers have full control and knowledge of the “hardware.” Also curious to know if these tools have been easier to access with open source standards for simulation. More technically, I’m wondering why the authors chose to discard the following bullet point:
It would definitely be interesting to see the trade-offs in security and optimization as I view this design choice from the Linux kernel as non-trivial and impactful in actual application of research. Are there more research papers that tackle this trade-off in compiler and systems design? |
Beta Was this translation helpful? Give feedback.
-
This paper really opened my eyes to a pitfall in computing research that I hadn't really considered before. In retrospect, it makes sense given the complexity of modern hardware systems - it's extremely difficult to factor in every possible contributor to measurement bias from the hardware. Moreover, it appears that the optimal link order & environment size for one machine is not necessarily optimal for another. It was striking to see just how significant of an impact parameters such as link order had (e.g. sometimes O3 optimizations were slower than O2). One step the authors of the paper took which I really appreciated was exploring whether the measurement bias they were observing was specific to gcc (which it wasn't). This further demonstrated that measurement bias must always be taken into consideration when researchers are trying to evaluate the true effect of an optimization. I'm curious if there have been any best practices implemented w.r.t. controlling for measurement bias since this paper was released. Do most papers nowadays take time to discuss environment setup? Is experimental setup randomization common practice now? Are the causal analysis steps usually discussed when authors attribute performance improvements to a specific optimization they made? |
Beta Was this translation helpful? Give feedback.
-
I was looking up link order and found that link order can not just affect the performance of the resulting program and can also not even allow programs to compile! Thought it was interesting so sharing it here. |
Beta Was this translation helpful? Give feedback.
-
Reading this paper brought me back to my experience with machine learning research and the issues that inevitably result when trying to benchmark results from that field. One paper that immediately comes to mind when thinking about this is A Metric Learning Reality Check, which shows that inconsistencies in benchmarking from in 10+ years of research in the field of metric learning have resulted in a perception that the field as a whole has progressed much more than it actually has. In addition, benchmarking machine learning models seems even less potentially error-prone than benchmarking system performance, since the metric for machine learning models is accuracy on a dataset, which suffers from far less potential variance than number of CPU cycles to execute some particular code (and that's not even taking into account the variety of programs that could potentially be run on a particular system, which seems to be far greater than the variety of datasets that an ML model could be evaluated on). As such, after reading this paper I was disappointed but not surprised, as even for the comparatively simpler task of benchmarking an ML model, these types of issues very much still do arise. |
Beta Was this translation helpful? Give feedback.
-
This thread is for discussing the famous "Producing Wrong Data!" paper by Mytkowicz et al. I (@sampsyo) am the discussion leader and will try to answer all your questions!
Beta Was this translation helpful? Give feedback.
All reactions