-
Notifications
You must be signed in to change notification settings - Fork 59
Precision
For a while now, we've been mulling over how to validate results in test-blas2. Ideally, we would be able to compare the results of a GPU-based SpM-dV calculation against a known-good CPU implementation. However, we're using floating point and comparing floating point values is hard.
This comes from two problems (which are related):
- Floating point addition is not associative. In SpM-dV, all of the matrix*vector results for a row must be summed. The order of this summation is different between CPU algorithms and our GPU algorithms (both in csrmv_general and csrmv_adaptive). As such, they accumulate different rounding errors during the summation process. This results in a divergence between the two values, meaning that we cannot (and should not) compare the results of these algorithms bit-for-bit.
- More problematic, floating point summation (especially when both positive and negative values are allowed) is notoriously error-prone. Catastrophic cancellation of values can completely destroy our ability to obtain useful answers. Even when this does not happen, however, the error rate can grow linearly with the number of additions. This means that a matrix with very long rows will have loose errors bounds in its results, making it difficult to automatically say whether two results are comparable.
The first problem means that we cannot directly compare (bit-for-bit) the output of a CPU-based SpM-dV and a GPU-based SpM-dV. The inability to directly compare the two floating point numbers is somewhat common knowledge, and we currently compare against a range of values.
Originally, we hard-coded this to say e.g. "if in single precision, the values must be within 1e-5 of each other". I recently changed this to instead say that "in single, the values must be within 0.1% of one another". We shouldn't be comparing against a static value (because a gigantic number, e.g. 1.0e126, cannot have another number that is within 1e-5 to it in single precision -- thus we're just doing a direct comparison that is doomed to fail). However, even the (extremely loose) error bounds I've included fail for many of the matrices we're operating on.
That is due to the second issue: floating point summation has linearly-increasing error bounds and suffers from catastrophic cancellation effects that can cause large values (which are eventually subtracted out) to destroy values that, in the end, are significant. This is true even in our CPU-based algorithm. This means that our CPU and GPU results may be different and they may both be wrong.
For example, the matrix dc2 from the UFL sparse matrix collection has a couple of extremely long rows (e.g. row 91, which has 47,193 non-zero values). A simple CPU-based SpMV algorithm (where the input vector entirely contains 1s) that is calculated in single precision says that the output for row 91 is roughly -1.921. csrmv_general calculates the answer as roughly -1.857, a noticeable percentage off. csrmv_adaptive returns roughly -1.013. Much worse!
However, if you perform this calculation in higher (double, quad) precision, the answer is roughly -1.005. Both the single-precision CPU and GPU results are wrong -- and the GPU-based results are closer to the correct answer. The different calculation orders between these algorithms have all resulted in different errors -- and it's not always the case that the GPU algorithms are closer.
This leads to the problems and questions I want to raise.
- I believe we should ensure that our CPU-based comparison point is computed accurately by calculating at a higher precision. For example, when comparing against single precision results from the GPU, we should perform CPU-based SpMV in double precision and cast the final answer to a float. I think that a multi- or arbitrary-precision library is overkill here, but this is probably a point for discussion. However, it's vital that our CPU baseline not be so erroneous that it makes comparisons impossible.
- How should we handle this floating point summation problem in the GPU kernels? Should we handle it there? If so, how much performance are we willing to lose (if any) in order to obtain more accurate results? Should these more accurate calculations always be enabled, or should there be an accuracy-mode and speed-mode? If so, which should be the default?
I believe this issue is larger than just the test case. It's also a question about how accurate our SpM-dV should be.
Optionally compute SpM-dV using compensated summation.
Compensated summation is a technique for reducing the cumulative error that occurs during floating point summation. Such errors occur both because of differences in magnitude between the summands (where part or all of the smaller one may be rounded out) as well as due to catastrophic cancellation, where subtracting a large value from a sum will cause previously marginal values that were rounded out to become significant (and now wrong).
It does this by, in part, replacing simple floating point summation with an error-free transformation, such as 2Sum or Fast2Sum. These equations essentially perform the equation A + B = (Sum, Error).
The Sum2 algorithm uses this transformation to greatly reduce the final error in a summation. As you continue to add values into sum, you continue to separately add the errors together. At the end of the summation, you add the final sum value and the final error value together. The end result is that your final answer is as accurate (0 ULP) as performing the summation in 2x precision and then casting down to your original precision. In other words, if all of your values to sum are floats, your answer should be exactly the same as if you'd performed the summation using doubles and then cast the final answer as a float.
This code uses a parallel version of this algorithm, PSum2. This compensated summation algorithm was added to both csrmv_general and csrmv_adaptive. test-blas2 and clsparse-bench were both modified to allow the command line option to use compensated summation.
When running test-blas2 with compensated summation enabled, all double-precision tests pass with 0 ULP difference compared to a CPU-side algorithm performing the same algorithm. The single precision tests sometimes differ compared to the CPU calculations performed in double precision. This is because, on my test platforms, the AMD OpenCL runtime does not allow single precision denormals. Instead, it rounds all denorms to zero, causing a large error that we can't compensate.
While this replaces all of the additions in SpM-dV with many more FLOPs, the overhead of compensated summation is, in most cases, modest. On a FirePro W9100, the average single-precision slowdown of csrmv_adaptive is 5%, and the average double-precision slowdown is 2%. On a Radeon R9 Fury X (with a DPFP rate of 1/16 the SPFP rate, but with much higher bandwidth than the FirePro W9100), the slowdown is 10% for single-precision and 28% for double-precision.
Note that the above graphic is not a comparison between the relative performance of the two cards. '1' in each bar is the performance in that category with traditional summation.
Note that we do not perform a fully compensated dot product, as described in the Yamanaka, Ogita, Rump, and Oishi paper. Performing an error-free transformation on the products in our SpM-dV is not extremely computationally intensive, but carrying around the extra error values from the multiplications is burdensome in csrmv_adaptive. We've seen that the majority of error in our calculations comes from summation, so we deemed that good enough. Nonetheless, this means that we can't fully guarantee a 0 ULP accuracy compared to 2x precision calculations, because the error due to multiplies may compound.