-
Notifications
You must be signed in to change notification settings - Fork 16
Performance Validation
When running Spatter, the source buffer gets overwritten many times in a typical gather run. The reason for this is that we want to only test the speed of gather operations, and not spend any bandwidth on writing back to memory. But if we're writing over the destination buffer, how can we verify that our kernel actually ran as intended? For one, we can check that the final state of the buffer is accurate. But what about all the other gathers? What happened with them? We can take two approaches.
The quickest test you can do to check kernel performance is to see if a STREAM-like pattern is close to the STREAM performance on that machine. A STREAM-like pattern would be something like ./spatter -pUNIFORM:8:1 -d8 -l$((2**24))
. This will run spatter with the indices [0,1,2,3,4,5,6,7]
and delta of 8, meaning we will end up reading every element in the input array. This is not quite the same as STREAM as Spatter is only intended to produce gather instructions which are reads, instead of having both read and write like STREAM would have. The length parameter -l
will generate 2^24 gathers, which are all length 8, and use an 8-byte data type, meaning you will be reading 1GB from memory, which is hopefully large enough so that the array does not fit in cache. If you believe the performance is too high, you should continue increasing the -l
parameter until performance plateaus.
Similarly, you can test the scatter kernel with ./spatter -pUNIFORM:8:1 -d8 -l$((2**24))
.
By default, Spatter will use all available threads, which matches STREAM. You may also wish to check that the single-threaded performance matches.
Example
On an Intel Xeon 6226 (Cascade Lake), when we run Stream compiled with icc -O3 -qopenmp
, we get roughly 132.0 GB/s for STREAM Copy. On the same machine, when we run spatter compiled in the same way, ./spatter -pUNIFORM:8:1:8 -l$((2**24))
reports a bandwidth of 132.8 GB/s.
The first approach is to view the assembly for the kernel, and check that it actually includes the kernel's loops. This can be a bit time consuming for those of us who don't normally spend much time reading assembly, though.
-- WIP --
A quicker method is to do a simple run of spatter. Calculate the expected memory traffic yourself, and then run Spatter with perf mem
and see if the memory traffic is what you expected.
-- WIP --
Spatter's CUDA and OpenMP kernels support additional validation in order to ensure data written to buffers are actually present. To enable validation, please use both the VALIDATE_DATA=1
CMake flag and the --validate
spatter flag. It should be noted that such validation may affect the bandwidth spatter reports.
On GPUs, it is slightly more complex to get a fast STREAM-like pattern. To achieve high bandwidth on GPUs, you should set the -z
flag to a large power of 2 such as 1024, so that each thread performs multiple gathers or scatters. This will reduce overhead for switching out thread blocks. For example, you should achieve reasonbly good performance on new GPUs with ./spatter -pUNIFORM:8:1:NR -l$((2**27)) -z1024