Benchmark Testing with Stumpless

High performance and efficiency is a core pillar of stumpless design, and as such a benchmarking framework is available to assist with this. This framework uses the Google Benchmark library to measure execution time and other efficiency characteristics.

Performance tests are named performance-test-<item> for various pieces of the library. You can use the bench target to build and execute all performance tests at once, or the name of the executable prefixed with run- if you only want to run a single module. These targets write their results to both the standard output as well as a json file in the performance-output directory of the build location, which you can use with the compare.py tool from the benchmark library. There is an example of using this tool in the walkthrough below. Of course, you can also directly execute the test executable itself if you want to set the parameters yourself. This is also demonstrated in the walkthrough.

Performance tests are NOT intended to be an absolute measurement of the performance of a function or the library as a whole. They are only useful for measuring the relative performance between two versions of code on the same machine in the same environment. This is why you will not see performance test results posted in any documentation. The results are only useful when compared to one another, typically during development of some change.

Benchmarks are run during Release CI builds, but should not be used as indicators of performance for this exact reason. They are only included in the CI process to make sure that they are not broken.

Walkthrough: Improving `stumpless_copy_element`

Walking through a benchmarking improvement change from beginning to end demonstrates all of these principles and how they are used to implement an actual improvement to the library. Let's analyze an improvement to the performance of stumpless_copy_element to do this.

An early version of stumpless_copy_element iterated through all of the params in the element, adding them to the copy one by one. The code for this looked like this snippet, which has been abbreviated to focus on the performance of the code:

// first create a new element
copy = stumpless_new_element( stumpless_get_element_name( element ) );

// then handle all of the parameters
for( i = 0; i < element->param_count; i++ ) {
  // copy the parameter
  param_copy = stumpless_copy_param( element->params[i] );

  // and then add it
  stumpless_add_param( copy, param_copy );
}

While it is logical and easy to follow, this method is inefficient because stumpless_add_param reallocates the underlying array each time it is called. This means that the same piece of memory could be reallocated several times in a single call, increasing execution time and putting pressure on the memory manager. Let's change it to instead allocate the memory up front.

Before we make any changes to the code itself, let's implement a benchmark test to measure the performance of the code as is. Our test code looks like this:

static void CopyElement(benchmark::State& state){
  struct stumpless_element *element;
  const struct stumpless_element *result;

  INIT_MEMORY_COUNTER( copy_element );
  stumpless_set_malloc( copy_element_memory_counter_malloc );
  stumpless_set_realloc( copy_element_memory_counter_realloc );
  stumpless_set_free( copy_element_memory_counter_free );

  element = stumpless_new_element( "copy-element-perf" );
  stumpless_add_new_param( element, "param-1", "value-1" );
  stumpless_add_new_param( element, "param-2", "value-2" );

  for(auto _ : state){
    result = stumpless_copy_element( element );
    if( result <= 0 ) {
      state.SkipWithError( "could not send an entry to the target" );
    } else {
      stumpless_destroy_element_and_contents( result );
    }
  }

  stumpless_destroy_element_and_contents( element );

  state.counters["CallsToAlloc"] = ( double ) copy_element_memory_counter.malloc_count;
  state.counters["MemoryAllocated"] = ( double ) copy_element_memory_counter.alloc_total;
  state.counters["CallsToRealloc"] = ( double ) copy_element_memory_counter.realloc_count;
  state.counters["CallsToFree"] = ( double ) copy_element_memory_counter.free_count;
  state.counters["MemoryFreed"] = ( double ) copy_element_memory_counter.free_total;
}

We can run this specific test with the following command, which will build the test if necessary and then execute it.

make performance-test-element && ./performance-test-element --benchmark_filter=CopyElement

# sample output:
# 2020-07-27T14:40:33-04:00
# Running ./performance-test-element
# Run on (8 X 1498 MHz CPU s)
# Load Average: 0.52, 0.58, 0.59
# ----------------------------------------------------------------------
# Benchmark            Time             CPU   Iterations UserCounters...
# ----------------------------------------------------------------------
# CopyElement        633 ns          628 ns      1120000 CallsToAlloc=8.96001M CallsToFree=10.08M CallsToRealloc=2.24M MemoryAllocated=181.44M MemoryFreed=181.44M

If you got an error about the library being built as DEBUG, make sure that you pass the -DCMAKE_BUILD_TYPE=Release argument to cmake when you are building stumpless.

Great! We have an idea of the speed of the library, as well as the number of calls that are made to various memory allocation routines. Next, let's make our fix to stumpless_copy_element.

// we still create a new element the same way
copy = stumpless_new_element( stumpless_get_element_name( element ) );

// now we manually allocate the array just once
copy->params = alloc_mem( element->param_count * sizeof( param_copy ) );

for( i = 0; i < element->param_count; i++ ) {
  param_copy = stumpless_copy_param( element->params[i] );

  // and then populate it with each copy
  copy->params[i] = param_copy;
  copy->param_count++;
}

Now that we've made this change, let's rebuild our performance test and run it!

make performance-test-element && ./performance-test-element --benchmark_filter=CopyElement

# sample output:
# 2020-07-27T14:45:05-04:00
# Running ./performance-test-element
# Run on (8 X 1498 MHz CPU s)
# Load Average: 0.52, 0.58, 0.59
# ----------------------------------------------------------------------
# Benchmark            Time             CPU   Iterations UserCounters...
# ----------------------------------------------------------------------
# CopyElement        542 ns          547 ns      1000000 CallsToAlloc=9.00001M CallsToFree=9.00001M CallsToRealloc=2 MemoryAllocated=162M MemoryFreed=162M

We immediately see that the number of calls to realloc dropped significantly, and is clearly no longer tied to calls to CopyElement. The execution time is also lower, so we can declare success!

If you run a number of benchmarks at once and want to compare all of the results, manually comparing this output can get difficult. Google Benchmark provides a python script in the tools folder that makes this much easier. In a normal build tree this is in benchmark/src/benchmark/tools/, and it is exported by the export-benchmark build target if you are using BENCHMARK_PATH (see the development notes for details on this option).

Running the script is straightforward, as you simply need to export JSON output from each benchmark execution and then compare the results. If you want more detail, check out the full documentation here.

For this example, we'll assume that you've built stumpless twice, one based on the latest branch in folder build-latest, and another based on a branch with your changes in folder build-element-copy. The general flow is to build the test, run it once with each library version, and then compare the results.

# in folder build-element-copy
make performance-test-element

# run the test with our changes
./performance-test-element --benchmark_filter=CopyElement --benchmark_out=new.json --benchmark_out_format=json

# and then swap out the library and run it again
rm libstumpless.so.2.0.0
cp ../build-latest/libstumpless.so.2.0.0 ./
./performance-test-element --benchmark_filter=CopyElement --benchmark_out=old.json --benchmark_out_format=json

# compare results with the Google Benchmark tool
cd benchmark/src/benchmark/tools
python3 compare.py benchmarks ../../../../old.json ../../../../new.json

# sample output:
# Comparing old.json to new.json
# Benchmark                     Time             CPU      Time Old      Time New       CPU Old       CPU New
# ----------------------------------------------------------------------------------------------------------
# CopyElement                -0.1791         -0.1747           663           545           663           547

This execution tells us that we have reduced the execution time of the function by just over 17 percent. Note that the numbers are slightly different from our previous executions, but that the general trend still holds true. This relative nature is exactly why benchmark test results are only relevant when executed on the same machine in the same environment, under the same load if at all possible.

You can also pass the compare script two performance test executables, if you have them, and bypass the json output steps. However, if you implemented a new benchmark for your change then the latest build tree may not have a test, and you can simply rely on the above method.

This is a real example of an actual improvement made to stumpless, so if you want to see any of the tests or code in detail you can simply look at them in the source tree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark.md

benchmark.md

Benchmark Testing with Stumpless

Walkthrough: Improving `stumpless_copy_element`

Files

benchmark.md

Latest commit

History

benchmark.md

File metadata and controls

Benchmark Testing with Stumpless

Walkthrough: Improving stumpless_copy_element

Walkthrough: Improving `stumpless_copy_element`