-
Notifications
You must be signed in to change notification settings - Fork 40
SummitTesting
Performance is considered a paramount priority for ArborX. As such, any pull request that may affect performance has to include performance results that indicate no regressions. So far, development of a robust system to automatically do that has not been successful. As such, ArborX relies on developers running the benchmarks by hand, posting the results in GitHub pull request (PR) comments.
Our main focus is Summit supercomputer, managed by OLCF. Right now, our
performance benchmark consists of running bvh_driver
, which runs several
Google Benchmark tests to build and query BVH tree. Below, we document a series
of steps with a focus on avoiding making sloppy mistakes.
For our studies, we use the following configuration:
Package | Version |
---|---|
Host compiler | gcc/7.4.0 |
Device compiler | cuda/10.1.243 |
MPI | spectrum-mpi/10.3.1.2-20200121 |
Boost | 1.68.0 |
Benchmark | 1.4 |
The first step is to build Kokkos. As ArborX requires at least version 3.1.00, this is the version we choose. Kokkos is built using the following CMake script:
ARGS=(
-D CMAKE_BUILD_TYPE=RelWithDebInfo
-D CMAKE_INSTALL_PREFIX=$KOKKOS_INSTALL_DIR
-D CMAKE_CXX_COMPILER="$KOKKOS_SOURCE_DIR/bin/nvcc_wrapper"
-D Kokkos_ENABLE_SERIAL=ON
-D Kokkos_ENABLE_OPENMP=ON
-D Kokkos_ENABLE_CUDA=ON
-D Kokkos_ENABLE_CUDA_LAMBDA=ON
-D Kokkos_ENABLE_DEPRECATED_CODE=OFF
-D Kokkos_ARCH_POWER9=ON
-D Kokkos_ARCH_VOLTA70=ON
)
cmake "${ARGS[@]}" "${KOKKOS_SOURCE_DIR}"
Next, ArborX is built:
ARGS=(
-D CMAKE_BUILD_TYPE=RelWithDebInfo
-D CMAKE_INSTALL_PREFIX=$ARBORX_INSTALL_DIR
-D BUILD_SHARED_LIBS=ON
-D ARBORX_ENABLE_MPI=ON
-D MPI_EXECUTABLE=/sw/summit/xalt/1.1.3/bin/jsrun
-D MPI_EXEC_NUMPROC_FLAG="-n"
-D CMAKE_PREFIX_PATH="$KOKKOS_INSTALL_DIR;$BENCHMARK_INSTALL_DIR;$BOOST_INSTALL_DIR"
-D CMAKE_CXX_COMPILER="$KOKKOS_INSTALL_DIR/bin/nvcc_wrapper"
-D CMAKE_CXX_FLAGS="-lineinfo -DKOKKOS_ENABLE_PROFILING -Wall -Wextra -Werror"
-D CMAKE_CXX_EXTENSIONS=OFF # required by Kokkos
-D ARBORX_ENABLE_BENCHMARKS=ON
)
cmake "${ARGS[@]}" ${ARBORX_SOURCE_DIR}
The benchmark driver is located in benchmarks/bvh_driver
directory.
Typically, we want to compare the performance of master
branch with new
changes proposed in a PR, say in feature
branch. As ArborX is header only
library, we can compile two executables, one for master
, one for feature
and launch them in the same Summit job.
To compile master
executable, proceed as follows:
git checkout $(git merge-base master feature)
cd build
./do-configure # configure CMake using the above script
cd benchmarks/bvh_driver # build only the benchmark
make
mv ArborX_BoundingVolumeHierarchy.exe ArborX_master_$(git rev-parse --short HEAD)
cd -
Then, redo the process with the feature
branch:
git checkout feature
cd build
./do-configure # configure CMake using the above script
cd benchmarks/bvh_driver # build only the benchmark
make
mv ArborX_BoundingVolumeHierarchy.exe ArborX_feature_$(git rev-parse --short HEAD)
cd -
We note that due to certain limitations in the way current driver is
implemented, it can only run a single configuration. We add an aditional step
after checking out master
or feature
by running
git am TEST.patch
At the moment, we have two versions of the patch: TEST_SETUP_NO_SORT.patch
to
run queries without sorting, and TEST_SETUP_SORT.patch
to also sort the
queries. This limitation will be addressed in the short term. These files are
provided at the bottom of this page.
At this point, one has two executables, ArborX_master_<hash1>
and
ArborX_feature_<hash2>
. The next task is to launch the job with the following script:
#!/bin/bash
### Begin BSUB Options
#BSUB -P PROJECT_ID
#BSUB -J perf_test
#BSUB -W 01:00
#BSUB -nnodes 1
#BSUB -alloc_flags "smt1"
### End BSUB Options and begin shell commands
module load gcc/7.4.0 cuda
labels=()
labels+=("master_<hash1>")
labels+=("feature_<hash2>")
# benchmark_filter="--benchmark_filter=radius"
jsrun_options="-n 1 -a 1 -c 42 -g 1 -r 1 -l CPU-CPU -d packed -b packed:42"
# Exit early if executables are not found
for label in "${labels[@]}"; do
[ -x "./ArborX_${label}" ] || exit 1
done
prefix="$(date +%Y%m%d%H%M)"
export OMP_NUM_THREADS=42
for label in "${labels[@]}"; do
cmd="jsrun $jsrun_options ./ArborX_${label} --benchmark_format=json --no-header --benchmark_repetitions=10 ${benchmark_filter} > ${prefix}_${label}.json"
echo $cmd
eval $cmd
done
On successfull run, there are going to be two .json
files in the directory.
They can then be compared using compare_bench.py
supplemenatry script of
Google Benchmark.
diff --git a/benchmarks/bvh_driver/bvh_driver.cpp b/benchmarks/bvh_driver/bvh_driver.cpp
index dbbc93a..8794383 100644
--- a/benchmarks/bvh_driver/bvh_driver.cpp
+++ b/benchmarks/bvh_driver/bvh_driver.cpp
@@ -22,6 +22,10 @@
#include <cstdlib>
#include <random>
+#ifdef ARBORX_ENABLE_MPI
+#include <mpi.h>
+#endif
+
#include <benchmark/benchmark.h>
#include <point_clouds.hpp>
@@ -210,17 +214,30 @@ public:
#define REGISTER_BENCHMARK(TreeType) \
BENCHMARK_TEMPLATE(BM_construction, TreeType) \
- ->Args({n_values, source_point_cloud_type}) \
+ ->Args({(int)1e4, 0}) \
+ ->Args({(int)1e5, 0}) \
+ ->Args({(int)1e6, 0}) \
+ ->Args({(int)1e4, 1}) \
+ ->Args({(int)1e5, 1}) \
+ ->Args({(int)1e6, 1}) \
->UseManualTime() \
->Unit(benchmark::kMicrosecond); \
BENCHMARK_TEMPLATE(BM_knn_search, TreeType) \
- ->Args({n_values, n_queries, n_neighbors, sort_predicates_int, \
- source_point_cloud_type, target_point_cloud_type}) \
+ ->Args({(int)1e4, (int)1e4, 10, 0, 0, 2}) \
+ ->Args({(int)1e5, (int)1e5, 10, 0, 0, 2}) \
+ ->Args({(int)1e6, (int)1e6, 10, 0, 0, 2}) \
+ ->Args({(int)1e4, (int)1e4, 10, 0, 1, 3}) \
+ ->Args({(int)1e5, (int)1e5, 10, 0, 1, 3}) \
+ ->Args({(int)1e6, (int)1e6, 10, 0, 1, 3}) \
->UseManualTime() \
->Unit(benchmark::kMicrosecond); \
BENCHMARK_TEMPLATE(BM_radius_search, TreeType) \
- ->Args({n_values, n_queries, n_neighbors, sort_predicates_int, \
- buffer_size, source_point_cloud_type, target_point_cloud_type}) \
+ ->Args({(int)1e4, (int)1e4, 10, 0, 0, 0, 2}) \
+ ->Args({(int)1e5, (int)1e5, 10, 0, 0, 0, 2}) \
+ ->Args({(int)1e6, (int)1e6, 10, 0, 0, 0, 2}) \
+ ->Args({(int)1e4, (int)1e4, 10, 0, 0, 1, 3}) \
+ ->Args({(int)1e5, (int)1e5, 10, 0, 0, 1, 3}) \
+ ->Args({(int)1e6, (int)1e6, 10, 0, 0, 1, 3}) \
->UseManualTime() \
->Unit(benchmark::kMicrosecond);
@@ -270,6 +287,15 @@ public:
int main(int argc, char *argv[])
{
+#ifdef ARBORX_ENABLE_MPI
+ // Even though this benchmark does not actually use MPI, initializing the
+ // MPI execution environment is necessary on Summit and Ascent
+ int required = MPI_THREAD_SERIALIZED;
+ int provided;
+ MPI_Init_thread(&argc, &argv, required, &provided);
+ assert(provided >= required);
+#endif
+
KokkosScopeGuard guard(argc, argv);
namespace bpo = boost::program_options;
@@ -361,11 +387,11 @@ int main(int argc, char *argv[])
REGISTER_BENCHMARK(ArborX::BVH<Cuda>);
#endif
-#if defined(KOKKOS_ENABLE_SERIAL)
- REGISTER_BENCHMARK(BoostRTree);
-#endif
-
benchmark::RunSpecifiedBenchmarks();
+#ifdef ARBORX_ENABLE_MPI
+ MPI_Finalize();
+#endif
+
return EXIT_SUCCESS;
}
diff --git a/benchmarks/bvh_driver/bvh_driver.cpp b/benchmarks/bvh_driver/bvh_driver.cpp
index dbbc93a..8794383 100644
--- a/benchmarks/bvh_driver/bvh_driver.cpp
+++ b/benchmarks/bvh_driver/bvh_driver.cpp
@@ -22,6 +22,10 @@
#include <cstdlib>
#include <random>
+#ifdef ARBORX_ENABLE_MPI
+#include <mpi.h>
+#endif
+
#include <benchmark/benchmark.h>
#include <point_clouds.hpp>
@@ -210,17 +214,30 @@ public:
#define REGISTER_BENCHMARK(TreeType) \
BENCHMARK_TEMPLATE(BM_construction, TreeType) \
- ->Args({n_values, source_point_cloud_type}) \
+ ->Args({(int)1e4, 0}) \
+ ->Args({(int)1e5, 0}) \
+ ->Args({(int)1e6, 0}) \
+ ->Args({(int)1e4, 1}) \
+ ->Args({(int)1e5, 1}) \
+ ->Args({(int)1e6, 1}) \
->UseManualTime() \
->Unit(benchmark::kMicrosecond); \
BENCHMARK_TEMPLATE(BM_knn_search, TreeType) \
- ->Args({n_values, n_queries, n_neighbors, sort_predicates_int, \
- source_point_cloud_type, target_point_cloud_type}) \
+ ->Args({(int)1e4, (int)1e4, 10, 1, 0, 2}) \
+ ->Args({(int)1e5, (int)1e5, 10, 1, 0, 2}) \
+ ->Args({(int)1e6, (int)1e6, 10, 1, 0, 2}) \
+ ->Args({(int)1e4, (int)1e4, 10, 1, 1, 3}) \
+ ->Args({(int)1e5, (int)1e5, 10, 1, 1, 3}) \
+ ->Args({(int)1e6, (int)1e6, 10, 1, 1, 3}) \
->UseManualTime() \
->Unit(benchmark::kMicrosecond); \
BENCHMARK_TEMPLATE(BM_radius_search, TreeType) \
- ->Args({n_values, n_queries, n_neighbors, sort_predicates_int, \
- buffer_size, source_point_cloud_type, target_point_cloud_type}) \
+ ->Args({(int)1e4, (int)1e4, 10, 1, 0, 0, 2}) \
+ ->Args({(int)1e5, (int)1e5, 10, 1, 0, 0, 2}) \
+ ->Args({(int)1e6, (int)1e6, 10, 1, 0, 0, 2}) \
+ ->Args({(int)1e4, (int)1e4, 10, 1, 0, 1, 3}) \
+ ->Args({(int)1e5, (int)1e5, 10, 1, 0, 1, 3}) \
+ ->Args({(int)1e6, (int)1e6, 10, 1, 0, 1, 3}) \
->UseManualTime() \
->Unit(benchmark::kMicrosecond);
@@ -270,6 +287,15 @@ public:
int main(int argc, char *argv[])
{
+#ifdef ARBORX_ENABLE_MPI
+ // Even though this benchmark does not actually use MPI, initializing the
+ // MPI execution environment is necessary on Summit and Ascent
+ int required = MPI_THREAD_SERIALIZED;
+ int provided;
+ MPI_Init_thread(&argc, &argv, required, &provided);
+ assert(provided >= required);
+#endif
+
KokkosScopeGuard guard(argc, argv);
namespace bpo = boost::program_options;
@@ -361,11 +387,11 @@ int main(int argc, char *argv[])
REGISTER_BENCHMARK(ArborX::BVH<Cuda>);
#endif
-#if defined(KOKKOS_ENABLE_SERIAL)
- REGISTER_BENCHMARK(BoostRTree);
-#endif
-
benchmark::RunSpecifiedBenchmarks();
+#ifdef ARBORX_ENABLE_MPI
+ MPI_Finalize();
+#endif
+
return EXIT_SUCCESS;
}