Skip to content

Space Time Stack

Vivek Kale edited this page Jul 16, 2024 · 12 revisions

Summary

This tool presents information on a Kokkos program's efficiency w.r.t. space (memory) and execution time for arbitrarily nested regions of the program.

You can see the current version of the Kokkos Tools Space Time Stack at:

https://github.com/kokkos/kokkos-tools/tree/develop/profiling/space-time-stack

Description

This is an essential Kokkos Tool library from the set of Kokkos Tools that provides a fine-grained quantitative answer to two top-level questions on efficiency of a parallel program:

(1) how efficient is my parallel program in finishing its work on a parallel computer? (2) for what part is it least efficient?

and it additionally facilitates to provide a quantitative answer to the question of how the answers to (1) and (2) compare across two different platforms.

This tool attempts to be a nice all-in-one implementation, combining some of the features of other tools. As the beginning of the name implies, it prints information about both runtime and memory usage of a Kokkos program. As the end of the name of this tool implies (i.e, 'stack'), this tool presents this information on efficiency of space and time in a way that allows arbitrary nesting of regions as defined by Kokkos::Profiling::pushRegion and Kokkos::Profiling::popRegion.

Why Should I Care?

To assess and understand efficiency of a computer program (work, or sequence of steps, algorithm) running on a computer (process), one needs answer to the question of how long (time) and how much stuff (space) the program needed to finish its work on computer, and what parts of the program's computation (regions) that program took the longest and used the most stuff.

A serial algorithm's efficiency run on a Turing machine is assessed by its time complexity and space complexity. When that algorithm is implemented as a computer program and run on a computer, one can expect that trends in its execution time and memory usage given a particular problem size roughly follows the order of growth in space and time complexity.

The same cannot be said for parallel algorithms and parallel programs runs on a parallel computer, due to complexities introduced by parallelism on a node of a supercomputer, and challenges in parallel programming. Having a tool to understand where time and space is spent is especially critical to ensuring parallel programs run as efficiently as possible.

Such a tool is even more important performance portable parallel program (and specifically a Kokkos parallel program). One can obtain a information similar to what Kokkos Tools Space-Time-Stack provides using the vendor-specific profiler tools for native GPU programming, e.g., roctx for HIP, nvtx for CUDA. However, there are problems with doing this for analyzing a Kokkos parallel program. The first problem is name mangling of the Kokkos function in these profiles, making it hard for a user to understand what Kokkos function in their Kokkos program is actually being profiled. The second problem is that one cannot always properly do an apples-to-apples comparison of time spent in a particular function across backends run on different platforms, as different profiling tools will support different granularity of detail in the profiling and show different information.Kokkos programmers want to compare efficiency of their program and the parts of the program across two different platforms (where one platform has a node with a one vendor's GPU and the other has a node with a different vendor's GPU).

Key Features

The fundamental questions on efficiency for any (parallel) computer program are:

(1) How much time (in the form of wall clock time, measured in seconds) did my program take in each part of the program? (2) How much space (in the form of memory, measured in Gigabytes) did my program take in each part of the program?

The connector tool provides this information per region of a Kokkos program, where each of the regions can have sub-regions within them, and each of those sub-regions can have sub-sub-regions within them, and so on, i.e., nested regions.

One can control the detail of information needed when profiling through an environment variable representing the percent of time taken in the application as a double. By default, this environment variable is set to a tenth of a percent.

Usage

Compilation

There are two ways to build this tool for use in your Kokkos program. The first way is through make command via the provided Makefile. The second is via cmake via the provided CMakeLists.txt file. Building using CMake is the recommended way to build this tool library.

Using Makefile

Go to this tool's directory. Open the Makefile and make changes to the compilers as needed. Then, run make. The compiler is hardcoded to mpicxx, and the code makes use of MPI. However, there is a USE_MPI macro such that if its definition is removed from the code, then MPI will not be used.

Using CMake

  1. Go to the top-level directory for Kokkos Tools. Create a build directory and then cd into it, e.g., mkdir mySpaceTimeStackBuild; cd mySpaceTimeStackBuild.

  2. Type cmake ... Wait until the command has completed execution.

  3. Once the command has completed execution from step 2, type ccmake .. and do the following: a. set the option KOKKOS_ENABLE_SINGLELIB to ON. b. find SINGLELIB_PROFILERS option, and set it to use only the name of this Kokkos Tools Connector. c. With this, you type c to configure.

  4. Type make and then make install.

Running

Once you have built the tool following the guidelines in the previous section, you can run a Kokkos program with this tool. You can run the tool in two ways. One way is by setting environment variables and then running the Kokkos program. The other way is by using command-line argument to the Kokkos program executable.

Using the Environment Variable

Type export KOKKOS_TOOLS_LIBS=kp_space_time_stack.so; myKokkosApp.exe.

Using the Command-line Argument

Type myKokkosApp.exe --kokkos-tools-libs='kp_space_time_stack.so'.

Sample Output

The following is example output from this tool for a two-rank MPI job that heavily leverages Kokkos:

BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 2.07634 seconds
TOP-DOWN TIME TREE:
<percent of total time> <percent MPI imbalance> <number of calls> <name> [type]
================== 
|-> 41.2% 47.2% 2000 N9SPARTA_NS12UpdateKokkosE [reduce]
|-> 8.7% 2.3% 2000 N9SPARTA_NS16CollideVSSKokkosE [reduce]
|-> 4.1% 1.8% 6007 N9SPARTA_NS14ParticleKokkosE [for]
|-> 0.8% 5.0% 2000 N9SPARTA_NS10CommKokkosE [for]
|-> 0.3% 38.3% 2000 ZN9SPARTA_NS17FixEmitFaceKokkos12perform_taskEvEUliE_ [for]
|-> 0.3% 5.0% 4002 N9SPARTA_NS8ExclScanIN6Kokkos6OpenMPEEE [scan]
|-> 0.3% 13.2% 4000 N9SPARTA_NS15IrregularKokkosE [for]

BOTTOM-UP TIME TREE:
<percent of total time> <percent MPI imbalance> <number of calls> <name> [type]
=================== 
|-> 41.2% 47.2% 2000 N9SPARTA_NS12UpdateKokkosE [reduce]
|-> 8.7% 2.3% 2000 N9SPARTA_NS16CollideVSSKokkosE [reduce]
|-> 4.1% 1.8% 6007 N9SPARTA_NS14ParticleKokkosE [for]
|-> 0.8% 5.0% 2000 N9SPARTA_NS10CommKokkosE [for]
|-> 0.3% 38.3% 2000 ZN9SPARTA_NS17FixEmitFaceKokkos12perform_taskEvEUliE_ [for]
|-> 0.3% 5.0% 4002 N9SPARTA_NS8ExclScanIN6Kokkos6OpenMPEEE [scan]
|-> 0.3% 13.2% 4000 N9SPARTA_NS15IrregularKokkosE [for]

MPI RANK WITH MAX MEMORY: 0
MAX BYTES ALLOCATED: 2669008
HOST ALLOCATIONS:
================ 
  63.8% particle:particles
  19.4% grid:csurfs
  3.0% particle:plist
  2.7% grid:cells
  2.5% surf:tris
  2.5% particle:mlist
  2.2% particle:plist
  1.2% grid:cinfo
  0.5% surf:pts
  0.3% normal
  0.3% vstream
  0.2% comm:sbuf
  0.1% collide:vremax
  0.1% collide:remain
  0.1% thermal/grid:vector_grid
  0.1% thermal/grid:tally
  0.1% Irregular:buf
  0.1% npqdim

END KOKKOS PROFILING REPORT.

The first two major blocks are runtime reports. Because this application does not yet use Kokkos::Profiling::pushRegion, the "bottom-up" and "top-down" stacks are identical. The first column is the percentage of the total runtime consumed by the kernel (this runtime is computed as the sum over MPI ranks). The second column is the imbalance across MPI ranks, defined as the maximum time consumed by the kernel in any MPI rank divided by the average time consumed by the kernel over all MPI ranks. The third column is the number of calls that were made to the kernel. The fourth column is the name of the kernel. Since this application did not specify explicit names for its kernels, the names are typeid(functor).name(). The fifth column is the type of the stack frame, which can be a parallel_for, a parallel_reduce, a parallel_scan, or a region.

The third major block is a snapshot of which Kokkos allocations existed at the time of high water mark memory consumption, in the MPI rank which had the highest high water mark memory consumption. The first column is the percentage of the high water mark memory which is consumed by this allocation, and the second column is the name given to it (via the Kokkos::View constructor). Note that this tool only measures the memory consumed by Kokkos allocations in a computer program.

The data processing all happens in memory, and the report is printed when Kokkos::finalize is called. The runtime overhead of profiling during application execution should be low. Likewise, we try to keep memory consumption low by using a prefix tree data structure to accumulate the information. The MPI communication needed is only done in the final post-processing called by Kokkos::finalize, and it is never done during application execution.