rocRoller is a software library for generating AMDGPU kernels.
- AMD's rocRoller Assembly Kernel Generator
To build rocRoller using Docker:
git clone --recurse-submodules git@github.com:ROCm/rocRoller.git rocRoller
cd rocRoller
./docker/user-image/start_user_container
docker exec -ti -u ${USER} ${USER}_dev_clang bash
cd /data
mkdir -p build
cd build
cmake -DROCROLLER_ENABLE_TIMERS=ON -DCMAKE_BUILD_TYPE=Release ..
make -j
To build rocRoller natively on Ubuntu 22 (jammy):
# As root, once:
apt update
apt install -y libboost-container1.74-dev libopenblas-dev ninja-build
# As regular user:
git clone --recurse-submodules git@github.com:ROCm/rocRoller.git rocRoller
cd rocRoller
mkdir -p build
cd build
cmake -DROCROLLER_ENABLE_TIMERS=ON -DCMAKE_BUILD_TYPE=Release ..
make -j
To run the unit tests:
cd rocRoller/build
make test
The rocRoller repository includes several Docker files. We recommend using these for development work.
Instructions for building and launching docker are available.
The rocRoller repo can be cloned from the internal GitHub repo. The
tip of the master
branch contains the latest commits:
https://github.com/ROCm/rocRoller
git clone --recurse-submodules git@github.com:ROCm/rocRoller.git rocRoller
From inside the docker container launched previously, the library can
be built using these steps. The cloned directory should be available
from the /data
directory in the docker container. First create a
build
directory in the root of the cloned repo and enter that
directory:
mkdir -p build
cd build
CMake will use the default compiler specified in the environment variable CXX
. If you want to set it manually, configure using the cmake command:
CXX=<g++ or clang++ path> CC=<gcc or clang path> cmake .. <cmake options>
Then build:
make -j$(nproc)
- Ninja is an alternative to
make
and is generally faster at building the rocRoller library. Include-G Ninja
in yourcmake
command and then useninja
to build. -DSKIP_CPPCHECK=ON
Skips running every source file throughcppcheck
(skip is on by default). This speeds up builds.-DCMAKE_BUILD_TYPE=Debug
will build a debug version with debugging information for gdb. SpecifyRelease
instead for release builds.-DBUILD_SHARED_LIBS=OFF
will build the project as a static library,ON
(default) will build it as a shared library.-DYAML_BACKEND=YAML_CPP
(default) or-DYAML_BACKEND=LLVM
determines the backend used for YAML serialization.-DINTERNAL_MRISAS=<CUSTOM_MRISA_XML_FILES>
where<CUSTOM_MRISA_XML_FILES>
is a semi-colon delimitted list of paths to MRISA XML files. Allows the user to specify additional MRISA XML files to use during compilation.- To see what options your build directory was configured with, run
cmake -LAH
- It is highly recommended not to reconfigure a build directory with cmake. Doing so can lead to caching problems with the configuration. When in doubt, remove your build directory and start fresh.
-DROCROLLER_USE_PREGENERATED_ARCH_DEF=OFF
will not use the GPUArchitecture yaml file(s) checked into the repo, and will instead generate them from scratch using the MRISA XML files and GPUArchitecture_def file.
There are three ways to launch the test applications:
make test
This will run our unit tests using ctest
(one process per test).
Alternatively, from the build directory simply run:
./rocRollerTests
This command will run all of the tests all together. The tests will
run faster than using make test
, however, if there is an
segmentation fault in any of the tests it will stop.
Individual tests can be run at the command line from the build directory. For example:
./rocRollerTests --gtest_filter="*MemoryInstructionsExecuter.GPU_ExecuteFlatTest1Byte*"
Tests that require a GPU are prefixed with GPU_
. If you are running
on a machine without a supported GPU, you can use:
ctest -LE GPU
or
./bin/rocRollerTests --gtest_filter="-*GPU_*"
A full list of gtests available can be listed using the command:
./bin/rocRollerTests --gtest_list_tests
To prevent exceptions from being eaten by GoogleTest:
--gtest_catch_exceptions=0
This allows catch/throw to work in GDB.
To turn a test failure directly into a GDB breakpoint:
--gtest_break_on_failure
The GPUArchitecture yaml file(s) checked into the repo should be updated anytime there are changes to the underlying MRISA XML files or the GPUArchitecture_def file.
Updated yaml files can be copied from ./build/share/rocRoller/split_yamls/
after
building with -DROCROLLER_USE_PREGENERATED_ARCH_DEF=OFF
.
mkdir ./build
cd ./build
cmake -DROCROLLER_USE_PREGENERATED_ARCH_DEF=OFF ../
make -j GPUArchitecture_def
find ./share/rocRoller/split_yamls/ -type f -name "*.yaml" -exec realpath {} \; | xargs -I {} ../scripts/format_yaml.py -f {} {}
rm -rf ../GPUArchitectureGenerator/pregenerated/*.yaml
cp ./share/rocRoller/split_yamls/*.yaml ../GPUArchitectureGenerator/pregenerated/
To explore GEMM workloads and do performance testing, rocRoller provides a GEMM client. The GEMM client is built alongside the library by default.
To run the GEMM client from your build directory:
./bin/client/gemm --help
Foo.hpp
: Contains definitions for classes, concepts, etc. Functions should be declaration-only. This is meant to allow for easy reading of the interface.Foo_impl.hpp
: Contains definitions for short (inlinable) functions. Should be included at the bottom ofFoo.hpp
. This makes it easy to move definitions between here andFoo.cpp
, since they are not within the class definition.Foo_fwd.hpp
: Contains forward declarations for all types defined inFoo.hpp
, type aliases such asFooPtr
forstd::shared_ptr<Foo>
andstd::variant
. Should have zero or very few includes.Foo.cpp
: Contains definitions for longer functions.
Generally, we want as much as possible to be inlined since performance is important.
- Static and free functions should start with an uppercase character
- Instance functions should start with lowercase
- Private member variables should start with
m_
- Public member variables should not start with
m_
- If there is no invariant, public member variables are ok
- Naming convention should follow camelCase
- Macro names and CMake options should be
UPPER CASE
usingsnake_case
and have a prefixROCROLLER_
- Before submitting code as a PR for review all sources need to be formatted with
clang-format
, version 13.
Code should be writing in a clearly descriptive fashion. Good resource for expressing ideas in code.
While the rocRoller
library does have C++20 features, new C++20 specific features should not be introduced into the code without explicit design review.
In general, C++17 modern coding practices should be followed. This means for example type aliases should be done with using
instead of typedef
.
Memory should not be allocated with new
keyword syntax, instead use managed pointers via either std::make_unique
or std::make_shared
. If an array of elements need to be allocated then a std::vector
can be used.
Documentation of code is as important for maintainability as writing clear and concise code. Documentation is built for rocRoller
using Doxygen, which means contributors need to follow Doxygen style comments.
- Developers must add documentation to describe new data structures such as
Class
declarations and member functions. - Per line comments are important when the semantics of the code section isn't obvious.
- Incomplete features, or sections that need to implemented in future PRs should have a comment with a
TODO
heading.
Note: If using VSCode consider installing the Doxygen Documentation Generator extension. This extension helps with quickly implementing documentation sections in a Doxygen format.
Graphviz and Doxygen both need to be installed to build the html documentation.
The Doxygen documentation can be built via the make command from the main build directory:
make -j$(nproc) docs
This will generate an HTML website from the markdown readme files and Doxygen comments and source/object relationships. The index.html
file can be found in doc/html
from the root of your local repository.
- PRs submitting should have fewer than
500
lines of code change, unless of course those changes are merely lines moved or deleted. - If a new feature cannot be implemented in fewer than
500
lines of code then consider either refactoring, or splitting the feature into two or more PRs. - When opening a PR, details of what the feature does are essential.
- PR authors should provide a description on how to review the PR, such as, outlining key changes. This is especially important if the PR is complex, novel, or on the higher side of lines of code change.
- As mentioned in the coding style section, PR code needs to be formatted with
clang-format
version 13. This can be done manually. - rocRoller's dockers already have this version of
clang-format
and formatting can be automated using githooks. The command for installing the hook to the local repo:.githooks/install
- If a PR alters the compilation process in a way that causes the performance job to fail when building the master branch, adding the label
ci:no-build-master
will instead compare the PR against the latest build of the master branch.
Each new feature is required to have a test.
- Test sources are placed in the
test
folder. - CPP Files for Unit Tests should be included in the
rocRollerTests
executable in CMakeLists.txt.
Some tests require multiple threads for properly testing a desired or undesired behaviour (e.g. thread-safety) or to benefit from faster execution. Therfore, it is recommended to set OMP_NUM_THREADS
appropriately. A value between [NUM_PHYSICAL_CORES/2, NUM_PHYSICAL_CORES)
is recommended. Setting OMP_NUM_THREADS
to the number of available cores or higher can cause test to run slower due to oversubscription (e.g. increased contention).
Note a few conditions:
- If your test requires a context but does not actually need to run on a GPU, inherit from
GenericContextFixture
. - If your test needs to run on an actual GPU, inherit from
CurrentGPUContextFixture
. This:- provides functionality to assemble a kernel
- sets up
targetArchitecture()
to reflect a real GPU. - will include functionality to run a kernel in the future.
- Name your test starting with
GPU_
. This ensures that it is can be filtered out when running on a CPU-only node.
- If your test needs to run against different option values, use
Settings::set()
to run the test with those values (overwriting previous values in memory). Once a test is completed, the context fixture callsSettings::reset()
, which resets the settings to the env vars (if set) or default settings.
Catch2 is the preferred unit testing framework for new unit tests. GTest information for working with older unit tests can be found in the GoogleTest User's Guide.
Additionally, there are tests that run against guidepost kernels that have been generated by Tensile. See the README for more information.
- By default, logging messages of level
info
and above are sent to the console. - To disable console output, set the environment variable
ROCROLLER_LOG_CONSOLE=0
. - To change the logging level, set the environment variable
ROCROLLER_LOG_LEVEL
. For example, settingROCROLLER_LOG_LEVEL
toDebug
will cause debug messages to be emitted. - To log to a file, set the environment variable
ROCROLLER_LOG_FILE
to a file name.
- Setting the environment variable
ROCROLLER_SAVE_ASSEMBLY=1
will cause assembly code to be written to a text file in the current working directory as it is generated. The file name is based on the kernel name and has a.s
extension. To manually set the assembly file name to something else, setROCROLLER_ASSEMBLY_FILE
to the desired name. - Setting the environment variable
ROCROLLER_RANDOM_SEED
to an unsigned integer value will set the seed of theRandomGenerator
used by the unit tests. - Setting the environment variable
ROCROLLER_BREAK_ON_THROW=1
will cause exceptions thrown directly by library code to cause a segfault. This causes GDB to break at the original point of failure, instead of at the point where it is rethrown by the Generator class.- You can also set a breakpoint at the constructor of an exception class (e.g.
b std::bad_variant_access::bad_variant_access()
), and GDB will more reliably break there than oncatch throw
.
- You can also set a breakpoint at the constructor of an exception class (e.g.
- Setting
ROCROLLER_ARCHITECTURE_FILE
will overwrite the default GPU architecture file generated atsource/rocRoller/GPUArchitecture_def.msgpack
. Currently supported file formats are YAML and Msgpack. - STL exceptions will not be affected by this. Sometimes a better stack trace can be obtained by placing a breakpoint in the constructor of the particular exception that is being thrown.
To explicitly enable/disable some of the mentioned environment variables, ROCROLLER_DEBUG
can be set accordingly. ROCROLLER_DEBUG
is a bit field that aggregates options together. The following options are covered by ROCROLLER_DEBUG
:
Option | Respective Bit |
---|---|
ROCROLLER_LOG_CONSOLE |
0 (0x0001) |
ROCROLLER_SAVE_ASSEMBLY |
1 (0x0002) |
ROCROLLER_BREAK_ON_THROW |
2 (0x0004) |
Therefore, enabling ROCROLLER_LOG_CONSOLE
and ROCROLLER_BREAK_ON_THROW
and disabling ROCROLLER_SAVE_ASSEMBLY
would require ROCROLLER_DEBUG=0x0005
. Enabling/disabling all corresponding options requires ROCROLLER_DEBUG=0xFFFF
/ ROCROLLER_DEBUG=0X0000
, respectively.
AssertFatals
can be used in place of assert
. AssertFatal
is similar to assert
but offers more debugging functionality.
An example usage:
auto vr = std::make_shared<Register::Value>(m_context, Register::Type::Vector, DataType::Int32, 1);
auto ar = std::make_shared<Register::Value>(m_context, Register::Type::Scalar, DataType::Int32, 1);
// This call will NOT exit the program
AssertFatal(vr->registerCount() == ar->registerCount(), "Register counts are not equivalent");
// This call WILL exit the program and print relevant info
AssertFatal(vr->regType() == ar->regType(), ShowValue(Register::toString(vr->regType()),
ShowValue(Register::toString(ar->regType()), "Register types are not equivalent");
The first argument is a condition that is checked, ShowValue()
is a macro defined in rocRoller/Error_impl.hpp
where it prints out the variable name and value. The last argument is a custom message the developer can print to screen. If the condition is false then the program exits and the file name, line number, and corresponding arguments passed to AssertFatal are then printed.
Note: The condition check must be the first argument, but the other arguments can be in any order and are optional.
The use of Throw<FatalError>("message")
can also be used to catch incorrect code.
On Linux, we can use the "perf" tool to sample the callgraph and generate a flame graph.
Using FlameGraph: https://github.com/brendangregg/FlameGraph
CMake configure with -DROCROLLER_ENABLE_TIMERS=ON
and compile.
Then run with the steps outlined in scripts/flamegraph.py:
perf record -F 99 -g ./bin/rocRollerTests --gtest_filter="KernelGraph*03"
perf script > out.perf
~/FlameGraph/stackcollapse-perf.pl out.perf > out.folded
~/FlameGraph/flamegraph.pl out.folded > kernel.svg
Additionally, when using the trace Dockerfile, you can invoke rrperf to profile RocRoller or Tensile guideposts with Omniperf. To see how this works, check rrperf's help documentation:
./scripts/rrperf profile --help
Invoke the rrperf tool to run performance tests. The autoperf command will test the performance of multiple commits and your current workspace. To see how this works, check rrperf's help documentation:
./scripts/rrperf autoperf --help
Setting the environment variable ROCROLLER_KERNEL_ANALYSIS=1
will enable the following kernel analysis features built into rocRoler.
- Register Liveness
- The file
${ROCROLLER_ASSEMBLY_FILE}.live
is created which reports register liveness. - Legend:
Symbol Meaning space "Dead" register :
"Live" register ^
This register is written to this instruction. v
This register is read from this instruction. x
This register is written to and read from this instruction. _
This register is allocated but dead.
- The file
To view a summary plot of the generated file, run:
python3 ./scripts/plot_liveness.py --help && \
python3 ./scripts/plot_liveness.py ${ROCROLLER_ASSEMBLY_FILE}.live
and visit http://127.0.0.1:8050/.
The kgraph script can be run on an assembly file or on log output to generate a .dot file or rendered .pdf of the internal graph. Run it with the --help
option to see invocation.
When using the diff functionality, the coloring is as follows:
- Red means the node will be removed in the next step
- Blue means the node was added since the last step
- Yellow means both, and the node is temporary for this step
Run the following on RocRoller output in a .log file to extract dots printed with logger->debug("CommandKernel::generateKernelGraph: {}", m_kernelGraph.toDOT());:
./kgraph.py gemm.log -o dots
Log files can be found in the output of RRPerf runs. Use --omit_diff to omit colouring difference when extracting log output.
Run the following to compare multiple dot files:
./dot_diff.py dots_0000.dot dots_0001.dot dots_0002.dot -o dots
The GEMM client can produce memory trace files, using --visualize=True
, which can be rendered using the show_matrix script.
The following commands can be used to visualize memory access patterns to png files:
$ cd ${build_dir}
$ ./bin/client/gemm --M=512 --N=768 --K=512 --mac_m=128 --mac_n=256 --mac_k=16 --alpha=2.0 --beta=0.5 --workgroup_size_x=256 --workgroup_size_y=1 --type_A=half --type_B=half --type_C=half --type_D=half --type_acc=float --num_warmup=2 --num_outer=10 --num_inner=1 --trans_A=N --trans_B=T --loadLDS_A=True --loadLDS_B=True --storeLDS_D=False --scheduler=Priority --visualize=True --match_memory_access=False
Visualizing to gemm.vis
Wrote workitem_A.dat
Wrote workitem_B.dat
Wrote LDS_A.dat
Wrote LDS_B.dat
Wrote workgroups_A.dat
Wrote workgroups_B.dat
Wrote loop_idx_A.dat
Wrote loop_idx_B.dat
Result: Correct
RNorm: 1.4083e-05
$ ../scripts/show_matrix *.dat
Wrote LDS_A.png
Wrote LDS_B.png
Wrote loop_idx_A.png
Wrote loop_idx_B.png
Wrote workgroups_A.png
Wrote workgroups_B.png
Wrote workitem_A.png
Wrote workitem_B.png