The perf_helper
library is designed for performance measurement using perf
.
Performance measurement is conducted by collecting values from PMU counters built into ARM processors.
It supports interval-based measurement by inserting functions into the source code to mark the start and end of measurement sections.
The number of PMU counters that can be measured simultaneously is limited by the CPU architecture.
Although mechanisms like multiplexing exist to increase the apparent number of simultaneously measurable counters, this library prioritizes measurement accuracy and does not incorporate such mechanisms.
- PMU driver: armv8_pmuv3_0
- Compilers:
- GCC (gcc, gfortran)
- ARM Compiler (armclang, armflang)
- Fujitsu Compiler (fccpx, frtpx)
To build perf_helper library, simply execute make.
- for GCC :
make COMPILER=gcc
- for ARM Compiler :
make COMPILER=acfl
- for Fujitsu Compiler :
make COMPILER=fj
To measure performance within specific sections of your code:
- Use
perf_initialize
andperf_finalize
outside parallel regions. - Use
perf_start_section
andperf_stop_section
within parallel regions. - Specify the events to measure using the
PERF_EVENTS
environment variable.
Note: Section IDs range from 0
to 99
, and nested sections are supported.
#include "perf_helper.h"
void compute(int n, double x);
int main() {
int n = 1000000;
double x;
perf_initialize();
#pragma omp parallel
{
perf_start_section(0);
perf_start_section(1);
compute(n, x);
perf_stop_section(1);
perf_start_section(2);
compute(n, x);
perf_stop_section(2);
perf_stop_section(0);
}
perf_finalize();
}
program main
use perf_helper_mod
implicit none
integer, parameter :: n = 1000000
double precision :: x
integer :: i
call perf_initialize()
!$omp parallel
call perf_start_section(0)
do i = 1, 10
! Section 1
call perf_start_section(1)
call sample(n, x)
call perf_stop_section(1)
! Section 2
call perf_start_section(2)
call sample(n, x)
call perf_stop_section(2)
end do
call perf_stop_section(0)
!$omp end parallel
call perf_finalize()
end program main
#!/bin/sh
gcc -fopenmp -c main.c -o main.o
gcc -fopenmp -c test.c -o test.o
gcc -fopenmp main.o test.o -lperf_helper
#!/bin/sh
gfortran -fopenmp -c main.f90 -o main.o
gfortran -fopenmp -c test.f90 -o test.o
gfortran -fopenmp main.o test.o -lperf_helper
This script execute load module 9 times and measure each counter set defined by environment variables COUNTER[0-9]. Since the measurement results are output to standard output, it is recommended to redirect them to an appropriate file.
#!/bin/sh
# 1.Performance
COUNTER1="INST_SPEC,CPU_CYCLES,STALL_FRONTEND,STALL_BACKEND,FP_SCALE_OPS_SPEC,FP_FIXED_OPS_SPEC"
# 2.Instruction Mix
COUNTER2="CPU_CYCLES,BR_IMMED_SPEC,BR_INDIRECT_SPEC,LD_SPEC,ST_SPEC"
COUNTER3="DP_SPEC,VFP_SPEC,ASE_INST_SPEC,SVE_INST_SPEC"
# 3.Cache Effectiveness
COUNTER4="INST_RETIRED,L1D_CACHE,L1D_CACHE_REFILL,L1D_CACHE_WB"
COUNTER5="L2D_CACHE,L2D_CACHE_REFILL,L2D_CACHE_WB,LL_CACHE_RD,LL_CACHE_MISS_RD"
# 4.TLB Effectivenes
COUNTER6="INST_RETIRED,DTLB_WALK,ITLB_WALK,L1I_TLB_REFILL,L1D_TLB_REFILL"
COUNTER7="L1D_TLB,L1I_TLB,L2D_TLB_REFILL,L2D_TLB"
# Performance Debug with implementation defined counters
COUNTER8="0x0E1,0x0E2,0x15B,0x158,0x159,0x15A"
COUNTER9="0x15C,0x15D,0x15E,0x15F,0x160"
LD=${1-"./a.out"}
export OMP_NUM_THREADS=${2-"2"}
export THREAD_STACK_SIZE=8192
export OMP_PROC_BIND=spread
STA=0
END=`expr ${STA} + ${OMP_NUM_THREADS} - 1`
for i in `seq 1 9`;do
C=COUNTER${i}
export PAPI_EVENTS=`eval echo '$'$C`
taskset -c ${STA}-${END} ${LD}
done
The counter values output to output.txt can be analyzed using the following command.
Specify the --cpu
option appropriately according to the measured environment.
python3 anal.py --cpu {fugaku|graviton3e|graviton4|gracecpu} output.txt