Performance Evaluation Library Using Linux perf for ARM Processors

Introduction

The perf_helper library is designed for performance measurement using perf. Performance measurement is conducted by collecting values from PMU counters built into ARM processors. It supports interval-based measurement by inserting functions into the source code to mark the start and end of measurement sections. The number of PMU counters that can be measured simultaneously is limited by the CPU architecture. Although mechanisms like multiplexing exist to increase the apparent number of simultaneously measurable counters, this library prioritizes measurement accuracy and does not incorporate such mechanisms.

PMU driver: armv8_pmuv3_0
Compilers:
- GCC (gcc, gfortran)
- ARM Compiler (armclang, armflang)
- Fujitsu Compiler (fccpx, frtpx)

Build perf_helper library

To build perf_helper library, simply execute make.

for GCC : make COMPILER=gcc
for ARM Compiler : make COMPILER=acfl
for Fujitsu Compiler : make COMPILER=fj

Adding Section Measurements

To measure performance within specific sections of your code:

Use perf_initialize and perf_finalize outside parallel regions.
Use perf_start_section and perf_stop_section within parallel regions.
Specify the events to measure using the PERF_EVENTS environment variable.

Note: Section IDs range from 0 to 99, and nested sections are supported.

Code Examples

C

#include "perf_helper.h"
void compute(int n, double x);

int main() {
    int n = 1000000;
    double x;

    perf_initialize();
    #pragma omp parallel
    {
        perf_start_section(0);
        perf_start_section(1);
        compute(n, x);
        perf_stop_section(1);
        perf_start_section(2);
        compute(n, x);
        perf_stop_section(2);
        perf_stop_section(0);
    }
    perf_finalize();
}

Fortran90

program main
  use perf_helper_mod
  implicit none
  integer, parameter :: n = 1000000
  double precision :: x
  integer :: i

  call perf_initialize()
  !$omp parallel
  call perf_start_section(0)
  do i = 1, 10
    ! Section 1
    call perf_start_section(1)
    call sample(n, x)
    call perf_stop_section(1)
    ! Section 2
    call perf_start_section(2)
    call sample(n, x)
    call perf_stop_section(2)
  end do
  call perf_stop_section(0)
  !$omp end parallel
  call perf_finalize()
end program main

Compilation

For GCC

#!/bin/sh
gcc -fopenmp -c main.c -o main.o
gcc -fopenmp -c test.c -o test.o
gcc -fopenmp main.o test.o -lperf_helper

For Fortran90

#!/bin/sh
gfortran -fopenmp -c main.f90 -o main.o
gfortran -fopenmp -c test.f90 -o test.o
gfortran -fopenmp main.o test.o -lperf_helper

Execution

Sample script:

This script execute load module 9 times and measure each counter set defined by environment variables COUNTER[0-9]. Since the measurement results are output to standard output, it is recommended to redirect them to an appropriate file.

#!/bin/sh
# 1.Performance
COUNTER1="INST_SPEC,CPU_CYCLES,STALL_FRONTEND,STALL_BACKEND,FP_SCALE_OPS_SPEC,FP_FIXED_OPS_SPEC"
# 2.Instruction Mix
COUNTER2="CPU_CYCLES,BR_IMMED_SPEC,BR_INDIRECT_SPEC,LD_SPEC,ST_SPEC"
COUNTER3="DP_SPEC,VFP_SPEC,ASE_INST_SPEC,SVE_INST_SPEC"
# 3.Cache Effectiveness
COUNTER4="INST_RETIRED,L1D_CACHE,L1D_CACHE_REFILL,L1D_CACHE_WB"
COUNTER5="L2D_CACHE,L2D_CACHE_REFILL,L2D_CACHE_WB,LL_CACHE_RD,LL_CACHE_MISS_RD"
# 4.TLB Effectivenes
COUNTER6="INST_RETIRED,DTLB_WALK,ITLB_WALK,L1I_TLB_REFILL,L1D_TLB_REFILL"
COUNTER7="L1D_TLB,L1I_TLB,L2D_TLB_REFILL,L2D_TLB"
# Performance Debug with implementation defined counters
COUNTER8="0x0E1,0x0E2,0x15B,0x158,0x159,0x15A"
COUNTER9="0x15C,0x15D,0x15E,0x15F,0x160"

LD=${1-"./a.out"}
export OMP_NUM_THREADS=${2-"2"}
export THREAD_STACK_SIZE=8192
export OMP_PROC_BIND=spread

STA=0
END=`expr ${STA} + ${OMP_NUM_THREADS} - 1`

for i in `seq 1 9`;do
  C=COUNTER${i}
  export PAPI_EVENTS=`eval echo '$'$C`
  taskset -c ${STA}-${END} ${LD}
done

Analyze performance counter output

The counter values output to output.txt can be analyzed using the following command. Specify the --cpu option appropriately according to the measured environment.

python3 anal.py --cpu {fugaku|graviton3e|graviton4|gracecpu} output.txt

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Makefile		Makefile
README.md		README.md
anal.py		anal.py
config.acfl		config.acfl
config.fj		config.fj
config.gcc		config.gcc
counter.py		counter.py
fugaku.py		fugaku.py
gracecpu.py		gracecpu.py
graviton3e.py		graviton3e.py
graviton4.py		graviton4.py
perf_helper.c		perf_helper.c
perf_helper.h		perf_helper.h
perf_helper_mod.f90		perf_helper_mod.f90
perf_helper_wrapper.c		perf_helper_wrapper.c
pmu.py		pmu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance Evaluation Library Using Linux perf for ARM Processors

Introduction

Build perf_helper library

Adding Section Measurements

Code Examples

C

Fortran90

Compilation

For GCC

For Fortran90

Execution

Sample script:

Analyze performance counter output

About

Releases

Packages

Languages

RIKEN-RCCS/perf_helper

Folders and files

Latest commit

History

Repository files navigation

Performance Evaluation Library Using Linux perf for ARM Processors

Introduction

Build perf_helper library

Adding Section Measurements

Code Examples

C

Fortran90

Compilation

For GCC

For Fortran90

Execution

Sample script:

Analyze performance counter output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages