Skip to content

Performance measurement

Vasily Philipov edited this page Sep 22, 2021 · 5 revisions

The performance measurement library allows running performance tests (in the current thread) on the various UCX communication APIs. The purpose is to allow a developer make optimizations to the code and immediately test their effects.
The infrastructure provides both an API (libperf.h) and a command-line utility ucx_perftest.
The API is tested as part of the unit tests.
Location in to code tree: src/tools/perf

Features of the API:

  • uct_perf_test_run() is the function which runs the test. (currently only UCT API is supported)
  • No need to do any resource allocation - just pass the testing parameters to the API
  • Requires running the function on 2 threads/processes/nodes - by passing RTE callbacks which are used to bootstrap the connections.
  • Two testing modes - ping-pong and unidirectional stream (TBD bi-directional stream)
  • Configurable message size and data layout (short/bcopy/zcopy)
  • Supports: warmup cycles, unlimited iterations.
  • UCT Active-messages stream is measured with simple flow-control.
  • Tests driver is written in C++ (C linkage), to take advantage of templates.
  • Results are reported to callback function at the specified intervals, and also returned from the API call.
    • Including: latency, message rate, bandwidth - iteration average, and overall average.

Features of ucx_perftest:

  • Have pre-defined list of tests which are valid combinations of operation and testing mode.
  • Can be run either as client-server application, as MPI application, or using libRTE.
  • Supports: CSV output, numeric formatting.
  • Supports "batch mode" - write the lists of tests to run to a text file (see example in contrib/perf) and run them one after another. Every line is the list of arguments that the tool would normally read as command-line options. They are "appended" to the other command-line arguments, if such were passed.
    • "Cartesian" mode: if several batch files are specified, all possible combinations are executed!
  • Can be compiled with MPI and use it 'mpirun' as a launcher. In order to do it, need to add --with-mpi to UCX ./configure command line.
  • Supports loopback mode, in this case the process will communicate with itself, so passing server hostname is not allowed.
$ ucx_perftest  -h
  Note: test can be also launched as an MPI application

  Usage: lt-ucx_perftest [ server-hostname ] [ options ]

  Common options:
     -t <test>      test to run:
                        am_lat - active message latency
                       put_lat - put latency
                       add_lat - atomic add latency
                           get - get latency / bandwidth / message rate
                          fadd - atomic fetch-and-add latency / message rate
                          swap - atomic swap latency / message rate
                         cswap - atomic compare-and-swap latency / message rate
                         am_bw - active message bandwidth / message rate
                        put_bw - put bandwidth / message rate
                        add_mr - atomic add message rate
                       tag_lat - UCP tag match latency
                        tag_bw - UCP tag match bandwidth
                   tag_sync_lat - UCP tag sync match latency
                   ucp_put_lat - UCP put latency
                    ucp_put_bw - UCP put bandwidth
                       ucp_get - UCP get latency / bandwidth / message rate
                       ucp_add - UCP atomic add bandwidth / message rate
                      ucp_fadd - UCP atomic fetch-and-add latency / bandwidth / message rate
                      ucp_swap - UCP atomic swap latency / bandwidth / message rate
                     ucp_cswap - UCP atomic compare-and-swap latency / bandwidth / message rate
                     stream_bw - UCP stream bandwidth
                    stream_lat - UCP stream latency
     -s <size>      list of scatter-gather sizes for single message (8)
                    for example: "-s 16,48,8192,8192,14"
     -n <iters>     number of iterations to run (1000000)
     -w <iters>     number of warm-up iterations (10000)
     -c <cpu>       set affinity to this CPU (off)
     -O <count>     maximal number of uncompleted outstanding sends (1)
     -i <offset>    distance between consecutive scatter-gather entries (0)
     -l <loopback>  use loopback connection, in this case,
                    the process will communicate with itself,
                    so passing server hostname is not allowed
     -T <threads>   number of threads in the test (1), if >1 implies "-M multi" for UCP
     -B             register memory with NONBLOCK flag
     -b <file>      read and execute tests from a batch file: every line in the
                    file is a test to run, first word is test name, the rest of
                    the line is command-line arguments for the test.
     -p <port>      TCP port to use for data exchange (13337)
     -P <0|1>       disable/enable MPI mode (0)
     -m <mem type>  memory type of messages
                        host - system memory(default)
     -h             show this help message

  Output format:
     -N             use numeric formatting (thousands separator)
     -f             print only final numbers
     -v             print CSV-formatted output

  UCT only:
     -d <device>    device to use for testing
     -x <tl>        transport to use for testing
     -D <layout>    data layout for sender side:
                        short - short messages API (default, cannot be used for get)
                        bcopy - copy-out API (cannot be used for atomics)
                        zcopy - zero-copy API (cannot be used for atomics)
                        iov    - scatter-gather list (iovec)
     -W <count>     flow control window size, for active messages (127)
     -H <size>      active message header size (8)
     -A <mode>      asynchronous progress mode (thread)
                        thread - separate progress thread
                        signal - signal-based timer

  UCP only:
     -M <thread>    thread support level for progress engine (single)
                        single     - only the master thread can access
                        serialized - one thread can access at a time
                        multi      - multiple threads can access
     -D <layout>[,<layout>]
                    data layout for sender and receiver side (contig)
                        contig - Continuous datatype
                        iov    - Scatter-gather list
     -C             use wild-card tag for tag tests
     -U             force unexpected flow by using tag probe
     -r <mode>      receive mode for stream tests (recv)
                        recv       : Use ucp_stream_recv_nb
                        recv_data  : Use ucp_stream_recv_data_nb

Example - client/server mode

Start server

$ ucx_perftest -c 0
Waiting for connection...
+------------------------------------------------------------------------------------------+
| API:          protocol layer                                                             |
| Test:         UCP tag match latency                                                      |
| Data layout:  (automatic)                                                                |
| Message size: 8                                                                          |
+------------------------------------------------------------------------------------------+

Connect client:

$ ucx_perftest vegas08 -t tag_lat -c 0
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
        592840     0.843     0.843     0.843       9.05       9.05     1185680     1185680
       1000000     0.840     0.843     0.843       9.05       9.05     1185782     1185721

Example - with MPI

$salloc -N2 --ntasks-per-node=1 mpirun --bind-to core --display-map ucx_perftest -d mlx5_1:1 \
                                       -x rc_mlx5 -t put_lat
salloc: Granted job allocation 6991
salloc: Waiting for resource configuration
salloc: Nodes clx-orion-[001-002] are ready for job
 Data for JOB [62403,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: clx-orion-001   Num slots: 1    Max slots: 0    Num procs: 1
        Process OMPI jobid: [62403,1] App: 0 Process rank: 0

 Data for node: clx-orion-002   Num slots: 1    Max slots: 0    Num procs: 1
        Process OMPI jobid: [62403,1] App: 0 Process rank: 1

 =============================================================
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
        586527     0.845     0.852     0.852       4.47       4.47      586527      586527
       1000000     0.844     0.848     0.851       4.50       4.48      589339      587686
Clone this wiki locally