Skip to content
Kent Knox edited this page Aug 14, 2013 · 2 revisions

clFFT client program

The clFFT client program comes with the clFFT library package. This program is more than just a sample application demonstrating the use of FFT library. For a simple example code, visit the home page of clFFT. The client program supports various capabilities including performance measurement. In general, the client program can invoke a user specified type of FFT transform and perform an FFT impulse test. In that sense, it has been designed to do a simple verification of a particular kind of FFT transform. The following features are supported by the client program.

  • Ability to specify precision of transform
  • Ability to specifiy lengths and dimensions
  • Ability to select forward or backward transform
  • Ability to choose buffer layouts
  • Ability to input strides and distances
  • Ability to specify number of transforms
  • Ability to dump underlying OpenCL kernels
  • Ability to measure performance for a specified transform

The block below shows the help message given by the client program listing all the command line options. These options can be used to input various parameters and control the type of FFT.

C:\clFFT\bin\staging\Debug>Client.exe -h
clFFT client command line options:
  -h [ --help ]               produces this help message
  -v [ --version ]            Print queryable version information from the
                              clFFT library
  -i [ --clInfo ]             Print queryable information of the OpenCL runtime
  -g [ --gpu ]                Force instantiation of an OpenCL GPU device
  -c [ --cpu ]                Force instantiation of an OpenCL CPU device
  -a [ --all ]                Force instantiation of all OpenCL devices
  -o [ --outPlace ]           Out of place FFT transform (default: in place)
  --double                    Double precision transform (default: single)
  --inv                       Backward transform (default: forward)
  -d [ --dumpKernels ]        FFT engine will dump generated OpenCL FFT kernels
                              to disk (default: dump off)
  -x [ --lenX ] arg (=1024)   Specify the length of the 1st dimension of a test
                              array
  -y [ --lenY ] arg (=1)      Specify the length of the 2nd dimension of a test
                              array
  -z [ --lenZ ] arg (=1)      Specify the length of the 3rd dimension of a test
                              array
  --isX arg (=1)              Specify the input stride of the 1st dimension of
                              a test array
  --isY arg (=0)              Specify the input stride of the 2nd dimension of
                              a test array
  --isZ arg (=0)              Specify the input stride of the 3rd dimension of
                              a test array
  --iD arg (=0)               input distance between subsequent sets of data
                              when batch size > 1
  --osX arg (=1)              Specify the output stride of the 1st dimension of
                              a test array
  --osY arg (=0)              Specify the output stride of the 2nd dimension of
                              a test array
  --osZ arg (=0)              Specify the output stride of the 3rd dimension of
                              a test array
  --oD arg (=0)               output distance between subsequent sets of data
                              when batch size > 1
  -b [ --batchSize ] arg (=1) If this value is greater than one, arrays will be
                              used
  -p [ --profile ] arg (=1)   Time and report the kernel speed of the FFT
                              (default: profiling off)
  --inLayout arg (=1)         Layout of input data:
                              1) interleaved
                              2) planar
                              3) hermitian interleaved
                              4) hermitian planar
                              5) real
  --outLayout arg (=1)        Layout of input data:
                              1) interleaved
                              2) planar
                              3) hermitian interleaved
                              4) hermitian planar
                              5) real
  --xFactor arg (=0)          set the size of X dimension if a large 1D dataset
                              needs to be broken down (default: library
                              automatically chooses factorization)
  --ldsComplex                LDS is complex (default: false)
  --ldsPadding                Data is padding in LDS (default: false)
  --ldsFraction arg (=0)      specify the LDS fraction (default: library
                              automatically chooses LDS fraction)
  --cacheSize arg (=0)        specify the cahce size (default: library
                              automatically chooses cache size)

Some examples are shown below. First example is invoking a tranform of length 16. All other values are at their defaults.

C:\clFFT\bin\staging\Debug>Client.exe -x 16


                Client Test *****PASS*****

Next example shows a 2D double precision transform with size 50x100.

C:\clFFT\bin\staging\Debug>Client.exe -x 50 -y 100 --double


                Client Test *****PASS*****

Next example shows a 1D transform with input & output having buffer layouts. The strides are 2 for input and 3 for output. The length of tranform is 1024.

C:\clFFT\bin\staging\Debug>Client.exe -x 1024 --inLayout 2 --outLayout 2 --isX 2 --osX 3


                Client Test *****PASS*****

Next example shows a 2D real transform with Hermitian interleaved output. The size is set at 192x108.

C:\clFFT\bin\staging\Debug>Client.exe -x 192 -y 108 --inLayout 5 --outLayout 3


                Client Test *****PASS*****

Next example shows how to measure performance for a 1D 512-size tranform with batch set to 100. The profile parameter specifies the number of iterations to run and prune the timing results. Since the GPU device becomes more efficient as the data size grows, you would want to set batch and transform size at high values, as allowed by the device memory limits, to see maximum attainable performance.

In this example, the Glops is reported as 88. It is calculated using the elapsed time and standard FFT performance formula ( 5nlog(n) / t ). The time in nanoseconds is also reported.

C:\clFFT\bin\staging\Debug>Client.exe -x 512 -b 100 -p 50

========================StdDev ( 2 )========================
clFFT[ 0 ]: Pruning 1 samples out of 50

===========================clFFT============================
        Handle:                   1
        Kernel:    0000000003CA0710
     OutEvents:    0000000003C86E70
        Length:               (512)
         Batch:                 100
  Input Stride:                 (1)
 Output Stride:                 (1)
   Global Work:              (6400)
        Gflops:                        88.492
     Time (ns):                                       26,036

Next example shows how to measure performance for a double precision 2D 128x128 transform. In this example, you see performance being reported for 5 plan handles. The last one is the overall performance for the transform. Since the 2D involves 4 operations, a row transform followed by a transpose and then a column transform followed by a transpose, all the individual operations are timed and reported.

C:\clFFT\bin\staging\Debug>clAmdFft.Client.exe -x 512 -y 512 --double -p 50

========================StdDev ( 2 )========================
clFFT[ 0 ]: Pruning 0 samples out of 50
clFFT[ 1 ]: Pruning 0 samples out of 50
clFFT[ 2 ]: Pruning 1 samples out of 50
clFFT[ 3 ]: Pruning 0 samples out of 50
clFFT[ 4 ]: Pruning 0 samples out of 50

===========================clFFT============================
        Handle:                   2
        Kernel:    0000000003BBC730
     OutEvents:    0000000004A37640
        Length:           (512,512)
  Input Stride:             (1,512)
 Output Stride:             (1,512)
   Global Work:             (32768)
        Gflops:                       125.589
     Time (ns):                                       93,928

        Handle:                   3
        Kernel:    0000000003BBC810
     OutEvents:    000000000640E410
        Length:           (512,512)
  Input Stride:             (1,512)
 Output Stride:             (1,512)
   Global Work:              (8704)
        Gflops:                        132.17
     Time (ns):                                      178,504

        Handle:                   4
        Kernel:    0000000003BBCA40
     OutEvents:    0000000003C7DE90
        Length:           (512,512)
  Input Stride:             (1,512)
 Output Stride:             (1,512)
   Global Work:             (32768)
        Gflops:                       127.419
     Time (ns):                                       92,580

        Handle:                   5
        Kernel:    0000000003BBC810
     OutEvents:    00000000049FFCB0
        Length:           (512,512)
  Input Stride:             (1,512)
 Output Stride:             (1,512)
   Global Work:              (8704)
        Gflops:                       132.637
     Time (ns):                                      177,875

        Handle:                   1
 Child Handles:           (2,3,4,5)
        Length:           (512,512)
  Input Stride:             (1,512)
 Output Stride:             (1,512)
        Gflops:                       43.4581
     Time (ns):                                      542,889

Clone this wiki locally