PCIe XDMA to AXI4-Stream with a 512-Bit H2C Bus. Demonstration for the Innova-2 using Vivado 2022.2. Stream multiplies Floating-Point numbers.
The AXI-Lite BAR has a 0x40000000
PCIe to AXI Translation offset.
Recreate the bitstream. Download xdma_stream_512bit.tcl
and constraints.xdc
. source
the Tcl script in the Vivado 2022.2 Tcl Console then run Generate Bitstream.
Load the bitstream into your Innova-2. It should work with every variant of the Innova-2. Refer to innova2_flex_xcku15p_notes for system setup.
pwd
cd DOWNLOAD_DIRECTORY
dir
source xdma_stream_512bit.tcl
Generate the bitstream:
Resources used for the design:
Confirm the xdma
driver has loaded and the hardware is recognized and operating as expected.
sudo lspci -vnn -d 10ee: ; sudo lspci -vvnn -d 10ee: | grep Lnk
sudo lspci -vv -d 15b3:1974 | grep "Mellanox\|LnkSta"
dmesg | grep xdma
will detail how the XDMA driver has loaded.
ls /dev/xdma*
will show all character device files associated with the XDMA driver.
Compile and run the test program.
gcc -Wall stream_test.c -o stream_test -lm ; sudo ./stream_test
Every once in a while there will be a problem with communication. A portion of the resulting C2H floating-point array gets shifted by a few indices. I have run the core pwrite
+pread
loop millions of times and problems pop up early.
By using /dev/zero
as the source of data and /dev/null
as the sink with dd you can experiment with data throughput vs. count=
and bs=
(Block Size) values. Channel 1, xdma0_h2c_1
and xdma0_c2h_1
, are shorted for loopback. This gives an estimate for the maximum possible throughput.
In one terminal:
sudo dd if=/dev/zero of=/dev/xdma0_h2c_1 count=32768 bs=16384
In a second terminal:
sudo dd if=/dev/xdma0_c2h_1 of=/dev/null count=32768 bs=16384
The H2C throughput will be slower as it includes the time it takes you to switch to the second window and start the second dd
.
The maximum width for the AXI Bus with a PCIe 3.0 x8 design is 256-Bit but a 512-Bit stream is required.
The goal is to re-clock and channel the data through the stream. Clocks and resets are carefully managed. tkeep
and tlast
signals are omitted from all blocks as they are not used.
In order to widen the 256-Bit AXI4-Stream bus to 512-Bit the 250MHz axi_aclk clock is halved in order to maintain the same bandwidth.
Each clock needs an associated aresetn synchronized to it and controllable by a GPIO signal to allow resetting the stream.
Input (Host-to-Card H2C) and output (Card-to-Host C2H) FIFOs were added to increase througput. The output C2H FIFO has the minimum depth of 16.
To match throughput the input H2C FIFO has a depth of 32 as its stream uses twice as many bits.
The 256-Bit XDMA Block H2C stream is widened to 512-Bit using a Data Width Converter.
The input (H2C) data stream is re-timed to 125MHz (half of the XDMA block's axi_aclk) which is used by the stream blocks.
The output (C2H) data stream is re-timed back to the 250MHz axi_aclk before going into the XDMA block.
The 512-Bit=64-Byte H2C data stream is split/broadcast into sixteen 32-Bit=4-Byte streams for the floating-point units.
The bits of each 32-Bit=4-Byte stream are appropriately selected from the 512-Bit stream.
The floating-point unit results are combined into the 256-Bit output C2H stream.
I put Floating-Point blocks in the stream as an example of something useful. Each pair of 4-byte=32-bit single precision floating-point values in the 64-Byte=512-Bit Host-to-Card (H2C) stream gets multiplied to produce a floating-point value in the 32-Byte=256-Bit Card-to-Host (C2H) stream.
The floating-point blocks are set up to multiply their inputs.
Full DSP usage is allowed to maximize throughput.
The interface is set up as Blocking so that the AXI4-Stream interfaces include tready
signals like the rest of the Stream blocks.