Introduction

DFModel is used to model workload performance on a distributed system of accelerators. It uses Gurobi for optimization given the input constraints set up by users. The following images show the flow chart of DFModel. It goes through two steps of optimization from inter-chip level to intra-chip level. Inter-chip optimization figures out the best sharding strategy given a certain dataflow graph. Intra-chip level optimization figures out the mapping for each kernel on the accelerator and calculates the overall performance.

Overall flow chart

here

Detailed flow chart

here

Environment

Gurobi 10.0.3
Python 3.11.5
libprotoc 3.20.3
pydot 1.4.2
h5py 3.9.0

How to use

Create a directory.
Assume the directory you created is dir, set up the input Protobuf text file in this path: dir/user_input/setup.txt.
Run DFModel using this command: ./run.sh dir.

How to set up the input protobuf file

The most important thing to do for a user is to set up the protobuf text file in step 1. The file contains information for the dataflow graph, system, cost, execution, and Gurobi.

Dataflow graph

Set up the dataflow graph which describes the workload. It consists of kernels and connections. Kernels represent the computation kernels such as a matrix multiplication (GEMM) or an element-wise operation (ReLU) in a dataflow graph. Connections represent the intermediate connections between kernels, which correspond to intermediate buffers such as the activation between two MLP layers. The table summarizes the fields in the dataflow_graph message.

Message	Field	Must Fill	Unit	If not filled	Description
`dataflow_graph`	`kernels`	yes	N/A	N/A	compute kernels in a graph
`dataflow_graph`	`connections`	yes	N/A	N/A	buffers in a graph

It's recommended that you use generators in generator/ to generate the dataflow graph for LLM/DLRM/HPL/FFT applications. The tables summarize the input arguments to the generators.

For LLM workloads, use generator/generate_LLM_fwd.py or generator/generate_LLM_fwdbwd.py.

Workload	Input Argument	Must Fill	Description
LLM	`hidden`	yes	hidden dimension of LLM
LLM	`seq`	yes	sequence length of LLM
LLM	`num_head`	yes	number of heads of LLM
LLM	`head_dim`	yes	head dimension of LLM
LLM	`word`	yes	number of bytes in the datatype

For DLRM workloads, use generator/generate_DLRM_fwdbwd.py.

Workload	Input Argument	Must Fill	Description
DLRM	`mlp_dim`	yes	dimension of MLP
DLRM	`bottom_num_mlp`	yes	number of MLPs at the bottom
DLRM	`top_num_mlp`	yes	number of MLPs at the top
DLRM	`pooled_row`	yes	number of rows pooled from the emb table
DLRM	`row`	yes	number of rows in the emb table
DLRM	`num_table`	yes	number of emb tables
DLRM	`emb_dim`	yes	dimenion of embeddings
DLRM	`num_chip`	yes	number of chips in the system
DLRM	`micro_batch_size`	yes	local batch size for MLPs
DLRM	`word`	yes	number of bytes in the datatype

For HPL workloads, use generator/generate_HPL.py.

Workload	Input Argument	Must Fill	Description
HPL	`N`	yes	side of a square matrix for factorization
HPL	`B`	yes	block size of the matrix
HPL	`X`	yes	number of chips along the first dimension of the matrix
HPL	`Y`	yes	number of chips along the second dimension of the matrix
HPL	`word`	yes	number of bytes in the datatype

For FFT workloads, use generator/generate_FFT.py.

Workload	Input Argument	Must Fill	Description
FFT	`length`	yes	length of FFT
FFT	`num_chip`	yes	number of chips in the system
FFT	`word`	yes	number of bytes in the datatype

System

Set up the system. The system consists of compute, memory, and network components. The compute component is the accelerator in the system. You can set up the number of chips and the detailed architectural details of the chip. The memory component is the main memory attached to the accelerator. You can set up characteristics such as bandwidth and capacity. The network component is the interconnection network which connects all the accelerators in the system. The network is hierarchical and you can set up the characteristics for each dimension. You can set up the degree, bandwidth, and topology of each dimension. The table summarizes the fields in the system message. There are many topologies to choose in src/setup.proto, and here we show a 3D torus as one example.

Message	Field	Must Fill	Unit	If not filled	Description
`system`	`num_chip`	yes	N/A	N/A	number of chips in the system
`system`	`accelerator`	yes	N/A	N/A	micro-architectural details of one accelerator
`accelerator`	`core`	yes	N/A	N/A	number of cores in the accelerator
`accelerator`	`systolic_width`	yes	N/A	N/A	width of one systolic core
`accelerator`	`systolic_height`	yes	N/A	N/A	height of one systolic core
`accelerator`	`sram_cap`	yes	Bytes	N/A	capacity of on-chip SRAM
`accelerator`	`freq`	yes	GHz	N/A	frequency of the accelerator
`memory`	`dram_bw`	yes	GB/s	N/A	DRAM bandwidth
`memory`	`dram_cap`	yes	Bytes	N/A	DRAM capacity
`topology_variant`	`r_r_r`	yes	N/A	N/A	`r_r_r` represents a 3D torus since all three dimensions of the torus is a basic ring, we follow the notation to define the basic dimension of a hierarchical topology like this: ring (r), switch (sw), fully-connected (fc), other topologies are shown in `setup.proto`
`r_r_r`	`x`	no	N/A	you can leave `x`, `y`, `z` blank simultaneously and DFModel automatically searches for the best degree for each dimension	degree for the X dimension
`r_r_r`	`y`	no	N/A	you can leave `x`, `y`, `z` blank simultaneously and DFModel automatically searches for the best degree for each dimension	degree for the Y dimension
`r_r_r`	`z`	no	N/A	you can leave `x`, `y`, `z` blank simultaneously and DFModel automatically searches for the best degree for each dimension	degree for the Z dimension
`r_r_r`	`link_bw_x`	yes	GB/s	N/A	link bandwidth for the interconnect on the X dimension
`r_r_r`	`link_bw_y`	yes	GB/s	N/A	link bandwidth for the interconnect on the Y dimension
`r_r_r`	`link_bw_z`	yes	GB/s	N/A	link bandwidth for the interconnect on the Z dimension
`r_r_r`	`par_x`	no	N/A	`TP`: tensor parallelism, `PP`: pipeline parallelism, `DP`: data parallelism (only set it for LLM, no need to set for DLRM/HPL/FFT	parallelization strategy on the X dimension)
`r_r_r`	`par_y`	no	N/A	`TP`: tensor parallelism, `PP`: pipeline parallelism, `DP`: data parallelism (only set it for LLM, no need to set for DLRM/HPL/FFT	parallelization strategy on the X dimension)
`r_r_r`	`par_z`	no	N/A	`TP`: tensor parallelism, `PP`: pipeline parallelism, `DP`: data parallelism (only set it for LLM, no need to set for DLRM/HPL/FFT	parallelization strategy on the X dimension)

Cost

Set up price cost and power consumption for the system, which consists of compute, memory, and network parts. The compute component includes price/power of the accelerator die. The memory component includes price/power of DRAM. The network component includes price/power of the interconnect in a hierarchical topology. The table summarizes the fields in the cost message.

Message	Field	Must Fill	Unit	If not filled	Description
`cost`	`link_unit_price`	yes	$/(GB/s)	N/A	dollar price per bandwidth for link
`cost`	`switch_unit_price`	yes	$/(GB/s*radix)	N/A	dollar price per bandwidth per radix for switch
`cost`	`dram_unit_price`	yes	$/(GB/s)	N/A	dollar price per bandwidth for DRAM
`cost`	`accelerator_price`	yes	$	N/A	dollar price of the accelerator die
`cost`	`link_unit_power_x`	no	J/GB	depends on topology, if not filled, value is 0	energy consumption per byte accessed on link in the X dimension
`cost`	`link_unit_power_y`	no	J/GB	depends on topology, if not filled, value is 0	energy consumption per byte accessed on link in the Y dimension
`cost`	`link_unit_power_z`	no	J/GB	depends on topology, if not filled, value is 0	energy consumption per byte accessed on link in the Z dimension
`cost`	`switch_unit_power`	yes	J/(GB*radix)	N/A	energy consumption per bandwidth per radix for switch
`cost`	`dram_unit_power`	yes	J/GB	N/A	energy consumption per byte accessed from DRAM
`cost`	`accelerator_power`	yes	W	N/A	power consumption of the accelerator die

Execution

Set up miscellaneous constraints. You can set up execution style for the workload which includes kernel-by-kernel execution and dataflow execution. Dataflow execution means DFModel automatically searches for the kernel fusion into different configurations. You can also specify other parameters such as the number of configurations and the number of bytes of the datatype. There are also workload-dependent constraints. For example, for LLM training workloads, you can specify the global batch size and local batch size. The following table shows the fields in the execution message.

Message	Field	Must Fill	Unit	If not filled	Description
`execution`	`execution_style`	yes	N/A	N/A	`DATAFLOW` or `KERNEL_BY_KERNEL`: dataflow automatically fuses kernels into different configurations, kernel-by-kernel puts every kernel into a different configuration
`execution`	`num_config`	no	N/A	if not specified, num_config is equal to the number of kernels	you can set a number for dataflow, no need to set for kernel-by-kernel
`execution`	`perfect_overlap`	yes	N/A	N/A	if true, overlap the compute/memory/network latency for each configuration
`execution`	`compute_util`	no	N/A	if not specified, DFModel automatically calculates the compute latency based on the accelerator's TFLOPS	if specified, DFModel assumes the TFLOPS utilization (compute only) to be the specified number, it can significantly reduce DFModel's runtime

Gurobi

Set up Gurobi-related constraints. You can control the runtime, gap, and the number of threads used. The following table shows the fields in the gurobi message.

Message	Field	Must Fill	Unit	If not filled	Description
`gurobi`	`thread`	no	N/A	if not specified, default to 20 threads	number of threads used for Gurobi program
`gurobi`	`gap`	no	N/A	if not specified, default to 0.01	MIPGap for Gurobi program
`gurobi`	`time`	no	second	if not specified, default to infinite time (run until completion)	runtime for Gurobi program

Example

We go through an example of training GPT-3 1T model on 1024 SambaNova SN30 RDUs.

Input dataflow graph

We first use the LLM generator in generator/generate_LLM_fwdbwd.py to generate the dataflow graph, which looks like the following. You can use the command python generator/generate_LLM_fwdbwd.py --hidden=25600 --num_head=160 --head_dim=160 --seq=2048 --word=2 to generate the dataflow graph for GPT3 1T model with 160 heads, 160 head size, and 2048 sequence length. The following generated code shows the input code for DFModel for one layer of GPT3 1T.

kernels {
  name: "Add_Prev_Layer"
  id: 1
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "LayerNorm_1"
  id: 2
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "Q"
  id: 3
  config: -1
  fwd_bwd: FWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 160
    M: 160
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 1310720000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "K"
  id: 4
  config: -1
  fwd_bwd: FWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 160
    M: 160
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 1310720000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "V"
  id: 5
  config: -1
  fwd_bwd: FWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 160
    M: 160
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 1310720000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "MHA_GEMM_1"
  id: 6
  config: -1
  fwd_bwd: FWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 160
    M: 2048
    K: 160
    N: 2048
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 1342177300.0
    tiling: N_TILING
  }
}
kernels {
  name: "SOFTMAX"
  id: 7
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1 {
    outer: 160
    M: 2048
    N: 2048
    input_tensor_size: 1342177300.0
    output_tensor_size: 1342177300.0
    tiling: N_TILING
  }
}
kernels {
  name: "DropOut_1"
  id: 8
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1 {
    outer: 160
    M: 2048
    N: 2048
    input_tensor_size: 1342177300.0
    output_tensor_size: 1342177300.0
    tiling: N_TILING
  }
}
kernels {
  name: "MHA_GEMM_2"
  id: 9
  config: -1
  fwd_bwd: FWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 160
    M: 160
    K: 2048
    N: 2048
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 1342177300.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "PROJ_GEMM"
  id: 10
  config: -1
  fwd_bwd: FWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 1
    M: 25600
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 1310720000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "DropOut_2"
  id: 11
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "Add_1"
  id: 12
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1_input2 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_1_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
    input_tensor_2_size: 104857600.0
  }
}
kernels {
  name: "LayerNorm_2"
  id: 13
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "FFN0"
  id: 14
  config: -1
  fwd_bwd: FWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 1
    M: 102400
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 5242880000.0
    output_tensor_size: 419430400.0
    tiling: N_TILING
  }
}
kernels {
  name: "GeLU"
  id: 15
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 102400
    N: 2048
    input_tensor_size: 419430400.0
    output_tensor_size: 419430400.0
    tiling: N_TILING
  }
}
kernels {
  name: "FFN1"
  id: 16
  config: -1
  fwd_bwd: FWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 1
    M: 25600
    K: 102400
    N: 2048
    input_tensor_size: 419430400.0
    weight_tensor_size: 5242880000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "DropOut_3"
  id: 17
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "Add_2"
  id: 18
  config: -1
  fwd_bwd: FWD
  type: SIMD
  elementwise_input1_input2 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_1_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
    input_tensor_2_size: 104857600.0
  }
}
kernels {
  name: "Loss_bwd"
  id: 19
  config: -1
  fwd_bwd: BWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "DropOut_3_bwd"
  id: 20
  config: -1
  fwd_bwd: BWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "FFN1_bwd"
  id: 21
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 1
    M: 102400
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 5242880000.0
    output_tensor_size: 419430400.0
    tiling: N_TILING
  }
}
kernels {
  name: "GeLU_bwd"
  id: 22
  config: -1
  fwd_bwd: BWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 102400
    N: 2048
    input_tensor_size: 419430400.0
    output_tensor_size: 419430400.0
    tiling: N_TILING
  }
}
kernels {
  name: "FFN0_bwd"
  id: 23
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 1
    M: 25600
    K: 102400
    N: 2048
    input_tensor_size: 419430400.0
    weight_tensor_size: 5242880000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "LayerNorm_2_bwd"
  id: 24
  config: -1
  fwd_bwd: BWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "DropOut_2_bwd"
  id: 25
  config: -1
  fwd_bwd: BWD
  type: SIMD
  elementwise_input1 {
    outer: 1
    M: 25600
    N: 2048
    input_tensor_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "PROJ_GEMM_bwd"
  id: 26
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 1
    M: 25600
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 1310720000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "MHA_GEMM_2_bwd1"
  id: 27
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 160
    M: 2048
    K: 160
    N: 2048
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 1342177300.0
    tiling: N_TILING
  }
}
kernels {
  name: "MHA_GEMM_2_bwd2"
  id: 28
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 160
    M: 160
    K: 2048
    N: 2048
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 1342177300.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "V_bwd"
  id: 29
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 1
    M: 25600
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 1310720000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "DropOut_1_bwd"
  id: 30
  config: -1
  fwd_bwd: BWD
  type: SIMD
  elementwise_input1 {
    outer: 160
    M: 2048
    N: 2048
    input_tensor_size: 1342177300.0
    output_tensor_size: 1342177300.0
    tiling: N_TILING
  }
}
kernels {
  name: "SOFTMAX_bwd"
  id: 31
  config: -1
  fwd_bwd: BWD
  type: SIMD
  elementwise_input1 {
    outer: 160
    M: 2048
    N: 2048
    input_tensor_size: 1342177300.0
    output_tensor_size: 1342177300.0
    tiling: N_TILING
  }
}
kernels {
  name: "MHA_GEMM_1_bwd1"
  id: 32
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 160
    M: 160
    K: 2048
    N: 2048
    input_tensor_1_size: 1342177300.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "MHA_GEMM_1_bwd2"
  id: 33
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 160
    M: 160
    K: 2048
    N: 2048
    input_tensor_1_size: 1342177300.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "Q_bwd"
  id: 34
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 160
    M: 160
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 1310720000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "K_bwd"
  id: 35
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_weight {
    outer: 160
    M: 160
    K: 25600
    N: 2048
    input_tensor_size: 104857600.0
    weight_tensor_size: 1310720000.0
    output_tensor_size: 104857600.0
    tiling: N_TILING
  }
}
kernels {
  name: "FFN1_bwd_weight_update"
  id: 36
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 1
    M: 102400
    K: 2048
    N: 25600
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 419430400.0
    output_tensor_size: 5242880000.0
    communication_type: ALL_REDUCE_PERIODIC
    communication_size: 5242880000.0
    tiling: K_TILING
  }
}
kernels {
  name: "FFN0_bwd_weight_update"
  id: 37
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 1
    M: 25600
    K: 2048
    N: 102400
    input_tensor_1_size: 419430400.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 5242880000.0
    communication_type: ALL_REDUCE_PERIODIC
    communication_size: 5242880000.0
    tiling: K_TILING
  }
}
kernels {
  name: "PROJ_GEMM_bwd_weight_update"
  id: 38
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 1
    M: 25600
    K: 2048
    N: 25600
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 1310720000.0
    communication_type: ALL_REDUCE_PERIODIC
    communication_size: 1310720000.0
    tiling: K_TILING
  }
}
kernels {
  name: "V_bwd_weight_update"
  id: 39
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 1
    M: 25600
    K: 2048
    N: 25600
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 1310720000.0
    communication_type: ALL_REDUCE_PERIODIC
    communication_size: 1310720000.0
    tiling: K_TILING
  }
}
kernels {
  name: "K_bwd_weight_update"
  id: 40
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 1
    M: 25600
    K: 2048
    N: 25600
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 1310720000.0
    communication_type: ALL_REDUCE_PERIODIC
    communication_size: 1310720000.0
    tiling: K_TILING
  }
}
kernels {
  name: "Q_bwd_weight_update"
  id: 41
  config: -1
  fwd_bwd: BWD
  type: SYSTOLIC
  gemm_input1_input2 {
    outer: 1
    M: 25600
    K: 2048
    N: 25600
    input_tensor_1_size: 104857600.0
    input_tensor_2_size: 104857600.0
    output_tensor_size: 1310720000.0
    communication_type: ALL_REDUCE_PERIODIC
    communication_size: 1310720000.0
    tiling: K_TILING
  }
}
connections {
  startIdx: 1
  endIdx: 2
  id: 1
}
connections {
  startIdx: 2
  endIdx: 3
  id: 2
}
connections {
  startIdx: 2
  endIdx: 4
  id: 3
}
connections {
  startIdx: 2
  endIdx: 5
  id: 4
}
connections {
  startIdx: 3
  endIdx: 6
  id: 5
}
connections {
  startIdx: 4
  endIdx: 6
  id: 6
}
connections {
  startIdx: 6
  endIdx: 7
  id: 7
}
connections {
  startIdx: 7
  endIdx: 8
  id: 8
}
connections {
  startIdx: 5
  endIdx: 9
  id: 9
}
connections {
  startIdx: 8
  endIdx: 9
  id: 10
}
connections {
  startIdx: 9
  endIdx: 10
  id: 11
}
connections {
  startIdx: 10
  endIdx: 11
  id: 12
}
connections {
  startIdx: 11
  endIdx: 12
  id: 13
}
connections {
  startIdx: 1
  endIdx: 12
  id: 14
}
connections {
  startIdx: 12
  endIdx: 13
  id: 15
}
connections {
  startIdx: 13
  endIdx: 14
  id: 16
}
connections {
  startIdx: 14
  endIdx: 15
  id: 17
}
connections {
  startIdx: 15
  endIdx: 16
  id: 18
}
connections {
  startIdx: 16
  endIdx: 17
  id: 19
}
connections {
  startIdx: 17
  endIdx: 18
  id: 20
}
connections {
  startIdx: 12
  endIdx: 18
  id: 21
}
connections {
  startIdx: 19
  endIdx: 20
  id: 22
}
connections {
  startIdx: 20
  endIdx: 21
  id: 23
}
connections {
  startIdx: 21
  endIdx: 22
  id: 24
}
connections {
  startIdx: 22
  endIdx: 23
  id: 25
}
connections {
  startIdx: 23
  endIdx: 24
  id: 26
}
connections {
  startIdx: 24
  endIdx: 25
  id: 27
}
connections {
  startIdx: 25
  endIdx: 26
  id: 28
}
connections {
  startIdx: 26
  endIdx: 27
  id: 29
}
connections {
  startIdx: 5
  endIdx: 27
  id: 30
}
connections {
  startIdx: 26
  endIdx: 28
  id: 31
}
connections {
  startIdx: 8
  endIdx: 28
  id: 32
}
connections {
  startIdx: 27
  endIdx: 30
  id: 33
}
connections {
  startIdx: 28
  endIdx: 29
  id: 34
}
connections {
  startIdx: 30
  endIdx: 31
  id: 35
}
connections {
  startIdx: 31
  endIdx: 32
  id: 36
}
connections {
  startIdx: 31
  endIdx: 33
  id: 37
}
connections {
  startIdx: 4
  endIdx: 32
  id: 38
}
connections {
  startIdx: 3
  endIdx: 33
  id: 39
}
connections {
  startIdx: 32
  endIdx: 34
  id: 40
}
connections {
  startIdx: 33
  endIdx: 35
  id: 41
}
connections {
  startIdx: 20
  endIdx: 36
  id: 42
}
connections {
  startIdx: 15
  endIdx: 36
  id: 43
}
connections {
  startIdx: 22
  endIdx: 37
  id: 44
}
connections {
  startIdx: 13
  endIdx: 37
  id: 45
}
connections {
  startIdx: 25
  endIdx: 38
  id: 46
}
connections {
  startIdx: 9
  endIdx: 38
  id: 47
}
connections {
  startIdx: 28
  endIdx: 39
  id: 48
}
connections {
  startIdx: 2
  endIdx: 39
  id: 49
}
connections {
  startIdx: 33
  endIdx: 40
  id: 50
}
connections {
  startIdx: 2
  endIdx: 40
  id: 51
}
connections {
  startIdx: 32
  endIdx: 41
  id: 52
}
connections {
  startIdx: 2
  endIdx: 41
  id: 53
}

The following graph shows the different kernels and buffers corresponding to the dataflow graph above.

System

The system is composed of 1024 SN30 RDUs. Each RDU has more than 600 MB of SRAM and more than 600 TFLOPS of compute power. Each chip is attched with 1TB of DDR memory. The interconnection topolopy is a 8x128 2D torus with PCIe links connecting both dimensions of the hierarchical topology. The parallelization strategy is TP (tensor parallelism) on the dimension with degree 8 and PP (pipeline parallelism) on the dimension with degree 128. The following code shows the system setup for the described system.

system {
  num_chip: 1024
  accelerator {
    core: 1280
    systolic_width: 32
    systolic_height: 6
    sram_cap: 671088640.0
    freq: 1.25
  }
  r_r {
    x: 8
    y: 128
    link_bw_x: 50.0
    link_bw_y: 50.0
    par_x: "TP"
    par_y: "PP"
  }
  memory {
    dram_bw: 300.0
    dram_cap: 1099511600000.0
  }
}

The following image shows the system setup.

Cost

We set up the cost of the system which includes both price cost and power cost. We need to set up the unit cost for the interconnection links, switches, DRAM as well as the cost of the accelerator silicon. The following code shows an example.

cost {
	accelerator_price: 16522.25023
	accelerator_power: 444.7061955
	dram_unit_price: 1.0
	dram_unit_power: 0.16248
	link_unit_price: 2.0
	link_unit_power_x: 0.052
	link_unit_power_y: 0.052
	switch_unit_price: 24.0
	switch_unit_power: 0.052
}

Execution

Since RDU is a dataflow architecture, we set the execution style to be DATAFLOW, which tells DFModel to automatically fuse kernels, and we set the overlap flag to be true, which tells DFModel to overlap compute/memory/network latencies. We also set the workload-dependent parameters such as the global batch size to be 3072 according to the workload specification. The following code shows the setup.

execution {
	execution_style: DATAFLOW
	num_config: 6
	perfect_overlap: true
	compute_util: 0.9
}

Gurobi

We set up Gurobi-related variables including the number of threads, MIPGap, and runtime. The following code shows an example.

gurobi {
  thread: 144
  gap: 0.001
  time: 1200
}

Results

The output dataflow graph shows the mapping, and the output text log shows various results like the following.

System Cost 17757184.0
System Power 511325.01135492325
Workload FLOP 3.85072240461978e+19
System FLOPS Utilization 0.7748055154407518
Optimizer Runtime (s) 262.675754070282

There are a total of six configurations and the following output graph shows six clusters of kernels.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
DAMSim		DAMSim
DLRM		DLRM
Example		Example
FFT		FFT
HPL		HPL
LLM		LLM
LLM_Serving		LLM_Serving
SP		SP
SP_test		SP_test
SSM		SSM
generator		generator
images		images
src		src
.gitignore		.gitignore
README.md		README.md
parse_spec_decoding.py		parse_spec_decoding.py
run.sh		run.sh
run_figure11.py		run_figure11.py
run_figure7.py		run_figure7.py
run_figure8.py		run_figure8.py
run_figure9.py		run_figure9.py
script.py		script.py
script2.py		script2.py
script3.py		script3.py
script5.py		script5.py
script_parallel_run.py		script_parallel_run.py
script_parallel_run2.py		script_parallel_run2.py
setup.txt		setup.txt
submit.py		submit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Overall flow chart

Detailed flow chart

Environment

How to use

How to set up the input protobuf file

Dataflow graph

System

Cost

Execution

Gurobi

Example

Input dataflow graph

System

Cost

Execution

Gurobi

Results

About

Releases

Packages

Languages

kosho2013/DFModel

Folders and files

Latest commit

History

Repository files navigation

Introduction

Overall flow chart

Detailed flow chart

Environment

How to use

How to set up the input protobuf file

Dataflow graph

System

Cost

Execution

Gurobi

Example

Input dataflow graph

System

Cost

Execution

Gurobi

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages