-
Notifications
You must be signed in to change notification settings - Fork 0
Hardware Language
To optimize and generate code for a specific system e.g. a subset of nodes from a cluster or a workstation, knowledge about the systems architecture is necessary. A scource-to-scource compiler does not always run on the maschine where the generated code executes. Furthermore, information on the network is also not necessarily available to the operating system. Thus, information on the maschines and the network (the system) must be provided differently.
The hardware language (HL) is a JSON file containing a set of parameters describing the properties of the system. An example for this file can be found in the benchmark folder under cluster.
The network is the outermost part of the HL. The parameters for the network are the following:
The topology of the network is a string briefly describing the the network-topology of the cluster.
The connectivity-bandwidth is a matrix M containing the bandwidth between all nodes within the network. Given a set of nodes n_0,...,n_k, where the indices depict the order in which the nodes are defined in the network. The matrix elements M_j_i (form j to i) and M_i_j (form i to j) are the bandwidth in ./s between the nodes n_i and n_j with 0 <= i,j <= k. An example for the connectivity-bandwidth, for a system with 4 nodes can be seen below. The rows describe the sending nodes and the columns the receiving nodes. The values on the diagonal will be ignored.
"connectivity-bandwidth": [
["0", "100", "200", "300"],
["0", "100", "200", "300"],
["0", "100", "200", "300"],
["0", "100", "200", "300"]
]
The connectivity-latency is a matrix M containing the latency between all nodes within the network. Given a set of nodes n_0,...,n_k, where the indices depict the order in which the nodes are defined in the network. The matrix elements M_j_i (form j to i) and M_i_j (form i to j) are the latency in µs between the nodes n_i and n_j with 0 <= i,j <= k. An example for the connectivity-latency, for a system with 4 nodes can be seen below. The rows describe the sending nodes and the columns the receiving nodes. The values on the diagonal will be ignored.
"connectivity-latency": [
["0", "100", "200", "300"],
["0", "100", "200", "300"],
["0", "100", "200", "300"],
["0", "100", "200", "300"]
]
The network contains a list nodes defined by the key word: "nodes". The order of the nodes in this file defines the ranking of the nodes. E.g. the first node defined will get the MPI Rank 0, the second Rank 1 etc.
The node describes a single machine within the system. The parameters are defined as follows:
The identifier, keyword "identifier", is a unique string for each node. The identifier is used to name node.
"identifier": "Node1"
The address, keyword "address", is a unique string for each node. The address defines where the specified node is located. The address must be understood by a MPI machinefile.
"address": "192.165.0.1.125"
or
"address": "login18-1.hpc.itc.rwth-aachen.de"
The first two parameters, identifier and address, must always be defined for each node individually. For all other parameters a template, keyword "template", can be defined. The template is a path to a predefined instantiation of the node. The path to the template file is relative to the file referencing the template. If a template file is provided within a folder "SampleFolder" within the same directory, you only need to specify the "SampleFolder" as your path. This can be utilized when defining a hardware language file with multiple instances of the same machine. If a template is defined all following parameters must be specified within the template.
"template": "./PathToMyTemplate/SampleNode.json"
The type, keyword "type", of a machine defines some general properties of the machine e.g. if it is a numa node.
"type": "numa"
The connectivity-bandwidth is a matrix M containing the bandwidth between all devices within the node. Given a set of devices n_0,...,n_k, where the indices depict the order in which the devices are defined in the node. The matrix elements M_j_i (form j to i) and M_i_j (form i to j) are the bandwidth in ./s between the devices n_i and n_j with 0 <= i,j <= k. An example for the connectivity-bandwidth, for a system with 4 device can be seen below. The rows describe the sending devices and the columns the receiving devices. The values on the diagonal will be ignored.
"connectivity-bandwidth": [
["0", "100", "200", "300"],
["0", "100", "200", "300"],
["0", "100", "200", "300"],
["0", "100", "200", "300"]
]
The connectivity-latency is a matrix M containing the latency between all devices within the node. Given a set of devices n_0,...,n_k, where the indices depict the order in which the devices are defined in the node. The matrix elements M_j_i (form j to i) and M_i_j (form i to j) are the latency in µs between the devices n_i and n_j with 0 <= i,j <= k. An example for the connectivity-latency, for a system with 4 devices can be seen below. The rows describe the sending devices and the columns the receiving devices. The values on the diagonal will be ignored.
"connectivity-latency": [
["0", "100", "200", "300"],
["0", "100", "200", "300"],
["0", "100", "200", "300"],
["0", "100", "200", "300"]
]
The devices, keyword "devices", specify the devices within a node. They are defined as a JSON Array and contain the definitions of the individual devices. The devices, especially GPUs must be in the same order as the CUDA runtime defines them, otherwise the execution could target the wrong device.
The device describes a single compute device, e.g. a CPU or a GPU. The parameters are defined as follows:
The identifier, keyword "identifier", is a string for each device. It is unique within each node. The identifier is used to name device.
"identifier": "CPU1"
The first parameter, identifier, must always be defined for each device individually. For all other parameters a template, keyword "template", can be defined. The template is a path to a predefined instantiation of a device. The path to the template file is relative to the file referencing the template. If a template file is provided within a folder "SampleFolder" within the same directory, you only need to specify the "SampleFolder" as your path. This can be utilized when defining a hardware language file with multiple instances of the same device. If a template is defined all following parameters must be specified within the template.
"template": "./PathToMyTemplate/SampleCPU.json"
The type, keyword "type", of a node defines what kind of device is specified, e.g., a CPU, a GPU or a FPGA etc..
"type": "CPU"
The latency, keyword "latency", specifies the latency to the main memory connected to the device in µs.
"latency": "500"
The bandwidth, keyword "bandwidth", specifies the bandwidth to the main memory connected to the device in ./s.
"bandwidth": "200000"
The maximum bandwidth, keyword "max-bandwidth", specifies the highest possible bandwidth to the main memory connected to the device in ./s, this bandwidth considers the utilization of multiple memory controllers.
"max-bandwidth": "200000"
The memory size, keyword "size", specifies the size of the main memory connected to the device in Byte.
"size": "128000"
The cache groupes, keyword "cache-group", are defined by a JSON Array containing cache groups. The Array is defined in ascending order, the first group will be mapped to the first set of cores on the device.
Cache Group
The cache group describes a single team of threads sharing the highest level cache. The parameters are definend as follows:
Identifier
The identifier, keyword "identifier", is a string for each cache group. It is unique within each device. The identifier is used to name the cache group.
"identifier": "1"
Templates
The first parameter, identifier, must always be defined for each cache group individually. For all other parameters a template, keyword "template", can be defined. The template is a path to a predefined instantiation of a device. The path to the template file is relative to the file referencing the template. If a template file is provided within a folder "SampleFolder" within the same directory, you only need to specify the "SampleFolder" as your path. This can be utilized when defining a hardware language file with multiple instances of the same device. If a template is defined all following parameters must be specified within the template.
"template": "./PathToMyTemplate/SampleCG.json"
Cores
The cores, keyword "cores", specify the number of cores that are available in the cache group.
"cores": "24"
Frequency
The frequency, keyword "frequency", specifies the clock frequency of the cache group in MHz.
"frequency": "2400"
Arithmetic-Units
The arithmetic-units, keyword "arithmetic-units", specifies the number of arithmetic units per core.
"arithmetic-units": "4"
Warp-Size
The warp-size, keyword "warp-size", specifies the number of threads per warp for GPUs.
"warp-size": "32"
Vectorization
The vectorization, keyword "vectorization", specifies how the cache group can vectorize the execution (the size of the SIMD register), e.g., avx512.
"vectorization": "avx512"
Hyper-Threading
The hyper-threads, keyword "hyper-threads", specify the number of possible hyperthreads per core.
"hyper-threads": "2"
Caches
The caches, keyword "caches", are defined by a JSON Array containing caches. The Array is defined in ascending order, the closest cache is defined first.
Cache
The cache specifies an individual cache level on the device.
- Latency
The latency, keyword "latency", specifies the latency to this cache in µs.
"latency": "500"
- Bandwidth
The bandwidth, keyword "bandwidth", specifies the bandwidth to this cache in ./s.
"bandwidth": "200000"
- Memory Size
The memory size, keyword "size", specifies the size of this cache in Byte.
"size": "128000"
- Cache sharing
The cache sharing, keyword "sharing", specifies the number of cores sharing this cache.
"sharing": "2"