The following document is a HOWTO for how to enable Kubernetes support for using MPI over InfiniBand in JARVICE jobs. Note that the referrenced components are managed by their respective third party maintainers.
Assumptions:
- working Kubernetes cluster with a functional primary CNI
- required drivers already installed for infiniband host adapter
- an existing IPoIB interface is up and active on the hosts
- this has only been tested with Mellanox infiniband cards and Mellanox OFED
- Multus project will be utilized to add the secondary IpoIB interface
- ipoib-cni from Mellanox
- Whereabouts for ipam
- Install Multus by applying multus-daemonset-thick-plugin.yaml.
kubectl apply -f multus-daemonset-thick-plugin.yml
- Deploy the ipoib-cni.
kubectl apply -f ipoib-cni-daemonset.yaml
- Install the whereabouts ipam.
git clone https://github.com/k8snetworkplumbingwg/whereabouts && cd whereabouts
kubectl apply \
-f doc/crds/daemonset-install.yaml \
-f doc/crds/whereabouts.cni.cncf.io_ippools.yaml \
-f doc/crds/whereabouts.cni.cncf.io_overlappingrangeipreservations.yaml
- Define the network for IPoIB with a NetworkAttachmentDefinition. Please modify ip range and master interface as required.
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: jarvice-ipoib
spec:
config: '{
"cniVersion": "0.3.1",
"type": "ipoib",
"name": "jarvice-ipoib",
"master": "ib0",
"ipam": {
"type": "whereabouts",
"range": "192.168.0.0/24"
}
}'
- Validate that containers can launch and attach this secondary network by running the following pod:
apiVersion: v1
kind: Pod
metadata:
name: ipoib-test
annotations:
k8s.v1.cni.cncf.io/networks: jarvice-ipoib
spec:
restartPolicy: OnFailure
containers:
- image: us-docker.pkg.dev/jarvice/images/ubuntu-desktop:bionic
name: ipoib-test
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
jarvice.com/rdma: 1
command:
- sh
- -c
- |
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000
When using rootless applications, which is default with appdef V2, init executions do not posses sufficient privileges to set Memlock rlimit to unlimited, leading to failure of Infiniband usage.
Two possibilities are available to bypass this issue:
- Cluster administrator need to force unlimited Memlock rlimites at docker / containerd level.
- For docker, append
--default-ulimit memlock=-1:-1
inside docker service file, atExecStart=/usr/bin/dockerd
line, then reload systemctl and restart docker service. - For containerd, edit service file, and add under
[SERVICE]
lineLimitMEMLOCK=infinity
, then reload systemctl and restart containerd service.
- For docker, append
- Or allow apps to run init as root, before dropping to unprivileged user. To do so, set
JARVICE_APP_ALLOW_ROOT_INIT
value totrue
inoverride.yaml
. This will unlock root usage for appdef V2 apps, which will set Memlock rlimit to unlimited before MPI execution.