This is a collection of Ansible Playbooks that simplifies the deployment and management of NVIDIA BlueField-2 DPUs.
- DPU DevOps Kit
- Quick Start
- Playbook Descriptions
bmc-install.yml
doca-setup.yml
dpdk-setup.yml
poc-dhcp-server.yml
poc-doca-ar-container.yml
poc-doca-devel-container.yml
poc-doca-ips-container.yml
poc-doca-telemetry-container.yml
poc-doca-url-filter-container.yml
poc-doca-firefly.yml
poc-doca-dpu-morpheus.yml
poc-doca-hbn.yml
poc-embedded-mode.yml
poc-grafana.yml
poc-host-restricted-disable.yml
poc-host-restricted-enable.yml
poc-ktls.yml
poc-link-type-ethernet.yml
poc-link-type-infiniband.yml
poc-reinstall-bfb.yml
poc-reset-ovs.yml
poc-separated-mode.yml
poc-sshkeys.yml
poc-test-inventory.yml
- Using the DevOps Kit
- Troubleshooting the DPU-PoC-Kit
- Clone this repo to a host with Ansible 2.12.x or later / or follow the Automation Container instructions below
- Edit the
hosts
file. - Set
ansible_user
andansible_password
to the username and password on the x86 and DPU endpoints. - Set
x86 ansible_host=
to the IP of the x86 server. - Set
dpu_oob ansible_host=
to the IP of the DPU OOB interface. - Run
ansible-playbook doca-setup.yml
.
The DevOps Kit has installed the following operating systems on the DPU:
- Ubuntu 20.04 (DOCA 1.1.x - 1.5.x)
The DevOps Kit has been tested on the following x86 / host platforms:
- Ubuntu 20.04 (DOCA 1.1.x - 1.5.x)
- Ubuntu 22.04 (DOCA 1.5.x)
- Debian 10.8 (DOCA 1.5.x)
- Red Hat Enterprise 8.2 (DOCA 1.2.x)
- Rocky Linux 8.6 (DOCA 1.5)
- Centos 7.9.x (DOCA 1.2.x)
One great piece of feedback that we have received is that installing Ansible on various operating systems can lead to version mismatches which are difficult for new Ansible users to troubleshoot and debug. To help resolve this issue, the next few steps will outline how to install an "automation" conatiner which will have all of the needed dependencies for you to succesfully launch the DevOps Kit.
Docker (Linux, Mac, Windows)
-
Follow this link to the install instructions for your platform
-
Pull the container from Docker Hub with the following command:
sudo docker pull ipspace/automation:ubuntu
-
Run the container with following command:
sudo docker run -it -d ipspace/automation:ubuntu
-
Next, log into the container with the following command:
sudo docker exec -it $(sudo docker ps | grep -i automation | awk -F" " '{print $1}') bash
You will see the prompt change to something similar to the following:
root@032f1ada86f4:/ansible#
- Clone the DPU DevOps Kit with the following command:
git clone https://gitlab.com/nvidia/networking/bluefield/dpu-poc-kit/
You will see the following output:
root@032f1ada86f4:/ansible# git clone https://gitlab.com/nvidia/networking/bluefield/dpu-poc-kit/
Cloning into 'dpu-poc-kit'...
warning: redirecting to https://gitlab.com/nvidia/networking/bluefield/dpu-poc-kit.git/
remote: Enumerating objects: 614, done.
remote: Counting objects: 100% (193/193), done.
remote: Compressing objects: 100% (133/133), done.
remote: Total 614 (delta 93), reused 75 (delta 34), pack-reused 421
Receiving objects: 100% (614/614), 1.95 MiB | 17.25 MiB/s, done.
Resolving deltas: 100% (248/248), done.
-
Change directories into the DevOps Kit:
cd dpu-poc-kit
-
Use vim or nano to edit the "hosts" file in this directory
Change the following settings for the x86:
ansible_user=<your x86 username>
ansible_password=<your x86 user password>
ansible_sudo_pass=<your x86 sudo password>
x86 ansible_host=<your x86 IP address>
Change the following settings for the DPU:
dpu_oob ansible_host=
Under the [dpu:vars]
heading, uncomment and change the following:
ansible_user=ubuntu
ansible_password=ubuntu
- Test out the Ansible playbook with the following command:
ansible all -m ping --become
This should produce an output similar to the following:
x86 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python3.8"
},
"changed": false,
"ping": "pong"
}
dpu_oob | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}
- Run the appropriate playbook as outlined in the rest of this README file.
Lima (Open Source Docker replacement for Mac)
-
This is a nice overview of Lima with install instructions
-
Start lima with the follwoing command:
limactl start
You will see output similar to the following:
INFO[0000] Using the existing instance "default"
INFO[0000] Attempting to download the nerdctl archive from "https://github.com/containerd/nerdctl/releases/download/v0.18.0/nerdctl-full-0.18.0-linux-amd64.tar.gz" digest="sha256:62573b9e3bca6794502ad04ae77a2b12ec80aeaa21e8b9bbc5562f3e6348eb66"
INFO[0000] Using cache "/Users/mcourtney/Library/Caches/lima/download/by-url-sha256/542daec4b5f8499b1c78026d4e3a57cbe708359346592395c9a20c38571fc756/data"
INFO[0002] [hostagent] Starting QEMU (hint: to watch the boot progress, see "/Users/mcourtney/.lima/default/serial.log")
INFO[0002] SSH Local Port: 60022
-
Download the container with the following command:
lima nerdctl pull ipspace/automation:ubuntu
-
Run and login to the container with the following command:
lima nerdctl run -it ipspace/automation:ubuntu
You will see the prompt change to something similar to the following:
root@032f1ada86f4:/ansible#
- Clone the DPU DevOps Kit with the following command:
git clone https://gitlab.com/nvidia/networking/bluefield/dpu-poc-kit/
You will see the following output:
root@032f1ada86f4:/ansible# git clone https://gitlab.com/nvidia/networking/bluefield/dpu-poc-kit/
Cloning into 'dpu-poc-kit'...
warning: redirecting to https://gitlab.com/nvidia/networking/bluefield/dpu-poc-kit.git/
remote: Enumerating objects: 614, done.
remote: Counting objects: 100% (193/193), done.
remote: Compressing objects: 100% (133/133), done.
remote: Total 614 (delta 93), reused 75 (delta 34), pack-reused 421
Receiving objects: 100% (614/614), 1.95 MiB | 17.25 MiB/s, done.
Resolving deltas: 100% (248/248), done.
-
Change directories into the DevOps Kit:
cd dpu-poc-kit
-
Use vim or nano to edit the "hosts" file in this directory
Change the following settings for the x86:
ansible_user=<your x86 username>
ansible_password=<your x86 user password>
ansible_sudo_pass=<your x86 sudo password>
x86 ansible_host=<your x86 IP address>
Change the following settings for the DPU:
dpu_oob ansible_host=
Under the [dpu:vars]
heading, uncomment and change the following:
ansible_user=ubuntu
ansible_password=ubuntu
- Test out the Ansible playbook with the following command:
ansible-playbook poc-test-inventory.yml
This should produce an output similar to the following:
x86 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python3.8"
},
"changed": false,
"ping": "pong"
}
dpu_oob | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}
- Run the appropriate playbook as outlined in the rest of this README file.
Other examples from tools such as Podman are welcome
Some of the Bluefield-2 cards have "unsigned" or development images that are available to internal NVIDIA resources. Currently, the only way to view if the card is a signed or unsigned / development card is to run the following command from the DPU:
sudo mlxbf-bootctl
A signed card will have the following:
lifecycle state: GA Secured
An unsigned / development card will have the following:
lifecycle state: GA Non-Secured
or
lifecycle state: Secured (development)
There are two ways to configure DOCA via the DevOps Kit. First, these variables will specify the install locations:
doca_bfb
- the file name of the DOCA image
bfb_download_url
- the combination of the development URL and the filename
Here is an example of what the command line would look like for a "signed" image install of DOCA 1.3:
ansible-playbook doca-setup.yml -e "doca_bfb='DOCA_1.4.0_BSP_3.9.2_Ubuntu_20.04-4.signed.bfb' bfb_download_url='http://www.mellanox.com/downloads/BlueField/BFBs/Ubuntu20.04/DOCA_1.4.0_BSP_3.9.2_Ubuntu_20.04-4.signed.bfb'"
Second, you can add the following to the x86 inventory item:
board=dev
The URL and file name for the development BFB and URL are defined in the following location:
group_vars > all > main.yml
This collection of playbooks provides the following utilities in the form of Ansible Roles.
More details on each role can be found in their individual README located within the roles/
directory.
A set of pre-defined playbooks
are provided in this root directory.
This playbook will install the DOCA image over the BMC rshim on compatible devices. This runs the same playbook as the "doca-setup.yml" file, expect the first tasks pushs the BFB over the BMC rshim
This is the playbook to get an x86 host and DPU fully ready to run DOCA applications.
- Installs DOCA software on both x86 and DPU
- Installs the DPU BFB image (if
bfb_install
is set to true) - Sets the x86 Rshim IP address
- Updates the DPU firmware
- Properly configures networking on the DPU and x86 host
- Installs packages to improve the user experience on both x86 and DPU
- Reboots the x86 host
This playbook supports two optional arguments that can be passed as Ansible extra-vars
x86_reboot
- Setx86_reboot=false
to skip the server reboot at the end of the playbook. Default istrue
These arguments are passed with the -e
flag
ansible-playbook doca-setup.yml -e x86_reboot=false -e bfb_install=true
Sets up the environment for dpdk and checks if the hw can be initialized via testpmd. dpdk libs and testpmd need to be installed separately or via doca-setup.
Configures an Ubuntu server to be a DHCP server with ISC-DHCP. This is a basic configuration designed to help with POC/lab environments.
Deploys 3x VFs on the host and all of the NGC containers on the DPU. Will prompt for NGC credentials and org if NGC is not installed and configured.
Deploys the Application Recognition container from NGC. Will prompt for NGC credentials and org if NGC is not installed and configured.
DOCA Application Recognition Container README
More info here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/doca/containers/doca_application_recognition
Deploys the DOCA development container from NGC. Will prompt for NGC credentials and org if NGC is not installed and configured.
DOCA Development Container README
More info here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/doca/containers/doca
Deploys the IPS container from NGC. Will prompt for NGC credentials and org if NGC is not installed and configured.
More info here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/doca/containers/doca_ips
Deploys the DOCA Telemetry container from NGC. Will prompt for NGC credentials and org if NGC is not installed and configured.
DOCA Telemetry Container README
More info here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/doca/containers/doca_telemetry
Deploys the URL Filter container from NGC. Will prompt for NGC credentials and org if NGC is not installed and configured.
DOCA URL Filter Container README
More info here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/doca/containers/doca_url_filter
Enables embedded mode on the DPU.
Deployes the DOCA Firefly PTP container. This changes your DPU to separated host mode and is not currently compatible with other embedded mode DPU use cases and functions.
Deploys the DOCA flow inspector and DOCA telemetry service to point to a Morpeus AI Engine using fluentd.
Deploys the DOCA HBN (Host Based Networking) Service on the DPU. DOCA HBN provides classic top of rack (ToR) routing/switching/network overlay capabilities on the DPU for the host.
Configures and deploys the Grafana Cloud monitoring agent unto the DPU
Disables restricted mode on the DPU.
Disabled Host Restricted Mode README
Enables restricted mode on the DPU.
Enable Host Restricted Mode README
Builds and installs openssl and associated ktls enabled applications for demonstrating ktls offload
Enable the Ethernet link type for the DPU
Enable the Infiniband link type for the DPU
Disables NIC mode / Connect X mode on the DPU.
Disables NIC mode / Connect X mode on the DPU.
Installs a fresh BFB image, networking, and utility software to the DPU
Delete, re-add, and reset Open VSwitch / OVS on the DPU. This is an easier step than reinstalling the BFB
Enables separated mode on the DPU.
Installs the SSH Keys on the DPU and x86 for passwordless authentication
Create SSH Keys README Install SSH Keys README
This will test connectivity to all of the elements in the inventory file
Identify a host that will run Ansible. This can be an external host or the x86 host with the DPU. Run the following steps on that selected server.
- Install SSHPass
sudo apt-get install sshpass
- First, run the following command on the Ansible server to download this repo:
git clone https://gitlab.com/nvidia/networking/bluefield/dpu-poc-kit
- Change directories into the dpu-poc-kit directory:
cd dpu-poc-kit
- Create a python Virtual Environment
python3 -m venv venv
source venv/bin/activate
Note if you need to re-activate the virtual environment use the following command
source venv/bin/activate
- Install Ansible
pip3 install --upgrade pip
pip3 install setuptools-rust
python3 -m pip install ansible paramiko
- Update the usernames, passwords and IP addresses in the
hosts
file. (This is also called the Anisble Inventory File)
ansible_user
is the username used for SSH.ansible_password
is the password used for SSH.ansible_sudo_pass
is the sudo password for the SSH user.x86 ansible_host
is the IP address to access the x86 host that has a DPU installed. If you are running Ansible from the x86 host, use127.0.0.1
dpu_oob ansible_host
is the IP address of the DPU out of band ethernet interface.
(optional) If you intend on deploying DOCA service containers, you can provide your NGC API Key and Group in the inventory file. If you do not provide it in the inventory, you will be prompted for input the middle of the playbook.
ngc_api_key=""
ngc_org=""
(optional) For the DPUs in your inventory, define a variable "parent_host" that refers to the inventory name of the server where this DPU is installed. This is optional, but if defined properly will allow for mlxconfig
changes to take effect WITHOUT THE NEED FOR A COLD REBOOT/POWER CYCLE
[x86_hosts]
x86 ansible_host=10.150.170.174 # host is named x86 in the inventory
<snip>
[dpus]
dpu_oob ansible_host=10.150.106.174 parent_host=x86 # so my parent_host variable refers to that name, 'x86'
- (Optional) If you wish to install DOCA components download DOCA software packages
- Download the DOCA file for x86 from
https://developer.nvidia.com/networking/secure/doca-sdk/doca_1.11/doca_111_b19/ubuntu2004/doca-host-repo-ubuntu2004_1.1.1-0.0.1.1.1.024.5.4.2.4.1.3_amd64.deb
and place the file inroles/install_server_doca/files
. - Download DOCA the file for DPU from
https://developer.nvidia.com/networking/secure/doca-sdk/doca_1.11/doca_111_b19/doca-repo-aarch64-ubuntu2004-local_1.1.1-1.5.4.2.4.1.3.bf.3.7.1.11866_arm64.deb
and place the file inroles/install_dpu_doca/files
.
- Verify that Ansible is working properly.
ansible x86 -m ping --become
Note if this step fails, look at the Troubleshooting
section at the end of this page for help.
- Install DOCA by running the
doca-setup.yml
playbook:
ansible-playbook doca-setup.yml
Reset the BF back to the factory defaults after running one of the BF use cases. This playbook is the minimum needed to accomplish that without going through the plays for the various components of the full PoC:
ansible-playbook poc-reinstall-bfb.yml
This was tested using an Ubuntu 20.04 server as the DHCP server.
- Edit the
hosts
file
- Uncomment
dhcpserver ansible_host
and set the IP address of the DHCP server. - Uncomment
oob_mac
and set the MAC address of the out of band ethernet interface of the DPU. - (Optional) If you have a BMC interface that will receive a DHCP IP uncomment
bmc_mac
andbmc_ip
and set the IP and MAC address of the BMC interface.
- Edit the
group_var/all/main.yml
file
- Set
dhcp_network
to be the subnet the DHCP server will assign IPs from.
- Run the following command to build the DHCP server:
ansible-playbook poc-dhcp-server.yml
- Confirm basic Ansible connectivity.
For this to work the IP addresses,
ansible_user
andansible_password
values must be correct. Note if the DPU has not been provisioned failure is expected.
Run the command:
ansible all -m ping
This should produce an output similar to the following:
x86 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python3.8"
},
"changed": false,
"ping": "pong"
}
dpu_oob | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}
- Confirm sudo access.
For this to work the
ansible_sudo_pass
value must be correct. Note if the DPU has not been provisioned failure is expected.
ansible all -m ping --become
This should produce an output similar to the following:
x86 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python3.8"
},
"changed": false,
"ping": "pong"
}
dpu_oob | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}
If this fails or just hangs, you may need to enable passwordless sudo
Use sudo visudo
and change the line
%sudo ALL=(ALL:ALL) ALL
to
%sudo ALL=(ALL:ALL) NOPASSWD: ALL
- Confirm gathering facts. This confirms that Ansible can connect to the DPU and read information from the DPU and x86 nodes. Note if the DPU has not been provisioned failure is expected.
ansible all -m setup
The output should be a few pages long and similar to the following:
dpu_oob | SUCCESS => {
"ansible_facts": {
"ansible_all_ipv4_addresses": [
"192.168.100.2",
"10.10.150.202"
],
"ansible_all_ipv6_addresses": [
"fe80::21a:caff:feff:ff01",
"fe80::bace:f6ff:febc:7c92"
],
"ansible_apparmor": {
"status": "enabled"
},
...