How to Perform MPI Benchmarking on Azure with Rocky Linux
Introduction
MPI (Message Passing Interface) benchmarking is heavily utilized in the automobile industry and is a backbone of High Performance Computing. With benchmarking, the objective is to pass a message between two nodes as quickly as possible. The faster the message is passed between both nodes, the more parallel operations you will be able to perform on your selected hardware.
This article will cover how to set up MPI benchmarking on a Rocky Linux 8.10 image on two nodes in Azure.
Prerequisites
-
Two individual VMs or a Virtual Machine Scale Set (Virtual Machine Scale Set) (a VMSS is the recommended option and is what is used in this article).
-
The
size
of the VMs / VMSS needs to be able to support Infiniband and RDMA. Examplesizes
that support Infiniband and RDMA areHBv4
,H16r
and so on (the testing performed for this article was on a Virtual Machine Scale Set using twoHB176-24rs_v4
instances). You may need to increase your quota in your Azure subscription in order to accesssizes
such as theHBv4
. -
Utilize a Rocky Linux VHD to be applied to each node (for this documentation, this was tested with Rocky Linux 8.10). The image requires the
cld_next
kernel, as well as thewaagent
,OFED
, Infiniband, and RDMA drivers. In addition,Open MPI
needs to be installed for the MPI benchmarking to be performed.
Azure setup
All steps are performed using the Azure CLI.
Step 1: create a resource group
az group create --name <RESOURCE-GROUP_NAME> --location <LOCATION>
Step 2: generate a storage account
az storage account create \
--name <NAME> \
--resource-group <RESOURCE-GROUP_NAME> \
--location <LOCATION> \
--sku Standard_LRS
Step 3: provision a container
az storage container create \
--account-name <NAME> \
--name <NAME>
Step 4: retrieve the storage account key and assign it to the STORAGE_KEY
variable
STORAGE_KEY=$(az storage account keys list \
--resource-group <RESOURCE-GROUP_NAME> \
--account-name <NAME> \
--query '[0].value' -o tsv)
Step 5: display the key for verification (optional)
echo $STORAGE_KEY
Step 6: upload the Rocky Linux VHD image from your local filesystem
az storage blob upload \
--account-name <NAME> \
--account-key $STORAGE_KEY \
--container-name <NAME> \
--type page \
--file <VHD_IMAGE> \
--name <VHD_IMAGE>
Step 7: set the VHD URI to the VHD_URI
variable
VHD_URI="https://<NAME>.blob.core.windows.net/<NAME>/<VHD_IMAGE>"
Step 8: create a managed image
az image create \
--resource-group <RESOURCE-GROUP_NAME> \
--name <IMAGE_NAME> \
--os-type Linux \
--source $VHD_URI \
--location <LOCATION>
Step 9: generate an ssh key pair for accessing the nodes
# Generate an ssh key pair if you don't have one already on your local machine
ssh-keygen -t rsa -b 4096 -f ~/.ssh/<SSH_KEY> -N ""
# Check the public key for verification
cat ~/.ssh/<SSH_KEY>.pub
Step 10: create a proximity placement group
az ppg create \
--resource-group <RESOURCE-GROUP_NAME> \
--name <PROXIMITY_PLACEMENT_GROUP_NAME> \
--location <LOCATION> \
--type Standard
Step 11: Create a virtual machine scale set
az vmss create \
--resource-group <RESOURCE-GROUP_NAME> \
--name <NAME> \
--image <IMAGE_NAME> \
--instance-count 2 \
--vm-sku <SIZE> \
--admin-username azureuser \
--ssh-key-values ~/.ssh/<SSH_KEY>.pub \
--upgrade-policy-mode automatic \
--lb <LOAD_BALANCER_NAME> \
--backend-pool-name <BACKEND_POOL_NAME> \
--location <LOCATION> \
--public-ip-per-vm \
--orchestration-mode Uniform \
--ppg <PROXIMITY_PLACEMENT_GROUP_NAME>
Step 13: assign public ip addresses for each node
az network public-ip create \
--resource-group <RESOURCE-GROUP_NAME> \
--name <NODE_1_PUBLIC_IP_NAME> \
--allocation-method Static \
--sku Standard \
--location <LOCATION>
az network public-ip create \
--resource-group <RESOURCE-GROUP_NAME> \
--name <NODE_2_PUBLIC_IP_NAME> \
--allocation-method Static \
--sku Standard \
--location <LOCATION>
Step 14: remove any existing nat rules
az network lb inbound-nat-rule delete \
--resource-group <RESOURCE-GROUP_NAME> \
--lb-name <LOAD_BALANCER_NAME> \
--name NatRule
Step 15: create a nat pool
az network lb inbound-nat-pool create \
--resource-group <RESOURCE-GROUP_NAME> \
--lb-name <LOAD_BALANCER_NAME> \
--name <NAT_POOL_NAME> \
--protocol Tcp \
--frontend-port-range-start 50000 \
--frontend-port-range-end 50119 \
--backend-port 22 \
--frontend-ip-name <LOAD_BALANCER_FRONTEND_NAME>
Step 16: add the nat pool to the vmss:
az vmss update \
--resource-group <RESOURCE-GROUP_NAME> \
--name <NAME> \
--add virtualMachineProfile.networkProfile.networkInterfaceConfigurations[0].ipConfigurations[0].loadBalancerInboundNatPools \
'{
"id": "/subscriptions/2247734c-d128-46cd-a462-ee14c4302d9b/resourceGroups/<RESOURCE-GROUP_NAME>/providers/Microsoft.Network/loadBalancers/<LOAD_BALANCER_NAME>/inboundNatPools/<NAT_POOL_NAME>"
}'
Step 17: check the vmss configuration:
az vmss show \
--resource-group <RESOURCE-GROUP_NAME> \
--name <NAME> \
--query "virtualMachineProfile.networkProfile.networkInterfaceConfigurations[0].ipConfigurations[0].loadBalancerInboundNatPools" \
--output json
Step 18: check that the nat rules were created:
az network lb inbound-nat-rule list \
--resource-group <RESOURCE-GROUP_NAME> \
--lb-name <LOAD_BALANCER_NAME> \
--output table
Step 19: verify each node's nics:
az vmss nic list \
--resource-group <RESOURCE-GROUP_NAME> \
--vmss-name <NAME> \
--query "[].{Name:name, PrivateIP:ipConfigurations[0].privateIpAddress}" \
--output table
Step 20: get detailed information on the nics' ip configuration:
az vmss nic list \
--resource-group <RESOURCE-GROUP_NAME> \
--vmss-name <NAME> \
--query "[].ipConfigurations[0]" \
--output json
Step 21: get the vmss instance information
az vmss list-instances \
--resource-group <RESOURCE-GROUP_NAME> \
--name <NAME> \
--output table
Step 22: display the public ip of each node
az vmss list-instance-public-ips \
--resource-group <RESOURCE-GROUP_NAME> \
--name <NAME> \
--output table
Step 23: ssh into each node
ssh -i ~/.ssh/<SSH_KEY> azureuser@<NODE_IP>
For each node, perform the below steps:
Step 24: generate an ssh key pair
ssh-keygen -t rsa -b 4096 -f ~/.ssh/inter_instance_key -N ""
Step 25: share the public keys between both nodes
# On node 1, show the public key
cat ~/.ssh/inter_instance_key.pub
# Copy the public key
# SSH into node 2 and add node 1's public key to node 2's authorized_keys file:
echo "PASTE_NODE_1_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys
# On node 2, display the public key
cat ~/.ssh/inter_instance_key.pub
# Copy the public key
# SSH into node 1
# Add node 2's public key to node 1's authorized_keys file:
echo "PASTE_NODE_2_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys
Step 26: create the hostlist.txt file for both nodes
# Get hostnames of both nodes
# On each node, create the hostlist.txt file
# First, get the hostnames of each node (run the below command on each node to identify the hostname)
hostname
# Create hostlist.txt with both hostnames
cat > ~/hostlist.txt << EOF
<NODE_1_HOSTNAME>
<NODE_2_HOSTNAME>
EOF
Step 27: generate the ssh config file on both nodes
# Create ~/.ssh/config file on each instance
cat > ~/.ssh/config << 'EOF'
Host <START_OF_HOSTNAME_PATTERN>*
IdentityFile /home/azureuser/.ssh/inter_instance_key
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
EOF
Step 28: set the 600 permissions for the ssh config file on both nodes, to ensure that only the owner has read / write permissions
chmod 600 ~/.ssh/config
Step 29: set 600 permissions for private ssh key
chmod 600 ~/.ssh/inter_instance_key
Step 29: test ssh connectivity between the nodes
# From node 1 to node 2
ssh -i ~/.ssh/inter_instance_key <NODE_2_HOSTNAME>
# From node 2 to node 1
ssh -i ~/.ssh/inter_instance_key <NODE_1_HOSTNAME>
Step 30: rdma setup for the waagent on both nodes
sudo vi /etc/waagent.conf
# Ensure the following are enabled and uncommented
OS.EnableRDMA=y
OS.CheckRdmaDriver=y
# Save and verify the changes
grep -i rdma /etc/waagent.conf | grep -v "^#"
Step 31: restart the waagent service
sudo systemctl restart waagent
sudo systemctl daemon-reload
Step 32: load the ib_ipoib module
sudo modprobe ib_ipoib
Step 33: configure the infiniband network
# On node 1, set an ip address (in the below example, it is 172.20.0.10)
sudo ip addr add 172.20.0.10/24 dev ib0
# Bring the link up
sudo ip link set ib0 up
# Verify the infiniband interface is up
ibdev2netdev
# On node 2, set an ip address
sudo ip addr add 172.20.0.11/24 dev ib0
# Bring the link up
sudo ip link set ib0 up
# Verify the interface is up
ibdev2netdev
Step 34: test infiniband connectivity
# From node 1 to node 2 with the above example ips
ping -c 3 172.20.0.11
# From node 2 to node 1
ping -c 3 172.20.0.10
Step 35: run the mpi benchmarking tests
Confirm that messages can be passed between the two nodes successfully with the pingpong
and allreduce
benchmarks.
# pingpong benchmark
/usr/mpi/gcc/openmpi-4.1.7rc1/bin/mpirun -n 2 -N 1 -hostfile ~/hostlist.txt -x UCX_NET_DEVICES=mlx5_0:1 /usr/mpi/gcc/openmpi-4.1.7rc1/tests/imb/IMB-MPI1 -msglog 27:28 pingpong
# allreduce benchmark
/usr/mpi/gcc/openmpi-4.1.7rc1/bin/mpirun -n 72 -N 36 -hostfile ~/hostlist.txt -x UCX_NET_DEVICES=mlx5_0:1 /usr/mpi/gcc/openmpi-4.1.7rc1/tests/imb/IMB-MPI1 -npmin 72 allreduce
Notes
If ssh between the nodes fails
-
Verify hostnames are correct in the
hostlist.txt
file. -
Check that public keys are properly added to
authorized_keys
file. -
Ensure firewall rules allow internal communication.
-
Verify the ssh server service is running:
sudo systemctl status sshd
If Infiniband connectivity fails
-
Check if the
ib0
interface is visible:ip addr show ib0
-
Verify RDMA modules are loaded:
lsmod | grep rdma
-
Check the Infiniband status:
ibstat
-
Verify connectivity with
ibping
If the MPI tests fail
-
Ensure all nodes are reachable via SSH without password prompts.
-
Verify OpenMPI installation:
which mpirun
-
Check that
UCX_NET_DEVICES
matches your hardware:ucx_info -d
How to quickly cleanup a test setup and delete all resources
# Delete the entire resource group and all its resources
az group delete --name <RESOURCE-GROUP_NAME> --yes --no-wait
# Check the deletion status
az group show --name <RESOURCE-GROUP_NAME>