ArticlesRocky Linux

How to Perform MPI Benchmarking on Azure with Rocky Linux

Introduction

MPI (Message Passing Interface) benchmarking is heavily utilized in the automobile industry and is a backbone of High Performance Computing. With benchmarking, the objective is to pass a message between two nodes as quickly as possible. The faster the message is passed between both nodes, the more parallel operations you will be able to perform on your selected hardware.

This article will cover how to set up MPI benchmarking on a Rocky Linux 8.10 image on two nodes in Azure.

Prerequisites

  • Two individual VMs or a Virtual Machine Scale Set (Virtual Machine Scale Set) (a VMSS is the recommended option and is what is used in this article).

  • The size of the VMs / VMSS needs to be able to support Infiniband and RDMA. Example sizes that support Infiniband and RDMA are HBv4, H16r and so on (the testing performed for this article was on a Virtual Machine Scale Set using two HB176-24rs_v4 instances). You may need to increase your quota in your Azure subscription in order to access sizes such as the HBv4.

  • Utilize a Rocky Linux VHD to be applied to each node (for this documentation, this was tested with Rocky Linux 8.10). The image requires the cld_next kernel, as well as the waagent, OFED, Infiniband, and RDMA drivers. In addition, Open MPI needs to be installed for the MPI benchmarking to be performed.

Azure setup

All steps are performed using the Azure CLI.

Step 1: create a resource group

az group create --name <RESOURCE-GROUP_NAME> --location <LOCATION> 

Step 2: generate a storage account

az storage account create \
    --name <NAME> \
    --resource-group <RESOURCE-GROUP_NAME> \
    --location <LOCATION> \
    --sku Standard_LRS

Step 3: provision a container

az storage container create \
    --account-name <NAME> \
    --name <NAME>

Step 4: retrieve the storage account key and assign it to the STORAGE_KEY variable

STORAGE_KEY=$(az storage account keys list \
    --resource-group <RESOURCE-GROUP_NAME> \
    --account-name <NAME> \
    --query '[0].value' -o tsv)

Step 5: display the key for verification (optional)

echo $STORAGE_KEY

Step 6: upload the Rocky Linux VHD image from your local filesystem

az storage blob upload \
    --account-name <NAME> \
    --account-key $STORAGE_KEY \
    --container-name <NAME> \
    --type page \
    --file <VHD_IMAGE> \
    --name <VHD_IMAGE>

Step 7: set the VHD URI to the VHD_URI variable

VHD_URI="https://<NAME>.blob.core.windows.net/<NAME>/<VHD_IMAGE>"

Step 8: create a managed image

az image create \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <IMAGE_NAME> \
    --os-type Linux \
    --source $VHD_URI \
    --location <LOCATION>

Step 9: generate an ssh key pair for accessing the nodes

# Generate an ssh key pair if you don't have one already on your local machine
ssh-keygen -t rsa -b 4096 -f ~/.ssh/<SSH_KEY> -N ""

# Check the public key for verification
cat ~/.ssh/<SSH_KEY>.pub

Step 10: create a proximity placement group

az ppg create \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <PROXIMITY_PLACEMENT_GROUP_NAME> \
    --location <LOCATION> \
    --type Standard

Step 11: Create a virtual machine scale set

az vmss create \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <NAME> \
    --image <IMAGE_NAME> \
    --instance-count 2 \
    --vm-sku <SIZE> \
    --admin-username azureuser \
    --ssh-key-values ~/.ssh/<SSH_KEY>.pub \
    --upgrade-policy-mode automatic \
    --lb <LOAD_BALANCER_NAME> \
    --backend-pool-name <BACKEND_POOL_NAME> \
    --location <LOCATION> \
    --public-ip-per-vm \
    --orchestration-mode Uniform \
    --ppg <PROXIMITY_PLACEMENT_GROUP_NAME>

Step 13: assign public ip addresses for each node

az network public-ip create \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <NODE_1_PUBLIC_IP_NAME> \
    --allocation-method Static \
    --sku Standard \
    --location <LOCATION>

az network public-ip create \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <NODE_2_PUBLIC_IP_NAME> \
    --allocation-method Static \
    --sku Standard \
    --location <LOCATION>

Step 14: remove any existing nat rules

az network lb inbound-nat-rule delete \
    --resource-group <RESOURCE-GROUP_NAME> \
    --lb-name <LOAD_BALANCER_NAME> \
    --name NatRule

Step 15: create a nat pool

az network lb inbound-nat-pool create \
    --resource-group <RESOURCE-GROUP_NAME> \
    --lb-name <LOAD_BALANCER_NAME> \
    --name <NAT_POOL_NAME> \
    --protocol Tcp \
    --frontend-port-range-start 50000 \
    --frontend-port-range-end 50119 \
    --backend-port 22 \
    --frontend-ip-name <LOAD_BALANCER_FRONTEND_NAME>

Step 16: add the nat pool to the vmss:

az vmss update \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <NAME> \
    --add virtualMachineProfile.networkProfile.networkInterfaceConfigurations[0].ipConfigurations[0].loadBalancerInboundNatPools \
    '{
      "id": "/subscriptions/2247734c-d128-46cd-a462-ee14c4302d9b/resourceGroups/<RESOURCE-GROUP_NAME>/providers/Microsoft.Network/loadBalancers/<LOAD_BALANCER_NAME>/inboundNatPools/<NAT_POOL_NAME>"
    }'

Step 17: check the vmss configuration:

az vmss show \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <NAME> \
    --query "virtualMachineProfile.networkProfile.networkInterfaceConfigurations[0].ipConfigurations[0].loadBalancerInboundNatPools" \
    --output json

Step 18: check that the nat rules were created:

az network lb inbound-nat-rule list \
    --resource-group <RESOURCE-GROUP_NAME> \
    --lb-name <LOAD_BALANCER_NAME> \
    --output table

Step 19: verify each node's nics:

az vmss nic list \
    --resource-group <RESOURCE-GROUP_NAME> \
    --vmss-name <NAME> \
    --query "[].{Name:name, PrivateIP:ipConfigurations[0].privateIpAddress}" \
    --output table

Step 20: get detailed information on the nics' ip configuration:

az vmss nic list \
    --resource-group <RESOURCE-GROUP_NAME> \
    --vmss-name <NAME> \
    --query "[].ipConfigurations[0]" \
    --output json

Step 21: get the vmss instance information

az vmss list-instances \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <NAME> \
    --output table

Step 22: display the public ip of each node

az vmss list-instance-public-ips \
    --resource-group <RESOURCE-GROUP_NAME> \
    --name <NAME> \
    --output table

Step 23: ssh into each node

ssh -i ~/.ssh/<SSH_KEY> azureuser@<NODE_IP>

For each node, perform the below steps:

Step 24: generate an ssh key pair

ssh-keygen -t rsa -b 4096 -f ~/.ssh/inter_instance_key -N ""

Step 25: share the public keys between both nodes

# On node 1, show the public key
cat ~/.ssh/inter_instance_key.pub

# Copy the public key

# SSH into node 2 and add node 1's public key to node 2's authorized_keys file:
echo "PASTE_NODE_1_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys

# On node 2, display the public key
cat ~/.ssh/inter_instance_key.pub

# Copy the public key

# SSH into node 1

# Add node 2's public key to node 1's authorized_keys file:
echo "PASTE_NODE_2_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys

Step 26: create the hostlist.txt file for both nodes

# Get hostnames of both nodes

# On each node, create the hostlist.txt file

# First, get the hostnames of each node (run the below command on each node to identify the hostname)
hostname

# Create hostlist.txt with both hostnames
cat > ~/hostlist.txt << EOF
<NODE_1_HOSTNAME>
<NODE_2_HOSTNAME>
EOF

Step 27: generate the ssh config file on both nodes

# Create ~/.ssh/config file on each instance
cat > ~/.ssh/config << 'EOF'
Host <START_OF_HOSTNAME_PATTERN>*
  IdentityFile /home/azureuser/.ssh/inter_instance_key
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
EOF

Step 28: set the 600 permissions for the ssh config file on both nodes, to ensure that only the owner has read / write permissions

chmod 600 ~/.ssh/config

Step 29: set 600 permissions for private ssh key

chmod 600 ~/.ssh/inter_instance_key

Step 29: test ssh connectivity between the nodes

# From node 1 to node 2
ssh -i ~/.ssh/inter_instance_key <NODE_2_HOSTNAME>

# From node 2 to node 1
ssh -i ~/.ssh/inter_instance_key <NODE_1_HOSTNAME>

Step 30: rdma setup for the waagent on both nodes

sudo vi /etc/waagent.conf

# Ensure the following are enabled and uncommented
OS.EnableRDMA=y
OS.CheckRdmaDriver=y

# Save and verify the changes
grep -i rdma /etc/waagent.conf | grep -v "^#"

Step 31: restart the waagent service

sudo systemctl restart waagent
sudo systemctl daemon-reload

Step 32: load the ib_ipoib module

sudo modprobe ib_ipoib

Step 33: configure the infiniband network

# On node 1, set an ip address (in the below example, it is 172.20.0.10)
sudo ip addr add 172.20.0.10/24 dev ib0

# Bring the link up
sudo ip link set ib0 up

# Verify the infiniband interface is up
ibdev2netdev

# On node 2, set an ip address
sudo ip addr add 172.20.0.11/24 dev ib0

# Bring the link up
sudo ip link set ib0 up

# Verify the interface is up
ibdev2netdev

Step 34: test infiniband connectivity

# From node 1 to node 2 with the above example ips
ping -c 3 172.20.0.11

# From node 2 to node 1
ping -c 3 172.20.0.10

Step 35: run the mpi benchmarking tests

Confirm that messages can be passed between the two nodes successfully with the pingpong and allreduce benchmarks.

# pingpong benchmark
/usr/mpi/gcc/openmpi-4.1.7rc1/bin/mpirun -n 2 -N 1 -hostfile ~/hostlist.txt -x UCX_NET_DEVICES=mlx5_0:1 /usr/mpi/gcc/openmpi-4.1.7rc1/tests/imb/IMB-MPI1 -msglog 27:28 pingpong

# allreduce benchmark
/usr/mpi/gcc/openmpi-4.1.7rc1/bin/mpirun -n 72 -N 36 -hostfile ~/hostlist.txt -x UCX_NET_DEVICES=mlx5_0:1 /usr/mpi/gcc/openmpi-4.1.7rc1/tests/imb/IMB-MPI1 -npmin 72 allreduce

Notes

If ssh between the nodes fails

  1. Verify hostnames are correct in the hostlist.txt file.

  2. Check that public keys are properly added to authorized_keys file.

  3. Ensure firewall rules allow internal communication.

  4. Verify the ssh server service is running: sudo systemctl status sshd

If Infiniband connectivity fails

  1. Check if the ib0 interface is visible: ip addr show ib0

  2. Verify RDMA modules are loaded: lsmod | grep rdma

  3. Check the Infiniband status: ibstat

  4. Verify connectivity with ibping

If the MPI tests fail

  1. Ensure all nodes are reachable via SSH without password prompts.

  2. Verify OpenMPI installation: which mpirun

  3. Check that UCX_NET_DEVICES matches your hardware: ucx_info -d

How to quickly cleanup a test setup and delete all resources

# Delete the entire resource group and all its resources
az group delete --name <RESOURCE-GROUP_NAME> --yes --no-wait

# Check the deletion status
az group show --name <RESOURCE-GROUP_NAME>

References & related articles

Open MPI