Multi-gpu (Nvidia) P2P capabilities and debugging tips

6 min readFeb 20, 2024

Hello everyone, it’s been a while! This time, I’m delving into a hardware debugging post-mortem.

I’ve recently optimistically bought this ML PC build, assembled it and tested it. I’m not expert in ML build system so I expected some bugs or surprise here and there. True to expectations, a significant issue emerged!

If you find an error, if you know more interesting ways to troubleshoot your system or have any feedback, reach out! I would be happy to make this post longer and more thourough

Edit: Thanks to tinygrad, there is a way to enable P2P capabilities on the Nvidia consumer graphics cards 40** series, check it out: https://morgangiraud.medium.com/multi-gpu-tinygrad-patch-4904a75f8e16

TL;DR

I assembled a dual-GPU system with 2 Nvidia Geforce RTX 4070 TI SUPER 16Go
After installing drivers and performing standard GPU tests, I encountered a “hanging” issue during benchmarking
Discovered that PyTorch was unable to transfer data between GPUs using the .to function.
I ended up replicating the issue in TensorFlow, indicating a non-framework specific issue.
Noticed that NCCL ‘all_reduce’ calls were hanging and that disabling P2P resolved this issue.
So I’ve dived deeper. Using the nvBandwidth utility, I found out that P2P transfers were illusory — data wasn’t actually being transferred.
Despite Nvidia driver version 545.* indicating P2P support, I started researching the subject and ended up with response: it was actually not supported for the RTX 40 series.
So it turns out, the current 545.* driver is bugged.
Hopefully, the fix was simple: update the driver to the beta version 550.*! It effectively resolved the incorrect P2P capability reporting.

It took a week to unravel this (better communication from Nvidia on P2P features would have been helpful). I had to navigate hardware debugging without knowing my GPUs lacked P2P support. A deeper dive into the GPU specs beforehand might have saved me the mystery tour too — live and learn, or in this case, debug and pain 🥹

Anyway, I also learned a lot. So here are some helpers for those who want to install a multi-GPU setup and troubleshoot it.

Installing and testing your setup

In this section, I will guide you through the detailed steps for installing and verifying the functionality of your multi-GPU setup machine to ensure that everything is working fine.

Start with a positive mindset and systematically test your configuration!😌

Install your Nvidia graphics cards with care.
Opt for the latest Ubuntu LTS version — stick to LTS for stability unless you’re highly experienced.
Ensure you update Ubuntu and all other drivers (outside of nvidia ones) before proceeding.

Refer to the following gist for a detailled guide on installing Nvidia drivers, CUDA, and cuDNN. It also includes detailed steps to thoroughly test your installation and validate its functionality. This is how I would have approached the problem with more experiences.

In the case (like me) that you have a problem (errors from nvBandwidth, NCCL hanging, or bugs with inter-GPU communication in PyTorch/TensorFlow), let’s see together how to troubleshoot this.

⚠️ Even if your setup is functioning correctly and your GPUs are communicating via PCIe (PHB, PXB, PIB), you might still want to check the ACS/IOMMU troubleshooting step to ensure optimal performance.

Troubleshooting

The deviceQuery binary does not work

Reinstall everything from scratch, that’s not ok. Make sure to purge your system completely before reinstalling everything.

The deviceQuery or p2pBandwidthLatencyTest binaries tell me I do not have p2p capabilities, is that ok?

If you are using consumer grade GPUs (30*, 40* series). P2P is not supported on those so that’s normal. Sorry.

Sources:

Now, if you are using professional grade GPUs. There is a high chance that this is not normal. Let’s explore!

Understanding how my GPUs communicate

We call that the topology, you can query nvidia-smi to know:

nvidia-smi topo -mp

# Output for my build
       GPU0   GPU1   CPU Affinity   NUMA Affinity   GPU NUMA ID
GPU0   X      PHB    0-23           0               N/A
GPU1   PHB    X      0-23           0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

If your cards are connected thourhg a PCIe connections, you can have a stronger sense of how everything is wired by using the following command

lscpi -tv

# Output for my build
-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 14d8
           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 14da
           +-01.1-[01]--+-00.0  NVIDIA Corporation Device 2705
           |            \-00.1  NVIDIA Corporation Device 22bb
           +-01.2-[02]----00.0  Micron/Crucial Technology Device 5419
           +-01.3-[03]--+-00.0  NVIDIA Corporation Device 2705
           |            \-00.1  NVIDIA Corporation Device 22bb
           ..... More down there but irrelevant

My card are using PCIe to communicate: check ACS and IOMMU

ACS (Access Control Services) is a security feature which forces P2P PCIe transactions to go up through the PCIe Root for some checks (used for security) and IOMMU force PCIe traffic has to be routed through the CPU root ports (used of device isolation and virtualization).

Those 2 features break the goal of P2P (which is direct communication without any checks whatsoever) and adds overhead to your communication.

Unless you need special security measure or virtualization, you can disable them safely and enjoy more performance. And if P2P should work with your setup, they are at teh top of the potential culprit list.

If you followed the above gist you have installed nvidia-gds pacakge which a very handy tool to check if IOMMU and ACS are enabled on your system.

# This should be the default path to your CUDA installation
# If it's not the right path, change it accordingly.
/usr/local/cuda/gds/tools/gdscheck -p

# If IOMMU is enabled, you can see how your devices are currently mapped
ls -l /sys/kernel/iommu_groups/*/devices/

To disable ACS on all your PCIe connected hardware (not just GPUs), you can run the following script:

#!/bin/bash
#
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# NVIDIA CORPORATION and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto.  Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA CORPORATION is strictly prohibited.
#

# must be root to access extended PCI config space
if [ "$EUID" -ne 0 ]; then
  echo "ERROR: $0 must be run as root"
  exit 1
fi

for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do

    # skip if it doesn't support ACS
    setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
    if [ $? -ne 0 ]; then
            #echo "${BDF} does not support ACS, skipping"
            continue
    fi

    logger "Disabling ACS on `lspci -s ${BDF}`"
    setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
    if [ $? -ne 0 ]; then
        logger "Error disabling ACS on ${BDF}"
            continue
    fi
    NEW_VAL=`setpci -v -s ${BDF} ECAP_ACS+0x6.w | awk '{print $NF}'`
    if [ "${NEW_VAL}" != "0000" ]; then
        logger "Failed to disable ACS on ${BDF}"
            continue
    fi
done
exit 0

Then, to disable IOMMU, you must find the option in your BIOS settings and disable it.

After rebooting, don’t forget to check that they are indeed disabled before recompiling the different P2P binaries.

Sources:

P2P should be working and is still not

In this case, your best bet is to experiment with different driver versions available for your machine (previous and next version), which may include venturing into beta drivers.

Go back to the gist and follow it again. Don’t forget to clean up all the built binaries (usually using make clean from the root folder) and rebuild them jsut to be extra sure that nothing is using previous libraries and files.

Good luck!🍀

References

You can find my journey on those 3 threads: