Thibault VINCENT 9d782c572a
feat: fabricmanager support Blackwell baseboards (DGX/HGX B100/B200/B300)
Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

> On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation.
> NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks.

Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error `NV_WARN_NOTHING_TO_DO`. NVSwitches will remain uninitialized and applications will fail with the `CUDA_ERROR_SYSTEM_NOT_READY` or `cudaErrorSystemNotReady` error.  The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric.
A GPU fabric registration status can be verified with the command: `nvidia-smi -q -i 0 | grep -i -A 2 Fabric`. An `In Progress` state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A `Completed` state is shown when the GPU is successfully registered with the NVLink fabric.

The FabricManager package includes the script `nvidia-fabricmanager-start.sh`, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform.
A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called `SMDL`, with a non-zero value defined as `SW_MNG`. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager.
So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket.

This patch introduces the following changes:
 * Adds NVLSM to the nvidia-fabricmanager extension.
 * Introduces a new `nvidia-fabricmanager-wrapper` program to replicate the initialization process from `nvidia-fabricmanager-start.sh`:
   * Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the `ibstat` command as in the upstream script.
   * Starts FabricManager, and NVLSM only when needed.
   * Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes.
 * Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extensions.

Signed-off-by: Thibault VINCENT <thibault.vincent@enix.fr>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2025-10-21 14:33:33 +05:30

10 lines
281 B
Plaintext

go 1.24.5
use (
./internal/grype-scan
./examples/hello-world-service/src
./nvidia-gpu/nvidia-container-toolkit/nvidia-container-runtime-wrapper
./nvidia-gpu/nvidia-container-toolkit/nvidia-persistenced-wrapper
./nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper
)