mirror of
https://github.com/siderolabs/extensions.git
synced 2026-05-05 20:26:10 +02:00
Quoting: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf > On NVIDIA DGX-B200, HGX-B200, and HGX-B100 systems and later, the FabricManager package needs an additional NVLSM dependency for proper operation. > NVLink Subnet manager (NVLSM) originated from the InfiniBand networking and contains additional logic to program NVSwitches and NVLinks. Not running the NVLSM on Blackwell and newer fabrics with NVLink 5.0+ will result in FabricManager failing to start with error `NV_WARN_NOTHING_TO_DO`. NVSwitches will remain uninitialized and applications will fail with the `CUDA_ERROR_SYSTEM_NOT_READY` or `cudaErrorSystemNotReady` error. The CUDA initialization process can only begin after the GPUs complete their registration process with the NVLink fabric. A GPU fabric registration status can be verified with the command: `nvidia-smi -q -i 0 | grep -i -A 2 Fabric`. An `In Progress` state indicates that the GPU is being registered and FabricManager is likely not running or missing the NVLSM dependency. A `Completed` state is shown when the GPU is successfully registered with the NVLink fabric. The FabricManager package includes the script `nvidia-fabricmanager-start.sh`, which is used to selectively start FabricManager and NVLSM processes depending on the underlying platform. A key aspect of determining whether an NVLink 5.0+ fabric is present is to look for a Limited Physical Function (LPF) port in InfiniBand devices. To differentiate LPFs, the Vital Product Data (VPD) information includes a vendor-specific field called `SMDL`, with a non-zero value defined as `SW_MNG`. The first device is then selected, and it's port GUID is extracted and passed to NVLSM and FabricManager. So both services share a configuration key that result from a common initialization process. Additionally, they communicate with each other over a Unix socket. This patch introduces the following changes: * Adds NVLSM to the nvidia-fabricmanager extension. * Introduces a new `nvidia-fabricmanager-wrapper` program to replicate the initialization process from `nvidia-fabricmanager-start.sh`: * Detects NVLink 5.0+ fabrics and extracts an NVSwitch LPF port GUID. This is done by calling libibumad directly with CGO instead of parsing the output of the `ibstat` command as in the upstream script. * Starts FabricManager, and NVLSM only when needed. * Keeps both process lifecycles synchronized and ensures the Talos container will restart if either process crashes. * Refactors the nvidia-fabricmanager container to be self-contained, as this service does not share files with other nvidia-gpu extensions. Signed-off-by: Thibault VINCENT <thibault.vincent@enix.fr> Signed-off-by: Noel Georgi <git@frezbo.dev>
10 lines
281 B
Plaintext
10 lines
281 B
Plaintext
go 1.24.5
|
|
|
|
use (
|
|
./internal/grype-scan
|
|
./examples/hello-world-service/src
|
|
./nvidia-gpu/nvidia-container-toolkit/nvidia-container-runtime-wrapper
|
|
./nvidia-gpu/nvidia-container-toolkit/nvidia-persistenced-wrapper
|
|
./nvidia-gpu/nvidia-fabricmanager/nvidia-fabricmanager-wrapper
|
|
)
|