NVSDM API Reference

The user guide for the NVSDM library.

1. Introduction

NVSwitch Device Monitoring (NVSDM) is a library for monitoring NVSwitch devices on NVIDIA Blackwell systems. NVSDM API provides a wide range of telemetry including, but not limited to, device health, port counters, and PCIe statistics.

The NVSDM package also contains the experimental nvsdm_cli utility. This utility provides a convenient way to utilize the NVSDM library.

Note

The nvsdm_cli is an experimental tool and is subject to change and/or removal without notice.

Note

NVSDM does not currently support ethernet devices.

1.1. Change log of NVSDM library

This chapter list changes in API and bug fixes that were introduced to the library

1.1.1. Changes between NVSDM v1.2.0 and v1.3.0

  • Added new ConnectX counter ID NVSDM_CONNECTX_TELEM_CTR_PCIE_LINK_INBOUND_BYTES

  • Added new ConnectX counter ID NVSDM_CONNECTX_TELEM_CTR_PCIE_LINK_OUTBOUND_BYTES

  • Updated doxygen documentation to fix warnings and add better grouping support.

1.1.2. Changes between NVSDM v1.1.0 and v1.2.0

  • Added a new API to retrieve “local” port number: nvsdmPortGetLocalNum

  • Modified nvsdmDeviceGetFirmwareVersion to also retrieve firmware versions for ConnectX HCA in addition to switches

  • Added support for 4 “extended” (i.e. 64b) PMA counters:

    • NVSDM_PORT_TELEM_CTR_EXT_XMIT_DATA

    • NVSDM_PORT_TELEM_CTR_EXT_RCV_DATA

    • NVSDM_PORT_TELEM_CTR_EXT_XMIT_PKTS

    • NVSDM_PORT_TELEM_CTR_EXT_RCV_PKTS

1.1.3. Changes between NVSDM v1.0 and v1.1.0

  • Added nvsdmSetLogFile to specify a log file.

  • Added nvsdmDeviceGetFirmwareVersion to retrieve the firmware version for a given switch.

  • Added nvsdmDeviceGetTelemetryValues to retrieve telemetry from a device.

  • Added a new telemetry type: NVSDM_TELEM_TYPE_CONNECTX for ConnectX device telemetry.

1.2. Known issues in the current version of NVSDM library

This is a list of known NVSDM issues in the current release:

  • The following ConnectX inbound and outbound byte counters are calculated over a very short period of time instead of the intended behavior of being calculated over the lifetime of the NVSDM library.

    • ConnectX counter ID NVSDM_CONNECTX_TELEM_CTR_PCIE_LINK_INBOUND_BYTES

    • ConnectX counter ID NVSDM_CONNECTX_TELEM_CTR_PCIE_LINK_OUTBOUND_BYTES

2. Getting Started

2.1. System Requirements

NVSDM is supported on Linux x86_64 platforms.

The following software dependencies are required to run NVSDM:

The following software dependencies are optional and are used to query additional ConnectX telemetry:

  • libdoca-sdk-telemetry-dev

  • fwctl-dkms

Note: these optional dependencies are provided by the NVIDIA DOCA repository

Note: NVSDM does not depend on any other libraries or headers from the CUDA toolkit.

2.2. Installation

NVSDM can be installed from the CUDA Toolkit Installer. Once you have added the CUDA package repository to your system, you can install NVSDM as follows:

  • For Debian and Ubuntu based OS distributions:

    sudo apt-get install -y libnvsdm-<driver-branch>
    
  • For Red Hat Enterprise Linux 8/9 based OS distributions:

    sudo dnf install libnvsdm-<driver-branch>
    

The libnvsdm installer package installs the following components:

  • /usr/lib/x86_64-linux-gnu/include/nvsdm.h

  • /usr/lib/x86_64-linux-gnu-bin/nvsdm_cli

  • /usr/lib/x86_64-linux-gnu/libnvsdm.so.1

  • /usr/lib/x86_64-linux-gnu/libnvsdm.so

  • /usr/share/doc/libnvsdm-<version>/README

  • /usr/share/doc/libnvsdm-<version>/third-party-notices.txt

2.3. Using the NVSDM API

Please see the API reference for details on how to use NVSDM.