Logo
Version: 3.1 (Latest)

User Guide:

  • Overview
    • Terminology
    • Focus Areas
      • Provide robust, online health and diagnostics
      • Enable job-level statistics and continuous GPU telemetry
      • Manage GPUs as collections of related resources
      • Configure NVSwitches
      • Define and enforce GPU configuration state
      • Automate GPU management policies
    • Target Users
  • Getting Started
    • Supported Platforms
      • Supported Linux Distributions
    • Installation
      • Pre-Requisites
      • Remove Older Installations
      • Installation
        • Ubuntu LTS and Debian
        • RHEL / CentOS / Rocky Linux
        • SUSE SLES / OpenSUSE
      • Post-Install
    • Basic Components
      • DCGM shared library
      • NVIDIA Host Engine
      • DCGM CLI Tool
      • Python Bindings
      • Software Development Kit
    • Modes of Operation
      • Embedded Mode
      • Standalone Mode
    • Static Library
  • Feature Overview
    • Groups
    • Configuration
    • Policy
      • Notifications
      • Actions
    • Job Statistics
    • Health and Diagnostics
      • Background Health Checks
      • Active Health Checks
    • Topology
    • NVLink Counters
    • Field Groups
    • Link Status
    • Profiling Metrics
      • Metrics
      • Multiplexing of Profiling Counters
      • Profiling Sampling Rate
      • CUDA Test Generator (dcgmproftester)
      • Metrics on Multi-Instance GPU
        • Example 1
        • Understanding Metrics
        • Platform Support
  • DCGM Diagnostics
    • Overview
      • DCGM Diagnostic Goals
      • Beyond the Scope of the DCGM Diagnostics
      • Run Levels and Tests
    • Overview of Plugins
      • Deployment Plugin
        • Preconditions
        • Configuration Parameters
        • Stat Outputs
        • Failure
      • PCIe - GPU Bandwidth Plugin
        • Preconditions
        • Sub tests
      • Memtest Diagnostic
        • Overview
        • Test Descriptions
        • Supported Parameters
        • Sample Commands
      • Pulse Test Diagnostic
      • Overview
      • Test Description
      • Sample Commands
      • Failure Conditions
      • End User Diagnostics (EUD)
        • Supported Products
        • Included Tests
        • Getting Started with EUD
        • Running the EUD
  • DCGM Modularity
    • Module List
    • Disabling Modules
  • Error Injection
    • Overview
      • Error Injection Workflow
      • Field Identifiers
      • Examples with dcgmi
        • Thermal Violation
        • PCIe Replay Errors
        • ECC Errors
      • API Examples

API Reference:

  • Modules
    • Administrative
      • Init and Shutdown
      • Auxilary information about DCGM engine
    • System
      • Discovery
      • Grouping
      • Field Grouping
      • Status Handling
    • Configuration
      • Setup and Management
      • Manual Invocation
    • Field APIs
    • Process Statistics
    • Job Statistics
    • Health Monitor
    • Policies
      • Setup and Management
      • Manual Invocation
    • Topology
    • Metadata
    • Topology
    • Modules
    • Profiling
    • Enums and Macros
    • Structure Definitions
    • Field Types
    • Field Scope
    • Field Entity
    • Field Identifiers
    • DCGMAPI_Admin_ExecCtrl
  • Data Structures

Release Notes:

  • DCGM Release Notes
    • 3.1.6
      • Improvements
      • Fixed Issues
    • 3.1.3
      • New Features
        • Major API changes and Deprecations
      • Fixed Issues
      • Known Issues
NVIDIA DCGM Documentation
  • Select Version latest
  • »
  • Search


© Copyright 2018-2023, NVIDIA Corporation. Last updated on 2023-02-01.

Built with Sphinx using a theme provided by Read the Docs.