.. Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. See LICENSE for license information. Setup ===== Precision debug tools for the Transformer Engine use `Nvidia-DL-Framework-Inspect `_ package from NVIDIA. Please refer to the Nvidia-DL-Framework-Inspect `documentation `_ for more details. Below, we outline the steps for debug initialization. initialize() ----------- Must be called once on every rank in the global context to initialize Nvidia-DL-Framework-Inspect. **Parameters** - **config_file** (*str*, default=""): Path to the configuration YAML file containing features to enable and layer names. If one wants to run without the configuration file, pass ``""``. - **feature_dirs** (*List[str] | str*): List of directories containing features to load and register. One needs to pass ``[/path/to/transformerengine/transformer_engine/debug/features]`` to use TE features. - **logger** (*Union[BaseLogger, None]*, default=None): Logger for logging tensor statistics. Should adhere to ``BaseLogger`` from the `Nvidia-DL-Framework-Inspect `_ package. - **log_dir** (*str*, default= "."): Directory path to hold ``debug_logs`` and ``debug_statistics_logs``. - **tb_writer** (*TensorBoardWriter*, default=None): TensorBoard writer for logging. - **default_logging_enabled** (*bool*, default=False): Enable default logging to the file. .. code-block:: python import nvdlfw_inspect.api as debug_api debug_api.initialize( config_file="./config.yaml", feature_dirs=["/path/to/transformer_engine/debug/features"], log_dir="./log_dir") set_tensor_reduction_group() -------------------------- Needed only for logging tensor stats. In multi-GPU training, activation and gradient tensors are distributed across multiple nodes. This method lets you specify the group for the reduction of stats; see the `reduction group section <./4_distributed.rst#reduction-groups>`_ for more details. If the tensor reduction group is not specified, then statistics are reduced across all nodes in the run. **Parameters** - **group** (torch.distributed.ProcessGroup): The process group across which tensors will be reduced to get stats. .. code-block:: python import nvdlfw_inspect.api as debug_api # initialization # (...) pipeline_parallel_group = initialize_pipeline_parallel_group() debug_api.set_tensor_reduction_group(pipeline_parallel_group) # training # (...) # activation/gradient tensor statistics are reduced along pipeline_parallel_group set_weight_tensor_tp_group_reduce() --------------------------------- By default, weight tensor statistics are reduced within the tensor parallel group. This function allows you to disable that behavior; for more details, see `reduction group section <./4_distributed.rst#reduction-groups>`_. This method is not provided by the ``debug_api``, but by the ``transformer_engine.debug``. **Parameters** - **enabled** (*bool*, default=True): A boolean flag to enable or disable the reduction of weight tensor statistics within the tensor parallel group. .. code-block:: python import nvdlfw_inspect.api as debug_api from transformer_engine.debug import set_weight_tensor_tp_group_reduce # initialization # (...) set_weight_tensor_tp_group_reduce(False) # training # (...) # weight tensor statistics are not reduced