Setup
Precision debug tools for the Transformer Engine use Nvidia-DL-Framework-Inspect package from NVIDIA. Please refer to the Nvidia-DL-Framework-Inspect documentation for more details. Below, we outline the steps for debug initialization.
initialize()
Must be called once on every rank in the global context to initialize Nvidia-DL-Framework-Inspect.
Parameters
config_file (str, default=””): Path to the configuration YAML file containing features to enable and layer names. If one wants to run without the configuration file, pass
""
.feature_dirs (List[str] | str): List of directories containing features to load and register. One needs to pass
[/path/to/transformerengine/transformer_engine/debug/features]
to use TE features.logger (Union[BaseLogger, None], default=None): Logger for logging tensor statistics. Should adhere to
BaseLogger
from the Nvidia-DL-Framework-Inspect package.log_dir (str, default= “.”): Directory path to hold
debug_logs
anddebug_statistics_logs
.tb_writer (TensorBoardWriter, default=None): TensorBoard writer for logging.
default_logging_enabled (bool, default=False): Enable default logging to the file.
import nvdlfw_inspect.api as debug_api
debug_api.initialize(
config_file="./config.yaml",
feature_dirs=["/path/to/transformer_engine/debug/features"],
log_dir="./log_dir")
set_tensor_reduction_group()
Needed only for logging tensor stats. In multi-GPU training, activation and gradient tensors are distributed across multiple nodes. This method lets you specify the group for the reduction of stats; see the reduction group section for more details.
If the tensor reduction group is not specified, then statistics are reduced across all nodes in the run.
Parameters
group (torch.distributed.ProcessGroup): The process group across which tensors will be reduced to get stats.
import nvdlfw_inspect.api as debug_api
# initialization
# (...)
pipeline_parallel_group = initialize_pipeline_parallel_group()
debug_api.set_tensor_reduction_group(pipeline_parallel_group)
# training
# (...)
# activation/gradient tensor statistics are reduced along pipeline_parallel_group
set_weight_tensor_tp_group_reduce()
By default, weight tensor statistics are reduced within the tensor parallel group. This function allows you to disable that behavior; for more details, see reduction group section.
This method is not provided by the debug_api
, but by the transformer_engine.debug
.
Parameters
enabled (bool, default=True): A boolean flag to enable or disable the reduction of weight tensor statistics within the tensor parallel group.
import nvdlfw_inspect.api as debug_api
from transformer_engine.debug import set_weight_tensor_tp_group_reduce
# initialization
# (...)
set_weight_tensor_tp_group_reduce(False)
# training
# (...)
# weight tensor statistics are not reduced