multi_tensor.h

Functions handling multi tensor kernels.

Functions

void nvte_multi_tensor_l2norm_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, NVTETensor output, NVTETensor output_per_tensor, NVTETensor ret, NVTETensor ret_per_tensor, int per_tensor, int max_chunks_per_tensor, const int device_id, cudaStream_t stream)

Computes L2 norm for a list of tensors.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [in] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
output – [in] Scratch space. Required size grows with number of inputs.
output_per_tensor – [in] Fixed size auxilliary scratch space.
ret – [out] L2 norm of all inputs.
ret_per_tensor – [out] L2 norm for each tensor.
per_tensor – [in] Whether to calculate per tensor or cumulative norm.
max_chunks_per_tensor – [in] Maximum number of chunks in any input tensor.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_unscale_l2norm_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, NVTETensor output, NVTETensor output_per_tensor, NVTETensor ret, NVTETensor ret_per_tensor, NVTETensor inv_scale, int per_tensor, int max_chunks_per_tensor, const int device_id, cudaStream_t stream)

Computes L2 norm for a list of tensors after unscaling.

Unscaling is only done for computing the L2 norm. The tensors themselves are not updated.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [in] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
output – [in] Scratch space. Required size grows with number of inputs.
output_per_tensor – [in] Fixed size auxilliary scratch space.
ret – [out] L2 norm of all inputs.
ret_per_tensor – [out] L2 norm for each tensor.
inv_scale – [in] Scalar for the unscaling operation.
per_tensor – [in] Whether to calculate per tensor or cumulative norm.
max_chunks_per_tensor – [in] Maximum number of chunks in any input tensor.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_adam_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, const float lr, const float beta1, const float beta2, const float epsilon, const int step, const int mode, const int bias_correction, const float weight_decay, const int device_id, cudaStream_t stream)

Compute and apply gradient update to parameters for Adam optimizer.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [inout] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
lr – [in] Learning rate.
beta1 – [in] Coefficient for first moment of gradient.
beta2 – [in] Coefficient for second moment of gradient.
epsilon – [in] Term added to the denominator for numerical stability.
step – [in] Iteration counter.
mode – [in] Whether to use AdamW (L2 penalty applied to params).
bias_correction – [in] Whether to apply correction factor for moment estimates.
weight_decay – [in] L2 penalty for weight decay.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_adam_param_remainder_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, const float lr, const float beta1, const float beta2, const float epsilon, const int step, const int mode, const int bias_correction, const float weight_decay, const int device_id, cudaStream_t stream)

Compute and apply gradient update to parameters for Adam optimizer where the master parameters only store the remainder bits.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [inout] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
lr – [in] Learning rate.
beta1 – [in] Coefficient for first moment of gradient.
beta2 – [in] Coefficient for second moment of gradient.
epsilon – [in] Term added to the denominator for numerical stability.
step – [in] Iteration counter.
mode – [in] Whether to use AdamW (L2 penalty applied to params).
bias_correction – [in] Whether to apply correction factor for moment estimates.
weight_decay – [in] L2 penalty for weight decay.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_adam_fp8_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, const float lr, const float beta1, const float beta2, const float epsilon, const int step, const int mode, const int bias_correction, const float weight_decay, const NVTEDType fp8_dtype, const int device_id, cudaStream_t stream)

Compute and apply gradient update to parameters for Adam optimizer when model parameters are in Float8 precision.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [inout] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
lr – [in] Learning rate.
beta1 – [in] Coefficient for first moment of gradient.
beta2 – [in] Coefficient for second moment of gradient.
epsilon – [in] Term added to the denominator for numerical stability.
step – [in] Iteration counter.
mode – [in] Whether to use AdamW (L2 penalty applied to params).
bias_correction – [in] Whether to apply correction factor for moment estimates.
weight_decay – [in] L2 penalty for weight decay.
fp8_dtype – [in] FP8 data type for model parameters.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_adam_capturable_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, NVTETensor lr, const float beta1, const float beta2, const float epsilon, NVTETensor step, const int mode, const int bias_correction, const float weight_decay, NVTETensor inv_scale, const int device_id, cudaStream_t stream)

Compute and apply gradient update to parameters for Adam optimizer with CUDA graph support and LR scheduling.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [inout] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
lr – [in] Learning rate.
beta1 – [in] Coefficient for first moment of gradient.
beta2 – [in] Coefficient for second moment of gradient.
epsilon – [in] Term added to the denominator for numerical stability.
step – [in] Iteration counter.
mode – [in] Whether to use AdamW (L2 penalty applied to params).
bias_correction – [in] Whether to apply correction factor for moment estimates.
weight_decay – [in] L2 penalty for weight decay.
inv_scale – [in] Scalar for the unscaling operation.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_adam_capturable_master_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, NVTETensor lr, const float beta1, const float beta2, const float epsilon, NVTETensor step, const int mode, const int bias_correction, const float weight_decay, NVTETensor inv_scale, const int device_id, cudaStream_t stream)

Compute and apply gradient update to parameters for Adam optimizer with CUDA graph support, LR scheduling, and FP32 master weights.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [inout] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
lr – [in] Learning rate.
beta1 – [in] Coefficient for first moment of gradient.
beta2 – [in] Coefficient for second moment of gradient.
epsilon – [in] Term added to the denominator for numerical stability.
step – [in] Iteration counter.
mode – [in] Whether to use AdamW (L2 penalty applied to params).
bias_correction – [in] Whether to apply correction factor for moment estimates.
weight_decay – [in] L2 penalty for weight decay.
inv_scale – [in] Scalar for the unscaling operation.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_sgd_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, float wd, float momentum, float dampening, float lr, int nesterov, int first_run, int wd_after_momentum, float scale, const int device_id, cudaStream_t stream)

Compute and apply gradient update to parameters for SGD optimizer.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [inout] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
wd – [in] Weight decay (L2 penalty).
momentum – [in] Momentum factor.
dampening – [in] Dampening factor.
lr – [in] Learning rate.
nesterov – [in] Whether or not to enable nesterov momentum.
first_run – [in] Whether momentum buffers have been initialized.
wd_after_momentum – [in] Whether to applied weight decay after momentum update.
scale – [in] Scalar for the scaling operation.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_scale_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, float scale, const int device_id, cudaStream_t stream)

Check overflow and scale a list of tensors.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [inout] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
scale – [in] Scalar for the scaling operation.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.

void nvte_multi_tensor_compute_scale_and_scale_inv_cuda(int chunk_size, NVTETensor noop_flag, NVTETensor **tensor_lists, const size_t num_tensor_lists, const size_t num_tensors_per_list, float max_fp8, int force_pow_2_scales, float epsilon, const int device_id, cudaStream_t stream)

Check overflow and scale a list of tensors.

Warning

This API is experimental and subject to change.

Warning

Argument device_id is deprecated and will be removed in a future release.

Parameters:

chunk_size – [in] Number of tensor elements processed by a CUDA block.
noop_flag – [in] If this single element tensor has non-zero value, kernel will exit immediately.
tensor_lists – [inout] 2D array of input tensors.
num_tensor_lists – [in] Size (dim0) of tensor_lists.
num_tensors_per_list – [in] Size (dim1) of tensor_lists.
max_fp8 – [in] Maximum representible value in underlying FP8 format.
force_pow_2_scales – [in] Ensure scaling factors are a power of 2.
epsilon – [in] Term added to the denominator for numerical stability.
device_id – [in] [DEPRECATED] CUDA device ID for this operation.
stream – [in] CUDA stream used for this operation.