Open zhuhaozhe opened 8 months ago
USE the bfloat16 datatype (8 mantissa bits) for internal computations
, refer to https://github.com/mgouicem/oneDNN/tree/mgouicem/rfcs/implicit_downconvert/rfcs/20210301-computation-datatype, oneDNN only take it as a hint which means BF16 or FP32 data type may be picked for internal computations.When the precision is high, the CUDA/CUDNN backend will be allowed to use TF32 as the internal computation data type. When the precision is medium, the MKLDNN backend will be allowed to use BF16 as the internal computation data type.
Refer to https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch-set-float32-matmul-precision, it should apply to all the backends?
For the 2 design options in Frontend API
and Inductor linear packable
, do we have any preferred option now? If so, we may talk about our preference for implementation.
When the precision is high, the CUDA/CUDNN backend will be allowed to use TF32 as the internal computation data type. When the precision is medium, the MKLDNN backend will be allowed to use BF16 as the internal computation data type.
Refer to https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch-set-float32-matmul-precision, it should apply to all the backends?
Yes, I changed it to all backends
instead of MKLDNN
or CUDA
- For the feature title:
USE the bfloat16 datatype (8 mantissa bits) for internal computations
, refer to https://github.com/mgouicem/oneDNN/tree/mgouicem/rfcs/implicit_downconvert/rfcs/20210301-computation-datatype, oneDNN only take it as a hint which means BF16 or FP32 data type may be picked for internal computations.- Here are the IPEX BF32 RFC: [RFC] IPEX integration of oneDNN implicit reduced precision arithmetic feature intel-innersource/frameworks.ai.pytorch.ipex-cpu#381
Thanks, changed.
Please add notes on how CUDA can support the new frontend APIs since it is general APIs that can be applied to all backends.
Please add notes on how CUDA can support the new frontend APIs since it is general APIs that can be applied to all backends.
Thanks for advice, added.
RFC: Extend set fp32 precision API to support Convolution and RNN
Overview
This RFC proposes the addition of a user-controlled frontend API to configure the internal precision of
float32
operations in convolutional (CONV
) and recurrent neural networks (RNN
) within PyTorch. Currently, PyTorch offerstorch.set_float32_matmul_precision
to configure the internal precision offloat32
matrix multiplication. This RFC suggests extending this functionality to include convolution and recurrent neural network operations, providingtorch.set_float32_conv_precision
andtorch.set_float32_rnn_precision
. The proposed APIs will mimic the behavior oftorch.set_float32_matmul_precision
.Frontend Changes
Frontend changes involve introducing new APIs:
torch.set_float32_conv_precision
,torch.get_float32_conv_precision
torch.set_float32_rnn_precision
,torch.get_float32_rnn_precision
These APIs will function similarly to
torch.set_float32_matmul_precision
andtorch.get_float32_matmul_precision
. Users can set the precision tohighest
,high
, ormedium
, each with corresponding backend behavior:highest
: Use the highest available precision, avoiding lower precision.high
: Allow backends to use TensorFloat32 (TF32) or treat eachfloat32
number as the sum of twobfloat16
numbers.medium
: Allow backends to use BFloat 16 (BF16).Backend Changes
Global flags
float32_conv/rnn_precision
will be introduced at this location in the PyTorch repository. This flag can be accessed and modified by the frontend APIstorch.get/set_float32_conv/rnn_precision
. Backend-related operators will read this flag to control the internal computation data types. For example:float32_conv_precision
in the CuDNN Conv kernel. We should also checkfloat32_rnn_precision
in the CuDNN RNN kernel. If not set tohighest
, the internal computation data type will be TF32.float32_conv_precision
in OneDNN Conv kernel and checkfloat32_rnn_precision
in OneDNN RNN kernel. If set tomedium
, the internal data type will be BF16.Flag Overrides
The existing CUDNN backend-specific flag
torch.backends.cudnn.allow_tf32
will interact with the proposed backend-irrelevant flagtorch.set_float32_conv/rnn_precision
. These flags will override each other( we follow similar behavior betweentorch.backends.cuda.matmul.allow_tf32
andfloat32_matmul_precision
):torch.backends.cudnn.allow_tf32
will setfloat32_rnn/conv_precision
tohigh
(TF32 enabled) andhighest
(TF32 disabled).float32_rnn/conv_precision
tohigh
ormedium
will enabletorch.backends.cudnn.allow_tf32
, while setting one of it tohighest
will disable it.Additional CuDNN Flag
We discussed how the existing CuDNN flag,
torch.backends.cudnn.allow_tf32
, interacts withtorch.set_float32_conv/rnn_precision
. However, we believe it is cleaner to use separate flags in CuDNN. We suggest deprecatingtorch.backends.cudnn.allow_tf32
in favor oftorch.backends.cudnn.conv.allow_tf32
andtorch.backends.cudnn.rnn.allow_tf32
. Then, the CuDNN backend-specific flags and backend-irrelevant flags can have a one-to-one correspondence, such astorch.backends.cuda.matmul.allow_tf32
andtorch.float32_matmul_precision
Motivation
Lower-precision computation from different backends can significantly improve performance for deep learning workloads with minimal impact on precision. For example,
TF32
fromCUDA/CUDNN
or implicit reduced precision arithmetic feature fromoneDNN
. By providing a user-controlled frontend API, users can easily configure the internal computation data type of convolutional and recurrent neural networks without knowing the detail of different backends. This allows them to leverage the performance benefits of lower precision while ensuring acceptable precision loss. Compared toAutocast
, the proposed flags offer:Pitch
Introduce
float32_conv/rnn_precision
and enable users to control the internal data type for convolutional and recurrent neural networks by configuring the value offloat32_conv/rnn_precision
.