export_onnx
用于将量化后的torch模型导出为量化ONNX的工具。
函数
将量化模型导出为FP4 ONNX。 |
|
将量化模型导出为FP8 ONNX。 |
|
将量化的fMHA导出为FP8 ONNX。 |
|
将量化模型导出为INT8 ONNX。 |
- export_fp4(g, inputs, block_size, amax, num_bits, trt_high_precision_dtype, onnx_quantizer_type)
将量化模型导出为FP4 ONNX。
- Parameters:
g (GraphContext) –
输入 (值) –
block_size (int) –
amax (Value) –
num_bits (Tuple[int, int]) –
trt_high_precision_dtype (str) –
onnx_quantizer_type (str) –
- export_fp8(g, inputs, amax, trt_high_precision_dtype)
将量化模型导出为FP8 ONNX。
- Parameters:
g (GraphContext) –
输入 (值) –
amax (float) –
trt_high_precision_dtype (str) –
- export_fp8_mha(g, query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None, q_quantized_scale=1.0, k_quantized_scale=1.0, v_quantized_scale=1.0, high_precision_flag='Half', disable_fp8_mha=True)
将量化的fMHA导出为FP8 ONNX。
FP8 ONNX 图:
Q K V | | | \ / | QDQ QDQ | \ / | Cast Cast | \ / | BMM1 | \ | Cast QDQ \ | SoftMax | | | QDQ | \ | Cast Cast \ / BMM2 | Cast
- Parameters:
g (GraphContext) –
query (值) –
key (Value) –
值 (值) –
attn_mask (值 | 无) –
dropout_p (float) –
is_causal (bool) –
scale (值 | 无) –
q_quantized_scale (float) –
k_quantized_scale (float) –
v_quantized_scale (float) –
high_precision_flag (str) –
disable_fp8_mha (bool) –
- export_int8(g, inputs, amax, num_bits, unsigned, narrow_range, trt_high_precision_dtype)
将量化模型导出为INT8 ONNX。
- Parameters:
g (GraphContext) –
输入 (值) –
amax (张量) –
num_bits (int) –
unsigned (bool) –
narrow_range (bool) –
trt_high_precision_dtype (str) –