Add triton version as a fused_moe_triton config search key to avoid performace decrease in different Triton version (#5955)

This commit is contained in:
Xiaoyu Zhang
2025-06-07 17:43:50 +08:00
committed by GitHub
parent d5c097a2f9
commit 2a413829f4
158 changed files with 11 additions and 1 deletions

View File

@@ -3,6 +3,9 @@ For different settings of
- E (number of experts)
- N (intermediate size)
- device_name (torch.cuda.get_device_name())
- dtype: The data type used by the fused MoE kernel for computation. Supported types include fp8_w8a8, int8_w8a8, int8_w8a16, int4_w4a16, etc. This determines the precision and quantization scheme for both weights and activations.
- block_shape: The block quantization shape introduced starting from DeepSeek V3/R1 models. This parameter defines the granularity for block-wise quantization, typically specified as `[block_n, block_k]` where `block_n` and `block_k` represent the block dimensions. For example, DeepSeek V3 commonly uses `[128, 128]` block shapes for efficient block-wise FP8 quantization.
the JSON file contains a mapping from M (batch size) to the chosen configuration.
The example configurations provided are for the Mixtral model for TP2 on H100

Some files were not shown because too many files have changed in this diff Show More