sglang/3rdparty/amd/tuning/TUNING.md

## Tuning SGLang Infer System with AMD GPUs
This AppNote describes the SGLang performance tuning technical, code harness and running steps for systems with AMD Instinct GPUs.
Harness code, examples and steps are provided in detail, to facilitate easy reproduce & use to tune performance towards workloads.
Three primary runtime areas are covered:

## 1. Triton Kernels
To maximize Triton kernel efficiency, several strategies can be employed:

### Key Environment Variables:
- **num_stages**: Adjusts the number of pipeline stages to optimize kernel efficiency based on the specific type of operations (e.g., General Matrix Multiplication - GEMM).
- **waves_per_eu**: Controls the usage of Vector General Purpose Registers (VGPR) to enhance occupancy, thereby improving latency or throughput.
- **BLOCK_M, BLOCK_N, BLOCK_K**: Tunable tile sizes that assist in balancing memory transfer and computational efficiency.
- **matrix_instr_nonkdim**: Optimizes the usage of Matrix-Fused Multiply-Add (MFMA) instructions for specific kernel types, such as Flash Attention.
- **OPTIMIZE_EPILOGUE**: An environment variable that can be set to `1` to enhance performance by eliminating the `convert_layout` operation in the kernel's epilogue.
```python
@triton.autotune(configs=[
        triton.Config({'waves_per_eu': 1}, num_warps=4, num_stages=1),
        triton.Config({'waves_per_eu': 1}, num_warps=8, num_stages=1),
        triton.Config({'waves_per_eu': 1}, num_warps=16, num_stages=1),
        triton.Config({'waves_per_eu': 2}, num_warps=4, num_stages=1),
        triton.Config({'waves_per_eu': 2}, num_warps=8, num_stages=1),
        triton.Config({'waves_per_eu': 2}, num_warps=16, num_stages=1),
        triton.Config({'waves_per_eu': 4}, num_warps=4, num_stages=1),
        triton.Config({'waves_per_eu': 4}, num_warps=8, num_stages=1),
        triton.Config({'waves_per_eu': 4}, num_warps=16, num_stages=1),
    ], key=['BLOCK_N', 'NUM_TOKEN_BLKS'], use_cuda_graph=True)
@triton.jit
def _triton_kernel_funtion():
    ...
```
## 2. Torch Tunable Operations
**TunableOp** is a feature in PyTorch that allows for the definition and optimization of custom kernels with tunable parameters. This feature is particularly useful for enhancing the performance of kernels by experimenting with different configurations.

### Key Environment Variables:
1. **PYTORCH_TUNABLEOP_ENABLED**:
   - Default: `0`
   - Set to `1` to enable TunableOp.

2. **PYTORCH_TUNABLEOP_TUNING**:
   - Default: `1`
   - Set to `0` to disable tuning. If a tuned entry is not found, it will run the tuning step and record the entry when PYTORCH_TUNABLEOP_ENABLED is enabled.

3. **PYTORCH_TUNABLEOP_VERBOSE**:
   - Default: `0`
   - Set to `1` to enable verbose output for TunableOp.

### Usage Example:
To enable TunableOp and tuning, and optionally enable verbose mode, you can run the following command in your terminal:

```bash
#Tuning
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=1 your_script.sh

#Inference with tuning op
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 your_script.sh

#Print out the log
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_VERBOSE=1 your_script.sh

```
## 3. Torch Compilation


The following are suggestions for optimizing matrix multiplication (GEMM) and convolution (conv) operations in PyTorch using Inductor, a part of the PyTorch compilation framework. The goal is to leverage Triton to achieve better performance.

To tune Triton kernels with GEMM and convolution ops (conv), use the `torch.compile` function with the max-autotune mode. This benchmarks a predefined list of Triton configurations and selects the fastest one for each shape.

### Key Configurations:
1. **Max Autotune**:
   - Set `torch._inductor.config.max_autotune = True` or `TORCHINDUCTOR_MAX_AUTOTUNE=1`.

2. **Fine-Grained Control**:
   - Enable GEMM tuning: `torch._inductor.config.max_autotune_gemm = True`.
   - Enable tuning for pointwise/reduction ops: `torch._inductor.config.max_autotune.pointwise = True`.

3. **Backend Selection**:
   - Use `torch._inductor.max_autotune_gemm_backends` to limit backends to TRITON for better performance.

4. **Freezing for Inference**:
   - Use `torch._inductor.config.freezing=True` to enable constant folding optimizations.

5. **Debugging**:
   - Set `TORCH_COMPILE_DEBUG=1` to extract Triton kernels generated by Inductor.

### Example Code Block:
```bash
#Gemm Tuning
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 your_script.sh

#Specify your backend to TRITON for Gemm Tuning
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON your_script.sh

#Inference with large improvement on AMD GPU
TORCHINDUCTOR_FREEZING=1 your_script.sh
```
## 4. Fused MOE kernel
To maximize moe kernel efficiency, need to use below scripts to find out the best launch configuration

### Key parameters:
- **--model**: what moe model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
- **--tp-size**: simulate the whole model run configuration to set the dimension size using tp correctly
- **--batch**: M dimension size of moe kernel, for prefill moe kernel the value is batch*input_len, for decode moe kernel the value is batch
- **--dtype**: computation type

```bash
#Tuning
#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quant fp" to run, it defined batch-size 32 input lenth 1024 and output length 8, from "--batch" in moe view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
#so we can tune decode moe use below command
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32"
# and use this command to tune prefill moe
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32768"
```

## Reference

For more detailed information on tuning SGLang performance with AMD GPUs, please refer to the following link:

[ROCm Documentation: Triton Kernel Performance Optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#triton-kernel-performance-optimization)
[3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added (#1822) 2024-10-29 10:51:02 -07:00			`## Tuning SGLang Infer System with AMD GPUs`
			`This AppNote describes the SGLang performance tuning technical, code harness and running steps for systems with AMD Instinct GPUs.`
			`Harness code, examples and steps are provided in detail, to facilitate easy reproduce & use to tune performance towards workloads.`
			`Three primary runtime areas are covered:`

[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			`## 1. Triton Kernels`
			`To maximize Triton kernel efficiency, several strategies can be employed:`
[3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added (#1822) 2024-10-29 10:51:02 -07:00
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			`### Key Environment Variables:`
			`- num_stages: Adjusts the number of pipeline stages to optimize kernel efficiency based on the specific type of operations (e.g., General Matrix Multiplication - GEMM).`
			`- waves_per_eu: Controls the usage of Vector General Purpose Registers (VGPR) to enhance occupancy, thereby improving latency or throughput.`
			`- BLOCK_M, BLOCK_N, BLOCK_K: Tunable tile sizes that assist in balancing memory transfer and computational efficiency.`
			`- matrix_instr_nonkdim: Optimizes the usage of Matrix-Fused Multiply-Add (MFMA) instructions for specific kernel types, such as Flash Attention.`
			- OPTIMIZE_EPILOGUE: An environment variable that can be set to `1` to enhance performance by eliminating the `convert_layout` operation in the kernel's epilogue.
			```python
			`@triton.autotune(configs=[`
			`triton.Config({'waves_per_eu': 1}, num_warps=4, num_stages=1),`
			`triton.Config({'waves_per_eu': 1}, num_warps=8, num_stages=1),`
			`triton.Config({'waves_per_eu': 1}, num_warps=16, num_stages=1),`
			`triton.Config({'waves_per_eu': 2}, num_warps=4, num_stages=1),`
			`triton.Config({'waves_per_eu': 2}, num_warps=8, num_stages=1),`
			`triton.Config({'waves_per_eu': 2}, num_warps=16, num_stages=1),`
			`triton.Config({'waves_per_eu': 4}, num_warps=4, num_stages=1),`
			`triton.Config({'waves_per_eu': 4}, num_warps=8, num_stages=1),`
			`triton.Config({'waves_per_eu': 4}, num_warps=16, num_stages=1),`
			`], key=['BLOCK_N', 'NUM_TOKEN_BLKS'], use_cuda_graph=True)`
			`@triton.jit`
			`def _triton_kernel_funtion():`
			`...`
			```
			`## 2. Torch Tunable Operations`
minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`TunableOp is a feature in PyTorch that allows for the definition and optimization of custom kernels with tunable parameters. This feature is particularly useful for enhancing the performance of kernels by experimenting with different configurations.`
[3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added (#1822) 2024-10-29 10:51:02 -07:00
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			`### Key Environment Variables:`
minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`1. PYTORCH_TUNABLEOP_ENABLED:`
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			- Default: `0`
			- Set to `1` to enable TunableOp.
[3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added (#1822) 2024-10-29 10:51:02 -07:00
minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`2. PYTORCH_TUNABLEOP_TUNING:`
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			- Default: `1`
			- Set to `0` to disable tuning. If a tuned entry is not found, it will run the tuning step and record the entry when PYTORCH_TUNABLEOP_ENABLED is enabled.
[3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added (#1822) 2024-10-29 10:51:02 -07:00
minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`3. PYTORCH_TUNABLEOP_VERBOSE:`
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			- Default: `0`
			- Set to `1` to enable verbose output for TunableOp.
[3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added (#1822) 2024-10-29 10:51:02 -07:00
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			`### Usage Example:`
			`To enable TunableOp and tuning, and optionally enable verbose mode, you can run the following command in your terminal:`

			```bash
			`#Tuning`
			`PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=1 your_script.sh`

			`#Inference with tuning op`
			`PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 your_script.sh`

			`#Print out the log`
			`PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_VERBOSE=1 your_script.sh`

			```
			`## 3. Torch Compilation`


			`The following are suggestions for optimizing matrix multiplication (GEMM) and convolution (conv) operations in PyTorch using Inductor, a part of the PyTorch compilation framework. The goal is to leverage Triton to achieve better performance.`

			To tune Triton kernels with GEMM and convolution ops (conv), use the `torch.compile` function with the max-autotune mode. This benchmarks a predefined list of Triton configurations and selects the fastest one for each shape.

			`### Key Configurations:`
minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`1. Max Autotune:`
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			- Set `torch._inductor.config.max_autotune = True` or `TORCHINDUCTOR_MAX_AUTOTUNE=1`.

			`2. Fine-Grained Control:`
			- Enable GEMM tuning: `torch._inductor.config.max_autotune_gemm = True`.
			- Enable tuning for pointwise/reduction ops: `torch._inductor.config.max_autotune.pointwise = True`.

minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`3. Backend Selection:`
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			- Use `torch._inductor.max_autotune_gemm_backends` to limit backends to TRITON for better performance.

minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`4. Freezing for Inference:`
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			- Use `torch._inductor.config.freezing=True` to enable constant folding optimizations.

minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`5. Debugging:`
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00			- Set `TORCH_COMPILE_DEBUG=1` to extract Triton kernels generated by Inductor.

			`### Example Code Block:`
			```bash
			`#Gemm Tuning`
			`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 your_script.sh`

			`#Specify your backend to TRITON for Gemm Tuning`
			`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON your_script.sh`

			`#Inference with large improvement on AMD GPU`
			`TORCHINDUCTOR_FREEZING=1 your_script.sh`
			```
[3rdparty, document] Updated Documentation that for triton fused_moe kernel tuning for AMD Instinct GPUs (#2191) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: HAI <hixiao@gmail.com> 2024-11-28 02:23:10 +08:00			`## 4. Fused MOE kernel`
			`To maximize moe kernel efficiency, need to use below scripts to find out the best launch configuration`

			`### Key parameters:`
			`- --model: what moe model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers`
			`- --tp-size: simulate the whole model run configuration to set the dimension size using tp correctly`
			`- --batch: M dimension size of moe kernel, for prefill moe kernel the value is batch*input_len, for decode moe kernel the value is batch`
			`- --dtype: computation type`

			```bash
			`#Tuning`
			`#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quant fp" to run, it defined batch-size 32 input lenth 1024 and output length 8, from "--batch" in moe view point, the prefill batch is 321024 = 32768, the decode batch is 321(only one output token generated in each run).`
			`#so we can tune decode moe use below command`
			`python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32"`
			`# and use this command to tune prefill moe`
			`python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32768"`
			```
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) Co-authored-by: root <root@dell300x-pla-t10-23.pla.dcgpu> 2024-11-02 03:12:59 +08:00
			`## Reference`

			`For more detailed information on tuning SGLang performance with AMD GPUs, please refer to the following link:`

minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 21:46:04 +08:00			`[ROCm Documentation: Triton Kernel Performance Optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#triton-kernel-performance-optimization)`