Revert "fix some typos" (#6244)
This commit is contained in:
2
3rdparty/amd/profiling/PROFILING.md
vendored
2
3rdparty/amd/profiling/PROFILING.md
vendored
@@ -356,7 +356,7 @@ client.sh
|
||||
# Start profiling via API
|
||||
curl http://localhost:30000/start_profile -H "Content-Type: application/json"
|
||||
|
||||
# Benchmark serving using SGLang with a random dataset and tokenizer
|
||||
# Benchmark serving using sglang with random dataset and tokenizer
|
||||
# Define the log file with a timestamp
|
||||
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||
LOGFILE="sglang_client_log_$TIMESTAMP.json"
|
||||
|
||||
14
3rdparty/amd/tuning/TUNING.md
vendored
14
3rdparty/amd/tuning/TUNING.md
vendored
@@ -93,21 +93,21 @@ TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDU
|
||||
#Inference with large improvement on AMD GPU
|
||||
TORCHINDUCTOR_FREEZING=1 your_script.sh
|
||||
```
|
||||
## 4. Fused MoE kernel
|
||||
To maximize MoE kernel efficiency, need to use below scripts to find out the best launch configuration
|
||||
## 4. Fused MOE kernel
|
||||
To maximize moe kernel efficiency, need to use below scripts to find out the best launch configuration
|
||||
|
||||
### Key parameters:
|
||||
- **--model**: what MoE model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
|
||||
- **--model**: what moe model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
|
||||
- **--tp-size**: simulate the whole model run configuration to set the dimension size using tp correctly
|
||||
- **--batch**: M dimension size of MoE kernel, for prefill MoE kernel the value is batch*input_len, for decode MoE kernel the value is batch
|
||||
- **--batch**: M dimension size of moe kernel, for prefill moe kernel the value is batch*input_len, for decode moe kernel the value is batch
|
||||
- **--dtype**: computation type
|
||||
|
||||
```bash
|
||||
#Tuning
|
||||
#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in MoE view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
|
||||
#so we can tune decode MoE use below command
|
||||
#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in moe view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
|
||||
#so we can tune decode moe use below command
|
||||
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32"
|
||||
# and use this command to tune prefill MoE
|
||||
# and use this command to tune prefill moe
|
||||
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32768"
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user