[Perf] Improve MLA multistream performance (#1353)
### What this PR does / why we need it?
> Need to merge after PR #1322
According to benchmark results, this PR brings approximately 1%
performance gain.
#### Before Improvement
Profiling
<img width="1147" alt="截屏2025-06-22 14 54 47"
src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c"
/>
Evaluation
```
# server launch command
python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \
--quantization ascend \
--served-model-name auto \
--trust-remote-code \
--distributed-executor-backend=mp \
--port 8006 \
-tp=16 \
--max-num-seqs 24 \
--max-model-len 32768 \
--max-num-batched-tokens 8192 \
--block-size 128 \
--no-enable-prefix-caching \
--additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
--gpu-memory-utilization 0.96
# client benchmark command
python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \
--random-input-len 4096 \
--random-output-len 1536 \
--num-prompts 200 \
--ignore-eos \
--model auto \
--tokenizer /DeepSeek-R1-W8A8 \
--port 8006 \
--request-rate 1 \
--max-concurrency 24 \
--save-result \
--skip-initial-test \
--metric-percentiles "50,90,99"
```
```
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 958.59
Total input tokens: 819200
Total generated tokens: 307200
Request throughput (req/s): 0.2086
Output token throughput (tok/s): 320.47
Total Token throughput (tok/s): 1175.05
---------------Time to First Token----------------
Mean TTFT (ms): 942.70
Median TTFT (ms): 713.87
P50 TTFT (ms): 713.87
P90 TTFT (ms): 1363.88
P99 TTFT (ms): 2008.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 68.96
Median TPOT (ms): 69.49
P50 TPOT (ms): 69.49
P90 TPOT (ms): 70.42
P99 TPOT (ms): 70.72
---------------Inter-token Latency----------------
Mean ITL (ms): 68.96
Median ITL (ms): 59.88
P50 ITL (ms): 59.88
P90 ITL (ms): 61.59
P99 ITL (ms): 68.82
==================================================
```
#### After Improvement
Profiling
<img width="1200" alt="截屏2025-06-22 14 55 42"
src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f"
/>
Evaluation
```
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 948.08
Total input tokens: 819200
Total generated tokens: 307200
Request throughput (req/s): 0.2110
Output token throughput (tok/s): 324.02
Total Token throughput (tok/s): 1188.08
---------------Time to First Token----------------
Mean TTFT (ms): 1019.25
Median TTFT (ms): 714.63
P50 TTFT (ms): 714.63
P90 TTFT (ms): 1367.31
P99 TTFT (ms): 2661.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 68.14
Median TPOT (ms): 68.68
P50 TPOT (ms): 68.68
P90 TPOT (ms): 69.33
P99 TPOT (ms): 70.30
---------------Inter-token Latency----------------
Mean ITL (ms): 68.14
Median ITL (ms): 59.04
P50 ITL (ms): 59.04
P90 ITL (ms): 60.93
P99 ITL (ms): 66.89
==================================================
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
65393ee064
Signed-off-by: ApsarasX <apsarax@outlook.com>