[MM][Perf] Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance (#6204)
### What this PR does / why we need it?
Currently, we pad the last dim of qkv to 128 before flash attention (in
`AscendMMEncoderAttention`) to get better performance on Ascend NPU.
However, the qkv padding is executed serially, which may lead to more
overhead when launching `aclnnConstantPadNd` (launch 3 times).
Since the three operations are mutually independent, we stack qkv first
and then pad them in one kernel launch. With this optimization, **TTFT**
has been reduced by **3.15%**, **peak throughput** has been increased by
**4.20%**.
---
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Launch the server:
```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```
Run benchmark:
```bash
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 1000 \
--no-stream
```
Before this PR:
```
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Benchmark duration (s): 122.33
Total input tokens: 66638
Total generated tokens: 122845
Request throughput (req/s): 8.17
Output token throughput (tok/s): 1004.18
Peak output token throughput (tok/s): 3073.00
Peak concurrent requests: 1000.00
Total token throughput (tok/s): 1548.90
---------------Time to First Token----------------
Mean TTFT (ms): 51757.16
Median TTFT (ms): 44853.42
P99 TTFT (ms): 110700.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 226.06
Median TPOT (ms): 206.85
P99 TPOT (ms): 935.31
---------------Inter-token Latency----------------
Mean ITL (ms): 208.82
Median ITL (ms): 96.37
P99 ITL (ms): 2183.13
==================================================
```
After this PR:
```
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Benchmark duration (s): 121.47
Total input tokens: 66638
Total generated tokens: 122860
Request throughput (req/s): 8.23
Output token throughput (tok/s): 1011.47
Peak output token throughput (tok/s): 3202.00
Peak concurrent requests: 1000.00
Total token throughput (tok/s): 1560.08
---------------Time to First Token----------------
Mean TTFT (ms): 50125.08
Median TTFT (ms): 46270.85
P99 TTFT (ms): 108107.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 227.11
Median TPOT (ms): 205.13
P99 TPOT (ms): 816.08
---------------Inter-token Latency----------------
Mean ITL (ms): 204.60
Median ITL (ms): 92.66
P99 ITL (ms): 2219.02
==================================================
```
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: shen-shanshan <467638484@qq.com>
This commit is contained in:
@@ -106,10 +106,12 @@ class AscendMMEncoderAttention(MMEncoderAttention):
|
||||
if enable_pad:
|
||||
origin_shape = q.shape[-1]
|
||||
pad_len = MAX_PAD_SIZE - origin_shape
|
||||
# q, k, v: [b * s, head, head_dim] -> [b * s, head, MAX_PAD_SIZE]
|
||||
q = F.pad(q, (0, pad_len), mode="constant", value=0)
|
||||
k = F.pad(k, (0, pad_len), mode="constant", value=0)
|
||||
v = F.pad(v, (0, pad_len), mode="constant", value=0)
|
||||
# Merge qkv to reduce the overhead of launching npu pad operation.
|
||||
# [3, b*s, head, head_dim]
|
||||
qkv = torch.stack([q, k, v], dim=0)
|
||||
# qkv: [3, b * s, head, head_dim] -> [3, b * s, head, MAX_PAD_SIZE]
|
||||
qkv = F.pad(qkv, (0, pad_len), mode="constant", value=0)
|
||||
q, k, v = qkv.unbind(dim=0)
|
||||
|
||||
context_layer = torch.empty_like(q)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user