xc-llm-ascend

Files

Shanshan Shen 6b7d9b76f1 [MM][Perf] Pre-compute seq_lens and put it on CPU before ViT vision blocks for better performance (#7104 )

### What this PR does / why we need it?
**Background:**

PR https://github.com/vllm-project/vllm-ascend/pull/6448 has introduced
a `seq_lens` CPU cache mechanism, which will considerably benefit the
performance for VL models but may lead to accuracy issues. Thus, we have
reverted it.

**Proposed Change:**

In PR https://github.com/vllm-project/vllm/pull/36605, we have supported
custom processing logic for OOT MMEncoder kernels in vLLM. Thus, we can
pre-compute `seq_lens` (rather than `cu_seqlens`) and put it on CPU
before ViT vision blocks to avoid redundant computation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

#### ✅ Functional Test

Run Qwen2.5-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--limit-mm-per-prompt '{"image": 1}'
```

Output:

```bash
"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" having a slightly bolder and more prominent appearance compared to \"Qwen.\" The overall design is simple and professional."
```

> [!NOTE]
> Since PR https://github.com/vllm-project/vllm/pull/36605 only modified
`Qwen3-VL` modeling files, this PR has no affect to `Qwen2.5-VL` model.

---
Run Qwen3-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--limit-mm-per-prompt '{"image": 1}'
```

Output:

```bash
"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with TONG."
```

---
#### ✅ Benchmark

Launch the server:

```
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```

Run benchmark:

```
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 500 \
--request-rate 10 \
--burstiness 5 \
--no-stream
```

Before this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  78.58     
Total input tokens:                      33418     
Total generated tokens:                  61431     
Request throughput (req/s):              6.36      
Output token throughput (tok/s):         781.78    
Peak output token throughput (tok/s):    2475.00   
Peak concurrent requests:                383.00    
Total token throughput (tok/s):          1207.07   
---------------Time to First Token----------------
Mean TTFT (ms):                          7116.24   
Median TTFT (ms):                        4295.84   
P99 TTFT (ms):                           18370.87  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          245.78    
Median TPOT (ms):                        264.03    
P99 TPOT (ms):                           334.38    
---------------Inter-token Latency----------------
Mean ITL (ms):                           246.99    
Median ITL (ms):                         117.71    
P99 ITL (ms):                            1327.55   
==================================================
```

After this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  77.44     
Total input tokens:                      33418     
Total generated tokens:                  61522     
Request throughput (req/s):              6.46      
Output token throughput (tok/s):         794.40    
Peak output token throughput (tok/s):    2691.00   
Peak concurrent requests:                369.00    
Total token throughput (tok/s):          1225.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          6888.64   
Median TTFT (ms):                        4128.82   
P99 TTFT (ms):                           17487.94  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          240.14    
Median TPOT (ms):                        259.18    
P99 TPOT (ms):                           313.15    
---------------Inter-token Latency----------------
Mean ITL (ms):                           241.84    
Median ITL (ms):                         121.08    
P99 ITL (ms):                            1470.33   
==================================================
```

**Performance Metrics:**

| Metric | Before this PR | After this PR | Comparison |
| :----- | :------------- | :------------ | :--------- |
| **Throughput** | | | |
| Request throughput (req/s) | 6.36 | 6.46 | +1.57% ↑ |
| Output token throughput (tok/s) | 781.78 | 794.40 | +1.61% ↑ |
| Total token throughput (tok/s) | 1,207.07 | 1,225.91 | +1.56% ↑ |
| Peak output token throughput (tok/s) | 2,475 | 2,691 | +8.73% ↑ |
| **Latency** | | | |
| Benchmark duration (s) | 78.58 | 77.44 | -1.45% ↓ |
| Mean TTFT (ms) | 7,116.24 | 6,888.64 | -3.20% ↓ |
| Median TTFT (ms) | 4,295.84 | 4,128.82 | -3.89% ↓ |
| P99 TTFT (ms) | 18,370.87 | 17,487.94 | -4.81% ↓ |
| Mean TPOT (ms) | 245.78 | 240.14 | -2.29% ↓ |
| Median TPOT (ms) | 264.03 | 259.18 | -1.84% ↓ |
| P99 TPOT (ms) | 334.38 | 313.15 | -6.35% ↓ |
| Mean ITL (ms) | 246.99 | 241.84 | -2.09% ↓ |
| Median ITL (ms) | 117.71 | 121.08 | +2.86% ↑ |
| P99 ITL (ms) | 1,327.55 | 1,470.33 | +10.76% ↑ |

**🤖 AI Summary:**

- The most notable improvement is in P99 TPOT, which dropped **-6.35%**
from 334.38ms → 313.15ms, indicating reduced tail latency for per-token
generation under heavy load.
- TTFT improved across all percentiles: mean dropped **-3.20%** (7,116ms
→ 6,889ms), median **-3.89%** (4,296ms → 4,129ms), and P99 **-4.81%**
(18,371ms → 17,488ms), reflecting faster time-to-first-token across the
board.
- TPOT also improved consistently, with mean down **-2.29%** (245.78ms →
240.14ms) and median down **-1.84%** (264.03ms → 259.18ms), showing a
modest but steady reduction in per-token generation time.
- Throughput saw a slight uplift of roughly **+1.6%** across request,
output token, and total token throughput. Peak output token throughput
jumped **+8.73%** (2,475 → 2,691 tok/s), suggesting better burst
handling capacity.
- P99 ITL increased **+10.76%** (1,328ms → 1,470ms), the largest
regression in the run. Median ITL also ticked up **+2.86%** (117.71ms →
121.08ms). These tail-latency spikes may reflect scheduling variability
under peak concurrency and could be within run-to-run noise, but are
worth monitoring.
- Overall, the PR delivers a consistent improvement in both throughput
and latency, with the caveat that P99 inter-token latency regressed —
likely a transient effect given that mean ITL still improved by
**-2.09%**.

---
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: shen-shanshan <467638484@qq.com>

2026-03-23 15:24:26 +08:00

fused_moe

[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 )

2026-03-20 23:23:57 +08:00

triton

[Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487 )

2026-03-22 23:09:23 +08:00

__init__.py

[OPS]add split_qkv_tp_rmsnorm_rope ops (#7376 )

2026-03-19 17:19:18 +08:00

activation.py

[Attention] add gpt-oss support (#5901 )