support prefill cache mode use fia op (#3696)

### What this PR does / why we need it?
support prefill cache mode use fia op for full graph
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

origin
============ Serving Benchmark Result ============
Successful requests:                     30
Maximum request concurrency:             256
Request rate configured (RPS):           0.70
Benchmark duration (s):                  131.63
Total input tokens:                      61363
Total generated tokens:                  61440
Request throughput (req/s):              0.23
Output token throughput (tok/s):         466.77
Peak output token throughput (tok/s):    750.00
Peak concurrent requests:                30.00
Total Token throughput (tok/s):          932.95
---------------Time to First Token----------------
Mean TTFT (ms):                          125.17
Median TTFT (ms):                        121.51
P50 TTFT (ms):                           121.51
P90 TTFT (ms):                           140.91
P99 TTFT (ms):                           182.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.85
Median TPOT (ms):                        43.84
P50 TPOT (ms):                           43.84
P90 TPOT (ms):                           44.28
P99 TPOT (ms):                           44.32
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.85
Median ITL (ms):                         42.63
P50 ITL (ms):                            42.63
P90 ITL (ms):                            48.74
P99 ITL (ms):                            59.62
==================================================

after
============ Serving Benchmark Result ============
Successful requests:                     30
Maximum request concurrency:             256
Request rate configured (RPS):           0.70
Benchmark duration (s):                  130.10
Total input tokens:                      61363
Total generated tokens:                  61440
Request throughput (req/s):              0.23
Output token throughput (tok/s):         472.26
Peak output token throughput (tok/s):    750.00
Peak concurrent requests:                30.00
Total Token throughput (tok/s):          943.94
---------------Time to First Token----------------
Mean TTFT (ms):                          123.69
Median TTFT (ms):                        122.51
P50 TTFT (ms):                           122.51
P90 TTFT (ms):                           143.69
P99 TTFT (ms):                           165.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.07
Median TPOT (ms):                        43.13
P50 TPOT (ms):                           43.13
P90 TPOT (ms):                           43.50
P99 TPOT (ms):                           43.57
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.07
Median ITL (ms):                         41.81
P50 ITL (ms):                            41.81
P90 ITL (ms):                            48.11
P99 ITL (ms):                            62.13
==================================================

Signed-off-by: shiyuan680 <917935075@qq.com>

This commit is contained in:

shiyuan680

2025-10-27 19:41:07 +08:00

committed by

GitHub

parent 3e5ae49160

commit 00aa0bf33e

3 changed files with 45 additions and 14 deletions

									
										8

vllm_ascend/worker/model_runner_v1.py
									
												View File
												
				@@ -962,8 +962,12 @@ class NPUModelRunner(LoRAModelRunnerMixin):

				                max_seq_len, self.dtype, self.device)

				        # Prefill with cache hit.

				        elif attn_state == AscendAttentionState.PrefillCacheHit:

				            return self.attn_mask_builder.get_attn_mask(

				                128, self.dtype, self.device)

				            if torch.version.cann.startswith("8.3"):

				                return self.attn_mask_builder.get_attn_mask(

				                    2048, self.dtype, self.device)

				            else:

				                return self.attn_mask_builder.get_attn_mask(

				                    128, self.dtype, self.device)

				        # Decode-only situation.

				        else:

				            return None

support prefill cache mode use fia op (#3696)

8 vllm_ascend/worker/model_runner_v1.py Unescape Escape View File

8

vllm_ascend/worker/model_runner_v1.py

View File