[perf][dsv3.2][async_scheduling] improve dsv3.2 performance by eliminating HD synchronization (#4805)

### What this PR does / why we need it? This PR eliminates the simplicit HD synchronization in sfa backend, and _build_dummy_attn_metadata and dummy_run in mtp_proposer, significantly improving dsv3.2 performance in low-latency scenarios. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Performance improvements are observed with E2E performance serving (P: DP4TP8EP32 D: DP8TP4EP32) with `num_speculative_tokens=3`. DSV3.2-W8A8-EXP: TPOT: 41.67ms -> 23.36ms ITL: 85.93ms -> 55.96ms DSV3.2-W8A8 (relaesed in December): TPOT: 18.11ms ITL: 56.13ms - vLLM version: v0.12.0 - vLLM main: ad32e3e19c Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-12-10 22:31:47 +08:00
parent dd622aa6a6
commit 490ddf536f
3 changed files with 16 additions and 7 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -2923,9 +2923,10 @@ class NPUModelRunner(LoRAModelRunnerMixin, ECConnectorModelRunnerMixin):
            cu_num_tokens, arange = self._get_cumsum_and_arange(
                num_scheduled_tokens)

-            self.query_start_loc[1:num_reqs + 1] = torch.Tensor(cu_num_tokens)
            self.query_start_loc_cpu[1:num_reqs +
                                     1] = torch.Tensor(cu_num_tokens)
+            self.query_start_loc = self.query_start_loc_cpu.pin_memory().to(
+                self.device, non_blocking=True)
            self.query_lens = torch.from_numpy(num_scheduled_tokens)
            self.attn_mask = self.attn_mask_builder.get_splitfuse_attn_mask()