[perf][dsv3.2][async_scheduling] improve dsv3.2 performance by eliminating HD synchronization (#4805)

### What this PR does / why we need it?
This PR eliminates the simplicit HD synchronization in sfa backend, and
_build_dummy_attn_metadata and dummy_run in mtp_proposer, significantly
improving dsv3.2 performance in low-latency scenarios.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Performance improvements are observed with E2E performance serving (P:
DP4TP8EP32 D: DP8TP4EP32) with `num_speculative_tokens=3`.

DSV3.2-W8A8-EXP:
TPOT: 41.67ms -> 23.36ms
ITL: 85.93ms -> 55.96ms

DSV3.2-W8A8 (relaesed in December):
TPOT: 18.11ms
ITL: 56.13ms
 

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
linfeng-yuan
2025-12-10 22:31:47 +08:00
committed by GitHub
parent dd622aa6a6
commit 490ddf536f
3 changed files with 16 additions and 7 deletions

View File

@@ -2923,9 +2923,10 @@ class NPUModelRunner(LoRAModelRunnerMixin, ECConnectorModelRunnerMixin):
cu_num_tokens, arange = self._get_cumsum_and_arange(
num_scheduled_tokens)
self.query_start_loc[1:num_reqs + 1] = torch.Tensor(cu_num_tokens)
self.query_start_loc_cpu[1:num_reqs +
1] = torch.Tensor(cu_num_tokens)
self.query_start_loc = self.query_start_loc_cpu.pin_memory().to(
self.device, non_blocking=True)
self.query_lens = torch.from_numpy(num_scheduled_tokens)
self.attn_mask = self.attn_mask_builder.get_splitfuse_attn_mask()