[Perf] Remove D2H operations to imporve performance (#4063)

### What this PR does / why we need it? Replace masked in-place assignment with a device-side torch.where so selection stays on-device, allowing subsequent device ops to be enqueued earlier and removing an implicit D2H sync, reducing latency by several hundreds μs on Ascend. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0 - vLLM main: 83f478bb19 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-12 09:08:55 +08:00
parent e38fe92f40
commit 638dbcdb32
1 changed files with 2 additions and 1 deletions
--- a/vllm_ascend/sample/rejection_sampler.py
+++ b/vllm_ascend/sample/rejection_sampler.py
@@ -320,7 +320,8 @@ def rejection_greedy_sample_spec_len_1_pytorch(
    accept_req_mask = draft_token_ids == target_argmax
    output_token_ids[:, 0] = target_argmax
    bonus_token_ids = bonus_token_ids.squeeze(1)
-    output_token_ids[accept_req_mask, 1] = bonus_token_ids[accept_req_mask]
+    output_token_ids[:, 1] = torch.where(accept_req_mask, bonus_token_ids,
+                                         output_token_ids[:, 1])


 def rejection_greedy_sample_pytorch(