From 638dbcdb32754ce76a51c20b0ff39fb9de272506 Mon Sep 17 00:00:00 2001
From: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Date: Wed, 12 Nov 2025 09:08:55 +0800
Subject: [PATCH] [Perf] Remove D2H operations to imporve performance (#4063)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

### What this PR does / why we need it?
Replace masked in-place assignment with a device-side torch.where so
selection stays on-device, allowing subsequent device ops to be enqueued
earlier and removing an implicit D2H sync, reducing latency by several
hundreds μs on Ascend.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.
- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
---
 vllm_ascend/sample/rejection_sampler.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/vllm_ascend/sample/rejection_sampler.py b/vllm_ascend/sample/rejection_sampler.py
index 991e07fb..0bf8b6bf 100644
--- a/vllm_ascend/sample/rejection_sampler.py
+++ b/vllm_ascend/sample/rejection_sampler.py
@@ -320,7 +320,8 @@ def rejection_greedy_sample_spec_len_1_pytorch(
     accept_req_mask = draft_token_ids == target_argmax
     output_token_ids[:, 0] = target_argmax
     bonus_token_ids = bonus_token_ids.squeeze(1)
-    output_token_ids[accept_req_mask, 1] = bonus_token_ids[accept_req_mask]
+    output_token_ids[:, 1] = torch.where(accept_req_mask, bonus_token_ids,
+                                         output_token_ids[:, 1])
 
 
 def rejection_greedy_sample_pytorch(