ZongYuan Zhan
d8e15dae6c
Optimize some rejectsampler functions to make npu op launch non-blocking (#4587)
### What this PR does / why we need it?
- Vetorize the loop (but change not output) in some rejectsampler
functions include: `expand_pytorch`, `sample_recovered_tokens_pytorch`,
`rejection_random_sample_pytorch`, `sample_recovered_tokens`.
- Remove synchronize-launch torchnpu operator in them to accelerate
sampling + MTP postprocess.
### Does this PR introduce _any_ user-facing change?
- No
### How was this patch tested?
- We tested this change with the serve&bench command:
```
===== serve =====
vllm serve $LOCAL_CKPT_DIR \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-address $MASTER_NODE_IP \
--data-parallel-start-rank $((2*VC_TASK_INDEX)) \
--data-parallel-rpc-port 13387 \
--tensor-parallel-size 8 \
--seed 1024 \
--enable-expert-parallel \
--served-model-name $NAME \
--max-model-len 4096 \
--max-num-seqs 16 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
$headless \
--speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true}}'
==== bench =====
vllm bench serve --model $LOCAL_CKPT_DIR --served-model-name DeepseekV3ForCausalLM \
--dataset-name spec_bench --spec-bench-output-len 2048 \
--dataset-path question.jsonl \
--top-p 1.0 --temperature 0.8 \
--ignore-eos \
--num-prompts 64 --trust-remote-code --base-url "http://0.0.0.0:8000" --request-rate 64
```
- In this case, our rj optimization can reduce TPOT from 84.94ms to
64.61ms, about 23% gain.
## before
<img width="1068" height="830" alt="image"
src="https://github.com/user-attachments/assets/278ac878-b49d-4588-b87c-316ca4d537f5"
/>
## after
<img width="781" height="756" alt="image"
src="https://github.com/user-attachments/assets/0c6d37ad-ed77-40b3-a1be-4933c468365c"
/>
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: ZongYuan Zhan <zhanzy178@gmail.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
2025-12-29 14:10:39 +08:00
..
2025-12-28 10:40:45 +08:00
2025-12-29 09:54:51 +08:00
2025-12-04 14:10:28 +08:00
2025-07-31 19:17:27 +08:00
2025-12-18 20:06:53 +08:00
2025-12-29 09:26:14 +08:00
2025-06-16 18:32:28 +08:00
2025-12-22 20:21:45 +08:00
2025-11-24 17:08:20 +08:00
2025-12-29 09:26:14 +08:00
2025-10-21 20:19:46 +08:00
2025-12-24 19:49:32 +08:00
2025-12-29 14:10:39 +08:00
2025-12-29 09:54:51 +08:00
2025-12-27 09:48:56 +08:00
2025-07-21 19:43:30 +08:00
2025-07-28 15:13:37 +08:00
2025-12-23 12:47:35 +08:00
2025-12-18 09:08:40 +08:00
2025-08-14 09:33:39 +08:00
2025-12-20 09:38:53 +08:00
2025-12-19 14:27:24 +08:00