xc-llm-ascend

Files

FuNanyang 8bb028424b Fixed the performance degradation issue in post-processing in speculative decoding scenarios. (#4849 )

…

### What this PR does / why we need it?
When speculative decoding is enabled and temperature > 0, bonus_logits
and target_logits are sampled separately:
1. bonus_logits are sampled using a fused torch_npu.npu_top_k_top_p
operator invoked inside the main sampler,
2. while target_logits are sampled within the rejection sampler using a
less-optimized implementation composed of smaller operators.
Consequently, the cumsum operation in the top-p sampling for
target_logits becomes especially time-consuming, leading to performance
degradation.
<img width="1029" height="623" alt="image"
src="https://github.com/user-attachments/assets/1969f561-6aa5-41b3-9a87-1f64d4321cbf"
/>

Apply the fused operator to the sampling of target_logits as well to
reduce overhead
<img width="1039" height="572" alt="image"
src="https://github.com/user-attachments/assets/1e6563da-3418-405d-b657-7bbe10dd0924"
/>

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: funanyang <985619145@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>

2025-12-10 20:32:44 +08:00

logits_processor

[performance] Enhance performance after enabling min_p (#4529 )

2025-12-02 20:35:51 +08:00

__init__.py

Spec decode support for V1 Engine (#874 )

2025-05-23 14:25:46 +08:00

rejection_sampler.py

Fixed the performance degradation issue in post-processing in speculative decoding scenarios. (#4849 )

2025-12-10 20:32:44 +08:00

sampler.py

[refact] unified soc_version code (#4359 )

2025-11-26 14:28:55 +08:00