xc-llm-ascend

Files

whx a7f91079b8 [BugFix][Triton] Fix ub overflow bug of sample_recover_tokens_kernel (#4673 )

### What this PR does / why we need it?
Original `sample_recover_tokens_kernel` of reject sampler didn't tile
the vocab size dim, whitch will cause ub overflow problem for models
with big vocab size like deepseek. This PR adds tiling to the vocab size
dim to avoid this problem.

Note that currently we just use a emperical `SUB_BLOCK_SIZE` of `4*1024`
for functionality. If in the future this kernel becomes performance
bottle neck, we can use triton autotune to optimize this. What's more,
we have to disable multibuffer of this kernel due to some accuracy
issues.

- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

Signed-off-by: whx-sjtu <2952154980@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>

2025-12-05 15:16:19 +08:00

logits_processor

[performance] Enhance performance after enabling min_p (#4529 )

2025-12-02 20:35:51 +08:00

__init__.py

Spec decode support for V1 Engine (#874 )

2025-05-23 14:25:46 +08:00

rejection_sampler.py

[BugFix][Triton] Fix ub overflow bug of sample_recover_tokens_kernel (#4673 )

2025-12-05 15:16:19 +08:00

sampler.py

[refact] unified soc_version code (#4359 )

2025-11-26 14:28:55 +08:00