[v0.18.0][kernel] Recompilation optimization triggered by triton function parameter optimization (#7647)

### What this PR does / why we need it? Some parameters of Triton operators are unnecessarily modified with the "constexpr" modifier. When these parameters change, recompilation is triggered, which significantly affects the model performance. Therefore, these parameters need to be rectified. - vLLM version: v0.17.0 - vLLM main: 8b6325758c Signed-off-by: HarpSealCC [844291270@qq.com](mailto:844291270@qq.com) Signed-off-by: l30072083 <liuchengzhuo1@h-partners.com> Co-authored-by: l30072083 <liuchengzhuo1@h-partners.com>
2026-03-26 19:10:45 +08:00
parent d781902ce9
commit d6661c09b6
5 changed files with 20 additions and 34 deletions
--- a/vllm_ascend/ops/triton/reject_sample.py
+++ b/vllm_ascend/ops/triton/reject_sample.py
@@ -82,7 +82,7 @@ def bonus_renew(
    tl.store(output_token_ids_ptr + position * (max_spec_len + 1) + num_tokens1, bonus_token_id)


-@triton.jit(do_not_specialize=["max_spec_len"])
+@triton.jit(do_not_specialize=["vec_len", "max_spec_len"])
 def rejection_greedy_sample_triton(
    output_token_ids_ptr,  # [batch_size, max_spec_len + 1]
    cu_num_draft_tokens_ptr,  # [batch_size]
@@ -196,7 +196,7 @@ def rejection_random_sample_kernel(
                )


-@triton.jit(do_not_specialize=["replace_from", "replace_to"])
+@triton.jit(do_not_specialize=["replace_from", "replace_to", "vec_len"])
 def expand_kernel(
    output_ptr,  # [num_tokens]
    input_ptr,  # [batch_size]