### What this PR does / why we need it?
This PR introduces optimized Triton implementations for the
rejection_greedy_sample_kernel and expand_kernel, delivering superior
performance compared to the existing Triton implementations. The new
Triton kernels maintain full functional accuracy while delivering
significant performance improvements across various batch sizes and MTP
configurations.
### Does this PR introduce _any_ user-facing change?
Yes, this PR modifies rejection_sampler.py to use optimized Triton
kernels:
- rejection_greedy_sample_kernel is enhanced with
rejection_greedy_sample_spec_len_1_triton and
rejection_greedy_sample_triton implementations
- expand_kernel receives a performance-optimized Triton version
These changes provide substantial performance improvements while
maintaining backward compatibility
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: yuxingcyx <yuxingchen.math@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>