### What this PR does / why we need it? Original `sample_recover_tokens_kernel` of reject sampler didn't tile the vocab size dim, whitch will cause ub overflow problem for models with big vocab size like deepseek. This PR adds tiling to the vocab size dim to avoid this problem. Note that currently we just use a emperical `SUB_BLOCK_SIZE` of `4*1024` for functionality. If in the future this kernel becomes performance bottle neck, we can use triton autotune to optimize this. What's more, we have to disable multibuffer of this kernel due to some accuracy issues. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: whx-sjtu <2952154980@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>