### What this PR does / why we need it?
Replace multiple PyTorch operations with a fused Triton kernel to
determine token indices for sampling during speculative decoding. This
reduces kernel launch overhead and memory traffic, improving overall
performance on Ascend hardware.
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>