### What this PR does / why we need it?
The triton operator does not perform boundary checks on the global
position within the loop, leading to the memory overflow in scenarios
with multiple concurrency + 1-step MTP launch.
Solution: Add a check that global_pos < vec_len, and strictly limit the
boundaries of all memory accesses to avoid out-of-bounds writes.
backport:#7459
Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>