### What this PR does / why we need it?
This PR introduces a new fused Triton kernel,
`split_qkv_tp_rmsnorm_rope` for Minimax-m2.5.
The implementation includes two Triton kernels:
1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input
and computes the local variance for RMSNorm.
2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering
TP all-reduce for variance) and Neox-style RoPE.
### Does this PR introduce _any_ user-facing change?
Does not.
### How was this patch tested?
```python
pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py
```
### Test Data
A3 TP16
基线
| data | TTFT(ms) | TPOT(ms) | TPS |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1 | 267.55 | 25.5 | 38.85 |
| 4k/1k@bs4 | 542.4 | 26.51 | 148.06 |
测试线
| data | TTFT(ms) | TPOT(ms) | TPS |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1 | 234.64 | 20.96 | 47.24 |
| 4k/1k@bs4 | 508.36 | 22.16 | 176.69 |
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: xutianyi <xutianyi5@huawei.com>
Co-authored-by: xutianyi <xutianyi5@huawei.com>