xc-llm-ascend

Author SHA1 Message Date

Author	SHA1	Message	Date
SparrowMu	6fbd0049df	[v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619 ) (#7714 ) ### What this PR does / why we need it? Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be discard after Eagle3 weight for MiniMax-M2.5 releases and code change accepted by official repo https://github.com/vllm-project/vllm/pull/37512/changes backport: #7619 - vLLM version: v0.18.0 - vLLM main: `ed359c497a` Signed-off-by: limuyuan <limuyuan3@huawei.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-27 18:33:29 +08:00
ichaoren	9d1452c74d	[OPS]add split_qkv_tp_rmsnorm_rope ops (#7376 ) ### What this PR does / why we need it? This PR introduces a new fused Triton kernel, `split_qkv_tp_rmsnorm_rope` for Minimax-m2.5. The implementation includes two Triton kernels: 1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input and computes the local variance for RMSNorm. 2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering TP all-reduce for variance) and Neox-style RoPE. ### Does this PR introduce _any_ user-facing change? Does not. ### How was this patch tested? ```python pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py ``` ### Test Data A3 TP16 基线 \| data \| TTFT(ms) \| TPOT(ms) \| TPS \| \|------------\|---------:\|---------:\|-------:\| \| 4k/1k@bs1 \| 267.55 \| 25.5 \| 38.85 \| \| 4k/1k@bs4 \| 542.4 \| 26.51 \| 148.06 \| 测试线 \| data \| TTFT(ms) \| TPOT(ms) \| TPS \| \|------------\|---------:\|---------:\|-------:\| \| 4k/1k@bs1 \| 234.64 \| 20.96 \| 47.24 \| \| 4k/1k@bs4 \| 508.36 \| 22.16 \| 176.69 \| - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: xutianyi <xutianyi5@huawei.com> Co-authored-by: xutianyi <xutianyi5@huawei.com>	2026-03-19 17:19:18 +08:00
SparrowMu	54668e73c5	[Model] Support Minimax-m2.5 on NPU (#7105 ) ### What this PR does / why we need it? Initial version to support minimax-m2.5 on vllm-ascend. This commit coverting original fp8 weight to a quantilized bf16 to support Minimax-m2.5 on NPU. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` ### Test Report Self tested precision summary, where the official precision score of AIME2025 is 86.3 <img width="426" height="84" alt="image" src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a" /> --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-11 00:12:02 +08:00

SparrowMu

6fbd0049df

[v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619 ) (#7714 )

### What this PR does / why we need it?
Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be
discard after Eagle3 weight for MiniMax-M2.5 releases and code change
accepted by official repo
https://github.com/vllm-project/vllm/pull/37512/changes
backport: #7619

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>

2026-03-27 18:33:29 +08:00

ichaoren

9d1452c74d

[OPS]add split_qkv_tp_rmsnorm_rope ops (#7376 )

### What this PR does / why we need it?
This PR introduces a new fused Triton kernel,
`split_qkv_tp_rmsnorm_rope` for Minimax-m2.5.

The implementation includes two Triton kernels:
1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input
and computes the local variance for RMSNorm.
2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering
TP all-reduce for variance) and Neox-style RoPE.

### Does this PR introduce _any_ user-facing change?
Does not.

### How was this patch tested?
```python
pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py
```
### Test Data
A3 TP16
基线  

| data       | TTFT(ms) | TPOT(ms) | TPS    |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1  | 267.55   | 25.5     | 38.85  |
| 4k/1k@bs4  | 542.4    | 26.51    | 148.06 |

测试线

| data       | TTFT(ms) | TPOT(ms) | TPS    |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1  | 234.64   | 20.96    | 47.24  |
| 4k/1k@bs4  | 508.36   | 22.16    | 176.69 |


- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: xutianyi <xutianyi5@huawei.com>
Co-authored-by: xutianyi <xutianyi5@huawei.com>

2026-03-19 17:19:18 +08:00

SparrowMu

54668e73c5

[Model] Support Minimax-m2.5 on NPU (#7105 )

### What this PR does / why we need it?

Initial version to support minimax-m2.5 on vllm-ascend. 
This commit coverting original fp8 weight to a quantilized bf16 to
support Minimax-m2.5 on NPU.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

### Test Report
Self tested precision summary, where the official precision score of
AIME2025 is 86.3
<img width="426" height="84" alt="image"
src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a"
/>

---------

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>

2026-03-11 00:12:02 +08:00

3 Commits