Commit Graph

16 Commits

Author SHA1 Message Date
whx
4d6444d5fd [Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (#6505)
### What this PR does / why we need it?
Remove kv_cache nz test case for test_mla_preprocess_nq.py. This case is
added by https://github.com/vllm-project/vllm-ascend/pull/3072 but has
not been tested on bf16 scenario. Results show that this is not
currently supported.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-02-03 18:26:51 +08:00
lidenghui1110
79803932e2 [Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366)
### What this PR does / why we need it?
As #2947 describe, we need to transpose kv cache layout after GQA kv
transfer when prefill and decode tensor parallel size are heterogeneous,
in the previous implementation, we use `npu_paged_cache_load ` +
`tranpose` + `_npu_reshape_and_cache` to do this work.

But obviously, it is not an efficient plan, the ops above need to be
called for each layer, which introduces 3 * layer_num kernel launch, and
6 * layer_num data movement between L1 Cache and HBM for one request on
decode node. Usually, decode node uses graph mode, so these op kernels
will be called between decode forward launched by an async thread in
mooncacke connector, this kernels maybe last for several decode forward
and TTFT will increase by 3~4 decode forward time.

In this PR, we implement an AscendC fused op
`transpose_kv_cache_by_block` to do this with only once kernel launch
and move data between L1 Cache and HBM only once.

After using this fused op, the time cost in transpose kv cacke layout
can be decreased to 0.24ms from 7ms in UT on 910C, and in PD
disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in
qwen3-235B.

| request_num | original | fused_op|
|:----------------------:|:---------------:|:-------------------:|
|           1            |      643 ms      |        578 ms        |
|          128           |     1480 ms      |       1368 ms        |

### Does this PR introduce _any_ user-facing change?
Use fused op by default, incase the op has bug in any scenario, provide
fallback choice using env to disable it.

**DISABLE fused op by add following env**
`export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0`

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-02-03 14:10:01 +08:00
linfeng-yuan
e25ee65729 [Misc][Test] add e2e test for apply_top_k_top_p_custom kernel (#6348)
### What this PR does / why we need it?
Add e2e test case for apply_top_k_top_p_custom kernel and eliminate
chinese comments.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest passed.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-01-28 17:25:57 +08:00
yjmyl
e90b14140b [feature] add_rms_norm support bias (#5790)
### What this PR does / why we need it?
This PR is to replace addRmsNorm and Add With addRmsNormBias. This way
can lead to a more effecient result.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Full Test Pass

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com>
Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>
2026-01-23 21:09:54 +08:00
shiyuan680
cea48c2a34 model runner v2 support triton of penalty (#5854)
### What this PR does / why we need it?
Optimized operator performance and add ut test
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen2.5 7b vl, ops time approved 90%
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

this pr is for
# https://github.com/vllm-project/vllm-ascend/issues/5208

Signed-off-by: shiyuan680 <917935075@qq.com>
2026-01-20 12:26:05 +00:00
shiyuan680
7af3b880c1 support triton of mrope (#5664)
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: shiyuan680 <917935075@qq.com>
2026-01-13 09:13:51 +08:00
Li Wang
595d3484c4 [Nightly] Move ops to the correct path (#5642)
### What this PR does / why we need it?
Move ops to the correct path where they belong

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-09 09:23:36 +08:00
Aoxuan Chen
8763953f56 [Feature] add the magicmtp speculative decoding acceleration algorithm (#5542)
### What this PR does / why we need it?

1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. Added Triton and PyTorch implementations, and added E2E test cases.

### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.
- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: chenaoxuan <cax1165@163.com>
2026-01-08 09:15:55 +08:00
ZCG12345
3be8e33fe9 [Kernel] Add moe_gating_top_k operator support for Ascend NPU (#5579)
### What this PR does / why we need it?

1.replace moe_gating_top_k from torch_npu with custom op
2.enable the  renorm function of moe_gating_top_k in softmax scenerio

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need test

- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: ZCG12345 <2097562023@qq.com>
2026-01-07 21:42:31 +08:00
wangxiyuan
6f7a81cd9f [CI] cleanup single/multi-card test (#5623)
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-07 14:13:34 +08:00
Li Wang
c5e2f48510 [CI] mv ops to correct path (#5615)
### What this PR does / why we need it?
mv ops to correct path
:`tests/e2e/nightly/single_node/ops/singlecard_ops/triton`

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-05 23:17:07 +08:00
Yizhou
755caeb06e [Feat][Spec] Optimize token index calculation in spec decode with Triton kernel (#5356)
### What this PR does / why we need it?
Replace multiple PyTorch operations with a fused Triton kernel to
determine token indices for sampling during speculative decoding. This
reduces kernel launch overhead and memory traffic, improving overall
performance on Ascend hardware.

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-01-05 16:51:29 +08:00
Trunrain
91bf524364 [BugFix][kernel] fix matmul_allreduce_add_rmsnorm_kernel (#5335)
### What this PR does / why we need it?
fix matmul_allreduce_add_rmsnorm_kernel, add hccl Init, SetCcTiling
interface
test case use multicard-4 
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
pytest -sv tests/e2e/nightly/ops/test_matmul_allreduce_add_rmsnorm.py
multicard-4 pass

https://github.com/vllm-project/vllm-ascend/actions/runs/20502630658/job/58914474652?pr=5335



- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: tongrunze <t00574058@china.huawei.com>
Co-authored-by: tongrunze <t00574058@china.huawei.com>
2026-01-05 15:19:54 +08:00
Jade Zheng
7d5242faca [Refactor] Formatting output types related to FuseMoE (#5481)
Currently in the Fused MoE module, functions of classes like
MoECommMethod and MoETokenDispatcher output data in dictionary or tuple
format, which hampers code maintainability, readability, and
extensibility. This PR introduces dataclasses for these key output types
to address these issues.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-31 14:24:37 +08:00
Jade Zheng
38570cfeb6 [Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072)
By converting the KV cache from ND to NZ format when the decode node
receives it, this PR ensures that the KV NZ feature works correctly
during the decoding phase in disagg-prefill scenario.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: ghphotoframe <854746559@qq.com>
Co-authored-by: alex101-ops <alex1015718386@gmail.com>
2025-12-31 14:24:04 +08:00
Li Wang
e760aae1df [1/N] Refactor nightly test structure (#5479)
### What this PR does / why we need it?
This patch is a series of refactoring actions, including clarifying the
directory structure of nightly tests, refactoring the config retrieval
logic, and optimizing the workflow, etc. This is the first step:
refactoring the directory structure of nightly to make it more readable
and logical.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-30 19:03:02 +08:00