Commit Graph

1139 Commits

Author SHA1 Message Date
wangxiyuan
29d2fe653d cleanup ascend config (#5296)
1. refresh additional config doc
2. move kv config logic to platform.
3. improve `dump_config` init logic and rename it to `dump_config_path`
this change is user impacted. dump_config is changed from dict to
string.
4. correct `enable_async_exponential` type
5. remove useless `chunked_prefill_for_mla`

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-26 14:07:37 +08:00
ZT-AIA
adaa89a7a5 Update vllm pin to 12.25 (#5342)
### What this PR does / why we need it?
- Fix vllm break in the pr:
1.[Drop v0.14 deprecations
]https://github.com/vllm-project/vllm/pull/31285
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
2025-12-26 14:05:40 +08:00
Li Wang
c2f776b846 [Nightly] Initial logging for nightly multi-node testing (#5362)
### What this PR does / why we need it?
Currently, our multi-node logs only show the master node's logs (via the
Kubernetes API), which is insufficient for effective problem
localization if other nodes experience issues. Therefore, this pull
request adds the ability to upload logs for other nodes.

Next plan: Output structured directory logs, including logs from each
node and the polog.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-26 11:39:07 +08:00
Icey
9b2a7d8866 [BugFix][Fusion] Patch compile backend to make fusion available (#5308)
Currently, the vllm pr: https://github.com/vllm-project/vllm/pull/24252
is causing operator fusion to fail, which can be mitigated by patching
the backend. Once the problem is completely resolved, I will submit a
new pull request to remove the patch.

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
2025-12-26 09:18:16 +08:00
Qi Mao
7372225bcb [FIX] Update _causal_conv1d_update_kernel for Efficient Conv State Handling on NPU (#5322)
Description:

This PR updates the implementation of the Triton operator for deployment
on NPU devices, focusing on optimizing grid size and memory handling
based on NPU limitations.

Design Plan:

Grid Calculation: The grid size is now dynamically calculated by batch
and dim to ensure that the number of programs executed does not exceed
the NPU's vector core capacity. This ensures optimal parallelism without
overloading the hardware.

Data Block Handling: Due to the limited on-chip memory (UB) on Ascend
NPUs, this implementation splits large data into smaller chunks of 32k
or less per block. The kernel performs a for-loop to process the data in
these smaller chunks, minimizing memory usage and avoiding potential
overflows.

Changes Compared to GPU Implementation:

Grid and Block Sizing:

For GPU, the grid and block size were determined based on available
thread counts and memory size. In contrast, the NPU version dynamically
adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s
architecture.

Memory Chunking:

The original GPU implementation did not require chunking due to the
higher available memory and processing capacity. For the NPU, data is
divided into smaller chunks (32k or smaller) to comply with memory
constraints on the device. The kernel has been modified to handle this
chunking mechanism inside a loop.

Optimized Thread Usage:

The NPU implementation takes into account the hardware-specific thread
limit (24 threads per vector core), ensuring that the number of active
programs is aligned with the NPU's vector core count, avoiding
over-subscription that would lead to serial processing.

This PR ensures that the operator functions efficiently on Ascend NPU,
considering hardware limitations while maintaining the same
functionality and input parameters as the GPU implementation.


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
2025-12-26 09:12:30 +08:00
Magnus
59f11dd1cb [Bugfix] fix xlite decode-only e2e test (#5354)
### What this PR does / why we need it?
fix xlite decode-only e2e test, xlite decode-only mode utilizes
aclgraph's prefill and will be affected by aclgraph, so shortened test
length.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: changdawei1 <changdawei3@huawei.com>
Co-authored-by: changdawei1 <changdawei3@huawei.com>
2025-12-25 16:30:17 +08:00
Aoxuan Chen
8caad0510d fix e2e rejection-sampler error (#5341)
### What this PR does / why we need it?
Fixed the error in the CI process for
vllm-ascend/tests/e2e/nightly/ops/triton/test_rejection_sampler.py
Error: test_rejection_sampler_block_verify_triton_kernel: duplicate
parametrization of 'vocab_size'.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: chenaoxuan <cax1165@163.com>
2025-12-25 11:39:38 +08:00
wangxiyuan
2ae0bad96d Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272)
`VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with
`VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-25 11:09:56 +08:00
Wang Kunpeng
13cd6362c6 [bugfix] fix Error 'ValueError: Duplicate layer name' (#5280)
### What this PR does / why we need it?
When matmul_and_reduce is enabled, the prefix attribute is required.
However, in some models, the prefix is not passed correctly, causing
errors when starting the service.
The issue of incorrect prefix passing will be fixed in vLLM in the
future.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-25 10:43:24 +08:00
dsxsteven
30778f371b [BugFix] Fix num_pcp_pads Assignment Issues (#5273)
### What this PR does / why we need it?
The variable `self.num_pcp_pads` was incorrectly truncated during
assignment, causing errors in certain scenarios such as PD
disaggregated. This issue has now been resolved.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

Co-author by: QiuChunshuo <qiuchunshuo@huawei.com>

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: daishixun <dsxsteven@sina.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-25 10:38:09 +08:00
wjunLu
fca2f948c1 [E2E Refactor] Enable skipped e2e case (#5287)
### What this PR does / why we need it?

The test case `tests/e2e/multicard/test_data_parallel.py` was skipped
due to the errors encountered during migration from Ascend A2 to A3, the
details are as follows
```
(EngineCore_DP0 pid=17833) RuntimeError: npu_moe_distribute_dispatch_v2:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:161 NPU function error: call aclnnMoeDistributeDispatchV3 failed, error code is 561002
(EngineCore_DP0 pid=17833) [ERROR] 2025-12-23-07:36:19 (PID:17833, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore_DP0 pid=17833) EZ9999: Inner Error!
(EngineCore_DP0 pid=17833) EZ9999[PID: 17833] 2025-12-23-07:36:19.237.396 (EZ9999):  HCCL_BUFFSIZE is too SMALL, maxBs = 512, h = 2048, epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine = 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + (maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 609MB, HCCL_BUFFSIZE=200MB.[FUNC:MoeDistributeDispatchA3TilingFuncImpl][FILE:moe_distribute_dispatch_v2_tiling.cc][LINE:941]
(EngineCore_DP0 pid=17833)         TraceBack (most recent call last):
(EngineCore_DP0 pid=17833)        MoeDistributeDispatchV2 do tiling failed, ret is -1.
(EngineCore_DP0 pid=17833)        Check NnopbaseExecutorDoTiling(executor) failed
(EngineCore_DP0 pid=17833)        Check NnopbaseExecutorTilingAndUpdateBinInfo(executor) failed
(EngineCore_DP0 pid=17833)        Check NnopbaseExecutorMatchCache(executor) failed
(EngineCore_DP0 pid=17833)        Check NnopbaseRunForWorkspace(*executor, workspaceSize) failed
```

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
After fixed, I ran `pytest -sv --durations=0
tests/e2e/multicard/test_data_parallel.py`, and the result looks good
```
========================================================================================= warnings summary =========================================================================================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================== slowest durations =========================================================================================
112.69s call     tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-vllm-ascend/Qwen3-30B-A3B-W8A8]
88.11s call     tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-Qwen/Qwen3-30B-A3B]
70.06s call     tests/e2e/multicard/test_data_parallel.py::test_qwen_inference_dp2[32-Qwen/Qwen3-0.6B]

(6 durations < 0.005s hidden.  Use -vv to show these durations.)
============================================================================ 3 passed, 2 warnings in 270.88s (0:04:30) ============================================================================
```
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2025-12-25 09:18:05 +08:00
Magnus
a9fccbeb30 [CI] add xlite e2e test (#5305)
### What this PR does / why we need it?
add xlite e2e test

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: DaweiChang <405739598@qq.com>
2025-12-25 09:17:06 +08:00
Aoxuan Chen
6d25372baa Add MagicMTP(block verify) and Triton optimization (#4443)
### What this PR does / why we need it?
1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. The rejection sampling logic in rejection_sampler.py was restructured
using Triton-Ascend, enabling it to operate under high concurrency, thus
resolving CPU and NPU operator bottlenecks and enhancing throughput.

### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: chenaoxuan <cax1165@163.com>
2025-12-25 09:00:25 +08:00
Ascendyh
a90482803d [Kernel] add l2norm triton kernel (#4595)
### What this PR does / why we need it?
This pull request introduces an L2 normalization kernel implemented in
Triton, specifically optimized for Ascend NPUs.
### Does this PR introduce _any_ user-facing change?
No, this PR does not introduce any user-facing changes.
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: Ascendyh <hw7osiris@outlook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-25 06:06:18 +08:00
Mengqing Cao
e54630e01c Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) (#5317)
### What this PR does / why we need it?
Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as
it causes deepseek v3.2 hang error


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-24 22:24:17 +08:00
wangxiyuan
fb3d6ca08c Cleanup uesless env (#5270)
`VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-24 22:07:59 +08:00
TmacAaron
5018f2d8fd [quantization] Add w8a16 quantization support (#4541)
### What this PR does / why we need it?
related to https://github.com/vllm-project/vllm-ascend/issues/4267

### Does this PR introduce _any_ user-facing change?
support w8a16 quantization now

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

### Test
tested using [aisbench](https://gitee.com/aisbench/benchmark/) with tp2
#### Precision
  | ceval | mmlu | gsm8k
-- | -- | -- | --
bf16 | 90.46 | 89.17 | 96.21
w8a16 | 89.51 | 89.29 | 95.98

#### Performance
  | input_len | output_len | concurrency | TTFT (ms) | TPOT (ms) | TPS
(Total) (tokens/s)
-- | -- | -- | -- | -- | -- | --
bf16 | 2048 | 2048 | 10 | 1911.7136 | 77.988 | 253.9866
w8a16 | 2048 | 2048 | 10 | 2128.6334 | 67.1633 | 293.9117
bf16 | 3500 | 1024 | 10 | 3076.2509 | 84.3525 | 506.949
w8a16 | 3500 | 1024 | 10 | 2685.2031 | 73.015 | 585.4717

---------

Signed-off-by: yyt <yangyit139@gmail.com>
Signed-off-by: TmacAaron <yangyit139@gmail.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
2025-12-24 19:49:32 +08:00
linfeng-yuan
515267de22 [perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler (#4154)
### What this PR does / why we need it?
1. Use optimized apply_top_k_top_p for NPU platfrom in rejection
sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24
per DP)
2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p
introduced by parameter validation which improves inference speed with
`async_scheduling` enabled;</del> In order to elminate the D2H
synchronization introduced by parameter validation before calling
`npu_top_k_top_p`, we directly drop this fused operator since the
performance improvement is not significant compared to async_scheduling
and may bring potential accuracy problem.
3. Refactor the implementation of AscendTopKTopPSampler to align that of
vLLM.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E serving test with combinations of `k=500` and `p=0.95` with
async_scheduling in single node and wide-EP scenarios.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
2025-12-24 19:10:33 +08:00
zhangyiming
74a1de50a9 [E2E] Optimize e2e test. (#5091)
### What this PR does / why we need it?
[E2E] Optimize e2e test.
- Remove the test_basic_camem testcase.
- Change Qwen2.5-0.5B-Instruct-W8A8 to Qwen3-0.6B-W8A8

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-24 10:41:55 +08:00
zhangyiming
bd4fb871c6 [CI] Add skipped testcases. (#5254)
### What this PR does / why we need it?
Some E2E testcases are not in our CI workflow, this PR add them back.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-24 10:41:32 +08:00
wujinyuan1
7ff1db4b84 [Refactor]5/N Extract common code of mla_v1.py & extract mla_cp (#5097)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
The functions related to Cp differ significantly from those of normal
MLA-Attention, but the coupling is quite severe.

Steps:
1)Extract common code AscendMLAMetadataBuilder.build to 4 functions: 
build_prefill_metadata, build_decode_metadata,build_cp_metadata,
build_chunked_metadata

todo:
1)refactor function _compute_prefill_context;
2)refactor function _mla_preprocess,_mla_decode_preprocess
3)Extract public data and processing functions from the attention_cp.py
and mla_cp.py files to the common_cp file.

vLLM version: 0.13.0rc3
vLLM main:
ad32e3e19c

- vLLM version: 0.13.0rc3
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-24 10:25:19 +08:00
Nengjun Ma
3b59f20a28 update to vllm 12-19 (#5223)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.

2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (https://github.com/vllm-project/vllm/pull/30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.

4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
https://github.com/vllm-project/vllm-ascend/issues/5297

### How was this patch tested?


Co-authored-by: zxwang <1476209578@qq.com>

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
2025-12-23 23:52:11 +08:00
weichen
ffe51eedd6 [Refactor][MoE] Reuse vLLM's all_reduce logic (#5189)
### What this PR does / why we need it?
Move all_reduce logic to AscendFusedMoE.forward, reuse vLLM's logic.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
e2e & ut
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: weichen <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-23 18:53:48 +08:00
zhangxinyuehfad
8ae7fca947 [CI] refect e2e ci test (#5246)
### What this PR does / why we need it?
efect e2e ci test:
1. tests/e2e/singlecard/pooling/test_embedding.py: remove the eager
parameter and rename test case
2. tests/e2e/singlecard/pooling/test_scoring.py: Rename test cases
3. tests/e2e/singlecard/pooling/test_classification.py: Rename test case
4. tests/e2e/singlecard/test_quantization.py: remove the eager parameter
and chage model to vllm-ascend/Qwen2.5-0.6B-W8A8 and Rename test case
5. tests/e2e/multicard/test_shared_expert_dp.py: Rename test cases
6. tests/e2e/singlecard/test_sampler.py: Rename test cases
7. tests/e2e/singlecard/test_aclgraph_accuracy.py: Rename test cases
8. tests/e2e/multicard/test_offline_inference_distributed.py: Rename
test cases and remove the eager parameter
9. tests/e2e/multicard/long_sequence/test_accuracy.py: Rename test cases
and remove the eager parameter
10. tests/e2e/multicard/long_sequence/test_basic.py: Rename test cases
and remove the eager parameter
11.tests/e2e/multicard/test_expert_parallel.py:remove the eager
parameter
12.tests/e2e/multicard/test_full_graph_mode.py:remove the eager
parameter
13.tests/e2e/multicard/test_ilama_lora_tp2.py:remove the eager parameter

14.tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py:remove
the eager parameter
15.tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py:remove the
eager parameter
16.tests/e2e/singlecard/test_aclgraph_accuracy.py:remove the eager
parameter
17.tests/e2e/singlecard/test_camem.py:remove the eager parameter
18.tests/e2e/singlecard/test_ilama_lora.py:remove the eager parameter

19.tests/e2e/singlecard/test_multistream_overlap_shared_expert.py:remove
the eager parameter
20.tests/e2e/singlecard/test_vlm.py:remove the eager parameter
21.tests/e2e/singlecard/test_xli:remove the eager parameter

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-23 18:42:35 +08:00
Li Wang
5d1f6daef6 [CI] Mock spawn for vlm tests (#5279)
### What this PR does / why we need it?
Using `spawn` in continuous testing scenarios
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-23 18:35:06 +08:00
SILONG ZENG
fa0c212bfa [test]Corrected the Qwen3-Omni-30B-A3B-Instruct accuracy test configuration in nightly tests. (#5195)
### What this PR does / why we need it?
Corrected the Qwen3-Omni-30B-A3B-Instruct accuracy test configuration in
nightly tests.
link: https://github.com/vllm-project/vllm-ascend/pull/4911

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2025-12-23 14:17:27 +08:00
SILONG ZENG
29a93daa82 [CI]refactor: standardize test case naming convention (#5243)
### What this PR does / why we need it?
- Standardize test case naming in `vllm-ascend/tests/e2e/multicard/` to
follow the `<model>_<feature>_<distributed>` convention.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2025-12-23 14:13:42 +08:00
meihanc
592cfb6a6f [CI] Add Triton Ascend in CI (#4921)
Add triton-ascend in UT and e2e

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2025-12-23 12:47:35 +08:00
LI SHENGYONG
2e010e12dd [EPLB][CI] Add dynamic EPLB CI for qwen3-moe (#5179)
### What this PR does / why we need it?
Add dynamic EPLB CI for qwen3-moe-30B-W8A8

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-12-23 11:31:00 +08:00
Mengqing Cao
449f8f65a7 [KV-Sharing] Support KV-Sharing feature in CLA models (#4138)
### What this PR does / why we need it?
Support KV-Sharing feature in CLA (cross layer attention) models, which
sharing kv cache in some layers.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-23 10:48:31 +08:00
Li Wang
9a79cbaecb [ModelRunner] Add hunyuan-vl basic support (#5151)
### What this PR does / why we need it?
This patch add handling of `XDRotaryEmbedding` in modelrunner to support
for `hunyuan-vl`
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with added/exist tests

Closes: https://github.com/vllm-project/vllm-ascend/issues/4992

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-23 10:46:54 +08:00
Wang Kunpeng
c3a8d13ca7 [refactor] Remove unnecessary attributes from set_ascend_forward_context (#5204)
### What this PR does / why we need it?
Remove unnecessary attributes from set_ascend_forward_context
1.prefetch_stream
2.weight_prefetch_method
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-23 08:49:52 +08:00
weijinqian0
95e8a52156 [Refactor] move the metadata from attention_v1 to util(ready for extract common_cp) & realize Ascendmetadata inherit from the parent class. (#5203)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629

1. Remove the pcp-related code from attention_v1.
2. Establish the inheritance relationship of CommonAttentionMetadata.

TODO
1. extract common_cp
2. move cp metadata to common_cp.
3. remove commonAttentionMetadata for aclgraph.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-23 00:10:52 +08:00
ApsarasX
3d9954eff0 [Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205)
### What this PR does / why we need it?
In code files such as`mooncake_connector.py`,
`vllm_config.model_config.hf_config` is used to get the LLM configs.
This approach works for LLMs, but not for multi-modal models. For
multi-modal models, `vllm_config.model_config.hf_text_config` must be
used instead to get the LLM configs.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UT
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-12-22 20:21:45 +08:00
jiangyunfan1
3ba920a65b [TEST]Update mm param --mm-processor-cache-gb (#5242)
### What this PR does / why we need it?
This PR updates the mm param --mm-processor-cache-gb, we need it to run
the case

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running the test

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-12-22 18:54:03 +08:00
zzzzwwjj
052e472453 [bugfix] fix w8a8dynamic fused_moe trans nz (#5199)
### What this PR does / why we need it?
Currently, `torch_npu.npu_grouped_matmul_swiglu_quant` can only support
weight nz, so we need to trans w13_weight, w2_weight to nz forcely.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-12-22 17:45:34 +08:00
zhangsicheng5
78aa7f2693 [feature] support pcp + mtp in full graph (#4572)
1. support pcp + mtp in full graph
2. pcp/dcp related mtp bugfix
3. support pcp + mtpx

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-12-22 16:13:39 +08:00
Qiu
ea6206bb18 [bugfix][ACLGraph][MTP] deletes cudagraph_batch_sizes in MtpProposer (#5183)
### What this PR does / why we need it?
This PR deletes `cudagraph_batch_sizes` in `MtpProposer` and reuses the
one in `NPUModelRunner`.

During our deployment of DeepSeek-V3.2 with MTP across machines 2P2D and
conducting AISBench stress testing, an error occurred (see below). After
investigation, we found that
`compilation_config.cudagraph_capture_sizes` is modified by
`adjust_cudagraph_sizes_for_spec_decode` in `NPUModelRunner`. This
modification only updates `cudagraph_batch_sizes` in `NPUModelRunner`
but is not synchronized to `MtpProposer`. After discussion (CC @yiz-liu)
, we believe it is unnecessary to maintain `cudagraph_batch_sizes` in
`MtpProposer`; it should directly use the variable from
`NPUModelRunner`.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2025-12-22 14:08:27 +08:00
Feng Liu
e117b3d693 [Perf] vectorize PCP/DCP loops in mla_v1.py (#5003)
### What this PR does / why we need it?
- Replace nested PCP/DCP Python loops with fully vectorized tensor
operations

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: F.Liu <liufeng248@huawei.com>
Co-authored-by: F.Liu <liufeng248@huawei.com>
2025-12-22 11:06:30 +08:00
Feng Liu
49838d4bec [Perf] vectorize PCP/DCP loops in attention_cp.py (#4944)
### What this PR does / why we need it?
- Add explicit .contiguous() after permute/view to ensure mem-friendly
layout
- Replace nested PCP/DCP Python loops with fully vectorized tensor
operations

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: F.Liu <liufeng248@huawei.com>
Co-authored-by: F.Liu <liufeng248@huawei.com>
2025-12-22 11:06:19 +08:00
YuhanBai
5d02eed16f [Performance] Add async exponential while model executing (#4501)
### What this PR does / why we need it?
Add a control to enable the exponential distribution operator
overlapping with model executing (default is OFF due to this feature
might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance
improvement.
Also, overlapping the exponential operator with module execution can
cover the performance drop introduced by AICPU-version's exponential
operator.

**UPDATE**: (12/12)
Now our overlap will use the same stream that introduced in this pr:
#4908 .
We move the `do_async_exponential` from `model_runner_v1.py` to
`sampler.py`.
Now we are using `additional_config` to enable async exponential:
Add `"enable_async_exponential": 1` in `addition_config`.
Now we **ONLY** support default exponential/AI-CPU exponential, the old
`"enable_async_exponential": 2` option has been aborted to keep
consistency.

### Does this PR introduce _any_ user-facing change?
**YES**, added a new `additional_config` : `"enable_async_exponential":
1`.
When `enable_async_exponential` is set to 1, we enable the async
exponential and overlap with model runner.
When `enable_async_exponential` is set to 0 (default is 0), we disable
the async exponential, but exponential will still running on a different
stream using stream introduced in #4908.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com>
Signed-off-by: YuhanBai yuhan.bai0830@gmail.com
2025-12-20 21:23:21 +08:00
weiguihua2
21745221a3 [lint]clean code (#5218)
### What this PR does / why we need it?
Fix lint error inreoduced by
https://github.com/vllm-project/vllm-ascend/pull/5141

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-12-20 18:24:04 +08:00
weiguihua2
74aa968a9f [e2e] add pcp e2e (#5141)
### What this PR does / why we need it?
add pcp accuracy e2e test case

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-12-20 16:56:46 +08:00
Mengqing Cao
5d59bf8ca0 [CI] unblock CI on suffix spec decoding (#4813)
### What this PR does / why we need it?
unblock CI on suffix spec decoding

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-20 14:54:49 +08:00
wangxiyuan
758d81dcb1 Drop 0.12.0 support (#5146)
We decided to release v0.13.0 soon. So no need to support 0.12.0 now.
Let's drop it.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-20 09:38:53 +08:00
Li Wang
243ab7d720 [CI] Use offline mode for nightly test (#5187)
### What this PR does / why we need it?
For single node test, the lack of a retry mechanism for accessing
ModelScope resulted in an HTTP 400 error sometimes. I recommend using a
local offline cache instead.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-19 21:21:42 +08:00
XiaoxinWang
0cc3fc357f [pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818)
### What this PR does / why we need it?
qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused
fused_gdn_gating+fused_recurrent_gated_delta_rule

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-12-19 16:34:11 +08:00
wangqiankun13
118b0ed346 [Feature] Add token mask for DispatchGmmCombineDecode operator (#5171)
### What this PR does / why we need it?
In this PR, DispatchGmmCombineDecode add an optional input
x_active_mask, with which
only token masked True will be dispatched and handle.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
2025-12-19 16:31:48 +08:00
zzzzwwjj
cc23067f1e [refactor] refactor weight trans nz and transpose (#4878)
### What this PR does / why we need it?

Now `VLLM_ASCEND_ENABLE_NZ` will have three options:
0: disable nz;
1: only quant case enable nz;
2: enable nz as long as possible;

And `VLLM_ASCEND_ENABLE_NZ`=1 by default.

All cases are shown in the table below:

|  | W4A4 | W4A8 | W8A8 | fp16/bf16 | fp32 |
|---|---|---|---|---|---|
| trans nz | can't support nz | trans nz by default | trans nz by
default | trans nz when VLLM_ASCEND_ENABLE_NZ is 2 | can't support nz |
| transpose | only support not transpose case | only support transpose
case | only support transpose case | linear: only support not transpose
case<br>gmm: only support transpose case | same to fp16/bf16 |

Some exceptional cases:
1. MLAPO op need to do some additional processing on the weights,
including trans nz. If use MLAPO op, some weight will be transformed to
nz forcely;
2. MLA/SFA's weight `W_UV` will be used by op
`torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support
nz currently;

### Does this PR introduce _any_ user-facing change?
Now fp16/bf16 weight will not trans nz by default.

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-12-19 14:27:24 +08:00
LookAround0301
76e58d66be support basic long_seq feature st (#5140)
### What this PR does / why we need it?
support basic long_seq feature st 

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LookAround <lixushi@huawei.com>
2025-12-19 10:50:01 +08:00