Commit Graph

634 Commits

Author SHA1 Message Date
realliujiaxu
f69a83b7ba [Feat] Flash comm allgher ep (#3334)
Support flash comm v1(Sequence Parallelism) for Allgather EP.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: zhaozx-cn <zhaozx2116@163.com>
2025-10-15 19:36:32 +08:00
Mengqing Cao
8abe517870 [Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432)
### What this PR does / why we need it?
Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch.

The final goal is to remove all the patches and align the code arch to
vllm, thus we need to do the following work in next prs.
TODO:
- [x] remove patch on attention spec
- [ ] refactor the kvcache creation logic

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
1. CI passed with existing test.
2. Test pass with deepseek-v3.2-exp


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-15 17:48:58 +08:00
linfeng-yuan
099255e933 [bugfix] fix pipeline parallel for mla & sfa attention backend (#3459)
### What this PR does / why we need it?
Fix pipeline parallel break for mla & sfa attention backend caused by a
magic number in metadata builder. The error report:
`AttributeError: 'PPMissingLayer' object has no attribute 'self_attn'`

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This PR was tested with "mp" backend (PP2TP8 on an A3 node) as well as
"ray" backend (PP2TP8 on two A2 nodes).

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-15 17:13:27 +08:00
offline893
5a3082cd15 [EPLB]Record expert map without dynamic eplb. (#3409)
What this PR does / why we need it?
1.Record expert map without dynamic eplb.
2.Add export PYTHONOPTIMIZE=1  when using dynamic eplb.
3.change eplb doc

Does this PR introduce any user-facing change?
How was this patch tested?
Qwen3_moe in A3.

- vLLM version: v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-15 14:21:15 +08:00
weichen
4f937f561d [MoE] [Refactor] Remove manual memory cleanup (#3365)
### What this PR does / why we need it?
1. Replace manual memory cleanup with passing parameter.
2. FusedMoEPrepareAndFinalizeWithMC2 inherits All2All avoid duplicated
code.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-10-15 12:36:24 +08:00
LeeWenquan
4e720936d8 Fix warning msg print (#3421)
### What this PR does / why we need it?
Avoid printing some warning msg as below :
UserWarning: To copy construct from a tensor, it is recommended to use
sourceTensor.clone().detach ...

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-10-15 11:30:30 +08:00
Chen Chen
16cb3cc45d adapt the mla_v1 with the mla_preprocess kernel (#3397)
### What this PR does / why we need it?

This pull request integrates a new `mla_preprocess` kernel to create an
optimized path for MLA (Multi-Layer Attention) decode operations on
Ascend hardware, controlled by an environment flag. The changes include
new utility functions for weight transformation, a method to prepare
weights for the fused kernel, and logic to route decode-only batches to
this new path. My review identified a critical bug in the `transdata`
utility function where padding dimensions are swapped, which will lead
to incorrect tensor shapes and kernel failures. Additionally, I've
pointed out a high-severity maintainability issue in the
trans_rope_weight function, which modifies its input in-place, and I
have provided a pure-function alternative.

### Does this PR introduce _any_ user-facing change?

No user-facing changes by default. User can enable the `mla_preprocess`
kernel in model by enable the env-var `VLLM_ASCEND_ENABLE_MLAPO`.

### How was this patch tested?

Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>
2025-10-15 10:34:25 +08:00
CaranLic
15b2e5c995 Remove unused row_idx in token_dispatcher (#3442)
### What this PR does / why we need it?
The `row_idx` parameter is no longer used since
PR[#2689](https://github.com/vllm-project/vllm-ascend/pull/2689), so
remove it across multiple files to remove unnecessary calculations and
parameter passing.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
accuracy test passed for Qwen3 235B and DeepSeek V3 671B after this PR.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: CaranLic <740821011@qq.com>
2025-10-15 09:08:31 +08:00
zouyida2052
3642b64afc bugfix for mtp with multistream_moe (#3419)
### What this PR does / why we need it?
when infer deepseek mtp layer with multistream_moe, we should pass a
boolean to evaluate this feature and fix bugs when we are in mtp layer

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-15 08:59:58 +08:00
zxr2333
c2c1db78a7 [Bugfix] fix ZeroDivisionError when prefill_tp_size > num_kv_head and fix tp_resharding README (#3437)
### What this PR does / why we need it?
Fix ZeroDivisionError when prefill_tp_size > num_kv_head, in this
situation, num_head_replica can be 0 and used to divide another value,
this PR restricts the minimum value of a to be 1. And this PR fix
tp_resharding README.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-10-15 08:45:44 +08:00
xuyexiong
02c26dcfc7 [Feat] Supports Aclgraph for bge-m3 (#3171)
### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>
2025-10-14 23:07:45 +08:00
fan2956
434059e417 [BugFix] Fix multimodal model support fullgraph error (#3425)
### What this PR does / why we need it?
Because the update_attn_params function requires passing the num_tokens
parameter, and num_tokens is obtained via postions.shape[0]. However,
the multimodal model uses mrope (Multidimensional Rotary Position
Embedding), which results in the postions having a shape of 2.
Consequently, postions.shape[0] retrieves an incorrect value.We resolve
this issue by replacing positions.shape[0] with maybe_padded_num_tokens.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: fan2956 <zhoufan53@huawei.com>
2025-10-14 21:51:09 +08:00
Mengqing Cao
223cc34085 [KVCache] Refactor KVCache as page_size_bytes is ineffective (#3438)
### What this PR does / why we need it?
Refactor KVCache as page_size_bytes is ineffective.

1. Currently the `AttentionSpec` is patched, but the `page_size_bytes`
is still using that in vLLM in runtime, thus the patch is not working
actually. Thus this pr removes the patch on `AttentionSpec`, and will do
the final fix in vLLM.
2. Use `MLAAttentionSpec` instead of `FullAttentionSpec` to reduce
`page_size_bytes` of spec, so that num_blocks in spec could double

### How was this patch tested?
Test pass with Qwen3-Next and DeepSeek-V3.2-Exp

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-14 21:28:41 +08:00
linfeng-yuan
c55d99d13e [bugfix][torchair] fix missing weight nz cast for w13_weight in torchair_w8a8_dynamic.py (#3446)
### What this PR does / why we need it?
Fix the issue of missing NZ conversion for quantized weights in GMM
after moe_dispatch operator in torchair scenario, which does not involve
aclgraph & single scenarios.

### How was this patch tested?
vllm serving passed with lower latency (~5ms TPOT with bs_per_rank=28 &
ep_size=32)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-14 21:11:05 +08:00
yuzhup
78777237a9 [2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203)
### What this PR does / why we need it?

- Refacotr and integrate a unified `WeightPrefetchMethod`
- Integrate `gate_up_proj.weight` in quantized Attention modules
- Prefetching these weights ahead of matmul-like operators imporves
performance by reducing L2 cache transfer latency

### Does this PR introduce _any_ user-facing change?

Add a new config in `--additional-config` for configuration:
```json
{
    "weight_prefetch_config": {
        "enabled": True,
        "prefetch_ratio": {
            "moe": {
                "gate_up": 0.8
            },
        },
    },
}
```
This feature is enabled by default, and can be disabled through this
configuration

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: yuzhup <15705211260@163.com>
2025-10-14 20:16:33 +08:00
anon189Ty
07e39620ea [Feat] Unquantized Linear to nz and control all nz-cast (#3356)
### What this PR does / why we need it?
Currently, when executing to the Linear layer of models in vLLM-Ascend,
the weights format is ND in unquantized case and skipped ascend case.
This PR supplements the execution logic for Linear layer. We use a new
global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and
CANN version is 8.3, the weights of the Linear layer will be converted
to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also
use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as
w8a8-quantized case.

### Does this PR introduce _any_ user-facing change?
Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ
format, you should set VLLM_ASCEND_ENABLE_NZ=1.

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-10-14 17:39:26 +08:00
elilzhu
5c45c227dc [BugFix] fix qwen2.5vl quant bug (#3426)
### What this PR does / why we need it?
This PR fixes issues:
1. Resolve the issue of qwen2.5-VL quantization service startup failure:
AttributeError, 'Parameter' object has no attribute 'weight_loader'.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
- ci & e2e
- vLLM version: v0.11.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: elilzhu <2435754260@qq.com>
2025-10-14 17:31:26 +08:00
whx
ee25a517d1 [BugFix] Fix the port conflict bug of running external dp with disaggregated-prefill. (#3416)
This PR fixes the port conflict bug of running external dp in
disaggregated-prefill scenario.

- vLLM version: v0.11.0

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-14 16:37:10 +08:00
XiaoxinWang
9eb62935b8 fix pagedattention to support fullgraph. (#3436)
### What this PR does / why we need it?
Calculate in advance the workspace memory size needed for the
PagedAttention operator to avoid deadlocks during resource cleanup. This
PR requires torch_npu version 0920 or newer.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-10-14 16:10:09 +08:00
Yizhou
4536123341 [Fix] Fix mc2_tokens_capacity-related issues (#3411)
### What this PR does / why we need it?
Replaces the hardcoded `mc2_tokens_capacity` with the max graph capture
size for a more accurate allocation.

This change ensures the capacity is correctly sized relative to the
graph capture configuration, removing a magic number and making the
setup more robust.

This PR fixes two issues:

1. <del>MC2 op restrictions differ between SoCs.</del> @Angazenn This
requires an overhaul, hence removed from this PR, please commit another
PR.
2. The hardcoded value `512` allocates too much buffer for large models.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Tested in daily checks.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-14 10:56:12 +08:00
Slightwind
4f6d60eb06 [Feature] Add W4A4 Flat Quantization support (#3427)
Introduce W4A4 Flat Quantization for better model compression and
inference efficiency on Ascend devices.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-10-13 23:20:16 +08:00
weijinqian0
6972df5951 [Feature] optimize sp & qwen3 next support sp. (#3225)
This PR will accomplish the following tasks: 
**optimize SP**
In the old version implementation, the first layer was all_reduce, which
used rms to split chunks. We changed it to perform reduce_scatter on the
embedding side, replace one all_reduce operation and one chunk with one
reduce_scatter operation.
**Support qwen3 next**
Since Qwen3 Next includes a linear attention module, the prefix name of
this module cannot take effect directly.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-10-13 23:02:12 +08:00
realliujiaxu
31682961af [Feat] enable hierarchical communication for mc2 ops on A2 (#3015)
Currently, when in A2, setting the environment variables
`HCCL_INTRA_PCIE_ENABLE=1` and `HCCL_INTRA_ROCE_ENABLE=0` can reduce
cross-machine communication traffic and significantly improve
communication performance.

For more details, please refer to
[document](https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_moe_distribute_dispatch_v2.md)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-10-13 16:13:17 +08:00
lidenghui1110
0563106477 [Feature] mooncake connector support GQA transport (#2947)
### What this PR does / why we need it?
The previous implementation of the Mooncake connector only supported
scenarios where the Tensor Parallel sizes for the Prefill and Decode
phases were the same for MLA and GQA/MHA.

For heterogeneous TP scenarios, a single rank on a decode node needs to
pull the KV cache from multiple ranks on the prefill nodes and then
merge them (only support prefill TP >= decode TP now). During this
merge, a transpose operation is required because the layouts of the KV
caches are different. To minimize transpose overhead, we use the
npu_paged_cache_load operation to extract the blocks corresponding to
the request from the KV cache. After performing the transpose, we use
_npu_reshape_and_cache to write the blocks back to their original
positions.

This process is illustrated in the diagram below.

b means block_size, this diagram illustrates transpose kv cache layout
for one block. In the implementation, we transpose kv cache by layer for
one request.

<img width="1464" height="916" alt="image"
src="https://github.com/user-attachments/assets/09d96a98-e41c-4733-9535-05544163081a"
/>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested

- vLLM version: v0.11.0
---------

Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: zzy-ContiLearn <1831242919@qq.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: Kurumi5210 <jaychou1620@gmail.com>
Co-authored-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: chenxiao <cx02308786@antgroup.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
2025-10-13 15:48:37 +08:00
dsxsteven
847d12a389 [BugFix]Fix moe load problems in torchair when using dynamic eplb (#3381)
### What this PR does / why we need it?

When using dynamic eplb, moe load is not imported. We fix this problem
by modifying the return value of hidden states in torchair.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
DeepseekV3 in A3.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: daishixun <dsxsteven@sina.com>
2025-10-13 11:38:57 +08:00
Mercykid-bash
ecb1713dfc Bugfix: Expose the user policy type interface (#3336)
This PR primarily focuses on two key changes:
1. Adjusts internal interface calls to optimize the interaction logic
between related modules.
2. Exposes an interface that allows users to select the EPLB algorithm,
enabling more flexible configuration based on specific usage scenarios.

These changes aim to enhance the usability of the system while ensuring
the stability of internal operations. Relevant unit tests have been
updated to cover the modified logic.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
2025-10-11 16:28:57 +08:00
linfeng-yuan
e4acb2dfc7 [feat] support customized and separated hccl_buffer_size for process group initialization (#3073)
### What this PR does / why we need it?
Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2
operators (dispatch and combine) while running moe models with large
`ep_size` and `batch_size`. This environmental variable not only affects
allocated VRAM for mc2 group, but also increases VRAM allocation for dp,
tp & ep groups, leading to significant kvcache and free_memory drops.
This PR supports to automatically calculate and set `hccl_buffer_size`
for each process group **(except mc2 group)** separately when users set
`HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted
buffer_size set for dp, tp & ep groups.

Note that current mc2 operators can only perform communication space
partitioning based on `HCCL_BUFFSIZE` configuration. Once they support
`hccl_buffer_size` configuration with `pg_options` while initializing
process group, we'll caculate the required buffer size and users would
avoid set `HCCL_BUFFSIZE` themselves.

### Does this PR introduce _any_ user-facing change?
No. 

### How was this patch tested?
We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2
process group and observed significant kv_cache and free_memory
increase!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-11 15:55:22 +08:00
offline893
82b6c846ca [BugFix]Fix eplb problems when using dynamic eplb. (#3364)
### What this PR does / why we need it?
When using dynamic eplb,it will be blocking by nz tensor.We fix these
prolems by clone src tensor and recv tensor.

### Does this PR introduce any user-facing change?

### How was this patch tested?
Qwen3_moe in A3.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-11 14:04:02 +08:00
wangxiaoteng888
ca05f7d632 [Bugfix] TP size larger than KV cache head causes accuracy issues (#3366)
### What this PR does / why we need it?
Resolve the issue where, in the case of unequal TP (Tensor Parallelism),
the TP size is larger than the number of model attention kvcache heads,
causing the KV cache to generate duplicates, which leads to transmission
errors in the original code.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-10-11 11:22:23 +08:00
无脸男
ace300a549 [Bugfix] Fix the abnormal NPU memory usage in full graph mode. (#3331)
### What this PR does / why we need it?

In the full graph mode, since paged attention operators updates are
required, the parameters of this operators needs to be retained.
However, the tensor such as query、key cache、value cache, does not need
to be persistently saved, and we can manually release this space by
`weak_ref_tensor` to save the memory.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: WithHades <244036962@qq.com>
2025-10-11 10:20:10 +08:00
Ruri
866f5e7283 [Bugfix] Fix weight prefetching AssertionError in W8A8 MTP scene (#3361)
### What this PR does / why we need it?

- Fix `AssertionError` of `weight_prefetch_method` in W8A8 MTP scene
- Remove hard-code key
(https://github.com/vllm-project/vllm-ascend/pull/3146#discussion_r2416644010)

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?
`weight_prefetch_method is None` (tested on DeepSeek-R1-w8a8mix_MTP)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
2025-10-11 09:24:02 +08:00
Peipei
8c1a4dedf3 [Bugfix]modify the enable range of _merge_multimodal_embeddings patch (#3360)
### What this PR does / why we need it?
Modify the enable range of _merge_multimodal_embeddings patch. The
current patch is only enabled for offline inference on the platform. For
online serviceization, due to the addition of the worker sub-process, it
is not enabled within the sub-process.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: booker123456 <945658361@qq.com>
2025-10-11 08:37:07 +08:00
Angazenn
27e0f2c035 [Perf]Add YaRN custom op (#3355)
### What this PR does / why we need it?
YaRN scaling is used to improve long seq accuracy for models like Qwen3.
In vLLM, YaRN scaling refers to `YaRNScalingRotaryEmbedding` class which
inherits from original `RotaryEmbedding`. Although
`YaRNScalingRotaryEmbedding` does not rewrite the `forward` function of
`RotaryEmbedding` , using YaRN on npu still run into the native
implementation of foward in `RotaryEmbedding`, rather than forward_oot
in vLLM-Ascend. Thus I register another custom op here to enable the oot
implementation for YaRN in vLLM-Ascend, similar to #3151 .

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-10-11 08:36:20 +08:00
zouyida2052
ee0a95e47f bugfix for mtp when running torchair in a2 (#3354)
### What this PR does / why we need it?
when ops torchair_fused_experts_with_mc2 is called, we need pass a tp
group, but now it only pass when quantized scenario, we need also pass
it when unquantized.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-10 23:07:24 +08:00
lilinsiman
90e00deaa9 [Bugfix] Optimized exception throwing when stream captures exception (#3322)
### What this PR does / why we need it?
Optimized exception throwing when stream captures exception, resolved
possible misleading.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-10 17:09:28 +08:00
panchao-hub
1756efa5fd [Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125)
### What this PR does / why we need it?
Adds support for capturing the Multi-Layer Attention (MLA) decode
operation into an ACL graph. This improves performance by compiling the
attention kernel for single-token decoding.

Key changes include:
- Implementing the graph capture logic for the MLA kernel, including
workspace management and parameter updates.
- Modifying the rotary embedding (RoPE) handling to use pre-allocated
tensors, which is a requirement for graph capture.
- Adding a `build_for_graph_capture` method to the MLA metadata builder
to create dummy metadata during the graph compilation phase.

Known issues:
- Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're
working on a fix
- We are preparing to remove update_mla_attn_params with
auto_dispatch_capture

### Does this PR introduce _any_ user-facing change?
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: panchao-hub <315134829@qq.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-10 16:31:20 +08:00
wangxiyuan
ba19dd3183 Revert PTA upgrade PR (#3352)
we notice that torch npu 0919 doesn't work. This PR revert related
change which rely on 0919 version.
Revert PR: #3295  #3205  #3102 

Related: #3353

- vLLM version: v0.11.0
2025-10-10 14:09:53 +08:00
MengLong Chen
6ae75933da [Feat] Load balance of tokens across experts in dummy_run (#3184)
### What this PR does / why we need it?
Due to the special input data during the dummy run, the majority of
tokens are distributed on DP0TP0, which results in insufficient
available KV cache on DP0TP0.
This PR changes the `topk_ids` of the dummy_run input from all zeros to
random values.
This is a naive implementation for experts load balance so as to avoid
accumulating too much tokens on a single rank.

### How was this patch tested?
model: DeepSeek-v3-w8a8
```bash
vllm serve DeepSeek-v3-w8a8 \
    --host 0.0.0.0 \
    --port 8004 \
    --data-parallel-size 2 \
    --tensor-parallel-size 8 \
    --quantization ascend \
    --seed 1024 \
    --enforce-eager \
    --served-model-name deepseek_v3 \
    --enable-expert-parallel \
    --disable-log-stats \
    --max-num-seqs 18 \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --gpu-memory-utilization 0.9 \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --additional-config \
    '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}' 
```

The Available memory: **2728672256** -> **6771544064**
KV Cache size: **38144** -> **95232** tokens

After enabling load balance


- vLLM version: v0.11.0

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-10-10 09:00:07 +08:00
XiaoxinWang
579b7e5f21 add pagedattention to support FULL_DECODE_ONLY. (#3102)
### What this PR does / why we need it?
Calculate in advance the workspace memory size needed for the
PagedAttention operator to avoid deadlocks during resource cleanup. This
PR requires torch_npu version 0920 or newer.
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-10-10 08:50:33 +08:00
offline893
1c2c72af8d [bugfix]change log2phy map to npu (#3339)
### What this PR does / why we need it?
Resolved the issue of EPLB failure caused by changes in the log2phy map
due to device type modifications when using MTP rotation position
encoding.

### Does this PR introduce any user-facing change?

### How was this patch tested?
https://github.com/vllm-project/vllm/commit/releases/v0.11.0


- vLLM version: v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-10 08:47:55 +08:00
fems14
55e23fabec 【bugfix】fix connector register failed (#3335)
### What this PR does / why we need it?
Register the connector in the plugin
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-09 21:09:54 +08:00
Ruri
ff37575936 [1/N][Feat] Add weight prefetch feature for Attention layers (#3146)
### What this PR does / why we need it?

- Refacotr and integrate a unified `WeightPrefetchMethod`
- Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention
modules
- Prefetching these weights ahead of matmul-like operators imporves
performance by reducing L2 cache transfer latency

### Does this PR introduce _any_ user-facing change?

Add a new config in `--additional-config` for configuration:
```json
{
    "weight_prefetch_config": {
        "enabled": false,
        "prefetch_ratio": {
            "attn": {
                "qkv": 1.0,
                "o": 1.0,
            },
        },
    },
}
```
This feature is enabled by default, and can be disabled through this
configuration

### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: yuzhup <15705211260@163.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Co-authored-by: yuzhup <15705211260@163.com>
2025-10-09 20:38:39 +08:00
huangdong2022
23db56a340 [Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with norm bias (#3205)
### What this PR does / why we need it?
1. qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and
quant op' during quantization scene.
2. torch_npu.add_rms_norm_quant op fixed accuracy while model weights is
quantized by anti_method m4, m4 quantization is asymmetric outlier
suppression method, it will generate none-zero norm bias,
add_rms_norm_quant op updated to add this parameter to calculate.

### Does this PR introduce _any_ user-facing change?
please use a torch_npu version >= torch_npu-2.7.1.dev20250919

### How was this patch tested?
1. no special parameters to set, no new envs to set.
2. use qwen3 moe quantization model to test ,such as
Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8,
Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: huangdong2022 <huangdong51@huawei.com>
Signed-off-by: h30027576 <huangdong51@huawei.com>
2025-10-09 20:18:10 +08:00
zouyida2052
81aff9c555 bugfix for mtp (#3300)
### What this PR does / why we need it?
when mtp>1, we need refresh cos ans sin in each step.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.11.0

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-09 19:22:46 +08:00
Wang Yixuan
30c5d947c3 [bugfix]fix multistream moe in torchair (#3164)
### What this PR does / why we need it?

the multistream moe in tochari only validate in decode, but can't be
applied to chunked prefill, So add some judgments to isolate the
scenario

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-09 19:00:32 +08:00
weichen
94dd832815 [MoE] [Refactor] Combine common_fused_moe and fused_moe (#3176)
### What this PR does / why we need it?
1. Move additional functionalities from fused_moe.py to
common_fused_moe.py and remove fused_moe.py
2. Remove unnecessary custom classes from qwen3_moe.py, and it will be
completely removed after we release vllm-ascend v0.11.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:

1. Enable/Disable EP
3. Aclgraph & eager
4. SP


- vLLM version: v0.11.0

---------

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-10-09 14:12:46 +08:00
Li Wang
a36e3da78e [Misc] Drop 0102 related lines (#3323)
### What this PR does / why we need it?
Since https://github.com/vllm-project/vllm-ascend/pull/3284 merged,
should discard some extra code that was previously done for version
compatibility

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-09 14:10:57 +08:00
wangxiyuan
1c5b302f0d [Misc] Clean up useless patch (#3320)
### What this PR does / why we need it?
1. clean up v0.10.2 support in ut and e2e test
2. remove v0.11.0 period job, we're at v0.11.0 now.
3. remove uesless patch for deepseek v3.2. They have been done in vLLM
already.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 14:07:26 +08:00
wangxiyuan
a43e2f61e1 [CI] Update vLLM to v0.11.0 (#3315)
### What this PR does / why we need it?
There are 3 step to upgrade vllm-ascend to newest vllm. We'll create 3
PR

- [x] Upgrade vllm to v0.11.0 to make CI happy first .
- [ ] Move deepseek v3.2 to vllm way
- [ ] Then we'll add a new PR to add vllm main support.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 10:41:19 +08:00
wangxiyuan
f12f76d7ba Drop 0.10.2 (#3284)
Drop v0.10.2 support, we support vLLM 0.11.0rc3 now.
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 10:28:38 +08:00