590 Commits

Author SHA1 Message Date
weijinqian0
b4bf01ead1 [Refactor] Remove redundant attention operator branches. (#4531)
[Refactor] Remove redundant attention operator branches.

Reason:

We replace other attention ops with fused_infer_attention_score expect
decode_only state.
clean code and remove 310P support.

https://github.com/vllm-project/vllm-ascend/pull/4455


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-02 09:13:26 +08:00
MengLong Chen
143e1f46d0 [Feat] shared expert dp for deepseek_mtp (#3811)
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature. 
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
Signed-off-by: zengran <zengran2@huawei.com>
Co-authored-by: zengran <zengran2@huawei.com>
2025-12-01 20:44:11 +08:00
zzhxxx
2b82320b66 [Bugfix] Fix bug with establishing the flashcomm2 and pp communication domains. (#4458)
### What this PR does / why we need it?
The previous implementation of the flashcomm2 communication domain did
not consider pp(pipeline parallel), which caused problems when enabling
pp and flashcomm2. This PR fixes this issue.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-01 15:56:22 +08:00
欧派果奶我还要
bc67696a02 [EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB (#4216)
### What this PR does / why we need it?
Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into
dynamic EPLB to support list-type parameters
This PR also modify the logic of loading model in dynamic-eplb scenario.
The operator is based on this pr:
https://github.com/vllm-project/vllm-ascend/pull/3804

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

```
vllm serve /home/weight/DeepSeek-V3.1_w8a8mix_mtp \
    --max_num_seqs 8 \
    --max-model-len 8192 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 8 \
    --data-parallel-size 2 \
    --enable-expert-parallel \
    --served-model-name ds_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --no-enable-prefix-caching \
    --port 8999 \
    --quantization "ascend" \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --compilation_config '{"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
    --additional-config='{"dynamic_eplb":true, "num_iterations_eplb_update":100, "num_wait_worker_iterations":100}'
 
```
input&output: 2k 2k
This PR:
<img width="1318" height="695" alt="fusion"
src="https://github.com/user-attachments/assets/f8657813-0c02-42f4-8396-d99e730f48cd"
/>

Baseline:
<img width="1323" height="690" alt="baseline"
src="https://github.com/user-attachments/assets/e1323a78-af26-4523-820c-e20e5642a38e"
/>


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 欧派果奶我还要 <845473182@qq.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
2025-11-30 22:52:05 +08:00
Mengqing Cao
517fd9272d Revert "drop ascend scheduler" (#4580)
Reverts vllm-project/vllm-ascend#4498
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
2025-11-29 22:20:48 +08:00
wangxiyuan
1874265074 Move mla to ops module (#4575)
Move mla custom op to correct module
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 18:36:55 +08:00
Shanshan Shen
2a19215e5f [MM][Model] Remove Qwen2-VL modeling files (#4534)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove
Qwen2-VL modeling files.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-29 18:07:01 +08:00
wangxiyuan
f10acddb78 drop ascend scheduler (#4498)
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.

Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 16:18:34 +08:00
Ting FU
9af34755ff [Bugfix] Fix model run _npu_flash_attention hang issue (#4410)
Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Ting FU <futing10@huawei.com>
2025-11-29 09:20:22 +08:00
fems14
5447a039b9 [Feature][main]reconstruction kvpool connector to ascend connector (#4438)
### What this PR does / why we need it?
1.In short, we renamed the existing MooncakeStoreConnector to
AscendStoreConnector and extracted the storage engine interaction logic
into a new Backend class.
Associated RFC:https://github.com/vllm-project/vllm-ascend/issues/4329
2.Fixed the issue where the number of input parameters for the connector
was incorrect, introduced in vllm 0.11.2
### Does this PR introduce _any_ user-facing change?
change MooncakeStoreConnector to AscendStoreConnector
### How was this patch tested?

- vLLM version: v0.11.2

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-11-28 18:08:37 +08:00
Shanshan Shen
e52ebf8674 [MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention (#4349)
### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at https://github.com/vllm-project/vllm/pull/29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for https://github.com/vllm-project/vllm/pull/28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark


- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-28 14:23:00 +08:00
LHXuuu
bdc66972db [Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.

Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com

- vLLM version: v0.11.2

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
2025-11-28 14:09:39 +08:00
Zhu Yi Lin
755b635844 [TEST] Add eagle proposer ut (#4447)
### What this PR does / why we need it?
Add eagle proposer ut

- vLLM version: v0.11.2

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-27 21:59:31 +08:00
Nengjun Ma
89a1a65300 [bugfix] fix ray start failed: local_world_size cannot little than visible device count error (#4457)
### What this PR does / why we need it?
Fix the ray start failed bug: local_world_size cannot little than
visible device count error
detail see issue #4456.

This fix code is copied from vllm fixing modify, PR:
[#28873](https://github.com/vllm-project/vllm/pull/28873)


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-27 21:18:32 +08:00
zhangxinyuehfad
84d7f5a10d [UT] Fix ut test (#4472)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-11-26 21:37:47 +08:00
zzzzwwjj
136ea9ff56 [refact] unified soc_version code (#4359)
### What this PR does / why we need it?

Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.

We need to unify these codes based on the following points:

1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.

Based on the above points, we have made the following changes:

1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.

### Does this PR introduce _any_ user-facing change?

When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-26 14:28:55 +08:00
wangxiyuan
bc69d7cfe1 upgrade to vllm 0.11.2 (#4400)
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
https://github.com/vllm-project/vllm/pull/26866
2. get_mrope_input_positions is broken by
https://github.com/vllm-project/vllm/pull/28399
3. graph mode is broken by
https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
https://github.com/vllm-project/vllm/pull/27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
https://github.com/vllm-project/vllm/pull/28534
6. spec decode is broken by
https://github.com/vllm-project/vllm/pull/28771
7. sp feature is broken by
https://github.com/vllm-project/vllm/pull/27126
8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922
9. lora is broken by https://github.com/vllm-project/vllm/pull/21068
10. execute_model is broken by
https://github.com/vllm-project/vllm/pull/26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
https://github.com/vllm-project/vllm/pull/28159
12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753
13. dp is broken by https://github.com/vllm-project/vllm/pull/25110

 
What's broken and changed by ourself:
1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
https://github.com/vllm-project/vllm/pull/28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: shen-shanshan <467638484@qq.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-11-26 11:48:58 +08:00
Zhu Yi Lin
1b137d6b1b [TEST] Delete Comment (#4427)
### What this PR does / why we need it?
Delete useless comments.
### Does this PR introduce _any_ user-facing change?
No

- vLLM main:
2918c1b49c

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-25 21:39:04 +08:00
weijinqian0
ae068a3342 [Refactor] remove moe type of multicast. (#4224)
The main purposes of this PR are as follows: 
1. Remove the multicast-related code; 

Reason:
1. In the scenario like a2 Dual-System Back-to-Back Networking,the
performance is worse than all_gather. Before the modification, in e2e
test, it was 3 tps; after the modification, it is 10 tps.
2. At the same time, we usually enable the SP feature,it is consistent
with the current logic.
3. The advantage of broadcast communication lies in the fact that it
does not suffer from uneven DP load and does not require the prefill ACL
graph to be enabled. But we support prefill Acl graph recently.

So we think there is no need to maintain the multicast as one choice in
moe communication.

Performance benefits are as follows:
When not enable_flashcomm1, TTFT remains relatively stable at around
43000ms, which is approximately 15000ms faster than before the
modification.

When enable_flashcomm1, there is no diffenence, TTFT remains relatively
stable at around 29000ms.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian0 <1184188277@qq.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-11-24 17:32:37 +08:00
wangxiyuan
a1f142b7ad Drop 0.11.0 support (#4377)
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-24 17:08:20 +08:00
wangxiaochao
3deeea14a0 [bugfix] bugfix for PD disaggregate (#4319)
This PR is used to fix mooncake_connector in pcp/dcp case. When
executing function update_done_task_count, it is necessary to ensure
that both pcp/dcp and TP ranks have finished transferring KV cache.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
2025-11-21 18:08:56 +08:00
CodeCat
e332e27ec3 [Test] Add ut test for torchair (#4287)
### What this PR does / why we need it?
The current community lacks unit tests (UT) for files such as
torchair_worker, mtp_proposer, and model_runner. Therefore, UT coverage
for these files needs to be added.

### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>
2025-11-21 16:33:34 +08:00
LI SHENGYONG
019c7ded91 eplb redundant expert bugfix (#4291)
### What this PR does / why we need it?
Redundant experts bugfix
### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.
### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-21 14:24:35 +08:00
InSec
5a4e8cdeba [Feat][BugFix]Support the Qwen3-Next-80B-A3B-Instruct quantization model&Fix the NZ issue (#4245)
### What this PR does / why we need it?
Support the Qwen3-Next-80B-A3B-Instruct quantization model and Fix the
NZ issue. Triton kernel doesn't support data format nz, thus we skip
converting weight to nz on layer `conv1d`

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: IncSec <1790766300@qq.com>
2025-11-21 10:42:56 +08:00
Zhu Yi Lin
d96d5fa971 [Test] quick fix mla ut (#4318)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-20 23:06:12 +08:00
anon189Ty
5c9f4a40c6 [Feat] Support MTP to running in full graph mode (#3892)
### What this PR does / why we need it?
Currently, the MTP model still runs in eager in full graph mode. This PR
adapts the MTP with the full graph capture and execution. When the graph
mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to
improve the performance.

The change in both disable_padded_drafter_batch is True and False case
include:

1. Add _mtp_graph_params in acl_graph.py to isolate the data of main
model and the data of MTP.
2. Padding some metadata in mla_v1.py when in fullgraph mode.
3. Fixed the essential data address that will be used in model.forward.
4. Adapted according to the aclgraph capture framwork:
    1). Rebuild MTP model with ACLGraphWrapper.
    2). Add common attn metadata when start capture in MTP dummy_run.
    3). Add common attn metadata update in MTP.
    4). Addapted data update when num_speculative_tokens > 1.
5. Add a patch of MTP to adapt vllm v0.11.0.

Existing Issues:
1. When disable_padded_drafter_batch=True and running in FullGraph mode,
the data of the first-round requests in MTP is abnormal. We need to
identify the cause subsequently.
2. When disable_padded_drafter_batch=False and running in FullGraph
mode, the acceptance rate of the second and third tokens will decrease
(For example, if we set the num_speculative_tokens=3, the acceptance
rate of first token is 90%, the second is only 50% lower than 60%, the
third is only 20% lower than 30%). The reason is that the data processed
after the model runs does not match. This is a problem from another PR.
It works fine in eager and PIECEWISE mode, but has problem in FullGraph
mode. Once we have a solution, we will submit a bugfix.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-11-20 20:34:54 +08:00
Zhu Yi Lin
15c1eb025c [CI] Add mla ut (#4280)
### What this PR does / why we need it?
add mla_v1.py and mla.py ut
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`pytest tests/ut/attention/test_mla_v1.py`
`pytest tests/ut/models/test_mla.py`

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-20 20:29:09 +08:00
shaopeng-666
3653f33878 avoid mrope fusion op when running qwen2.5-vl on a+x machine (#4270)
### What this PR does / why we need it?
avoid mrope fusion op when running qwen2.5-vl on a+x machine
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Test text VQA accuracy on G8600 with aisbench

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2025-11-19 22:31:14 +08:00
欧派果奶我还要
c848da0687 [Bugfix] fix nightly multi-node EPLB tests' "DYNAMIC_EPLB=true" environment not working (#4223)
### What this PR does / why we need it?
fix nightly multi-node EPLB tests by adjusting
vllm_ascend\eplb\core\eplb_utils.py dynamic_eplb gate checking
### Does this PR introduce _any_ user-facing change?
no

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-19 21:31:58 +08:00
wangxiyuan
97daf7f78c [misc] clean up get_metadata_cls (#4276)
Follow up #4087 to remove get_metadata_cls 
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-19 17:18:19 +08:00
wangxiyuan
2938bd5ad2 remove get_metadata_cls (#4087)
remove get_metadata_cls. It's only used for V0 engine and has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-19 14:58:17 +08:00
1092626063
9328f377b4 [refactor]support gatingtopk operator generalization (#2958)
### What this PR does / why we need it?

Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern

Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`

CANN: depends on 8.3.RC1

Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-11-19 10:38:56 +08:00
wangxiaochao
0d04ad8c8f [feature] Mooncake_connector support pcp/dcp (#4183)
add feature for Mooncake_connector supporting pcp/dcp

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
2025-11-18 10:17:48 +08:00
Angazenn
10a046ddce [main][misc]change default capture size for Qwen3-MoE when using full dp (#4199)
### What this PR does / why we need it?
Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8
,16 ,24 ,... , max_capture_size]`. However, this is not always the best
choice on different situations. This PR aims to change the default
setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size ==
1`) setting, which is usually applied in Large-Scale EP.
old :
`[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`
new:
`[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]`
This is mainly because the performance of `_npu_paged_attention` op
degrades dramatically on old settings. We hope to provide better
performance if users do not set specific `cudagraph_capture_size`.
### Does this PR introduce _any_ user-facing change?
The default `cudagraph_capture_size` is modified in above cases.
However, if `cudagraph_capture_size` has already set by users, this PR
won't have any influence on this.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-18 08:41:45 +08:00
XiaoxinWang
e38ef2c434 support FULL graph mode for GQA (#3970)
### What this PR does / why we need it?
The current library only supports the FullDecodeOnly graph mode, which
enables full graph execution during the decode. This PR extends support
to allow full graph execution in both the prefill and decode, referred
to as FULL graph mode.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-11-17 10:50:35 +08:00
LookAround0301
5ec96fd46c [long_seq_Feat] support chunk prefill (#4158)
### What this PR does / why we need it?
1、qwen GQA attention_v1 optim
2、DeepSeek MLA refactor, all gather q -> all gather kv 
3、modelrunner refactor for chunk prefill, we remove some code not use

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
2025-11-14 08:43:37 +08:00
CodeCat
49818dbbed [Test]Add ut test qwen3_moe and sfa (#4121)
### What this PR does / why we need it?
Currently, the UT tests lack coverage for the Qwen3_moe network and
torchair_sfa. Therefore, supplementary tests are being added.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by CI

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>
2025-11-13 16:09:22 +08:00
realliujiaxu
5093192769 [Bugfix] fix mtp profile run error where main model and mtp model use different quantization (#4102)
### What this PR does / why we need it?
In PR https://github.com/vllm-project/vllm-ascend/pull/3420, we
initially placed the quantization type (quant_type) in the MoECommMethod
class. However, since MoECommMethod follows a singleton pattern, it
couldn't accommodate scenarios where different layers in the model might
use different quantization approaches (e.g., MTP modules using
floating-point computation while the main model employs quantized
computation).
In this PR, we've moved the quantization type to the AscendFusedMoe
class and pass it as a parameter to MoECommMethod.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```bash
export HCCL_BUFFSIZE=1024
export VLLM_VERSION=0.11.0

vllm serve /home/data/DeepSeek-R1_w8a8/ \
 --data-parallel-size 2 \
 --tensor-parallel-size 8 \
 --enable-expert-parallel \
 --served-model-name dsv3 \
 --max-model-len 32768 \
 --max-num-batched-tokens 4096 \
 --max-num-seqs 16 \
 --quantization ascend \
 --trust-remote-code \
 --gpu-memory-utilization 0.9 \
 --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}'
```


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-11-13 11:02:31 +08:00
22dimensions
c272747d13 Upgrade to 0.11.1 newest vllm commit (#3982)
### What this PR does / why we need it?
adapt vllm-ascend main branch with vllm releases/v0.11.1

fix `forward context not set` in test_vlm.py caused by:
https://github.com/vllm-project/vllm/pull/23207

fix import `cdiv round` failed caused by:
https://github.com/vllm-project/vllm/pull/27188

fix import `init_cached_hf_modules` failed caused by:
https://github.com/vllm-project/vllm/pull/27567

adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused
by: https://github.com/vllm-project/vllm/pull/27654
- remove unused code in sigmoid_gating.py
- `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`,
`fused_recurrent_gated_delta_rule_fwd`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI 


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-11-12 23:01:19 +08:00
zhangsicheng5
a123f355e9 [feature] support pcp + mtp (in pd co-locate scenario) (#4098)
1. support pcp + mtp in pd co-locate scenario
2. llmdatadist connector pcp related bugfix and cleancode

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-11-12 17:22:21 +08:00
zzhxxx
46a41b26d3 oproj TP support acl graph (#4073)
### What this PR does / why we need it?
Reference #2167 and orpoj TP supports ACL graph.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2025-11-11 19:39:06 +08:00
wangxiyuan
f811a24bf0 Remove VLLM_USE_V1 (#4086)
Drop VLLM_USE_V1 usage.  This env has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 15:43:39 +08:00
Apocalypse
71866d5311 [feature] chunkprefill support pcp&dcp (#3801)
### What this PR does / why we need it?
ChunkPrefill now can support Long Sequence Feature Pcp&Dcp

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI tests passed with self-test


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <3834144971@qq.com>
2025-11-11 09:18:02 +08:00
Icey
e04a87f4be [BugFix] Fixes Qwen3-Next enable nz accuracy problem (#4058)
### What this PR does / why we need it?
- Fixes Qwen3-Next enable nz accuracy problem

### Does this PR introduce _any_ user-facing change?
N/A


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
2025-11-10 20:54:57 +08:00
rjg-lyh
a1558b99c2 [Core] Restore scheduling logic under default configuration (#3967)
### What this PR does / why we need it?
This PR reverts the changes introduced in PR #2894 Initially, due to
performance issues with the older version of the chunked prefill ops,
the default behavior was to use the Ascend scheduler to disable the
chunked prefill feature. However, with the improvements in the
performance of the new chunked prefill ops, this interception strategy
has been removed. This change also aligns with the community's default
configuration behavior.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-11-10 17:48:56 +08:00
Levi
0a62e671fb [Feat] flashcomm_v2 optim solution (#3232)
### What this PR does / why we need it?
Supports generalized FlashComm2 optimization, which reduces
communication overhead, decreases RmsNorm computation, and saves one
AllGather step by replacing Allreduce operations in the Attention module
with pre-AlltoAll and post-AllGather operations (used in combination
with FlashComm1). This feature is enabled during the Prefill phase and
is recommended to be used together with FlashComm1, delivering broad
performance improvements, especially in long sequence scenarios with
large tensor parallelism (TP) configurations. Benchmark tests show that
under TP16DP1 configuration, it can improve the prefill performance of
the DeepSeek model by 8% on top of FlashComm1.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: zzhxx <2783294813@qq.com>
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: zzhxx <2783294813@qq.com>
2025-11-10 11:01:45 +08:00
hucong
48094148f8 [BugFix] Improve the performance of prefixcache features (#4022)
### What this PR does / why we need it?
The code bug caused an empty bubble. When the npu_paged_cache_load
operator was called, it forcibly transferred seq_len2 to the device,
which triggered synchronization and interrupted the CPU operator's
launch stream.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: underfituu <hzhucong@163.com>
2025-11-08 18:45:31 +08:00
zxr2333
1d81a289d0 [P/D][BugFix]Fix proxy format processing errors & Layerwise connector performance optimization (#4043)
### What this PR does / why we need it?
1. Fix proxy format processing errors.
2. Layer-wise connector performance optimization.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-11-08 18:44:06 +08:00
wangx700
24d6314718 [Bugfix] fix sleepmode level2 e2e test (#4019)
### What this PR does / why we need it?

enable sleepmode level2 e2e test and add the check logic to ensure the
nz is not enabled.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

use e2e tests


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangx700 <wangxin700@huawei.com>
2025-11-08 14:11:55 +08:00
drslark
23b785fdfb [Feat] Adapted mtp function to Qwen3-next (#3918)
### What this PR does / why we need it?

Adapts mtp function to Qwen3-next.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-07 16:39:03 +08:00