Commit Graph

1063 Commits

Author SHA1 Message Date
zhaozx-cn
923cdaeba3 fix ascend fused moe spelling error (#2863)
### What this PR does / why we need it?
fix ascend fused moe spelling error

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

0ae43dbf8c

- vLLM version: main
- vLLM main:
fcc0a3130a

Signed-off-by: zhaozixin <zhaozixin1@huawei.com>
Co-authored-by: zhaozixin <zhaozixin1@huawei.com>
2025-09-11 14:35:46 +08:00
zhaozx-cn
b9a0a75c78 fix qwen torchair attention PrefillCacheHit (#2787)
### What this PR does / why we need it?
Fix qwen torchair attention PrefillCacheHit 
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
vLLM version: v0.10.1.1
vLLM main:
e599e2c65e

- vLLM version: main
- vLLM main:
0b9a612fa3

Signed-off-by: zhaozixin <zhaozixin1@huawei.com>
Co-authored-by: zhaozixin <zhaozixin1@huawei.com>
2025-09-11 14:26:59 +08:00
anon189Ty
7b2ecc1e9a [Feat] Unquantized linear nz support (#2619)
### What this PR does / why we need it?
Currently, when executing to the Linear layer of the model in
vLLM-Ascend, the weights input format is ND in unquantized case and
skipped ascend case, which is slower than FRACTAL_NZ.
This PR supplements the execution logic for Linear layer. When
VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights
of the Linear layer will be converted to FRACTAL_NZ, in both unquantized
case and skipped ascend case.

- vLLM version: main
- vLLM main:
267c80d31f

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-09-11 11:40:00 +08:00
liziyu
5691104249 LLMdatadist connector adapt the distributed KV aggregation (#2718)
### What this PR does / why we need it?
LLMdatadist connector adapt the distributed KV aggregation for the main
branch. Change the P node from returning "finish sending" only when TP0
responds to returning "finish sending" as soon as each NPU receives it.
The D node will send a finish receive signal to the corresponding tp
rank of the P node.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
gsm8k test
2*A3 1P 1D
P: dp2 tp8 D:dp 4 tp4
P: dp2 tp8 D:dp 2 tp8


- vLLM version: main
- vLLM main:
cc99baf14d

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-11 11:37:41 +08:00
Mengqing Cao
c2fdd4b8bc [CI/UT] Fix UTs on register customop and warm up model (#2862)
### What this PR does / why we need it?
Fix UTs on register customop and warm up model

### How was this patch tested?
CI passed with existing test.

Co-authored-by: Icey <1790571317@qq.com>

- vLLM version: main
- vLLM main:
cc99baf14d

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-11 11:30:16 +08:00
lilinsiman
b7df04de9b debug_aclgraph_sizes_capture (#2827)
### What this PR does / why we need it?
1. Solved the problem that in the Qwen3 Moe model case, opening DP would
use an extra stream, causing ACLgraph sizes capture error
2. After experimentation, it was found that in many cases, some
operators would occupy more streams than expected. Therefore, the buffer
area for streams in ACLgraph was not large enough. After discussion,
extra 120 streams were added as buffer.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: main
- vLLM main:
0ae43dbf8c

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-09-10 22:50:48 +08:00
zhangxinyuehfad
e75b568011 [CI] Update pre_commit runner (#2850)
### What this PR does / why we need it?

Update pre_commit runner

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: main
- vLLM main:
0ae43dbf8c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-10 20:23:25 +08:00
Jiawei Li
b7ee3fdad3 [Code clean] Remove the unnecessary code (#2815)
**Background:**

A dynamic library named vllm_ascend_C.so will be generated when
compiling vLLM-Ascend, so when import vllm_ascend.vllm_ascend_C in
python, the interperter will search vllm_ascend_C.so and try to find a
external symbol named PyInit_vllm_ascend_C which is provided in
csrc/camem_allocator.cpp.

**Conclusion:**

The PyInit__C is redundent.

- vLLM version: v0.10.1.1
- vLLM main:
717fc00e98

Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
2025-09-10 17:19:39 +08:00
huangxialu
88d7af62be [main] adjust the position of warm_up_atb (#2823)
### What this PR does / why we need it?
Adjust the position of warm_up_atb.

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
CI passed with existing test.

- vLLM version: main
- vLLM main:
b23fb78623

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-09-10 14:06:38 +08:00
Li Wang
22b425765a [Bugfix] Fix broken CI (#2825)
### What this PR does / why we need it?
1. Initial support disable tp for integrating with
[vllm-commit](https://github.com/vllm-project/vllm/pull/23024)
2. [vllm@commit](https://github.com/vllm-project/vllm/pull/23673) now
use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add
the integration

- vLLM version: main
- vLLM main:
e40827280b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-10 13:29:29 +08:00
Icey
aa4d2a91ed Refactor AscendMultiHeadLatentAttention (#2826)
### What this PR does / why we need it?
Register AscendMultiHeadLatentAttention as CustomOP, following vllm changes

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: main
- vLLM main:
b23fb78623

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-10 11:26:11 +08:00
CaranLic
168ad600b5 [main] add pd transfer for ascend scheduler (#2753)
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.

### How was this patch tested?

```
max_num_seqs=24,
additional_config={
    "ascend_scheduler_config":{
        "enabled": True,
        "enable_pd_transfer": True,
        "decode_max_num_seqs": 24,
        "enable_chunked_prefill": False
    }
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |


- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163

---------

Signed-off-by: CaranLic <740821011@qq.com>
2025-09-10 08:46:39 +08:00
Mengqing Cao
edf1f600ad [CI] Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 (#2840)
### What this PR does / why we need it?
Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1

### Does this PR introduce _any_ user-facing change?
branch main of vllm-ascend will not be compatible with vllm v0.10.1 and
v0.10.1.1

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-10 08:43:10 +08:00
sherie
93e28e6862 add weight transpose check. (#2756)
### What this PR does / why we need it?
In reinforcement learning scenarios, weight updates are required, but
the current inference applies a transpose operation to the weights,
altering their shape. This causes a shape mismatch with the training
weights, triggering an error during weight updates.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-09-09 20:33:43 +08:00
yiz-liu
e13c4ddb42 [Fix] Fix SharedFusedMoE (#2817)
### What this PR does / why we need it?
Really strange that `register_oot` doesn't work with `SharedFusedMoE`,
so we have to add this patch, for now.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
This PR won't have any effect in DeepSeek since we currently still stick
with the old `CustomDeepseekV2`.

- vLLM version: v0.10.1.1
- vLLM main:
0cdd213641

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-09 18:19:56 +08:00
rjg-lyh
7a205dbaa8 [main] Optimize rope in Qwen Models (#2571)
### What this PR does / why we need it?
Optimize rope by caching sin and cos at the first layer in Qwen Models.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.1.1
- vLLM main:
562663a044

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: ZYang6263 <zy626375@gmail.com>
Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: ZYang6263 <51255902183@stu.ecnu.edu.cn>
Co-authored-by: ZYang6263 <zy626375@gmail.com>
2025-09-09 14:28:14 +08:00
wangxiyuan
5bcb4c1528 [CI] Reduce CI time (#2801)
1. Only run light e2e test before the PR is `ready` to reduce CI time.
2. Run full test once the PR is labled `ready` and `ready for test`
3. Run lint job on self host CPU container to avoid waiting much.


- vLLM version: v0.10.1.1
- vLLM main:
6910b56da2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-09 10:52:14 +08:00
rjg-lyh
1bbb20ea13 [main] flashcomm_v1 optim in Qwen Dense Models (#2802)
### What this PR does / why we need it?
Flashcomm_v1 optim in Qwen Dense Models.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
5e537f45b4

Co-authored-by: 1024daniel <xxltju324@gmail.com>
2025-09-08 22:52:24 +08:00
zzzzwwjj
4df8df5b94 [bugfix] fix deepseek rope sincoscache re-generation (#2744)
### What this PR does / why we need it?
The current implementation will result in duplicate generation of
`sin_cos_cache` in rope when `kv_seqlen` > 4k, because the
initialization length of the `sin_cos_cache` is only 4k.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
After this PR merged, sin_cos_cache will not increase in forward func,
so `test_native_rope_deepseek_forward_cache_handling` is not necessary.

- vLLM version: v0.10.1.1
- vLLM main:
60f0843ef8

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-09-08 22:03:34 +08:00
wangxiyuan
7d6d9449a8 [Misc] Move lora patch file into lora module (#2797)
Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-08 21:42:12 +08:00
wangxiyuan
85d989a3b9 [Misc] Remove pangu model file (#2798)
vllm-ascend won't contain model file anymore. Now pangu model file has
been moved to torchair module. The origin one can be removed.

Note: After this PR, pangu only works with torchair mode then.
- vLLM version: v0.10.1.1
- vLLM main:
8c892b1831

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-08 21:30:37 +08:00
weichen
a041d4f328 [main] [refactor] refactor common_fused_moe.py (#2706)
### What this PR does / why we need it?
1. Move prepare/finalize operation from moe_comm_method to
/ops/moe/fused_moe_prepare_and_finalize
2. Adapt to token_dispatcher in moe_comm_method
3. Move
moe_comm_method/experts_selector/token_dispatcher/fused_moe_prepare_and_finalize
to /ops/moe
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut

- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: weichen <calvin_zhu0210@outlook.com>
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-09-08 20:09:50 +08:00
machenglong2025
1a82b16355 Remove unused code in fused_moe.py (#2805)
### What this PR does / why we need it?
line 408 already declared mc2_mask ,  remove duplicated unused code

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
60f0843ef8

Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
2025-09-08 20:05:19 +08:00
22dimensions
d51694a77b [2/N][Refactor][Quantization] clean quantization patch (#2785)
### What this PR does / why we need it?
quantization patch is unused code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
tested by CI

- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-08 17:31:53 +08:00
dependabot[bot]
cd88f89267 Bump actions/github-script from 7 to 8 (#2803)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 8.

- vLLM version: v0.10.1.1
- vLLM main:
8c892b1831

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-08 14:53:26 +08:00
realliujiaxu
d3c3538ddc [Bugfix]fix bug when graph_size is not divisible by tp_size (#2719)
### What this PR does / why we need it?
fix https://github.com/vllm-project/vllm-ascend/issues/2702
- A2: skip graph_size update that makes it to tp_size because
dispatch/combine op support different batch size across EP ranks
- A3: add `max_num_reqs = max(new_graph_batch_sizes)` to fix graph_size
and max_num_reqs mismatch

### Does this PR introduce _any_ user-facing change?
Nope
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-09-08 14:52:33 +08:00
TaoYu Chen
dd087effcc Refector prepare_inputs in model_runner_v1.py (#2750)
### What this PR does / why we need it?
Refector prepare_inputs in model_runner_v1.py for more easy read.

### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
PASS CI

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
2025-09-08 10:45:23 +08:00
yiz-liu
c735bb0941 [Fix] Ensure metadata sync across DP ranks in eager mode (#2766)
### What this PR does / why we need it?
Removes the condition that skips metadata synchronization when
`enforce_eager` is enabled.

This change is necessary to correctly sync the `with_prefill` and
`enable_dbo` flags across all data parallel ranks, which is not required
in the base implementation. Forcing the sync operation prevents
potential inconsistencies, albeit with a minor performance impact.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Add a E2E online test case?

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-08 09:55:16 +08:00
sherie
2693196ef8 add gatherep select. (#2740)
### What this PR does / why we need it?
add gatherep select.

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-09-08 09:15:50 +08:00
Marco Barletta
6666e5265d Added support for KV connector v1 (#2039)
### What this PR does / why we need it?
- This PR adds the support for the KV connector interface in the V1
architecture, in the same way as vllm. Vllm-ascend currently lacks of
this support, required to support also layerwise management of KV
caches.

- The connector interface allows using external tools and integrate them
with vllm

### Notes:
We are aware of Issue #684 , however that issue does not modify the
attention classes as necessary to perform layerwise management of KV
caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla
vllm. The KV connector API is the same as vanilla vllm, supporting the
standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour
before this PR was made on the file model_runner_v1.py. I solved the
conflicts by removing any modification to the model_runner_v1 file,
which now are largely already merged in main. Now this PR is left for
the modifications to the attention_v1 file.

### Does this PR introduce _any_ user-facing change?
The PR does not modify current APIs, but it extends the behavior of
current worker runner and attention classes to save and load KV caches.
In absence of connectors, the behavior should stay untouched.

### How was this patch tested?
- No unit test implemented yet for the worker.

- Tested together with LMCache using
https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py
with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.

- Performed LMEval on LMCache integrated with vllm-ascend.

Results without LMCache on Qwen3-8B:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102|
 
Results with LMCache Layerwise:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103|


- vLLM version: v0.10.1.1
- vLLM main:
50fede6634

---------

Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
2025-09-08 09:04:22 +08:00
Li Wang
2967e5e22a [Benchmark] Correctly kill vllm process in performance benchamrk (#2782)
### What this PR does / why we need it?
vLLM now names the process with VLLM prefix after
https://github.com/vllm-project/vllm/pull/21445, we should kill the
correct process name after one iteration benchmark to avoid OOM issue
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-07 10:36:34 +08:00
yupeng
a746f8274f [DOC] Qwen3 PD disaggregation user guide (#2751)
### What this PR does / why we need it?
The PR is for the document of the prefiller&decoder disaggregation
deloyment guide.

The scenario of the guide is:
- Use 3 nodes totally and 2 NPUs on each node
- Qwen3-30B-A3B
- 1P2D
- Expert Parallel

The deployment can be used to verify PD Disggregation / Expert Parallel
features with a slightly less resources.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.


- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2025-09-07 10:35:37 +08:00
yeyifan
b2f77d3aa8 [fix] prefill unsupport sliding window attention (#2758)
### What this PR does / why we need it?
fix prefill attention bug,not support sliding window.
npu_fused_infer_attention_score head_dim only equal 128, not support
other number.
### Does this PR introduce _any_ user-facing change?
remove prefill phase npu_fused_infer_attention_score
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: nsdie <yeyifan@huawei.com>
2025-09-07 10:34:38 +08:00
Yikun Jiang
752e272a55 Add note for Ascend HDK version (#2765)
### What this PR does / why we need it?
Add note for Ascend HDK version

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-07 10:33:41 +08:00
lidenghui1110
5a7181569c [feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167)
### What this PR does / why we need it?
This PR introduces Oproj matrix tensor model parallel to achieve
decreasing of memory consumption. It only support graph mode in pure DP
scenario.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8
GB NPU memory per RANK. We got best performance when
oproj_tensor_parallel_size=4 without TPOT increasing.

performance data:
<img width="1442" height="442" alt="image"
src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| oproj_tensor_parallel_size | Split the o_proj matrix along the row
dimension (head num * head dim) into oproj_tensor_parallel_size pieces.
| No | int | default value is None, once this value is set, the feature
will be enabled, head num * head dim must be divisible by this value. |

example

`--additional_config={"oproj_tensor_parallel_size": 8}`

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
eddaafc1c7

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zzh <zzh_201018@outlook.com>
2025-09-07 10:31:32 +08:00
Yikun Jiang
a58b43b72c Remove git .extraheader and fecth all commtis in /vllm-workspace/vllm-ascend (#2746)
### What this PR does / why we need it?
Remove git .extraheader and fecth all commtis in
/vllm-workspace/vllm-ascend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/2735
- vLLM version: v0.10.1.1
- vLLM main:
51d5e9be7d

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-05 09:45:11 +08:00
henryxuxu0716
51a2aec115 Delete redundant codes related to communication (#2717)
### What this PR does / why we need it?
Delete redundant codes related to communication

### Does this PR introduce _any_ user-facing change?
not involve

### How was this patch tested?
not involve

- vLLM version: v0.10.1.1
- vLLM main:
6c7af8110a

---------

Signed-off-by: 刘哲续 <liuzhexu1@huawei.com>
Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>
2025-09-05 09:39:39 +08:00
1092626063
5b3646ab21 [FEATURE][MTP] Support MTP > 1 (#2708)
### What this PR does / why we need it?
[RFC:Support MTP > 1 for
DeepSeek](https://github.com/vllm-project/vllm-ascend/issues/2745)

- [x] dp1 tp16
- [x] dp4 tp4
- [x] dp2 tp 8
- [x] torchair graph

- vLLM version: v0.10.1.1
- vLLM main:
c9f7081f9c

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-09-05 09:11:22 +08:00
yiz-liu
83eb40a51c [Fix][MoE] Refine MoE communication strategy (#2734)
### What this PR does / why we need it?
Refactors the Mixture-of-Experts (MoE) communication method selection
logic. The choice between all-gather, all-to-all, and mc2 is now
determined by expert parallel configuration, SoC version (A2/A3), and
token count for better performance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Added.


- vLLM version: v0.10.1.1
- vLLM main:
eafa8dcde6

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-05 09:04:04 +08:00
liziyu
4c90fa79ca [Misc] Remove useless PD check in deepseek (#2739)
### What this PR does / why we need it?
Remove useless PD check in deepseek


- vLLM version: v0.10.1.1
- vLLM main:
6c7af8110a

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-04 22:22:19 +08:00
vllm-ascend-ci
3a2a7d88db [Doc] Update accuracy reports for v0.10.1rc1 (#2755)
The accuracy results running on NPU Altlas A2 have changed, updating
reports for: All models (Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct,
Qwen3-8B-Base, DeepSeek-V2-Lite)

  - [Workflow run][1]
  
[1]:
https://github.com/vllm-project/vllm-ascend/actions/runs/17459225764
- vLLM version: v0.10.1.1
- vLLM main:
2b30afa442

Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
2025-09-04 22:17:17 +08:00
sherie
f86596a66c allgather use fusedop. (#2689)
### What this PR does / why we need it?
Use 'npu_moe_init_routing_v2' &'npu_moe_token_unpermute' repalce
'npu_moe_init_routing' &‘npu_moe_compute_expert_tokens’&
'npu_moe_finalize_routing' to optimize performance
### Does this PR introduce _any_ user-facing change?
| branch| tps| TTFT |TPOT |
| --- | --- | --- |--- |
|main  |733.98  | 280.05 |34.30 |
|main+fusedop  | 740.33 | 273.34 |33.99 |
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-09-04 11:56:29 +08:00
无脸男
7d47d8f4f6 [Fix] fix resources limit error when apply speculative decoding and aclgraph (#2472)
### What this PR does / why we need it?
When both speculative decoding and aclgraph are applied, and
cudagraph_capture_sizes uses the default value, it will report that the
stream resources are insufficient.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
9c99e4871f

Signed-off-by: withHades <244036962@qq.com>
2025-09-04 11:50:43 +08:00
无脸男
0c0789be74 [Feat] allow using aclgraph in ray backend (#2589)
### What this PR does / why we need it?

Allow using aclgraph in ray backend, for tp + pp + aclgraph in multi
machine

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
4ba0c587ba

Signed-off-by: withHades <244036962@qq.com>
2025-09-04 11:45:56 +08:00
Ruri
aff5189c87 [main] Fuse GroupedMatmul, Swiglu and DynamicQuant in W8A8_DYNAMIC quantized MoE layers (#2275)
### What this PR does / why we need it?

Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion
operation `GroupedMatmulSwigluQuant`.

1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py`
2. if in supported occasion, use fusion operation
`npu_grouped_matmul_swiglu_quant`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16`

1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output
Token Throughput increased 27.35%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e"
/>

3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output
Token Throughput increased 6.86%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6"
/>


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
2025-09-04 11:37:32 +08:00
22dimensions
37f5a29cd4 [1/N][Refactor][Quantization] remove redundant quantizer class (#2680)
### What this PR does / why we need it?

AscendQuantizer/LLMQuantizer class is used to select quant method based
on quant config and some other arguments,
but it is more simple and clean replacing these classes with map. So i
remove them.

### Does this PR introduce _any_ user-facing change?
No 

### How was this patch tested?

ut and e2e test


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-04 11:35:14 +08:00
Icey
d4370ebc42 [Refactor] Refactor Spec Decode (#2668)
### What this PR does / why we need it?
Refactor spec decode

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-04 11:34:47 +08:00
Mengqing Cao
7e16b4a7cd [ReleaseNote] Add Release Note for v0.10.1rc1 (#2635)
Add Release Note for v0.10.1rc1

- vLLM version: v0.10.1.1
- vLLM main:
b5ee1e3261

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-04 11:26:47 +08:00
Angazenn
e7409e95ee [1/N][Draft][Refactor]torchair pangu_moe modeling refactor (#2437)
### What this PR does / why we need it?

1. Similar to #2384 , this PR add a torchair-specific modeling for
pangu.
2. Fixes a bug introduced by routed_scaling_factor in #2675 .
3. remove eager test case for pangu since there has already been a
torchair test case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: zengyanjia <z00883269@china.huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Co-authored-by: zengyanjia <z00883269@china.huawei.com>
2025-09-04 10:39:21 +08:00
whx
a58013440a [BugFix][MLA] Fix attn_mask bug for ring mla (#2704)
This PR fix a bug related to attention mask used in ring mla. Current
ring mla has supported compressed mask, so we can directly use a 512 *
512 attention mask.

- vLLM version: v0.10.1.1
- vLLM main:
b5ee1e3261

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-04 10:22:46 +08:00