Commit Graph

244 Commits

Author SHA1 Message Date
LeeWenquan
c8d1df3a3f [Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465)
### What this PR does / why we need it?
In order to support fused kernels, multi-stream, communication
optimization etc, it's better to aggregate all opreations in Attention
layer togather. This PR tries to refactor mla_v1 by moving all MLA
preprocessing ops into mla_v1 attention impl.
Note that new mla_v1 doesn't take torchair into consideration. So this
PR can only be merged after torchair related mla_v1 is isolated into a
new file.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

### Features Test

<img width="506" height="141" alt="image"
src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466"
/>

### Performance After Refact
<img width="648" height="486" alt="image"
src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb"
/>

### Performance Before Refact
<img width="618" height="494" alt="image"
src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1"
/>


- vLLM version: v0.10.1.1
- vLLM main:
e03940762b

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: SunnyLee219 <3294305115@qq.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-08-28 10:35:57 +08:00
weichen
320edde2df [main] [refactor] refactor fused_moe.py to enable token_dispatchers (#2570)
### What this PR does / why we need it?
Enable token_dispatcher to replace fused_experts_with_xxx in eager mode
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut


- vLLM version: v0.10.1.1
- vLLM main:
704432af3c

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: sherie <963372609@qq.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
Co-authored-by: shiyuan680 <72335504+shiyuan680@users.noreply.github.com>
2025-08-28 10:13:35 +08:00
Wang Yixuan
a955e5d404 [4/N][refactor]delete torchair from quantization (#2535)
### What this PR does / why we need it?
After moved torchair related quantization section into
torchair_quantization, split the torchair from the origin quantization

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
69244e67e6

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-08-28 09:10:03 +08:00
Icey
c578f817ca [CustomOp] Register VocabParallelEmbedding instead of overwrite forward (#2515)
### What this PR does / why we need it?
Register VocabParallelEmbedding instead of overwrite forward

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
644d57d531

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-08-28 08:57:34 +08:00
rjg-lyh
2bfbf9b9b3 [main][bugfix] Fix bugs and refactor cached mask generation logic (#2442)
### What this PR does / why we need it?
This PR fix bugs and refactor cached mask generation logic. Now just
pre-construct and use the cached mask on cpu instead of device on npu.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
9b5f64238f

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-08-27 12:07:29 +08:00
huangxialu
6881c19458 [main] convert the format of gmm to nz (#2474)
### What this PR does / why we need it?
convert the format of gmm to nz

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
ut: test_fused_ops.py and e2e: test_fused_moe.py

**performance**:
(qwen3 30B, 2k->20k)

base:
Total Token throughput (tok/s):          719.93

gmm nz:
Total Token throughput (tok/s):          728.52


- vLLM version: v0.10.1.1
- vLLM main:
bfc1edc9f5

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-08-27 11:25:02 +08:00
wangxiyuan
c0e12143a3 [CI] Fix UT failure (#2563)
UT is broken by vLLM commit
https://github.com/vllm-project/vllm/pull/23664

This PR mock the related config to recover the CI

- vLLM version: v0.10.1.1
- vLLM main:
6dab89b8ec

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-27 11:24:35 +08:00
Wang Yixuan
20a7bc4b71 [3/N][refactor] refactoer quantization (#2504)
### What this PR does / why we need it?
Move torchair related qunatization section into torchair dir to make the
code clear. Next step we'll remove all torchair related code outside of
torchair quantization.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
959783fb99

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-08-27 10:45:50 +08:00
Mengqing Cao
a9e78a3299 [Aclgraph] Update compilation config in check_and_update_config (#2540)
### What this PR does / why we need it?
This pr updates compilation config in `check_and_update_config`, we use
`compilation_config.level` to update `compilation_config.cudagraph_mode`
to ensure the config is correct.

Add `compilation_config.cudagraph_num_of_warmups = 1` when V1 is
enabled, cause this is also used in torchair graph mode. and this fixes
https://github.com/vllm-project/vllm-ascend/issues/2523

fix the bug that the `aclgraphmode` always be `NONE` while running
forward in aclgraph mode

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.1.1
- vLLM main:
f58675bfb3

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-27 09:30:25 +08:00
wangxiyuan
f22077daa6 [Embedding] Recover embedding function (#2483)
Fix broken embedding function. It's broken by
http://github.com/vllm-project/vllm/pull/23162

- vLLM version: v0.10.1.1
- vLLM main:
efc88cf64a

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-27 09:22:01 +08:00
s30076806
6a4ec186e7 [Qwen-moe] Remove the minor operation arange (#2373)
### What this PR does / why we need it?
Integrate the arange operator to reduce the time spent and improve
performance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
56dcf4e7e9

---------

Signed-off-by: s30076806 <songjiayang2@h-partners.com>
2025-08-27 09:13:31 +08:00
zhanghw0354
8151a9d5a4 [Test]Add unit test for worker_v1.py (#2547)
### What this PR does / why we need it?
According to issue
https://github.com/vllm-project/vllm-ascend/issues/1298 ,this pull
request adds unit test code for platform.py.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
b5d34af328

---------

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
2025-08-26 22:00:49 +08:00
yiz-liu
a6bb502e70 [2/N][Feat] Add MC2 communication method for MoE layers (#2469)
### What this PR does / why we need it?
This method replaces the previous all-gather approach for small numbers
of tokens.

The key changes include:
- A new `AscendFusedMoE` layer that handles token splitting, local
computation, and final aggregation via all-gather.
- Logic in the model runner to dynamically select between the new MC2
method and the existing all-gather method based on the number of input
tokens.
- Sharding the MoE communication mask across tensor-parallel ranks.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test case fixed.


- vLLM version: v0.10.1.1
- vLLM main:
b00e69f8ca

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-26 19:05:23 +08:00
lilinsiman
cfe77e83ae [Bugfix]Support Qwen3-MOE on aclgraph mode in sizes capture and add new ut (#2511)
[Bugfix]Support Qwen3-MOE on aclgraph mode in sizes capture and add new
ut

What this PR does / why we need it?
This PR solves the problem of sizes capture and stream error caused by
using ACLgraph on the Qwen3-30B MOE model.
Add new ut.

Does this PR introduce any user-facing change?
no

How was this patch tested?
ut

- vLLM version: v0.10.1.1
- vLLM main:
6fad29b11b

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-08-26 12:39:21 +08:00
zhanghw0354
b3fdd78a6b [Main][Refactor]Change ASCEND_QUATIZATION_METHOD to ASCEND_QUANTIZATION_METHOD (#2517)
### What this PR does / why we need it?
The constant ASCEND_QUATIZATION_METHOD in vllm_ascend/utils.py is
misspelled and should be corrected to ASCEND_QUANTIZATION_METHOD.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
c9abb10489

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
2025-08-26 09:06:16 +08:00
wangxiyuan
7e494e94a9 [CI] Fix broken ci (#2530)
vLLM commit https://github.com/vllm-project/vllm/pull/22711 changed the
encode cache entries logic, this PR adapt the same change for vllm
ascend to make CI happy.

Co-Authored-By: zhoux77899 <zhouxiang100@huawei.com>

- vLLM version: v0.10.1.1
- vLLM main:
0ff902f3b4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-26 07:42:24 +08:00
wangxiyuan
de7649492d [Refactor] cleanup converting_weight_acl_format_format (#2482)
move maybe_converting_weight_acl_format_format to torchair module, it's
only used with 310p+torchair

- vLLM version: v0.10.1.1
- vLLM main:
49ab23b3cc

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-25 19:48:55 +08:00
Wang Yixuan
0f81e032f0 [1/N][refactor] torchair fused_moe refactor (#2438)
### What this PR does / why we need it?
Move torchair related fused_moe section into torchair_fused_moe to make
the code clear. Next step we'll remove all torchair related code outside
of torchair_fused_moe .

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.10.0
vLLM main:
08d5f7113a

- vLLM version: v0.10.1.1
- vLLM main:
170e8ea9ea

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-08-25 15:46:10 +08:00
Shanshan Shen
0767d51dd5 [Structured Output][CI] Add test for outlines backend for structured output in CI (#2283)
### What this PR does / why we need it?
Add test for `outlines` backend for structured output in CI.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Tests have all passed with:

```bash
pytest -sv tests/e2e/singlecard/test_guided_decoding.py
```

- vLLM version: v0.10.0
- vLLM main:
53415653ff

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-25 09:59:13 +08:00
Icey
891b2bfe71 Accuracy report formatting (#2279)
### What this PR does / why we need it?
Accuracy report formatting

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.10.0
- vLLM main:
53415653ff

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-08-25 09:39:30 +08:00
Icey
f796e6280b [CustomOp] Register RotaryEmbedding instead of overwrite forward (#2385)
### What this PR does / why we need it?
Register RotaryEmbedding instead of overwrite forward

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.0
- vLLM main:
808d2e9aa0

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
2025-08-25 09:32:35 +08:00
weichen
950c4b219a [main] refactor alltoallv in fused_moe (#2487)
### What this PR does / why we need it?
Refactor all2all-related fused_experts (both quantized/unquantized) into
TokenDispatcherWithAll2AllV, including dispatch & combine calculation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E & UT
- vLLM version: v0.10.0
- vLLM main:
65197a5fb3

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-08-23 20:38:17 +08:00
linfeng-yuan
4af5b80606 [Scheduler] validate max_num_batched_tokens and max_model_len in AscendSchedulerConfig (#2434)
### What this PR does / why we need it?
Add configuration check logic for ascend scheduler: if chunked_prefill
is disabled, max_num_batched_tokens couldn't be less than max_model_len,
following vLLM;

### Does this PR introduce _any_ user-facing change?
users cannot set max_num_batched_tokens smaller than max_model_len with
ascend scheduler
### How was this patch tested?
CI and vllm serving passed

- vLLM version: v0.10.0
- vLLM main:
f77a0802b7

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-08-23 19:39:44 +08:00
ZhaoJiangJiang
3629bc4431 feat: add mtp ut and fix some bugs (#2453)
### What this PR does / why we need it?
Fix mtp mode ut

### Does this PR introduce _any_ user-facing change?
Nothing

### How was this patch tested?
This can be tested in the same way as a unit test.


- vLLM version: v0.10.0
- vLLM main:
53415653ff

Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com>
Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>
2025-08-22 17:09:08 +08:00
Mengqing Cao
60ac4fb576 [QuickFix] Skip failed ut to recover CI quickly (#2484)
### What this PR does / why we need it?
Skip failed ut to recover CI quickly
related ut:
- `test_embed_models_correctness`: revert me when pooler is adapted with
the latest vllm main
- `test_check_and_update_config_enforce_eager_mode`: revert me when the
occasional failed is fixed

- vLLM version: v0.10.0
- vLLM main:
8896eb72eb

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-22 14:14:51 +08:00
Mengqing Cao
b0403f8d8a [CI] fix ci (#2464)
### What this PR does / why we need it?
1. use action/checkout@v5 instead of v4
2. remove dbo test case because there is issue with it and will be
refactored later
3. make vllm-ascend compatible with vllm v0.10.1.1 and add CI for it
4. fix sampler api changes introduced by
https://github.com/vllm-project/vllm/pull/22387
6. fix qwen3 moe config changes intruoduced by
https://github.com/vllm-project/vllm/pull/20562
7. fix kvcache block changes introduced by
https://github.com/vllm-project/vllm/pull/23262

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.10.0
- vLLM main:
0c6e40bbaa

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-22 07:30:48 +08:00
linfeng-yuan
0ca3f48c90 [2/N][refactor] torchair deepseek mla backend refactor (#2459)
### What this PR does / why we need it?
This PR move current unified mla backend to torchair folder and remove
torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).

 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running eager mode with mla backend, and torchair mode with code before
[2445](https://github.com/vllm-project/vllm-ascend/pull/2445)


- vLLM version: v0.10.0
- vLLM main:
f571ff8eb6

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-08-21 14:02:30 +08:00
sherie
3fb80ee356 add mlp tp optimze (#2120)
### What this PR does / why we need it?
For dense models, by not applying tensor parallelism (TP) to the
attention module and applying TP to the MLP module, the allreduce
operations in the attention module can be eliminated, thereby reducing
computational overhead. However, this approach increases memory usage,
so the environment variable VLLM_ASCEND_ENABLE_MLP_OPTIMZE is used to
control this optimization.

- vLLM main:
b17109beea

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-08-21 09:22:07 +08:00
Wang Kunpeng
c40d4171bc [main][quantization] Adapt to the new format of ds w4a8 weight (#2392)
### What this PR does / why we need it?

The deepseek w4a8 weights we supported before were in mindie-format
format. It uses int8 to represent int4, so the weight size is similar to
w8a8, and we need to do a few extra steps to make vllm-ascend load it
normally.

Now we can directly use the new weight format, which uses two int4 packs
to save the weight, the weight size is reduced, and there is no need to
do many extra operations to directly use it on vllm-ascend, but we are
also compatible with the weights of the previous mindie format.

The weight changes in the new version: 
1. The weight is packed (2 int4 pack to int8)
2. The bias required in the apply method is directly generated by
modelslim

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`

#### 1.How to get weights using Modelslim

##### Installation steps

we can use the branch br_release_MindStudio_8.1.RC2_TR5_20260624
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624
https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### Generate w4a8 weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/br_release_MindStudio_8.1.RC2_TR5_20260624/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/br_release_MindStudio_8.1.RC2_TR5_20260624/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96)
chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path}

##### Adapt to vllm-ascend

Modification in `config.json`:`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3`;

#### 2.How to run w4a8

##### a.How to run eager mode

export VLLM_ASCEND_MLA_PA=1

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120 --max-num-seqs 128 --enforce-eager

##### b.How to run graph mode

export HCCL_BUFFSIZE=1024

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
eg: python -m vllm.entrypoints.openai.api_server
--model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
103f1ec8d3

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-08-20 20:25:18 +08:00
wangxiyuan
eccfb715f6 [CI] Fix UT (#2452)
Make UT CI happy 

- vLLM version: v0.10.0
- vLLM main:
d983769c41

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-08-20 16:26:07 +08:00
sherie
3f867ee708 refactor allgather/mc2-related fused_experts (#2369)
### What this PR does / why we need it?
refactor allgather/mc2-related fused_experts

- vLLM version: v0.10.0
- vLLM main:
de7b67a023

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-08-20 14:20:46 +08:00
Nicholas Tao
7bec1a9b9c qwen3_moe/qwen25 support torchair graph (#2403)
### What this PR does / why we need it?
Added support for the TorchAir graph mode in qwen3_moe and qwen2.5
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```bash
llm = LLM(
    model=model,
    tensor_parallel_size=GPUs_per_dp_rank,
    enforce_eager=False,
    enable_expert_parallel=True,
    max_model_len=4096,
    max_num_seqs=16,
    trust_remote_code=trust_remote_code,
    gpu_memory_utilization=0.4,
    additional_config={
             "torchair_graph_config": {
                 "enabled": True,
                 "use_cached_graph": False,
                 "graph_batch_sizes_init": False,
                 "graph_batch_sizes": [16]
             },
             "ascend_scheduler_config": {
                 "enabled": True,
                 "chunked_prefill_enabled":True,
             },
             "refresh": True,
    },
)
```

- vLLM version: v0.10.0
- vLLM main:
b87cb97a53

Signed-off-by: taoyuxiang <oui.nicholas.tao@gmail.com>
2025-08-20 11:23:50 +08:00
Mengqing Cao
1327f9be1c Fix some ci issue and refactor modelrunner (#2445)
### What this PR does / why we need it?
Fix some ci issue and refactor modelrunner

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.0
- vLLM main:
4d9c61993a

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
2025-08-20 09:01:04 +08:00
Shanshan Shen
83e0f41408 [3/N][Refactor] Move torchair_attention to torchair dir (#2017)
### What this PR does / why we need it?

1. Move `torchair_attention` to `torchair` dir.
2. Make `AscendAttentionTorchairBackend` extend `AscendAttentionBackend`
to reduce duplicate methods.
3. Make `AscendTorchairMetadata` extend `AscendMetadata` to reduce
duplicate properties.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
0933f9d518

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-19 10:25:22 +08:00
xleoken
2a763b8326 [Bug] Fix bug in test_chunked.py (#1992)
### What this PR does / why we need it?

1. Remove the return statement, it will always skip following logic.

2. Update `deepseek` to `Qwen2.5-Instruct` for OOM in github e2e test
env.

3. Fix the comparison logic

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Local Test.


- vLLM version: v0.10.0
- vLLM main:
0933f9d518

Signed-off-by: xleoken <xleoken@163.com>
2025-08-19 10:23:47 +08:00
linfeng-yuan
3fc31ee1cb [1/N][refactor] torchair deepseek modeling refactor (#2384)
### What this PR does / why we need it?

Move torchair related model arch into torchair moduel to make the code
clear. Next step we'll remove all torchair related code outside of
torchair moduel.

### Does this PR introduce _any_ user-facing change?
No.

- vLLM version: v0.10.0
- vLLM main:
08d5f7113a

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-08-18 15:00:37 +08:00
Chao Lei
03ca2b26ca [P/D] Mooncake Connector for v1 distributed (#1568)
### What this PR does / why we need it?
This PR adopt Mooncake TransferEngine for kv cache register and
pull_blocks style disaggregate prefill implementation.

### Does this PR introduce any user-facing change?
No

### Dependencies
1. Cann Dependencies
Using Mooncake TransferEngine with Ascend Transport requires CANN
version 8.2.RC1 or higher.(see detail
Mooncake[#502](https://github.com/kvcache-ai/Mooncake/pull/502))

2. vllm-ascend
This PR depends on changes introduced by #950 (modifications to
`model_runner_v1`) and #1361 (updates to `schedule`), both of which have
been merged into the `v0.9.1-dev` branch and are expected to land in
`main` shortly.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
1c859a1387

---------

Signed-off-by: leichao.lc <leichao139636@163.com>
Co-authored-by: jianzs <zheng.shoujian@outlook.com>
Co-authored-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: chris668899 <15105191595@126.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
2025-08-18 14:30:07 +08:00
Mengqing Cao
61866b8ac6 [Quickfix] update CachedRequestState as NewRequestData changed (#2367)
### What this PR does / why we need it?
1. update `CachedRequestState` as `NewRequestData` changed in
https://github.com/vllm-project/vllm/pull/22570
2. drop maintenance of vllm v0.10.0 in the branch main

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.10.0
- vLLM main:
92ff41abea

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-15 07:35:27 +08:00
Icey
c721ae6042 [CustomOp] Register RMSNorm instead of overwrite forward_oot (#2284)
### What this PR does / why we need it?
Use function CustomOp.register_oot to achieve the customop registery
```
from vllm.model_executor.custom_op import CustomOp
CustomOp.register_oot(_decorated_op_cls=AscendRMSNorm, name="RMSNorm")
```

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.0
- vLLM main:
afa5b7ca0b

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-08-14 17:18:30 +08:00
shiyuan680
e14f2ef669 refactor select_experts of moe module (#2150)
### What this PR does / why we need it?
this pr refactor select_experts of moe module
i merge implementations of quantitative and non-quantitative method in a
new class
use such as vllm like ExpertsSelector.select_experts
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
test in qwen3-moe and all ut.

- vLLM version: v0.10.0
- vLLM main:
e18859298d

Signed-off-by: yangcheng <yangcheng104@huawei.com>
Co-authored-by: yangcheng (AJ) <y00806874@china.huawei.com>
2025-08-14 11:50:53 +08:00
Shanshan Shen
103654ccd6 [Misc] Remove redundant imported envs, using envs_ascend instead (#2193)
### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.

```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```

- vLLM version: v0.10.0
- vLLM main:
71683ca6f6

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-14 09:33:39 +08:00
yiz-liu
992271b027 [1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic (#2125)
### What this PR does / why we need it?
This PR refactors the MoE (Mixture of Experts) communication logic by
introducing a strategy pattern. It defines an abstract base class,
`MoECommMethod`, which encapsulates different communication strategies
for MoE layers. By decoupling the MoE implementation from any single
communication method, this change makes it simpler to add, replace, or
optimize communication strategies in the future.

Plan / Roadmap

1. Introduce `MoECommMethod`, implement `AllGatherImpl`, and adapt ACL
Graph handling to cover all scenarios (this PR).
2. Implement `MC2CommImpl` and `AllToAllCommImpl` to optimize
performance in specific scenarios.
3. Enable W8A8 / Int8 models to use `unified_fused_experts`.

Other notes

* Data-parallel (DP) communication currently does not work with vLLM's
dispatch/combine mechanisms; an alternative approach is required to
resolve this incompatibility.

- vLLM version: v0.10.0
- vLLM main:
f7ad6a1eb3

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-12 21:10:20 +08:00
Wang Kunpeng
dc585f148a [main][prefill optimization] Optimize parallel strategies to reduce communication overhead (#2198)
### What this PR does / why we need it?
1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to
pure DP for shared experts, enabling more efficient execution.
2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by
using ReduceScatter, made possible by pure DP sharding.
3.AllGather Postponed: Delayed to after QKV down projection to reduce
synchronization impact during prefill.

### How was this patch tested?
Adding ut case in `tests/ut/attention/test_mla_v1.py`

#### How to run

use parameter `--additional_config='{"enable_shared_expert_dp": true}'`

##### a.How to run eager mode

eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true}'

##### b.How to run graph mode

eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
9edd1db02b

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
2025-08-12 14:12:12 +08:00
zhenghaojiang
eb43a475f4 [Feat] chunkprefill mla support torchair graph (#1772)
chunkprefill mla only support eager mode now,we want to optimaze it by
support torchair graph, the idea is simple, when all the request is
running in decode, use torchair graph to deal with it, else when
chunkprefill or prefill only, use the eager mode

- vLLM version: v0.10.0
- vLLM main:
ebf7605b0d

Signed-off-by: haojiangzheng <justineric096@gmail.com>
Co-authored-by: haojiangzheng <justineric096@gmail.com>
2025-08-11 19:58:59 +08:00
whx
29aaba5f84 [Perf][MTP] Optimize reject sampler in greedy situation. (#2137)
This PR port optimization in PR #2002 to main and makes it cleaner.

- vLLM version: v0.10.0
- vLLM main:
afa5b7ca0b

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-08-11 17:37:49 +08:00
Pleaplusone
c0f0b70813 [core] Support capture custom ops into aclgraph (#2113)
### What this PR does / why we need it?
Thanks to the PR https://github.com/vllm-project/vllm-ascend/pull/426
make vllm-ascend support the aclgraph inference to reduce the host
overhead. However, the capability of aclgraph strongly relies on the
functionality provided by `torch.compile`, which is the key feature
supported in torch 2.x . Therefore, capture custom op into aclgraph is
only possible when it can be recognize and captured by `torch.compile`.

In this PR, we register the meta implementation of current custom ops to
enable the fx graph capture. And by doing that, insert those custom ops
into aclgraph become a natural thing to the ascend runtime.

### Does this PR introduce _any_ user-facing change?
No user face change.

### How was this patch tested?
Tested in unittest, we will integrate the `rotary_embedding` op into a
small custom model and use `torch.compile` and aclgraph to capture and
replay it to verify its functionality.

- vLLM version: v0.10.0
- vLLM main:
1b99028069

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-08-11 15:59:42 +08:00
wangxiyuan
9260910c8d [CI] Fix broken CI (#2302)
1. disable test_eagle_ccorrectness test, we'll reopen it once oom error
fixed.
2. drop transformers version limit for main, since vLLM rely on
>=4.55.0, see:
65552b476b
3. fix kv_connector_output bug, see:
796bae07c5

- vLLM version: v0.10.0
- vLLM main:
d1af8b7be9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-11 11:22:32 +08:00
yangqinghao-cmss
ee6f79c44a Add ut for test_communicator.py (#2293)
### What this PR does / why we need it?

Add ut for test_communicator.py 

- vLLM version: v0.10.0
- vLLM main:
e5ebeeba53

Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
2025-08-09 08:26:04 +08:00
Icey
0bd5ff5299 Fix accuracy test config and add DeepSeek-V2-Lite test (#2261)
### What this PR does / why we need it?
This PR fix accuracy test related to
https://github.com/vllm-project/vllm-ascend/pull/2073, users can now
perform accuracy tests on multiple models simultaneously and generate
different report files by running:

```bash
cd ~/vllm-ascend
pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \
          --config-list-file ./tests/e2e/models/configs/accuracy.txt
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
<img width="1648" height="511" alt="image"
src="https://github.com/user-attachments/assets/1757e3b8-a6b7-44e5-b701-80940dc756cd"
/>


- vLLM version: v0.10.0
- vLLM main:
766bc8162c

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-08-08 11:09:16 +08:00
Mengqing Cao
ad1083761f [CI][Quickfix] Fix AscendFusedMoE init error (#2268)
### What this PR does / why we need it?
Fix AscendFusedMoE init error. Use `super().__init__()` instead of
`super(FusedMoE, self).__init__()` to ensure the member variables in
base class could be called by the children class

### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new existing test.


- vLLM version: v0.10.0
- vLLM main:
766bc8162c

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-08 10:20:23 +08:00