Commit Graph

366 Commits

Author SHA1 Message Date
XiaoxinWang
579b7e5f21 add pagedattention to support FULL_DECODE_ONLY. (#3102)
### What this PR does / why we need it?
Calculate in advance the workspace memory size needed for the
PagedAttention operator to avoid deadlocks during resource cleanup. This
PR requires torch_npu version 0920 or newer.
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-10-10 08:50:33 +08:00
offline893
1c2c72af8d [bugfix]change log2phy map to npu (#3339)
### What this PR does / why we need it?
Resolved the issue of EPLB failure caused by changes in the log2phy map
due to device type modifications when using MTP rotation position
encoding.

### Does this PR introduce any user-facing change?

### How was this patch tested?
https://github.com/vllm-project/vllm/commit/releases/v0.11.0


- vLLM version: v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-10 08:47:55 +08:00
Ruri
ff37575936 [1/N][Feat] Add weight prefetch feature for Attention layers (#3146)
### What this PR does / why we need it?

- Refacotr and integrate a unified `WeightPrefetchMethod`
- Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention
modules
- Prefetching these weights ahead of matmul-like operators imporves
performance by reducing L2 cache transfer latency

### Does this PR introduce _any_ user-facing change?

Add a new config in `--additional-config` for configuration:
```json
{
    "weight_prefetch_config": {
        "enabled": false,
        "prefetch_ratio": {
            "attn": {
                "qkv": 1.0,
                "o": 1.0,
            },
        },
    },
}
```
This feature is enabled by default, and can be disabled through this
configuration

### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: yuzhup <15705211260@163.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Co-authored-by: yuzhup <15705211260@163.com>
2025-10-09 20:38:39 +08:00
huangdong2022
23db56a340 [Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with norm bias (#3205)
### What this PR does / why we need it?
1. qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and
quant op' during quantization scene.
2. torch_npu.add_rms_norm_quant op fixed accuracy while model weights is
quantized by anti_method m4, m4 quantization is asymmetric outlier
suppression method, it will generate none-zero norm bias,
add_rms_norm_quant op updated to add this parameter to calculate.

### Does this PR introduce _any_ user-facing change?
please use a torch_npu version >= torch_npu-2.7.1.dev20250919

### How was this patch tested?
1. no special parameters to set, no new envs to set.
2. use qwen3 moe quantization model to test ,such as
Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8,
Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: huangdong2022 <huangdong51@huawei.com>
Signed-off-by: h30027576 <huangdong51@huawei.com>
2025-10-09 20:18:10 +08:00
Wang Yixuan
30c5d947c3 [bugfix]fix multistream moe in torchair (#3164)
### What this PR does / why we need it?

the multistream moe in tochari only validate in decode, but can't be
applied to chunked prefill, So add some judgments to isolate the
scenario

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-09 19:00:32 +08:00
weichen
94dd832815 [MoE] [Refactor] Combine common_fused_moe and fused_moe (#3176)
### What this PR does / why we need it?
1. Move additional functionalities from fused_moe.py to
common_fused_moe.py and remove fused_moe.py
2. Remove unnecessary custom classes from qwen3_moe.py, and it will be
completely removed after we release vllm-ascend v0.11.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:

1. Enable/Disable EP
3. Aclgraph & eager
4. SP


- vLLM version: v0.11.0

---------

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-10-09 14:12:46 +08:00
wangxiyuan
1c5b302f0d [Misc] Clean up useless patch (#3320)
### What this PR does / why we need it?
1. clean up v0.10.2 support in ut and e2e test
2. remove v0.11.0 period job, we're at v0.11.0 now.
3. remove uesless patch for deepseek v3.2. They have been done in vLLM
already.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 14:07:26 +08:00
weijinqian0
474fa737c8 [bugfix] Fix moe bug: allgather error. (#3279)
It will crash when deepseek model executed in A2.


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-09-30 18:45:09 +08:00
wangxiyuan
4abdcdba4e upgrade pta to 0919 (#3295)
### What this PR does / why we need it?
Upgrade torch-npu to the newest POC version
### Does this PR introduce _any_ user-facing change?
yes, user need upgrade the pta version as well.
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-30 17:14:23 +08:00
Chao Lei
a486ff8c11 KVCache Transfer via Layer-wise Strategy in Disaggregation (#2602)
### What this PR does / why we need it?
See RFC: https://github.com/vllm-project/vllm-ascend/issues/2470 This PR
add a new kv connector for layer-wised kv transfer

### Does this PR introduce _any_ user-facing change?
yes, a new kv connector is added. User can use layer wised feature now.
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: leichao.lc <leichao139636@163.com>
Signed-off-by: CaveNightingale <2859066733@qq.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: hanxinlong <50882499@qq.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: CaveNightingale <2859066733@qq.com>
Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: hanxinlong <50882499@qq.com>
2025-09-30 15:10:29 +08:00
Mengqing Cao
f8c93d8d24 [Aclgraph][DP] Fix dp dummy run not in aclgraph error (#3208)
### What this PR does / why we need it?
When running DP in a non-equilibrium scenario, which means there is some
dp groups executing `dummy_run`, we need to make sure it running the
same mode as other dp, thus improving then performance in dp scenario

### How was this patch tested?
Tested by adding log in `_dummy_run`

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-30 11:14:51 +08:00
wangxiyuan
c73dd8fecb [CI] Fix CI by addressing max_split_size_mb config (#3258)
### What this PR does / why we need it?
Fix CI by addressing max_split_size_mb config

### Does this PR introduce _any_ user-facing change?
No, test onyl

### How was this patch tested?
Full CI passed, espcially eagle one


- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-29 14:05:12 +08:00
wangxiyuan
15b8aff582 [CI] Add max_split_size_mb for e2e test to avoid oom (#3252)
### What this PR does / why we need it?
we add a patch for model weight loader to avoid using vLLM weight loader
v2, since v2 will lead unknown issue for torchair. While this patch make
some unknown memory usage problem. To quick fix the problem, let's
expend the `max_split_size_mb` to a larger value to avoid weight load
oom issue.

Further solution is to remove the patch and address weight loader v2
from vLLM.

Closes: https://github.com/vllm-project/vllm-ascend/issues/3251

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-29 09:13:08 +08:00
Mengqing Cao
4ff422c730 [CI][Bugfix] Quickfix for DPMetaData (#3234)
### What this PR does / why we need it?
Fix `dpmetadata` and `Qwen3MoeSparseMoeBlock` break introduced by
26a7a33b88 (diff-c1550d0a38469d039370567d8981969530cbfffc7302cd1778e7c2c8a9322dea)

NOTE: we maintain a different sp in vllm-ascend with vllm, thus we can
just use `cu_tokens_across_sp(1)` as `cu_tokens_across_dp_cpu`

close https://github.com/vllm-project/vllm-ascend/issues/3236,
https://github.com/vllm-project/vllm-ascend/issues/3239
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-28 21:11:22 +08:00
lilinsiman
1705501ae2 [BugFix] Fix ACLgraph bug in Qwen3_32b_int8 case (#3204)
### What this PR does / why we need it?
1. Solved the issue where sizes capture failed for the Qwen3-32b-int8
model when aclgraph, dp1, and tp4 were enabled.
2. Added the exception thrown when sizes capture fails and provided a
solution
3. Add this common problem to the FAQ doc
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-09-28 17:44:04 +08:00
Wang Kunpeng
859e861d92 [main][quantization] Support deepseek w4a8 per-channel quantization (#3011)
### What this PR does / why we need it?
1.Support deepseek w4a8 per-channel quantization
2.The eager mode supports converting weights to the NZ format
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
#### How to get weights using Modelslim

##### Installation steps

git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### Generate w4a8 per-channel weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-09-27 21:01:16 +08:00
XiaoxinWang
8406aafaff Add e2e test related to weight updates in RL scenarios. (#2954)
### What this PR does / why we need it?
Add e2e test related to weight updates in RL scenarios.

Due to CI issues, the newly added Python test files cannot locate the
correct path. As a temporary solution, use absolute paths to add test
cases.

- vLLM version: v0.10.2
- vLLM main:
52d0cb8458

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: Shangwei-Li <lishangwei2@huawei.com>
2025-09-26 11:07:10 +08:00
florenceCH
14497b748d Remove qwen3 moe MC2 cumsum & cast (#3126)
What this PR does / why we need it?
The Qwen3 moe MC2 graph currently has two redundant computational
operator implementations. After npu_moe_distribute_dispatch_v2, the
cumsum and cast operations have been added. By using
expert_token_nums_type=0 and not converting weight_scale to float32,
these two operators can be eliminated, thereby improving inference
performance.

Does this PR introduce any user-facing change?
No

How was this patch tested?
No need

vLLM version: v0.10.2
vLLM main:
f225ea7dd9

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: florenceCH <gaoxiang120@huawei.com>
Co-authored-by: florenceCH <gaoxiang120@huawei.com>
2025-09-26 08:51:30 +08:00
wangxiyuan
0794f64a18 Revert "[Disagg][Perf] Use NPU event sync instead of blocking tolist (#3194)
…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (#2788)"



### What this PR does / why we need it?
This reverts commit 6995a7bc5b. We'll add
it back once the issue is fixed.

related issue: https://github.com/vllm-project/vllm-ascend/issues/3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
2025-09-26 06:17:36 +08:00
leo-pony
72f64c10b7 [bugFix] Correct the vllm interface e2e test Base container image name (#3179)
### What this PR does / why we need it?
Correct the vllm interface e2e test Base container image name

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
Tests in vllm ci pipeline
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-09-25 16:03:09 +08:00
Icey
2a9d02e080 [Bugfix] eagle and eagle3 spec decode failures and enable e2e test (#2979)
### What this PR does / why we need it?
- Fix the bug https://github.com/vllm-project/vllm-ascend/issues/2978
- Enable e2e test,
- Adapt to scenarios where Speculative tokens are greater than 2,
- Fix the bug that causes Eagle3 inference failures under high
concurrency and improve the acceptance rate of draft models, by
https://github.com/vllm-project/vllm-ascend/pull/2794

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with new added/existing test.

Co-authored-by: hukongyi
[hukongyi@cmbchina.com](mailto:hukongyi@cmbchina.com)
Co-authored-by: guanyuzhu
[zhuguanyu@huawei.com](mailto:zhuguanyu@huawei.com)
Co-authored-by: liumail680
[liumail680@163.com](mailto:liumail680@163.com)


- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-25 14:39:12 +08:00
wangxiyuan
ac1c2cd9ac [CI] Upgrade vllm version - 0925 (#3167)
Upgrade vLLM to newest commit.

1. Remove the useless func get_state_cls, it has been removed from vLLM
already.
e6750d0b18
2. Fix ut broken by
6160ba4151


- vLLM version: v0.10.2
- vLLM main:
b1068903fd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-25 14:20:10 +08:00
mfyCn-1204
33c118c80e [core]vllm-ascend support msMonitor tool (#3123)
### What this PR does / why we need it?
vllm-ascend support [msMonitor
](https://gitcode.com/Ascend/mstt/tree/master/msmonitor)tool to collect
performance of vllm-ascend

### Does this PR introduce _any_ user-facing change?
1.add env MSMONITOR_USE_DAEMON;
2.user cann enable msMonitor tool by setting MSMONITOR_USE_DAEMON=1
before run vllm-ascend model;
3.MSMONITOR_USE_DAEMON and VLLM_TORCH_PROFILER_DIR cannot both set

### How was this patch tested?
1.run vllm-ascend model while not set MSMONITOR_USE_DAEMON=1 or set
MSMONITOR_USE_DAEMON=0, model will run successfully;
2.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1, run msMonitor
tool to collect profile data;
3.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1 and
VLLM_TORCH_PROFILER_DIR, will raise error

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

Signed-off-by: mei-feiyao <1332490378@qq.com>
2025-09-25 14:15:02 +08:00
wangxiyuan
a055183821 [CI] Upgrade vLLM version (#3139)
Upgrade vLLM version to the newest commit.
- Fix the break change introduced by
969b4da3a6
- Add a patch to quick fix torhcair
de94289a98
- fix the ut error introduced by
de94289a98

Close: https://github.com/vllm-project/vllm-ascend/issues/3138


- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-25 07:36:51 +08:00
leo-pony
360a736dfa Add OOT platform E2E test case to be run in the vllm buildkite pipeline (#3154)
### What this PR does / why we need it?
Add OOT platform E2E test case to be run in the vllm buildkite pipeline.
Note: added test case is not run in vllm-ascend CI.

### Does this PR introduce _any_ user-facing change?
NA

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-09-24 17:55:58 +08:00
Clorist33
302494c1fe [EPLB] ut for EPLB (#3035)
## UT for EPLB

Co-authored-by Skywalker-EP 173723846@qq.com
Co-authored-by offline 0806@qq.com
Co-authored-by dsxsteven@sina.com

## UT Description

### 1. Module Description
- Module: EPLB

### 2. Covered Source Files
- vllm_ascend/eplb/adaptor/abstract_adaptor.py
- vllm_ascend/eplb/core/eplb_device_transfer_loader.py
- vllm_ascend/eplb/core/eplb_utils.py
- vllm_ascend/eplb/core/policy/policy_abstract.py
- vllm_ascend/eplb/core/policy/policy_dynamic_ep.py
- vllm_ascend/eplb/core/policy/policy_dynamic_ep_v2.py
- vllm_ascend/eplb/core/policy/policy_factory.py

### 3. Testing Method
- Framework: pytest
- Test Data: mock data
- Test Type: unit test

### 4. Coverage
- Statement Coverage: 90%


- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: tanqingshan (A)  <50050625@china.huawei.com>
Signed-off-by: tanqingshan <50050625@china.huawei.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
Co-authored-by: tanqingshan (A) <t50050625@china.huawei.com>
Co-authored-by: tanqingshan <50050625@china.huawei.com>
Co-authored-by: daishixun <dsxsteven@sina.com>
Co-authored-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>
2025-09-24 17:14:38 +08:00
Csrayz
80524f5711 [CORE] concurrent partial prefills (#2372)
# What this PR does / why we need it?

When processing a mix of large and small requests, the TTFT of responses
is significantly reduc\ed. Please refer to
https://github.com/vllm-project/vllm/pull/10235, which achieves the same
effect by simply limiting the number of prompt fills for long requests.
This solution can be applied to both AscendScheduler (V0) and vLLM
Scheduler (V1). Tests show that TTFT can be significantly improved when
handling such mixed requests. However, This capability is currently
missing when Ascend Scheduler is enabled.

This benchmark used the Qwen3-8B model, with a context length of 128K,
running on a single card.

Regarding dataset selection, the sharegpt_clean dataset is used, with
its content concatenated and cropped. Small requests with token=50 and
medium requests with token=10240 were constructed (there were also large
requests with token=102400, but these were ignored because when using
the Prefill First scheduling strategy, max_num_batched_tokens will not
be set to such a large value). When loading vLLM, set
max_num_batched_tokens=22000. This length can accommodate two
medium-sized requests and some short requests, reflecting an extreme
scenario where the budget is almost entirely occupied by longer
requests.

Next, we mix 990 small requests and 100 medium requests into one type of
load scenario (hereinafter referred to as 10%), and similarly generate
load scenarios with 5% medium requests and 1% load scenarios.

Performance tests were conducted separately for enabling vLLMScheduler,
AscendScheduler, and AscendScheduler (long prompt concurrency set to 1).

- vLLM version: v0.10.2
- vLLM main:
1dfea5f4a9

---------

Signed-off-by: Csrayz <jover@cmbchina.com>
2025-09-24 17:12:55 +08:00
baxingpiaochong
eb205d9f35 [P/D][BugFix]Mooncake timeout release bug fix (#2899)
### What this PR does / why we need it?
In the P node timeout release mechanism during PD separation, the req_id
that requires timeout release is transmitted from the scheduler to the
worker. If the KV cache between PDs is transferred too quickly, the P
node's req_id may be released twice. The first release is when the D
node notifies the P node that the KV cache has been pulled, and the
second release is when the scheduler transmits the timeout release to
the worker.

To address this bug, an intermediate component is introduced to manage
the release of req_ids.

Pull kv and forward2 may occur one after the other in timing. The
previous timeout defaulted to forward2 being before pull_kv.


### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: baxingpiaochong <771405853@qq.com>
2025-09-24 11:22:46 +08:00
Song Zhixin
6995a7bc5b [Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788)
### What this PR does / why we need it?
When we copy the sampled valid token ids from device to host, avoid
using tolist which would trigger a CUDA wise stream sync if the source
is on device. We change it to use non-blocking copy followed by an
explicit CUDA event sync.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Bring up vLLM server
```bash
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000
```
## Before:

![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1)
## After

![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691)

As shown in the figure, the TTFT decreased


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: jesse <szxfml@gmail.com>
2025-09-24 11:21:58 +08:00
linfeng-yuan
d01fd1d1c3 [misc][torchair] fix bugs around deepseek mtp, enable_shared_expert_dp and use_cached_kv_cache_bytes (#3074)
### What this PR does / why we need it?
This miscellaneous​ contains several small fixes:
1) fix initialization and forward bugs of DeepseekMTPLayer with
`shared_expert_dp` enabled.
2) fix a tensor shape mismatches after o_proj caused by a work-aroud
change in NPUModelRunner.
3) avoid unnecessary decline of kv_cache memory (default: 64MB) with
`use_cached_kv_cache_bytes` disabled.
4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding
logic of `mc2_mask` is incompatible with input hidden_states when
`shared_expert_dp` enabled.

Once this PR is merged, users can launch disaggregated_prefill
deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as
`v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline
compared to `v0.9.1-dev` will be resolved by
https://github.com/vllm-project/vllm-ascend/pull/3073.
 
### Does this PR introduce _any_ user-facing change?

No.
### How was this patch tested?
E2E vllm serving about deepseek_mtp with torchair graph mode and
`enable_shared_expert_dp` with eager mode. Large ep deployments are also
tested with this PR.


- vLLM version: v0.10.2
- vLLM main:
5aeb925452

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-23 14:52:42 +08:00
lidenghui1110
0f3939e5a9 [Feature]cpu offload connector (#1659)
This PR implements cpu offload connector to enable NPU kv cache offload
to host DRAM.

- vLLM version: v0.10.2
- vLLM main:
5aeb925452

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
Co-authored-by: AlvisGong <gwly0401@163.com>
2025-09-23 14:25:05 +08:00
wyu0-0
d2399ab97b Fix VLLM_ASCEND_LLMDD_RPC_PORT renaming (#3108)
### What this PR does / why we need it?
This PR implements the renaming of the environment variable
VLLM_LLMDD_RPC_PORT to VLLM_ASCEND_LLMDD_RPC_PORT, as proposed and
tracked in
[#2450](https://github.com/vllm-project/vllm-ascend/pull/2450). The
renaming is intended to align the variable naming convention with other
Ascend-specific environment variables in the vllm-ascend codebase,
enhancing consistency and clarity for developers and users working with
Ascend-based deployments.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

Signed-off-by: wyu0-0 <woshilynn@163.com>
2025-09-23 10:33:04 +08:00
Li Wang
02f89d166f [CI] Update vllm version to 20250922(5aeb925) (#3091)
### What this PR does / why we need it?
This pr bump vllm commit hash to
5aeb925452
fix issues:  
1. https://github.com/vllm-project/vllm/pull/25345 has remove v0
metadata
2. https://github.com/vllm-project/vllm/pull/25332
3. https://github.com/vllm-project/vllm/pull/25334
4. https://github.com/vllm-project/vllm/pull/23558, note that this vllm
commit update the model register logic, which will check all the model
registered have the `vllm.model_executor.models` path , which breaks our
custom registration of the deepseek_v3 model (it doesn't exist in the
vllm model path). so I move deepseek_v3 model registy to deepseek_v2 to
solve temporary

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-22 22:18:13 +08:00
weichen
37a0715eda [Refactor] Adjustments to moe_comm_method selection process (#3001)
### What this PR does / why we need it?
Fix issues mentioned in
https://github.com/vllm-project/vllm-ascend/pull/2791 and some minor
refactoring.
1. Use Enum instead of string.
2. Avoid setting a new property to forward_context in
AscendFusedMoE.forward().
3. Enabling TokenDispatcherWithMoge.
4. Remove redundant code.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:
1. Enable/Disable EP
2. Aclgraph & eager


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-09-22 19:12:58 +08:00
Yizhou
338231acaf [Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models (#2128)
Note: This depends on [vLLM
#25161](https://github.com/vllm-project/vllm/pull/25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of #1503 and #1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 17:14:28 +08:00
zhangxinyuehfad
c90a6d3658 [Test] Update the format of the accuracy report (#3081)
### What this PR does / why we need it?
Update the format of the accuracy report

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-22 14:10:03 +08:00
Yikun Jiang
b8b68b3dfe [CI] Upgrade vLLM to 20250920 (c60e613) and address config break (#3067)
### What this PR does / why we need it?
Bump main to
c60e6137f0

- Updated imports in `vllm.config` to
`vllm.config.model`(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252

- Refactored `vllm_ascend/sample/sampler.py` to use string values for
`logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs
mode handling and improving compatibility with recent vLLM changes
(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-21 09:49:17 +08:00
Li Wang
12bcbd02bb [CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907)
### What this PR does / why we need it?
1. This pr bump vllm commit to
6d8246aaff
2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548
abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable
3. fix metadata_builder changes introduced by
https://github.com/vllm-project/vllm/pull/23693
4. fix `structured_outputs_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22772
5. fix `moe_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22537

Co-authored-by:  MengqingCao <cmq0113@163.com>
Co-authored-by:  Yikun Jiang <yikunkero@gmail.com>


- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-20 17:37:57 +08:00
zhangxinyuehfad
e26fe1caf1 [TEST] Speed up DS V2 accuracy test and turn up accuracy baseline (#3047)
### What this PR does / why we need it?
1. update expected accuracy for DeepSeek-V2-Lite
2. add batch size 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Accuracy CI passed

- vLLM version: v0.10.2
- vLLM main:
838d7116ba

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-20 00:40:33 +08:00
zhangxinyuehfad
a22b532d38 [Fixbug] Fix shape not match when sliding_window and dynamic batch_size (#2830)
### What this PR does / why we need it?
Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy 

### Does this PR introduce _any_ user-facing change?

Users can't set dynamic batch_size or use lm_eval test accuracy when
using models(sliding_window)

### How was this patch tested?
accuarcy of LLM-Research/Phi-4-mini-instruct is ok :
```
vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8105|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.8097|±  |0.0108|
```


- vLLM version: v0.10.2
- vLLM main:
3c96e7b8a1

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-19 22:35:14 +08:00
zhanghw0354
cf549b976d [Test]Add unit test for compilation/acl_graph.py (#3039)
### What this PR does / why we need it?
According to issue [#1298
](https://github.com/vllm-project/vllm-ascend/issues/1298) ,this pull
request adds unit test code for compilation/acl_graph.py.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.2
- vLLM main:
f2718d2948

---------

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
2025-09-19 21:31:17 +08:00
22dimensions
0942d9aaab [3/N][Refactor][Quantization]remove packed_modules_mapping from models (#3021)
### What this PR does / why we need it?

Some custom models in vllm-ascend define packed_modules_mapping, which
prevent keeping same model class with vllm community. So move these
custom packed_modules_mapping to quant utils.py. After this pr, some
custom models can be removed.

### Does this PR introduce _any_ user-facing change?

tested by CI

### How was this patch tested?

tested by CI

- vLLM version: v0.10.2
- vLLM main:
5089fd749c

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-19 20:50:14 +08:00
Yikun Jiang
4ba56716f9 Increase doctest timeout to 300s and time print (#3041)
### What this PR does / why we need it?
Increase doctest timeout to 300s and time print, according to time print
in https://github.com/vllm-project/vllm-ascend/pull/3045 , most of time
consumed in `Graph capturing`, so I think it's fine to increase doctest
timeout

This PR also add time log for each task.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Run `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`
- CI passed

- vLLM version: v0.10.2
- vLLM main:
a684c0124c

Closes: https://github.com/vllm-project/vllm-ascend/issues/3045

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-19 20:26:00 +08:00
Song Zhixin
833cd1b698 [BugFix] Async scheduling and PP compatibility with DP (#2796)
### What this PR does / why we need it?
based on the https://github.com/vllm-project/vllm/pull/23770,
fix Async scheduling and PP compatibility with DP, also fixes issue with
finished requests not being processed in async scheduling and PP cases,
and possible worker race conditions.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
544fe76b95

---------

Signed-off-by: jesse <szxfml@gmail.com>
2025-09-19 11:29:50 +08:00
whx
0a526768f5 [Feature] Support moe multi-stream for aclgraph. (#2946)
This PR puts the calculation of shared experts into a separate stream,
overlaping with routing experts.

- vLLM version: v0.10.2
- vLLM main:
fbd6523ac0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-19 11:06:45 +08:00
zhangxinyuehfad
0c04bf1e36 [Fixbug] Fix accuracy for DeepSeek-V2-Lite (#3016)
### What this PR does / why we need it?

Fix accuracy for DeepSeek-V2-Lite

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed

- vLLM version: v0.10.2
- vLLM main:
66072b36db

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-18 23:58:23 +08:00
Icey
acb46f303f Fix VocabParallelEmbedding UT (#2722)
### What this PR does / why we need it?
Fix VocabParallelEmbedding UT

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: main
- vLLM main:
f592b3174b

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-18 19:54:01 +08:00
Li Wang
01592515b8 [Bugfix] Fix sleep mode level 2 (#1376)
### What this PR does / why we need it?
For sleep mode level 2, we discarded model both weights and kv_cache,
but the problems is: When we discard weights, we also discard some
tensors representing the model state which we called
`model.named_buffers()`, such as: `running_mean / running_var` in
BatchNorm、rope cos-sin cache ... when we update weights, but forgot to
update buffers as well, this will lead to some unknown issue
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
5963b98b46

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-18 19:51:52 +08:00
realliujiaxu
af2a886814 refactor linear (#2867)
### What this PR does / why we need it?
The current linear.py has the following issues:

- There is redundant conditional logic in the `comm_group` and `forward`
selection for classes such as `AscendMergedColumnParallelLinear`.

- Inconsistent comm_group selection logic exists among
`AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and
`AscendQKVParallelLinear`.

To address these two issues, this PR encapsulates `comm_group` and
`forward` into classes and extracts the classes selection logic into
common functions. For future additions of custom communication groups or
forward methods, it will only be necessary to extend
`CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection
logic.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
dd39baf717

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: weijinqian0 <weijinqian@huawei.com>
2025-09-18 14:09:19 +08:00
panchao-hub
a7f8ed38ed [Bugfix]:replace npu_incre_flash_attention with npu_fused_infer_atten… (#2901)
### What this PR does / why we need it?
[Bugfix]:replace npu_incre_flash_attention with
npu_fused_infer_attention_score in order to be able to tiling update

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
2b85697031

Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
2025-09-18 14:06:08 +08:00