Commit Graph

599 Commits

Author SHA1 Message Date
panchao-hub
1756efa5fd [Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125)
### What this PR does / why we need it?
Adds support for capturing the Multi-Layer Attention (MLA) decode
operation into an ACL graph. This improves performance by compiling the
attention kernel for single-token decoding.

Key changes include:
- Implementing the graph capture logic for the MLA kernel, including
workspace management and parameter updates.
- Modifying the rotary embedding (RoPE) handling to use pre-allocated
tensors, which is a requirement for graph capture.
- Adding a `build_for_graph_capture` method to the MLA metadata builder
to create dummy metadata during the graph compilation phase.

Known issues:
- Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're
working on a fix
- We are preparing to remove update_mla_attn_params with
auto_dispatch_capture

### Does this PR introduce _any_ user-facing change?
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: panchao-hub <315134829@qq.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-10 16:31:20 +08:00
wangxiyuan
ba19dd3183 Revert PTA upgrade PR (#3352)
we notice that torch npu 0919 doesn't work. This PR revert related
change which rely on 0919 version.
Revert PR: #3295  #3205  #3102 

Related: #3353

- vLLM version: v0.11.0
2025-10-10 14:09:53 +08:00
MengLong Chen
6ae75933da [Feat] Load balance of tokens across experts in dummy_run (#3184)
### What this PR does / why we need it?
Due to the special input data during the dummy run, the majority of
tokens are distributed on DP0TP0, which results in insufficient
available KV cache on DP0TP0.
This PR changes the `topk_ids` of the dummy_run input from all zeros to
random values.
This is a naive implementation for experts load balance so as to avoid
accumulating too much tokens on a single rank.

### How was this patch tested?
model: DeepSeek-v3-w8a8
```bash
vllm serve DeepSeek-v3-w8a8 \
    --host 0.0.0.0 \
    --port 8004 \
    --data-parallel-size 2 \
    --tensor-parallel-size 8 \
    --quantization ascend \
    --seed 1024 \
    --enforce-eager \
    --served-model-name deepseek_v3 \
    --enable-expert-parallel \
    --disable-log-stats \
    --max-num-seqs 18 \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --gpu-memory-utilization 0.9 \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --additional-config \
    '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}' 
```

The Available memory: **2728672256** -> **6771544064**
KV Cache size: **38144** -> **95232** tokens

After enabling load balance


- vLLM version: v0.11.0

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-10-10 09:00:07 +08:00
XiaoxinWang
579b7e5f21 add pagedattention to support FULL_DECODE_ONLY. (#3102)
### What this PR does / why we need it?
Calculate in advance the workspace memory size needed for the
PagedAttention operator to avoid deadlocks during resource cleanup. This
PR requires torch_npu version 0920 or newer.
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-10-10 08:50:33 +08:00
offline893
1c2c72af8d [bugfix]change log2phy map to npu (#3339)
### What this PR does / why we need it?
Resolved the issue of EPLB failure caused by changes in the log2phy map
due to device type modifications when using MTP rotation position
encoding.

### Does this PR introduce any user-facing change?

### How was this patch tested?
https://github.com/vllm-project/vllm/commit/releases/v0.11.0


- vLLM version: v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-10 08:47:55 +08:00
fems14
55e23fabec 【bugfix】fix connector register failed (#3335)
### What this PR does / why we need it?
Register the connector in the plugin
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-09 21:09:54 +08:00
Ruri
ff37575936 [1/N][Feat] Add weight prefetch feature for Attention layers (#3146)
### What this PR does / why we need it?

- Refacotr and integrate a unified `WeightPrefetchMethod`
- Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention
modules
- Prefetching these weights ahead of matmul-like operators imporves
performance by reducing L2 cache transfer latency

### Does this PR introduce _any_ user-facing change?

Add a new config in `--additional-config` for configuration:
```json
{
    "weight_prefetch_config": {
        "enabled": false,
        "prefetch_ratio": {
            "attn": {
                "qkv": 1.0,
                "o": 1.0,
            },
        },
    },
}
```
This feature is enabled by default, and can be disabled through this
configuration

### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: yuzhup <15705211260@163.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Co-authored-by: yuzhup <15705211260@163.com>
2025-10-09 20:38:39 +08:00
huangdong2022
23db56a340 [Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with norm bias (#3205)
### What this PR does / why we need it?
1. qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and
quant op' during quantization scene.
2. torch_npu.add_rms_norm_quant op fixed accuracy while model weights is
quantized by anti_method m4, m4 quantization is asymmetric outlier
suppression method, it will generate none-zero norm bias,
add_rms_norm_quant op updated to add this parameter to calculate.

### Does this PR introduce _any_ user-facing change?
please use a torch_npu version >= torch_npu-2.7.1.dev20250919

### How was this patch tested?
1. no special parameters to set, no new envs to set.
2. use qwen3 moe quantization model to test ,such as
Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8,
Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: huangdong2022 <huangdong51@huawei.com>
Signed-off-by: h30027576 <huangdong51@huawei.com>
2025-10-09 20:18:10 +08:00
zouyida2052
81aff9c555 bugfix for mtp (#3300)
### What this PR does / why we need it?
when mtp>1, we need refresh cos ans sin in each step.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.11.0

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-09 19:22:46 +08:00
Wang Yixuan
30c5d947c3 [bugfix]fix multistream moe in torchair (#3164)
### What this PR does / why we need it?

the multistream moe in tochari only validate in decode, but can't be
applied to chunked prefill, So add some judgments to isolate the
scenario

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-09 19:00:32 +08:00
weichen
94dd832815 [MoE] [Refactor] Combine common_fused_moe and fused_moe (#3176)
### What this PR does / why we need it?
1. Move additional functionalities from fused_moe.py to
common_fused_moe.py and remove fused_moe.py
2. Remove unnecessary custom classes from qwen3_moe.py, and it will be
completely removed after we release vllm-ascend v0.11.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:

1. Enable/Disable EP
3. Aclgraph & eager
4. SP


- vLLM version: v0.11.0

---------

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-10-09 14:12:46 +08:00
Li Wang
a36e3da78e [Misc] Drop 0102 related lines (#3323)
### What this PR does / why we need it?
Since https://github.com/vllm-project/vllm-ascend/pull/3284 merged,
should discard some extra code that was previously done for version
compatibility

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-09 14:10:57 +08:00
wangxiyuan
1c5b302f0d [Misc] Clean up useless patch (#3320)
### What this PR does / why we need it?
1. clean up v0.10.2 support in ut and e2e test
2. remove v0.11.0 period job, we're at v0.11.0 now.
3. remove uesless patch for deepseek v3.2. They have been done in vLLM
already.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 14:07:26 +08:00
wangxiyuan
a43e2f61e1 [CI] Update vLLM to v0.11.0 (#3315)
### What this PR does / why we need it?
There are 3 step to upgrade vllm-ascend to newest vllm. We'll create 3
PR

- [x] Upgrade vllm to v0.11.0 to make CI happy first .
- [ ] Move deepseek v3.2 to vllm way
- [ ] Then we'll add a new PR to add vllm main support.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 10:41:19 +08:00
wangxiyuan
f12f76d7ba Drop 0.10.2 (#3284)
Drop v0.10.2 support, we support vLLM 0.11.0rc3 now.
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 10:28:38 +08:00
weijinqian0
474fa737c8 [bugfix] Fix moe bug: allgather error. (#3279)
It will crash when deepseek model executed in A2.


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-09-30 18:45:09 +08:00
Chao Lei
a486ff8c11 KVCache Transfer via Layer-wise Strategy in Disaggregation (#2602)
### What this PR does / why we need it?
See RFC: https://github.com/vllm-project/vllm-ascend/issues/2470 This PR
add a new kv connector for layer-wised kv transfer

### Does this PR introduce _any_ user-facing change?
yes, a new kv connector is added. User can use layer wised feature now.
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: leichao.lc <leichao139636@163.com>
Signed-off-by: CaveNightingale <2859066733@qq.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: hanxinlong <50882499@qq.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: CaveNightingale <2859066733@qq.com>
Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: hanxinlong <50882499@qq.com>
2025-09-30 15:10:29 +08:00
Mengqing Cao
f8c93d8d24 [Aclgraph][DP] Fix dp dummy run not in aclgraph error (#3208)
### What this PR does / why we need it?
When running DP in a non-equilibrium scenario, which means there is some
dp groups executing `dummy_run`, we need to make sure it running the
same mode as other dp, thus improving then performance in dp scenario

### How was this patch tested?
Tested by adding log in `_dummy_run`

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-30 11:14:51 +08:00
Angazenn
ddf4d53ca3 [bugfix] Fix bugs in _dumm_run and re-initialize kv-cache. (#3262)
### What this PR does / why we need it?
Currently we run an extra profile_run with `num_tokens ==
self.mc2_tokens_capacity`. However, when setting `max_num_batched_tokens
< self.mc2_tokens_capacity`, this will trigger an assertion error that
requires num_tokens in `_dummy_run` to be smaller than
`max_num_batched_tokens`. This PR skips this extra `profile_run` if
`self.max_num_tokens <= self.mc2_tokens_capacity` so as to avoid this
bug.

This PR fixes a bug that `kernel_block_sizes` never equals to
`[self.cache_config.block_size]`. `kernel_block_sizes` is type of
List[List[int]], so the condition should be `kernel_block_sizes !=
[[self.cache_config.block_size]]`. This also helps to resolve a issue
that cpu_offload_gb cannot be enabled.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: Angazenn <supperccell@163.com>
2025-09-30 10:54:14 +08:00
wangxiyuan
00ba071022 [Doc] Release note for v0.11.0rc0 (#3224)
### What this PR does / why we need it?
Add release note for v0.11.0rc0

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-30 03:26:18 +08:00
wangxiyuan
81bd6e4c99 Add DeepSeek V3.2 support (#3270)
### What this PR does / why we need it?

This PR added the initial DeepSeek V3.2 support with [vLLM
v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0)
(not released yet). We will complete vLLM adaptation as soon as
possible. This feature will be ready in recent 1-2 days.

Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 .

### Does this PR introduce _any_ user-facing change?
Yes!

### How was this patch tested?
CI passed and Run deepseek doc soon.


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-30 03:25:58 +08:00
Icey
83092d9b8b [BugFix] Fix Qwen3-Next because of vllm #24982 (#3221)
- Fixes Qwen3-Next because of vllm #24982

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```


- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-29 15:27:30 +08:00
LeeWenquan
69cc99d004 Add restriction conditions to the ApplyTopPTopK operator (#3254)
### What this PR does / why we need it?
Add restriction conditions to the ApplyTopPTopK operator : 1 <= K <=1024
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-09-29 14:04:58 +08:00
无脸男
373f84a193 [Bugfix] Fix the error "cur batch_size is invalid" during profile_run in the torchair scenario (#3243)
### What this PR does / why we need it?
Fix the error "cur batch_size is invalid" during profile_run in the
torchair scenario.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: WithHades <244036962@qq.com>
2025-09-29 11:51:07 +08:00
weijinqian0
8870966031 [bugfix] Fix warning bug: model config is None. (#3238)
Cleanup wrong warning log error: model config is None

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-09-29 09:44:49 +08:00
Mengqing Cao
050d202bb9 [Quickfix] Fix dp+ep+tp error when sp chunked the hidden_states (#3246)
### What this PR does / why we need it?
Fix dp+ep+tp inplace copy error when sp chunked the `hidden_states`.


### How was this patch tested?
test locally with the following scripts
```bash
python examples/offline_data_parallel.py \
        --model="Qwen/Qwen3-30B-A3B" \
        --dp-size=2 \
        --tp-size=2 \
        --enable-expert-parallel
```

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-29 09:12:49 +08:00
whx
14d4ed5f0c [BugFix] Fix aclgraph accu problem in A2. (#3163)
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR #2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
17b4c6685c

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-28 21:31:55 +08:00
socrahow
c3fee66806 [Model] Optimizing gemma3 model's GemmaRMSNorm function (#3151)
### What this PR does / why we need it?
Before optimizing,the rmsnorm time in one decoding is 531.5us. After
optimizing,the rmsnorm time in one decoding is 105us.
I closed the previous
PR(https://github.com/vllm-project/vllm-ascend/pull/2456) by mistake and
resubmitted it now
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
b1068903fd

---------

Signed-off-by: socrahow <suzihao4@h-partners.com>
2025-09-28 21:19:10 +08:00
Icey
dd56e9306b [3/N][Refactor][Qwen3-Next] Refacotr model structure and fix bug by vllm #25400 (#3142)
### What this PR does / why we need it?
Refactor model structure in qwen3_next.py to reduce code line.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64,
              )
    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```


- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-28 21:14:36 +08:00
Mengqing Cao
4ff422c730 [CI][Bugfix] Quickfix for DPMetaData (#3234)
### What this PR does / why we need it?
Fix `dpmetadata` and `Qwen3MoeSparseMoeBlock` break introduced by
26a7a33b88 (diff-c1550d0a38469d039370567d8981969530cbfffc7302cd1778e7c2c8a9322dea)

NOTE: we maintain a different sp in vllm-ascend with vllm, thus we can
just use `cu_tokens_across_sp(1)` as `cu_tokens_across_dp_cpu`

close https://github.com/vllm-project/vllm-ascend/issues/3236,
https://github.com/vllm-project/vllm-ascend/issues/3239
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-28 21:11:22 +08:00
fan2956
f2d8493221 [BugFix] Fix ascend scheduler assert error (#3191)
### What this PR does / why we need it?
Running multimodal model with ascend scheduler may cause assert error
【assert (request.num_tokens - request.num_computed_tokens) == 1】

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
17b4c6685c

---------

Signed-off-by: fan2956 <zhoufan53@huawei.com>
2025-09-28 18:22:08 +08:00
Icey
68c5401ad6 [Eagle] Fix attn_mask index out of range in high concurrency situations (#3187)
### What this PR does / why we need it?
- Fixes the bug that Multiple calls (maybe >100) to eagle3-qwen3-8b often incurs "attn_mask index out of range" error

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --served-model-name Eagle3 --port 8000  --model Qwen/Qwen3-8B   --seed 42     -tp 1  --speculative_config '{"model": "Tengyunw/qwen3_8b_eagle3", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'
```

Co-authored-by: liuruijin17
[ricklrj@outlook.com](mailto:ricklrj@outlook.com)
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458

Signed-off-by: Icey <1790571317@qq.com>
2025-09-28 18:09:26 +08:00
lilinsiman
1705501ae2 [BugFix] Fix ACLgraph bug in Qwen3_32b_int8 case (#3204)
### What this PR does / why we need it?
1. Solved the issue where sizes capture failed for the Qwen3-32b-int8
model when aclgraph, dp1, and tp4 were enabled.
2. Added the exception thrown when sizes capture fails and provided a
solution
3. Add this common problem to the FAQ doc
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-09-28 17:44:04 +08:00
Zetong Li
a86ece5e39 [Bugfix][LoRA] Fix forward error and shape mismatch when using LoRA (#3153)
### What this PR does / why we need it?
Relying on #3044, this PR aims to further fix:
1. The forward error occured when `LogitsProcessorWithLoRA` calls
`AscendLogitsProcessor.forward`. Since `LogitsProcessorWithLoRA`
bypasses the MRO to call it, `super().forward(...)` in
`AscendLogitsProcessor.forward` will raise an error. This PR fixes it by
directly invoking `LogitsProcessor.forward(self, ...)`;
2. The shape mismatch in `add_lora_logits` in punica_npu.py. The
`lora_a_stacked` and `lora_b_stacked` are organized as [num_loras, 1,
lora_rank, hidden_size] and [num_loras, 1, vocab_size, lora_rank] shapes
respectively, but they are misunderstood in #1583---the last two
dimensions were assumed in reverse order, which causes errors in
`bgmv_shrink` and `bgmv_expand`. This PR fixes it by reverting it to the
previous version to align with the implementation in punica_cpu.py in
vllm.

### Dependencies
This PR depends on changes introduced by #3044 (LoRA support for
`AscendQKVParallelLinear` and `AscendMergedQKVParallelLinear` layers).

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
The LoRA-related tests, e.g., test_ilama_lora.py and
test_ilama_lora_tp2.py, use ilama-3.2-1B, and this model is regarded as
`TransformersForCausalLM`, where `embedding_modules` attribute lacks
`lm_head`. However, `LlamaForCausalLM` and most other models include
both `embed_tokens` and `lm_head` in `embedding_modules`. This attribute
contributes to `supported_lora_modules` when using LoRA in vllm.
Therefore, without `lm_head` in `embedding_modules`, current tests using
ilama-3.2-1B are unable to find the abve errors since
`LogitsProcessorWithLoRA` replacing `lm_head` is skipped. Simply using
Meta-Llama-3.1-8B-Instruct can reproduce the above errors and check
whether these fixes can work. What's more, it's necessary to add more
comprehensive tests for LoRA.

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

Signed-off-by: Zetong Li <slippersss@126.com>
2025-09-28 17:30:50 +08:00
Peipei
3d21ed9ee8 [Bugfix]Fix quant_config input parameter bug in qwenvl series (#3220)
### What this PR does / why we need it?
Fix quant_config input parameter bug in qwenvl series. Currently,
non-instantiated variables should be passed.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: booker123456 <945658361@qq.com>
2025-09-28 14:08:24 +08:00
Wang Kunpeng
859e861d92 [main][quantization] Support deepseek w4a8 per-channel quantization (#3011)
### What this PR does / why we need it?
1.Support deepseek w4a8 per-channel quantization
2.The eager mode supports converting weights to the NZ format
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
#### How to get weights using Modelslim

##### Installation steps

git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### Generate w4a8 per-channel weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-09-27 21:01:16 +08:00
wangxiyuan
e9359bd8fa [CI] Pin vLLM to releases/v0.11.0 (#3211)
### What this PR does / why we need it?
- Pin vLLM commit to releases/v0.11.0 branch.
- Fix the break change by vLLM commit
d4d9899860

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
17b4c6685c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-27 10:41:48 +08:00
yupeng
9caf6fbaf5 [Bugfix][LoRA] Fix LoRA bug after supporting Qwen3-Next (#3044)
### What this PR does / why we need it?
LoRA e2e test uses ilama-3.2-1B model. It uses transformers.py model
files. Its self-attention layer names end with "\*.attn", not
"\*.self_attn".

There are some other model attention layer names end with "*.attn", such
as baichuan.py, bert.py.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py

- vLLM version: v0.10.2
- vLLM main:
17b4c6685c

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2025-09-26 11:12:45 +08:00
realliujiaxu
d8a9cb8458 [Bugfix] fix bug when tp=1 (#3193)
### What this PR does / why we need it?
Addresses a bug in DenseOptimRowParallelOp that occurs when tensor
parallelism is not used
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
2025-09-26 10:55:32 +08:00
zouyida2052
b72e3327a6 bugfix for mtp>1 (#3174)
### What this PR does / why we need it?
fix bugs when mtp>1, and reorder input batch when mtp is not accepted.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
by ci

- vLLM version: v0.10.2
- vLLM main:
52d0cb8458

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-09-26 09:04:16 +08:00
无脸男
69509bcdd6 [bugfix] fix oom in aclgraph (#3158)
### What this PR does / why we need it?
fix oom in aclgraph.

1. In the current token dispatch implementation, tensors are mounted on
class instances to facilitate parameter passing between different
methods. This approach prevents automatic recycling of these tensors. In
some cases, it may lead to out-of-memory error. To address this issue,
we manually set these tensors to None to release corresponding memory.

2. The `profile_run` method is designed to accurately estimate the
maximum NPU memory usage during vLLM inference. However, in certain
scenarios, MoE models perform inference via MC2, which includes
communication and consumes additional NPU memory. This leads to
inaccurate estimation by the profile run. We address this by actively
triggering the MC2 during profile run for initialization.```.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
52d0cb8458

Signed-off-by: WithHades <244036962@qq.com>
2025-09-26 08:57:47 +08:00
Ronald
621aa7d270 fix error async_scheduler can't be enabled (#3127)
### What this PR does / why we need it?
PR #2894 make ascend_scheduler_config.enabled always be `True` for
non-mla models,when `ascend_scheduler_config.enabled=True `, it will
always initialize `AscendScheduler` which is a subclass of `Scheduler`,
but when we enbale async_scheduling,we need to initialize
`AsyncScheduler` in vllm, this will make async_scheduling can't be
enabled.

### Does this PR introduce _any_ user-facing change?
not-related

### How was this patch tested?
when user set `async_scheduling`, it means user don't want to use
`AscendScheduler`, so we shouldn't set `ascend_scheduler_config.enabled
= True`

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-09-26 08:51:54 +08:00
florenceCH
14497b748d Remove qwen3 moe MC2 cumsum & cast (#3126)
What this PR does / why we need it?
The Qwen3 moe MC2 graph currently has two redundant computational
operator implementations. After npu_moe_distribute_dispatch_v2, the
cumsum and cast operations have been added. By using
expert_token_nums_type=0 and not converting weight_scale to float32,
these two operators can be eliminated, thereby improving inference
performance.

Does this PR introduce any user-facing change?
No

How was this patch tested?
No need

vLLM version: v0.10.2
vLLM main:
f225ea7dd9

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: florenceCH <gaoxiang120@huawei.com>
Co-authored-by: florenceCH <gaoxiang120@huawei.com>
2025-09-26 08:51:30 +08:00
wangxiyuan
2930e4a6bd [CI] Upgrade vllm to newest commit (#3182)
### What this PR does / why we need it?
Upgrade vLLM to newest commit

- Fix the aclgraph doesn't work problem, caused by
24fab45d96
- Fix PoolerOutput import error, caused by
755ed7b05b
- Fix the aclgraph weight load error to keep the same with torchair fix.
4492e3a554

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All test should pass


- vLLM version: v0.10.2
- vLLM main:
52d0cb8458

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-26 06:18:15 +08:00
wangxiyuan
0794f64a18 Revert "[Disagg][Perf] Use NPU event sync instead of blocking tolist (#3194)
…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (#2788)"



### What this PR does / why we need it?
This reverts commit 6995a7bc5b. We'll add
it back once the issue is fixed.

related issue: https://github.com/vllm-project/vllm-ascend/issues/3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
2025-09-26 06:17:36 +08:00
Peipei
31dda3f557 [Model]Add support for qwen3_vl and qwen3_vl_moe (#3103)
### What this PR does / why we need it?
This PR is for the adaptation and optimization of qwen3_vl and
qwen3_vl_moe on the Ascend platform.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
b1068903fd

---------

Signed-off-by: booker123456 <945658361@qq.com>
2025-09-25 18:50:12 +08:00
MengLong Chen
07f4710216 [BugFix] Fix dummy_run memory explosion in eager mode (#3132)
### What this PR does / why we need it?

It is a quick bugfix for the memory explosion issue that requires
further refactoring.
The dummy_run in eager mode may lead to OOM and the reason is that
`hidden_states` were not released in time.
The PR temporarily resolves the issue by manually clearing the cache,
and further refactoring will be conducted subsequently.

Before the modification, the dummy_run's memory showed an accumulation
issue.
<img width="1796" height="207" alt="image"
src="https://github.com/user-attachments/assets/05e2b04c-2f99-4085-9eda-c78b7d9a57b0"
/>

After modification, it can be observed that the memory is released
promptly.
And it was verified that the model responded normally after a single
data input.


- vLLM version: v0.10.2
- vLLM main:
b1068903fd

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-09-25 16:09:44 +08:00
Icey
2a9d02e080 [Bugfix] eagle and eagle3 spec decode failures and enable e2e test (#2979)
### What this PR does / why we need it?
- Fix the bug https://github.com/vllm-project/vllm-ascend/issues/2978
- Enable e2e test,
- Adapt to scenarios where Speculative tokens are greater than 2,
- Fix the bug that causes Eagle3 inference failures under high
concurrency and improve the acceptance rate of draft models, by
https://github.com/vllm-project/vllm-ascend/pull/2794

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with new added/existing test.

Co-authored-by: hukongyi
[hukongyi@cmbchina.com](mailto:hukongyi@cmbchina.com)
Co-authored-by: guanyuzhu
[zhuguanyu@huawei.com](mailto:zhuguanyu@huawei.com)
Co-authored-by: liumail680
[liumail680@163.com](mailto:liumail680@163.com)


- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-25 14:39:12 +08:00
wangxiyuan
ac1c2cd9ac [CI] Upgrade vllm version - 0925 (#3167)
Upgrade vLLM to newest commit.

1. Remove the useless func get_state_cls, it has been removed from vLLM
already.
e6750d0b18
2. Fix ut broken by
6160ba4151


- vLLM version: v0.10.2
- vLLM main:
b1068903fd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-25 14:20:10 +08:00
mfyCn-1204
33c118c80e [core]vllm-ascend support msMonitor tool (#3123)
### What this PR does / why we need it?
vllm-ascend support [msMonitor
](https://gitcode.com/Ascend/mstt/tree/master/msmonitor)tool to collect
performance of vllm-ascend

### Does this PR introduce _any_ user-facing change?
1.add env MSMONITOR_USE_DAEMON;
2.user cann enable msMonitor tool by setting MSMONITOR_USE_DAEMON=1
before run vllm-ascend model;
3.MSMONITOR_USE_DAEMON and VLLM_TORCH_PROFILER_DIR cannot both set

### How was this patch tested?
1.run vllm-ascend model while not set MSMONITOR_USE_DAEMON=1 or set
MSMONITOR_USE_DAEMON=0, model will run successfully;
2.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1, run msMonitor
tool to collect profile data;
3.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1 and
VLLM_TORCH_PROFILER_DIR, will raise error

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

Signed-off-by: mei-feiyao <1332490378@qq.com>
2025-09-25 14:15:02 +08:00