1665 Commits

Author SHA1 Message Date
Mengqing Cao
58db21f56a [DP] Fix dp padding logic in dummyrun (#4705)
### What this PR does / why we need it?
Fix dp padding logic in dummyrun. After
https://github.com/vllm-project/vllm/pull/28579, `num_tokens` will be
padded in `CudagraphDispatcher`, thus we also need to do the pad in the
dummy_run.

### How was this patch tested?
Test locally with the following scripts
```bash
VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \
         --model wemaster/deepseek_mtp_main_random_bf16 \
         --trust-remote-code \
         --data-parallel-size 4 \
         --tensor-parallel-size 1 \
         --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \
         --enable-expert-parallel
```
```bash
vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0
```

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-08 20:32:35 +08:00
shaopeng-666
9766cf9128 fix qwen3vl mrope op (#4484)
### What this PR does / why we need it?
Qwen2.5-VL mrope precision problem would been solved once this pr is
merged
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Test on G8600 with textVQA dataset

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-08 19:19:17 +08:00
zengzengran
f0876b5d88 [Bugfix] Fix Dcp dimension mismatch when enable Mlapo (#4687)
### What this PR does / why we need it?
After enabling Mlapo and DCP, since Mlapo has its own mla_preprocess
logic and does not perform additional all_gather operations on the DCP
group, this will lead to dimension mismatch during the subsequent
forward proces

### Does this PR introduce _any_ user-facing change?

N/A


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zengran <zengran2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-08 17:19:58 +08:00
dsxsteven
96ea0e078f [EPLB] Add log Info for moe_load Imbalance Ratio (#4482)
### What this PR does / why we need it?
Add log Info for MOE_load Imbalance Ratio

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

---------

Signed-off-by: daishixun <dsxsteven@sina.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-08 14:28:13 +08:00
ZYang6263
a433f3280a [Op] DeepSeekV3.2 support bmm_transpose operator (#4631)
### What this PR does / why we need it?
DeepSeekV3.2 support bmm_transpose operator.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: ZYang6263 <zy626375@gmail.com>
Signed-off-by: ZYang6263 <50876451+ZYang6263@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-08 14:03:38 +08:00
wangxiyuan
0b65ac6c4b remove useless patch (#4699)
patach_config is useless now. Let's remove it


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-08 11:02:42 +08:00
zzhxxx
866347a621 Deepseek Mtp model uses the lm_head and embedding from the main model (#2790)
### What this PR does / why we need it?
In the Deepseek technical report, it is mentioned that the embedding and
lmhead layers of the MTP layer are shared with the main model, but the
current implementation independently loads the complete embedding and
lmhead. In the Deepseek-R1 model, their weight sizes are 129280*7168 in
fp16 format, which is 1.72G.

This PR fixes the MTP layer to use the lmhead and embedding of the main
model, saving 3.45G of GPU memory in the pure DP scenario.

The current process will first create temporary spaces for the embedding
and lmhead in the mtp layer, then I will call torch.equal to determine
if the two matrices are the same. If they are the same, they will be
reused, and the previous tensor will be released.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-08 10:33:29 +08:00
Ronald
916a9a1913 fix synchronize error of exceeds_max_model_len d2h copy (#4708)
### What this PR does / why we need it?
there is d2h copy blocking cpu operations in mtp propose method, which
make host bound issue. this pr refactor it and use cpu tensor to
implement it.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
vllm main f5d3d93c40417c296c20dc301100e55708a17f3f

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-08 09:07:59 +08:00
LuLina
2be0fe2691 [Feat] Add Euler xlite graph wrapper support (#4526)
### What this PR does / why we need it?
This patch adds support for the xlite graph wrapper to vllm_ascend.
Xlite provides operator implementations of the transformer network on
Ascend hardware. For details about xlite, please refer to the following
link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison
- aclgraph: main(c4a71fc6) 
- xlite-full: main(c4a71fc6) + xlite-full
- xlite-decode-only: main(c4a71fc6) + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph


### Does this PR introduce _any_ user-facing change?
Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' #
Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true,
"full_mode": true}}' # Enabled for prefill and decode

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lulina <lina.lulina@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-08 08:27:46 +08:00
Yizhou
8fdb689a32 [BugFix] Refactor ACL graph size adjustment for speculative decoding (#4640)
### What this PR does / why we need it?
Move the logic for adjusting ACL graph capture sizes for speculative
decoding from the generic utility module into a dedicated method within
the compilation configuration.

This change improves code organization and encapsulation by making the
compilation configuration responsible for managing its own state. The
model runner now triggers this adjustment directly, providing the
necessary context.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-07 17:32:45 +08:00
liziyu
688b1332da [P/D] check kv extra config and del hccl backend (#4547)
### What this PR does / why we need it?
check kv extra config & del hccl backend


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-07 15:19:42 +08:00
ZYang6263
b91a5f0968 Support DeepSeekV3.2 with MLAPO operator (#4753)
### What this PR does / why we need it?
This PR adds support for the optimized MLAPO operator in DSV3.2 and this
operator provides an optimized implementation that avoids redundant
q_down recomputation.
The operator implementation and optimizations were introduced in PR
[#4707](https://github.com/vllm-project/vllm-ascend/pull/4707).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: ZYang6263 <zy626375@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-12-07 12:40:24 +08:00
AlvisGong
a5163c8c36 [Feat]enable sfa cp for dsv3.2 (#4702)
### What this PR does / why we need it?
RFC: https://github.com/vllm-project/vllm/issues/30055

### How was this patch tested?
1. enable flashcommon1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
2. enable sfa-cp
--additional-config '{ "enable_sfa_cp": true }' \

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: AlvisGong <gwly0401@163.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: hwhaokun <haokun0405@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-06 19:46:41 +08:00
zhaomingyu13
cb42564942 [BugFix] Fix eagle3 accuracy problem when enforce_eager=True (#4521)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="meta-llama/Llama-3.1-8B-Instruct",
            tensor_parallel_size=1,
            speculative_config={
                "method": "eagle3",
                "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
                "num_speculative_tokens": 3
            },
            enforce_eager=True,
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-06 17:31:26 +08:00
Ronald
3480094d7c support async mtp (#4511)
### What this PR does / why we need it?
this pr aims to support async_scheduling for mtp, which refer to vllm pr
https://github.com/vllm-project/vllm/pull/24799.
and this pr fix some synchronize problem in vllm-ascend.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-06 17:15:57 +08:00
Zhu Yi Lin
f067623afd [Bugfix] fix mtp and eagle aclgraph bug (#4710)
### What this PR does / why we need it?
fix mtp and eagle aclgraph bug

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: GDzhu01 <809721801@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-06 11:22:57 +08:00
zzzzwwjj
8378f56f53 rm vanilla attn (#4558)
### What this PR does / why we need it?

Remove unused vanilla attn code.

### Does this PR introduce _any_ user-facing change?

NA


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-06 10:53:55 +08:00
weijinqian0
a78f49ea57 [Refactor] 1/N Refactor attention_v1 & extract attention_cp (#4628)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
The functions related to Cp differ significantly from those of normal
Attention, but the coupling is quite severe.

Steps:
Isolate PCP and DCP
(1) Forward class extraction (100%)
(2) Metadata coupling processing
(3) Builder processing

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-06 09:33:28 +08:00
wangxiaoteng888
41fbc5ebc9 [P/D][main] Clean connector history information (#4650)
### What this PR does / why we need it?
Clean connector history information when the node restarts.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-05 16:22:23 +08:00
欧派果奶我还要
a336543977 [Bugifx] fix quant_apply_mlp w1_scale type error & fix getting num_local_expert (#4632)
### What this PR does / why we need it?
Fix bugs introduced by
bc67696a02
1. fix getting num_local_experet error in vllm_adaptor
2. fix w1_scale type error in
moe_mlp.quant_apply_mlp.npu_dequant_swiglu_quant in w4a8 quantized
scenario

- vLLM version: v0.12.0

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-05 16:04:24 +08:00
whx
a7f91079b8 [BugFix][Triton] Fix ub overflow bug of sample_recover_tokens_kernel (#4673)
### What this PR does / why we need it?
Original `sample_recover_tokens_kernel` of reject sampler didn't tile
the vocab size dim, whitch will cause ub overflow problem for models
with big vocab size like deepseek. This PR adds tiling to the vocab size
dim to avoid this problem.

Note that currently we just use a emperical `SUB_BLOCK_SIZE` of `4*1024`
for functionality. If in the future this kernel becomes performance
bottle neck, we can use triton autotune to optimize this. What's more,
we have to disable multibuffer of this kernel due to some accuracy
issues.

- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

Signed-off-by: whx-sjtu <2952154980@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-05 15:16:19 +08:00
LookAround0301
b32ef53b3b [long_seq] remove long_seq env (#4660)
### What this PR does / why we need it?
remove env VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL 

- vLLM version: v0.12.0

---------

Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: ZhangMingWei716 <2894054457@qq.com>
Co-authored-by: ZhangMingWei716 <2894054457@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-05 10:31:49 +08:00
wangxiyuan
ea54388e19 Drop ascend scheduler (#4623)
It's safe to drop ascend scheduler now. The related test and doc has
been removed already


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-05 09:03:45 +08:00
Chen Chen
ad0607f900 add dispatch_gmm_combine kernel (#3532)
### What this PR does / why we need it?

This PR introduces the Ascend implementation of the
`dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime,
together with follow‑up fixes to ensure the kernel builds and runs
correctly in CI.

- Add full host and device implementation of the `dispatch_ffn_combine`
kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE
routing helpers, and kernel utilities for quantized FFN dispatch.
- Integrate the new kernel with the PyTorch binding
(csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend
runtime (vllm_ascend/ascend_forward_context.py,
vllm_ascend/worker/model_runner_v1.py).
- Extend fused MoE communication and token dispatch support in
`vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new
dispatch path.
- Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py
to support the new FFN dispatch flow.
- Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake
configuration, and include/namespace usage in the new kernel files.
- Add an end‑to‑end nightly test
`tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper
utilities in `vllm_ascend/utils.py` to validate the new kernel.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

---------

Signed-off-by: mojave2 <chenchen145@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-04 23:00:59 +08:00
Shanshan Shen
fb15fec662 [MM][Patch] Remove patch for cos/sin cache (#4672)
### What this PR does / why we need it?
Remove patch for https://github.com/vllm-project/vllm/pull/28798.

- vLLM version: v0.12.0

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-04 22:30:06 +08:00
1092626063
b3e1377a92 【fix】ops gatingtopk fix nightly ci error (#4340)
### What this PR does / why we need it?
This pr https://github.com/vllm-project/vllm-ascend/pull/2958 is
supporting gatingtopk operator generalization, but caused nightly ci
error.
Now we add check logits for ops gatingtopk, and fix nightly ci.

- vLLM version: v0.12.0

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-04 20:09:21 +08:00
Icey
178ca1607e Adopt inductor fusion and define quantization fusion pass (#4168)
### What this PR does / why we need it?
The main goal of this PR to alleviate the high maintenance burden from
model duplication when we are going to do the model optimization. Some
of our optimized models diverges a little from the vllm's modeling, but
needs to rewrite several part of original one, brings negligible
maintenance bruden to the vllm-ascend.In order to solve that, we propose
to leverage `torch.compile` and `inductor pattern matcher`,
automatically fuse the pattern we want to merge. For more details can
refer to the RFC https://github.com/vllm-project/vllm-ascend/issues/4239

This pr integrates `AddRMSNorm` and the `Quant` operator, which can
improve the inference speed of models using `w8a8 `quantization.

### Does this PR introduce _any_ user-facing change?
Yes, add new additional_config

### How was this patch tested?
```python
def main():
    prompts = [
        "The president of the United States is Mr.",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8",
              # enforce_eager=True,
              tensor_parallel_size=1,
              trust_remote_code=True,
              gpu_memory_utilization=0.7,
              quantization="ascend",
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

```text
Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden.  \nB. Mr. Trump is not Mr. Biden.  \nC. The president of the United States is not Mr. Trump.  \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of'
```


- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
2025-12-04 10:29:48 +08:00
wangxiyuan
3f4c0ea0a0 upgrade vLLM to 0.12.0 tag (#4647)
Upgrade vLLM to v0.12.0 tag

- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-03 23:43:05 +08:00
amy-why-3459
26e8e58cea [Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#4176)
### What this PR does / why we need it?
Support Encoder separation for Encode-Prefill-Decode Disaggregation

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
2025-12-03 20:48:45 +08:00
wangxiyuan
6ece6660ec fix custom ops env set error (#4675)
Move Custom ops register to correct place to make CI happy

- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-03 19:27:38 +08:00
XiaoxinWang
15dc01f050 [Fix] Fix FIA query and query_start_loc shape mismatch error (#4518)
### What this PR does / why we need it?
Due to the requirement of the FIA operator that the **query.shape[0]**
must match **actual_seq_len[-1]**, in graph mode and multi-DP scenarios,
the query is padded to the size of **num_input_token**. This leads to
validation errors during tiling in the operator. However, since the
padding is applied at the end of the query, it does not affect the
actual execution result of the operator, and the precision remains
unaffected.
<img width="2434" height="49" alt="image"
src="https://github.com/user-attachments/assets/63520816-fbc3-4382-82b9-89dbb1492f6c"
/>
Our modification padding both **actual_seq_len** and
**actual_seq_len_kv** to resolve the validation issue in the operator.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-12-03 17:33:31 +08:00
ZYang6263
7271f0d536 [Feat] MTP support DeepSeekV3.2 (#4465)
### What this PR does / why we need it?
Currently, MTP does not support the DeepSeekV3.2 model. In this PR, we
have enabled this feature.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-12-03 14:24:33 +08:00
LeeWenquan
38bd95229f [Model] Add qwen3Next support in Main (#4596)
### What this PR does / why we need it?
Add Qwen3Next support in main

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-12-03 14:17:37 +08:00
Song Mingyang
18b90b501d [kernel] add AscendC op: lightning_indexer and sparse_flash_attention (#4625)
### What this PR does / why we need it?
Provide high-performance AscendC operators lightning_indexer and
sparse_flash_attention to boost the execution performance of the
DeepSeek v3.2 model. Meanwhile, adapt the two AscendC operators to
vllm-ascend framework.

### Does this PR introduce _any_ user-facing change?
No (only underlying operator optimizations, with no user-facing changes)

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: MingYang119 <songmingyang@huawei.com>
2025-12-03 09:53:10 +08:00
wangxiyuan
7f2673ea2d upgrade vLLM to main (#4608)
1. fix https://github.com/vllm-project/vllm/pull/28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix https://github.com/vllm-project/vllm/pull/29121
   the output token now  type changed from np to `list[list[int]]`

3. fix https://github.com/vllm-project/vllm/pull/29262
    `xformers` backend for multimodal now has been deprecated
4. fix https://github.com/vllm-project/vllm/pull/29342

5. fix https://github.com/vllm-project/vllm/pull/28579
6. fix https://github.com/vllm-project/vllm/pull/28718
7. fix https://github.com/vllm-project/vllm/issues/28665
8. fix https://github.com/vllm-project/vllm/pull/26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2025-12-02 22:10:52 +08:00
Chenxi Qian
4588cdac02 [Bugfix] fix custom op GmmSwigluQuantWeightNzTensorList (#4593)
### What this PR does / why we need it?

1. Fixes the environment path used to locate custom op shared libraries.
2. Uses empty tensor initialization for op outputs instead of
zero-initialization for better efficiency.



- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
2025-12-02 22:02:04 +08:00
FuNanyang
1b5513aa91 [performance] Enhance performance after enabling min_p (#4529)
### What this PR does / why we need it?
When min_p post-processing parameters are enabled, the original vllm
implementation introduces the aclnInIndexPutImpl operator, which
performs poorly on NPU


### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
After enabling min_p to collect profiling

The performance has been greatly improved


- vLLM version: v0.11.2

---------

Signed-off-by: funanyang <985619145@qq.com>
2025-12-02 20:35:51 +08:00
wangxiyuan
874097a1de clean up model module (#4611)
Model module is useless now. Let't remove it totally.

- vLLM version: v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-02 17:35:47 +08:00
whx
96b2cdf6d8 [Ops][Triton] Add a triton kernel supporting partial rope. (#4413)
### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-12-02 17:10:19 +08:00
wangxiyuan
6360eb1dea Revert "[Bugfix] Fix Qwen2.5-Omni-7B accuarcy test (#4556)" (#4619)
This reverts commit 71e9b379c8. It breaks vllm-ascend/Qwen3-30B-A3B-W8A8 test
2025-12-02 13:15:47 +08:00
offline893
2fa3945112 [Bugfix]Fix eplb enable when using mtp float weights. (#4571)
### What this PR does / why we need it?
Fix eplb enable when using mtp float weights. It will be remove when
eplb supporting mtp and float weights.

### How was this patch tested?
Deepseek-V3 + MTP + EPLB in A3.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Signed-off-by: offline893 <158537145+offline893@users.noreply.github.com>
Co-authored-by: offline0806 <3337230449@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-02 09:20:49 +08:00
zhangxinyuehfad
71e9b379c8 [Bugfix] Fix Qwen2.5-Omni-7B accuarcy test (#4556)
### What this PR does / why we need it?
Fix Qwen2.5-Omni-7B accuarcy test
issue:https://github.com/vllm-project/vllm-ascend/issues/4480
Depends on : https://github.com/vllm-project/vllm-ascend/pull/4534

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-02 09:20:05 +08:00
weijinqian0
b4bf01ead1 [Refactor] Remove redundant attention operator branches. (#4531)
[Refactor] Remove redundant attention operator branches.

Reason:

We replace other attention ops with fused_infer_attention_score expect
decode_only state.
clean code and remove 310P support.

https://github.com/vllm-project/vllm-ascend/pull/4455


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-02 09:13:26 +08:00
Shanshan Shen
6b9a997076 [MM][Model] Remove Qwen3-VL modeling files (#4577)
### What this PR does / why we need it?
Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove
Qwen3-VL modeling files.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-12-02 07:33:17 +08:00
Wang Kunpeng
a9c4b8604a [main][bugfix] bugfix for qwen3 moe quantization (#4599)
### What this PR does / why we need it?
Fix the issue where the qwen3 moe service cannot be started due to
upgrading the vllm version

Error info:
AttributeError: 'AscendFusedMoE' object has no attribute 'use dp
chunking'

### Does this PR introduce _any_ user-facing change?
no


- vLLM version: v0.11.2

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-01 23:48:57 +08:00
Slightwind
12ca99c94e [Bugfix] Remove ModelSlim-"M4 Quantization". (#4589)
The M4 quantization method in ModelSlim adds bias to model weights that
originally do not have a linear bias. PR #4235 supported PD-MIX
quantization and M4 quantization, adding bias to `w8a8.py` and
`w8a8_dynamic.py`, and implementing adaptations in `ops/linear.py` to
prevent it from being reset to `None` by
`self.register_parameter("bias", None)`. However, this modification
introduced an issue where the bias was still being reset to `None` in
certain scenarios, causing errors during service startup. Therefore,
support for M4 quantization is temporarily being reverted in this PR.
___
- vLLM version: v0.11.2

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-12-01 23:45:02 +08:00
shaopeng-666
8e7f5cff6d fix qwenvl pd smoke test error (#4597)
### What this PR does / why we need it?
Fix A3 QwenVL PD smoke test error
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
curl a requet to proxy port,it responses correctly

- vLLM version: v0.11.2

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2025-12-01 22:24:59 +08:00
MengLong Chen
143e1f46d0 [Feat] shared expert dp for deepseek_mtp (#3811)
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature. 
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
Signed-off-by: zengran <zengran2@huawei.com>
Co-authored-by: zengran <zengran2@huawei.com>
2025-12-01 20:44:11 +08:00
zzhxxx
203b4e6777 [Bug_fix] fix torchair o_proj forward parameter (#4166)
### What this PR does / why we need it?
In `torchair_mla.py`, the `self.oproj` function includes an additional
parameter `is_force_scatter`, while the `AscendRowParallelLinear`
function in `linear.py` does not add this parameter.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2025-12-01 19:57:01 +08:00
Slightwind
aa56a0f4b7 [Bugfix] PCP adaptation for VLLM v0.11.2 modifications (#4604)
To adapt to the vLLM v0.11.2 image, the method for obtaining PCP size
and DCP size has been modified.
___
- vLLM version: v0.11.2

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-12-01 19:20:32 +08:00