444 Commits

Author SHA1 Message Date
linfeng-yuan
68d8d20ca2 [misc] move mxfp_compat into device to decouple from quantization init chain (#6918)
### What this PR does / why we need it?
`mxfp_compat` only provides dtype/symbol compatibility helpers for
different `torch_npu` versions, but it was placed under
`vllm_ascend.quantization`. Importing it from device/ops paths could
trigger `quantization/__init__.py` and pull in heavy quantization method
dependencies, increasing startup coupling and causing import-cycle risk
(especially on 310P paths).

### Does this PR introduce _any_ user-facing change?
No functional behavior change intended.

### How was this patch tested?
CI passed.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-02 18:17:01 +08:00
whx
16c879cdf7 [Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518)
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-03-02 17:54:25 +08:00
Eric-dot
3c66a970f2 add mxfp8 moe quantization (#6670)
### What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more
maintainable
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
13397841ab

---------

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
Co-authored-by: fangrongcan <f00876277@china.huawei.com>
Co-authored-by: wangyao-i <iwangyao@outlook.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
2026-03-02 11:04:06 +08:00
realliujiaxu
5e24b26a54 [Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883)
### What this PR does / why we need it?

PR #5632 introduced a bug by replacing some branches gated by enable_sp
with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is
enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and
VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with
the previous logic and leads to accuracy issues. This PR restores the
original enable_sp-based branching to recover expected behavior and
accuracy.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

#### 1. start server
``` bash
vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/  \
    --port 8001 \
    --served-model-name auto \
    --max-model-len 1024 \
    --enforce-eager \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --additional-config '{"enable_shared_expert_dp": true}'
```

#### 2. curl
```bash
curl -s http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "auto",
  "messages": [
    {"role": "user", "content": "Hello. I have a question. Who are you?"}
  ],
  "max_tokens": 10,
  "temperature": 0.0,
  "ignore_eos_token": true
}'
```

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-01 20:22:50 +08:00
Bai Yongbin
9d09488b4a [Feat] support basic pcp&dcp for qwen3next (#6091)
### What this PR does / why we need it?
This PR implements Context Parallelism (CP) support for the Qwen3-Next
model, including PCP (Parallel Context Parallelism) and DCP
(Dynamic/Data Context Parallelism).

- vLLM version: v0.15.0
- vLLM main:
f176443446

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Co-authored-by: SunnyLee219 <3294305115@qq.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-02-28 21:44:08 +08:00
wangxiyuan
3d563292f3 clean 0.15.0 support (#6852)
Clean up vllm 0.15.0 related code

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-28 09:20:57 +08:00
Canlin Guo
e4458b2d2b [Main2Main] Upgrade vLLM to 0226 (#6813)
### What this PR does / why we need it?

Breaking:
1. https://github.com/vllm-project/vllm/pull/33452
2. https://github.com/vllm-project/vllm/pull/33451
3. https://github.com/vllm-project/vllm/pull/32567
4. https://github.com/vllm-project/vllm/pull/32344

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-02-27 16:05:21 +08:00
realliujiaxu
5def28dcd3 [Feat]support sequence parallelism by pass for VL models (#5632) 2026-02-27 08:27:41 +08:00
Li-Yongwen
2870f7c8ad [Feat] Support routing replay (#6696)
### What this PR does / why we need it?

[Feat] Support routing replay
same as https://github.com/vllm-project/vllm-ascend/pull/6666
resubmit  because of DOC failure

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: liyongwen <1310439159@qq.com>
Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-26 10:22:47 +08:00
Shanshan Shen
3a4292e5b7 [MM][Perf] Use seq_lens CPU cache to avoid frequent d2h copy for better performance (#6448)
### What this PR does / why we need it?

Currently, the performance of multi-modal encoding (i.e.,
`AscendMMEncoderAttention` forward) is considerably bounded by the heavy
host pre-process operations.

We can see from the profiling results below, before the real computation
of Attention, there are long free time in the device, which will lead to
extremely low NPU utilization.

<img width="2264" height="1398" alt="iShot_2026-01-23_16 26 39"
src="https://github.com/user-attachments/assets/37f21d06-e526-4f28-82fe-005746cf13bd"
/>

---
**To opitimize this, this PR has proposed four changes:**

1. Use `seq_lens` CPU cache to avoid frequent d2h copy. Before this PR,
`AscendMMEncoderAttention` will copy the `cu_seqlens` from NPU to CPU in
every forward, since the op `_npu_flash_attention_unpad()` requires CPU
`cu_seqlens` (otherwise it will crash). Thus, we use
`seq_lens_cpu_cache` to cache this tensor, since it's shared between all
layers, but may change in different forward step. When the current
`layer_index` is `0`, we update the cache, otherwise we directly use the
cache to avoid frequent `diff` and `copy` operations, which are costful.
2. Pre-compute the scale value to avoid calculating it in every forward.
3. Move the judgment of `enable_pad` from forward to the `__init__`
method.
4. Revert https://github.com/vllm-project/vllm-ascend/pull/6204.

**Performance after these optimizations:**

- **TTFT** has been reduced by **7.43%** ⬇️.
- **Throughput** has been increased by **1.23%** ⬆️.

---
> [!NOTE]
> This PR requires https://github.com/vllm-project/vllm/pull/33674 be
merged.

---
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Launch the server:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--no-async-scheduling
```

Run benchmark:

```bash
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 500 \
--request-rate 10 \
--burstiness 5 \
--no-stream
```

Before this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  82.23     
Total input tokens:                      33418     
Total generated tokens:                  61543     
Request throughput (req/s):              6.08      
Output token throughput (tok/s):         748.45    
Peak output token throughput (tok/s):    3203.00   
Peak concurrent requests:                402.00    
Total token throughput (tok/s):          1154.86   
---------------Time to First Token----------------
Mean TTFT (ms):                          10275.37  
Median TTFT (ms):                        6297.88   
P99 TTFT (ms):                           22918.26  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          263.02    
Median TPOT (ms):                        277.61    
P99 TPOT (ms):                           483.56    
---------------Inter-token Latency----------------
Mean ITL (ms):                           257.31    
Median ITL (ms):                         94.83     
P99 ITL (ms):                            1773.90   
==================================================
```

After this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  81.20     
Total input tokens:                      33418     
Total generated tokens:                  61509     
Request throughput (req/s):              6.16      
Output token throughput (tok/s):         757.54    
Peak output token throughput (tok/s):    2562.00   
Peak concurrent requests:                395.00    
Total token throughput (tok/s):          1169.11   
---------------Time to First Token----------------
Mean TTFT (ms):                          9511.91   
Median TTFT (ms):                        5479.78   
P99 TTFT (ms):                           21427.21  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          261.12    
Median TPOT (ms):                        276.03    
P99 TPOT (ms):                           446.99    
---------------Inter-token Latency----------------
Mean ITL (ms):                           254.04    
Median ITL (ms):                         97.71     
P99 ITL (ms):                            1516.67   
==================================================
```

- vLLM version: v0.15.0
- vLLM main:
dc917cceb8

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-02-26 08:49:36 +08:00
Icey
ee59429015 upgrade main to 0212 (#6712)
### What this PR does / why we need it?
Fixes `transformers_utils/processors/__init__` import error, due to
https://github.com/vllm-project/vllm/pull/33247
Fixes Fused MoE break introduced by `MoERunner abstraction,` due to
https://github.com/vllm-project/vllm/pull/32344

> delete AscendMoERunnere when
https://github.com/vllm-project/vllm/pull/35178 is merged

Fixes `Make Qwen3VL compatible with Transformers v5`, due to
https://github.com/vllm-project/vllm/pull/34262

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-02-25 09:17:29 +08:00
xulei
1e77077788 [Bugfix][DispatchFFNCombine] resolve vec error caused by unaligned UB access (#6707)
### What this PR does / why we need it?
1. Fix a vec error caused by unaligned UB accesss in the
DispatchFFNCombine;
2. Fix expert_token_nums tensor defined on host instead of NPU in
moe_comm_method.py
3. Fix multi-core copy issue of expert_token_nums in dispatchffnCombine
op (single aiv copy is sufficient)

### Does this PR introduce _any_ user-facing change?

No, this PR does not introduce any user-facing changes. The fix only
addresses internal memory access logic and does not modify any public
APIs, interfaces, or user-visible behaviors.

### How was this patch tested?

`export VLLM_ASCEND_ENABLE_FUSED_MC2=1`

vLLM version: v0.15.0

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: xulei_ict <xulei292@huawei.com>
Co-authored-by: xulei_ict <xulei292@huawei.com>
2026-02-14 10:32:50 +08:00
jiahao.quan
7221045777 [Attention] add gpt-oss support (#5901)
### What this PR does / why we need it?
Please refer to the following link for the historical conversation
https://github.com/vllm-project/vllm-ascend/pull/4467. We have made
updates in light of the comments from the prior PR review. Given the
refactoring of the attention_v1 component, we have carried out necessary
adjustments to fit the newly revised code.

### Does this PR introduce _any_ user-facing change?

1. Modified the code in the Attention section to adapt to the SWA and
Sink features required by gpt-oss.
2. Modified the code in the MoE section to add support for bias and
swigluoai.

### How was this patch tested?
Please refer to the
https://github.com/vllm-project/vllm-ascend/pull/4467 for performance
tests, on the basis of which the accuracy tests from AIME2024 have been
newly added.

![img_v3_02tu_501e88e3-2217-4565-8edf-b9acf4f43f2g](https://github.com/user-attachments/assets/024f8283-18ab-4d4d-ab12-27917b5d7d06)


- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: mikequan0425 <mikequan0425@foxmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leon_tao <taoyao2@huawei.com>
Co-authored-by: nurxat <738457498@qq.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: mikequan <199741451@qq.com>
Co-authored-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com>
Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Co-authored-by: pu-zhe <zpuaa@outlook.com>
Co-authored-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Cao Yi <slightwindsec@gmail.com>
Co-authored-by: Icey <1790571317@qq.com>
Co-authored-by: SILONG ZENG <2609716663@qq.com>
2026-02-12 10:55:34 +08:00
Angazenn
c0c2eb614e [Main][Ops] Make triton rope support index_selecting from cos_sin_cache (#5450)
### What this PR does / why we need it?

This PR extends original `rope_triton_forward` and
`split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as
inputs. This fully aligns to vLLM RoPE api interface. Compared with
earlier implementation for RoPE, the benefits are:

1. avoiding pre-computation of `cos` `sin` before model execution, which
helps to remove redundant codes.
2. allowing eagle3 draft model to have different rope parameters with
main model (see #6612 ). This help to recover accept rate && accuracy in
that case.

In addition, this kernel change only introduces very small performance
degradation. Those `index_select` or `chunk` operations are now changed
into simple memory access in triton kernel (For example,
https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81).

**Highlights**

- **RoPE Cache Unification**: Replaced separate _sin and _cos global
tensors with a unified cos_sin_cache and explicit positions tensor for
Rotary Positional Embeddings (RoPE), streamlining data handling.
- **Triton Kernel Integration**: Updated Triton kernels
(split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the
cos_sin_cache and positions for more efficient and integrated RoPE
calculations.
- **Custom Operation Registration**: Registered `rope_forward_oot` as a
new custom operation, allowing its use in fused compilation passes and
providing a dedicated entry point for the new RoPE implementation.
- **Refactored RoPE Forward Pass**: Modified the rope_forward_oot
function to accept the new cos_sin_cache and positions arguments,
enabling a more flexible and integrated RoPE application within the
system.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Additional test on Qwen3-235b accuracy:

| Aime2024 | GSM8K | Livecodebench |
| -------- | -------- | -------- |
| 83.33 | 96.26 | 70.23 |

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-02-11 21:20:53 +08:00
LI SHENGYONG
7cf285a77a [MOE Refactor] Remove QuantType in prepare_finalize.py (#6534)
### What this PR does / why we need it?
To prevent confusion between different QuantType classes, we remove**
QuantType in prepare_finalize.py

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:59:58 +08:00
Nengjun Ma
66b60c9440 [Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (#6629)
### What this PR does / why we need it?
1. [Refact] Refact MLA/SFA weight prefetch to consist with moe weight
prefetch
2. Remove duplicated o_proj weight prefetch in forward for MLA/SFA

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?

1) Performance result:
Perf test data:
*) MLA:

| | 1st test | 2nd test | Output Token Throughput(Avg) | Performance
improvement percentage |
| --- | --- | --- | --- | --- |
| o_proj duplicate prefetch | 11.9669 token/s | 12.0287 token/s |
11.9978 |
| o_proj no duplicate prefetch | 12.5594 token/s | 12.6216 token/s |
12.5905 | 4.94%| |

single layer performace improve: 5%~8%

*) SFA:

| | 1st test | 2nd test | Output Token Throughput(Avg) | Performance
improvement percentage |
| --- | --- | --- | --- | --- |
| o_proj duplicate prefetch | 13.0523 token/s | 13.1084 token/s |
13.08035 | |
| o_proj no duplicate prefetch | 13.9844 token/s | 14.1678 token/s |
14.0761 | 7.6% |

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-10 14:14:37 +08:00
meihanc
5b8e47cb68 [bugfix]Fix no attribute 'data' when MLAPO is enable (#6601)
### What this PR does / why we need it?

This PR fixes an `AttributeError: 'Parameter' object has no attribute
'data'` that occurs when MLAPO is enabled with vLLM v0.15.0.

The error is caused by a monkey-patch on
`MLAAttention.process_weights_after_loading` which is incompatible with
changes in vLLM v0.15.0. This is likely related to PyTorch's deprecation
of the `.data` attribute on `torch.nn.Parameter` objects.

This change makes the monkey-patch conditional, so it is not applied for
vLLM v0.15.0 and newer versions, resolving the crash.


- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-02-10 09:04:32 +08:00
GoCHug
80e5812b39 [BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (#6581)
### What this PR does / why we need it?
Issue: If a model such as Ling-1T adopts partial rotary position
embedding (partial RoPE), but config.json uses the rotary_dim parameter
instead of partial_rotary_factor, it will trigger a RuntimeError: The
expanded size of the tensor (128) must match the existing size (64) at
non-singleton dimension 3.
<img width="1681" height="472" alt="image"
src="https://github.com/user-attachments/assets/ba03d7df-ecba-4d6f-9ec1-4dc55f59799e"
/>

This PR addresses an issue where models using partial rotary position
embedding (partial RoPE) with the `rotary_dim` parameter in
`config.json` (instead of `partial_rotary_factor`) would encounter a
`RuntimeError`.

This change adds support for the `rotary_dim` parameter in
`vllm_ascend/ops/rotary_embedding.py` to correctly calculate the
`rope_dim`, resolving the tensor size mismatch error.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The patch was tested successfully with the Ling-1T model, which
previously triggered the error.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: GoCHug <93277779+GoCHug@users.noreply.github.com>
2026-02-09 17:17:52 +08:00
wangxiyuan
9c6d031797 [MISC] Clean up useless env USE_OPTIMIZED_MODEL (#6618)
Clean up uesless env `USE_OPTIMIZED_MODEL`

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-09 15:38:58 +08:00
wangxiyuan
6c49f95da2 [Ops][Refactor] Remove custom rotary_embedding operator (#6523)
### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-07 09:24:05 +08:00
SILONG ZENG
06aa6036f6 [Lint]Style: Convert vllm-ascend/ to ruff format(new Batch #8) (#6604)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| vllm_ascend/ops/\_\_init\_\_.py |
| vllm_ascend/ops/activation.py |
| vllm_ascend/ops/flashcomm2_oshard_manager.py |
| vllm_ascend/ops/layernorm.py |
| vllm_ascend/ops/mla.py |
| vllm_ascend/ops/mm_encoder_attention.py |
| vllm_ascend/ops/register_custom_ops.py |
| vllm_ascend/ops/vocab_parallel_embedding.py |
| vllm_ascend/ops/weight_prefetch.py |
| vllm_ascend/spec_decode/\_\_init\_\_.py |
| vllm_ascend/spec_decode/eagle_proposer.py |
| vllm_ascend/spec_decode/interface.py |
| vllm_ascend/spec_decode/mtp_proposer.py |
| vllm_ascend/spec_decode/ngram_proposer.py |
| vllm_ascend/spec_decode/suffix_proposer.py |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-07 09:16:07 +08:00
wangxiyuan
06c0aed124 [CI] Fix broken CI (#6599)
Revert
4fb3d5e1b2
it breaks E2E Test

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
2026-02-06 17:23:58 +08:00
SILONG ZENG
19b5d44ea8 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10) (#6173)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
|`vllm_ascend/ops/layer_shard_linear.py`|
|`vllm_ascend/ops/linear.py`|
|`vllm_ascend/ops/linear_op.py`|
|`vllm_ascend/worker/worker.py`|
| ` vllm_ascend/patch/worker/patch_bert.py` |
| ` vllm_ascend/patch/worker/patch_deepseek.py` |
| ` vllm_ascend/patch/worker/patch_distributed.py` |
| ` vllm_ascend/patch/worker/patch_module.py` |
| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` |
| ` vllm_ascend/patch/worker/patch_qwen3_next.py` |
| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` |
| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` |
| ` vllm_ascend/patch/worker/patch_rope.py` |
| ` vllm_ascend/patch/worker/patch_triton.py` |
| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` |
| ` vllm_ascend/patch/worker/patch_v2_egale.py` |
|` vllm_ascend/worker/npu_input_batch.py`|
|` vllm_ascend/worker/v2/aclgraph_utils.py`|
|` vllm_ascend/worker/v2/attn_utils.py`|
|` vllm_ascend/worker/v2/model_runner.py`|
|` vllm_ascend/worker/v2/sample/gumbel.py`|
|` vllm_ascend/worker/v2/sample/penalties.py`|
|` vllm_ascend/worker/v2/sample/sampler.py`|
|` vllm_ascend/worker/v2/spec_decode/__init__.py`|
|` vllm_ascend/worker/v2/spec_decode/eagle.py`|
|` vllm_ascend/worker/v2/states.py`|
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: SILONG ZENG <2609716663@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-06 15:35:06 +08:00
SILONG ZENG
65b7f716e6 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #11) (#6176)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `vllm_ascend/ops/fused_moe/comm_utils.py` |
| `vllm_ascend/ops/fused_moe/experts_selector.py` |
| `vllm_ascend/ops/fused_moe/fused_moe.py` |
| `vllm_ascend/ops/fused_moe/moe_comm_method.py` |
| `vllm_ascend/ops/fused_moe/moe_mlp.py` |
| `vllm_ascend/ops/fused_moe/prepare_finalize.py` |
| `vllm_ascend/ops/fused_moe/token_dispatcher.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: SILONG ZENG <2609716663@qq.com>
2026-02-06 15:28:49 +08:00
SILONG ZENG
4fb3d5e1b2 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #8) (#6129)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| vllm_ascend/ops/\_\_init\_\_.py |
| vllm_ascend/ops/activation.py |
| vllm_ascend/ops/flashcomm2_oshard_manager.py |
| vllm_ascend/ops/layernorm.py |
| vllm_ascend/ops/mla.py |
| vllm_ascend/ops/mm_encoder_attention.py |
| vllm_ascend/ops/register_custom_ops.py |
| vllm_ascend/ops/vocab_parallel_embedding.py |
| vllm_ascend/ops/weight_prefetch.py |
| vllm_ascend/spec_decode/\_\_init\_\_.py |
| vllm_ascend/spec_decode/eagle_proposer.py |
| vllm_ascend/spec_decode/interface.py |
| vllm_ascend/spec_decode/mtp_proposer.py |
| vllm_ascend/spec_decode/ngram_proposer.py |
| vllm_ascend/spec_decode/suffix_proposer.py |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: SILONG ZENG <2609716663@qq.com>
2026-02-06 15:25:08 +08:00
pu-zhe
85e33941e8 [Feat.]: 310p support MOE models (#6530)
### What this PR does / why we need it?
This pull request integrates comprehensive support for Mixture of
Experts (MoE) models on the Ascend 310P device within the vllm-ascend
framework. It achieves this by introducing specialized modules for
expert selection, fused MoE layers, and optimized all-gather
communication. The changes also refine existing NPU operations, making
them more consistent and efficient for 310P, ultimately enhancing the
performance and compatibility of MoE models on this hardware.

Highlights
310P MoE Support: Introduces dedicated implementations for Mixture of
Experts (MoE) models on Ascend 310P devices, including new modules for
expert selection, fused MoE layers, and communication.
All-Gather Communication: Enforces the use of ALLGATHER communication
for MoE operations on 310P, optimizing data transfer and leveraging
NPU-specific token dispatching.
Simplified NPU Operations: Removes conditional type casting for
npu_swiglu and enables custom rotary embedding kernels unconditionally,
suggesting improved native support for 310P.
New MoE Classes Registered: Registers AscendFusedMoE310 and
AscendSharedFusedMoE310 to integrate 310P-specific MoE layers into the
system's custom operation registry.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
offline test and server test, with qwen3-30b-a3b,tp/ep 4 on 310p

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-02-06 10:30:56 +08:00
meihanc
922e5c163b [main2main] upgrade vllm main 0202 (#6560)
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
https://github.com/vllm-project/vllm/pull/32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to https://github.com/vllm-project/vllm/pull/33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
https://github.com/vllm-project/vllm/pull/33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
https://github.com/vllm-project/vllm/pull/32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
https://github.com/vllm-project/vllm/pull/32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to https://github.com/vllm-project/vllm/pull/27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
https://github.com/vllm-project/vllm/pull/33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
https://github.com/vllm-project/vllm/pull/32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-02-05 19:31:17 +08:00
Zhijun Chen
0ead5e8681 perf: adaptive block size selection in linear_persistent kernel (#6537)
### What this PR does / why we need it?

**Optimization:** Replaces fixed block sizes (128x128x128) in
`linear_persistent_kernel` with adaptive selection logic that considers:
- Matrix dimensions (M, N, K) 
- Device NPU vector core count
- Data type (float32 vs others)

**Why:** Fixed block sizes lead to suboptimal hardware utilization
across different matrix shapes. Adaptive sizing maximizes occupancy and
memory efficiency for varied workload patterns, improving throughput for
batch-invariant linear operations in LLM inference.

**Details:**
- Small matrices (M < 256): Size-proportional allocation
- Medium matrices (256 ≤ M < 1024): Balanced distribution based on grid
capacity
- Large matrices (M ≥ 1024): Optimized for dominant dimension

### Does this PR introduce _any_ user-facing change?

No. This is a performance optimization. The API and numerical results
remain unchanged; only kernel execution efficiency improves.

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: DDCHY <843049740@qq.com>
Signed-off-by: zjchenn <zjchenn@gmail.com>
Co-authored-by: DDCHY <843049740@qq.com>
2026-02-04 21:36:26 +08:00
Nengjun Ma
78fad4e348 [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442)
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:

--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-04 09:08:18 +08:00
guanguan0308
dffac6db73 [Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402)
### What this PR does / why we need it?
Add New Output for Expert Token Count
An additional output tensor expert_token_nums is added to both operators
to meet the requirement of tracking token distribution among experts:

Tensor Name: expert_token_nums
Dimension: 1D tensor
Shape: (local_expert_num,)
Data Type: int32
Semantics: Represents the number of tokens actually received by each
expert on the current card.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: guanguan0308 <1546542263@qq.com>
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
2026-02-03 10:41:06 +08:00
zhangxinyuehfad
26b83f8bde [Bugfix] Improve Triton stability on Ascend for large grids (#6301)
### What this PR does / why we need it?
Improve Triton stability on Ascend for large grids
set `TRITON_ALL_BLOCKS_PARALLEL=1` when grids > 65535

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-02-03 10:32:27 +08:00
wangxiyuan
eeedf7c503 [Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470)
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
https://github.com/vllm-project/vllm/pull/31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
https://github.com/vllm-project/vllm/pull/32082 by
https://github.com/vllm-project/vllm-ascend/pull/6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
https://github.com/vllm-project/vllm/pull/25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-02 15:57:55 +08:00
wubin58
4230bc8646 [Bugfix]Modify NPU rotary encoding parameter fields,fix RopeOperation setup failed in condition of self.rotary_dim < self.head_size (#6310)
### What this PR does / why we need it?
change self.head_size to self.rotary_dim.
only the rotary part is processed here, the dimension should be rotary_dim.

Fix bug #6060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only a small section of code was modified to adjust the parameters, and
all standard tests were passed.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net>
Co-authored-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net>
2026-01-30 21:25:04 +08:00
whx
39f8af9d96 [Main2Main][BugFix] Add shared_experts check for AscendSharedFusedMoE (#6335)
### What this PR does / why we need it?
PR https://github.com/vllm-project/vllm/pull/32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-01-29 08:47:20 +08:00
Mercykid-bash
29fb27d3bb BugFix: Fix moe_load accumulation error in ACL graph mode (#6182)
This PR fixes the numerical error in moe_load accumulation under ACL
graph mode on NPU: using += for NPU tensors in graph mode does not throw
errors but leads to incorrect values, so we replace it with the in-place
add_() method to ensure accurate calculation.

Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
2026-01-26 17:18:46 +08:00
LI SHENGYONG
611e223b7d [EPLB][Bugfix] EPLB support fp/bf16 (#5531)
### What this PR does / why we need it?
EPLB support dtype of fp/bf16.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
w8a8_dynamic Baseline:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

w8a8_dynamic eplb:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

The fp16 conversation is normal.
The fp16 test is in progress.

Baseline fp16
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

eplb fp16
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-26 14:28:16 +08:00
Shanshan Shen
76ac688388 [MM][Perf] Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance (#6204)
### What this PR does / why we need it?

Currently, we pad the last dim of qkv to 128 before flash attention (in
`AscendMMEncoderAttention`) to get better performance on Ascend NPU.
However, the qkv padding is executed serially, which may lead to more
overhead when launching `aclnnConstantPadNd` (launch 3 times).

Since the three operations are mutually independent, we stack qkv first
and then pad them in one kernel launch. With this optimization, **TTFT**
has been reduced by **3.15%**, **peak throughput** has been increased by
**4.20%**.

---

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Launch the server:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```

Run benchmark:

```bash
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 1000 \
--no-stream
```

Before this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  122.33    
Total input tokens:                      66638     
Total generated tokens:                  122845    
Request throughput (req/s):              8.17      
Output token throughput (tok/s):         1004.18   
Peak output token throughput (tok/s):    3073.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          1548.90   
---------------Time to First Token----------------
Mean TTFT (ms):                          51757.16  
Median TTFT (ms):                        44853.42  
P99 TTFT (ms):                           110700.14 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          226.06    
Median TPOT (ms):                        206.85    
P99 TPOT (ms):                           935.31    
---------------Inter-token Latency----------------
Mean ITL (ms):                           208.82    
Median ITL (ms):                         96.37     
P99 ITL (ms):                            2183.13   
==================================================
```

After this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  121.47    
Total input tokens:                      66638     
Total generated tokens:                  122860    
Request throughput (req/s):              8.23      
Output token throughput (tok/s):         1011.47   
Peak output token throughput (tok/s):    3202.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          1560.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          50125.08  
Median TTFT (ms):                        46270.85  
P99 TTFT (ms):                           108107.12 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          227.11    
Median TPOT (ms):                        205.13    
P99 TPOT (ms):                           816.08    
---------------Inter-token Latency----------------
Mean ITL (ms):                           204.60    
Median ITL (ms):                         92.66     
P99 ITL (ms):                            2219.02   
==================================================
```
- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-26 10:20:24 +08:00
huangning1995
ce11fd49f3 [Feature] Batch invariant torch.compile (#6107)
### What this PR does / why we need it?
Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to
enable batch-invariant in vllm-ascend, we observed that the performance
of BI in eager mode remains suboptimal.

This PR further integrates batch-invariant with torch.compile, which
improves inference performance by 350% when tested with Qwen3-0.6B.

### Does this PR introduce _any_ user-facing change?
Previously, enabling both aclgraph and Batch-Invariant would cause an
"ub overflow" error. This occurred because transposed input tensors
could produce incorrect stride() values.

To fix this, we now call .contiguous() on the input tensors before
passing them to Triton kernels. This ensures a contiguous memory layout
and prevents transposed tensors from causing incorrect stride
calculations.

### Test Plan
pytest -sv --durations=0
tests/e2e/singlecard/test_aclgraph_batch_invariant.py

### Test Result
```
============================================================================ slowest durations ============================================================================
87.37s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle
77.39s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN
74.04s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail
73.59s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation

(8 durations < 0.005s hidden.  Use -vv to show these durations.)
================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================
```
### Performance
export VLLM_BATCH_INVARIANT=1
vllm serve /home/Qwen3-0.6B \
--served-model-name qwen \
--port 8000 \
--max-num-seqs 256 \
--tensor-parallel-size 1 \
--max-model-len 5500 \
--max-num-batched-tokens 5500 \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.9 \
--compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY",
"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
--additional-config
'{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}'

vllm bench serve --served-model-name qwen --trust-remote-code --backend
vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name
random --random-input-len 512 --random-output-len 256 --num-prompts 800
--max-concurrency 8

torch.compile batch invariant performance:
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  477.21    
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              1.68      
Output token throughput (tok/s):         429.16    
Peak output token throughput (tok/s):    472.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1287.48   
---------------Time to First Token----------------
Mean TTFT (ms):                          285.53    
Median TTFT (ms):                        312.70    
P99 TTFT (ms):                           324.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.59     
Median TPOT (ms):                        17.50     
P99 TPOT (ms):                           18.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.59     
Median ITL (ms):                         17.45     
P99 ITL (ms):                            18.76     
==================================================
```
Eager
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  1694.70   
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         120.85    
Peak output token throughput (tok/s):    136.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          362.54    
---------------Time to First Token----------------
Mean TTFT (ms):                          164.29    
Median TTFT (ms):                        129.71    
P99 TTFT (ms):                           1961.66   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.81     
Median TPOT (ms):                        65.15     
P99 TPOT (ms):                           72.27     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.81     
Median ITL (ms):                         64.64     
P99 ITL (ms):                            75.72     
==================================================
```

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: huangning1995 <huangning12@huawei.com>
2026-01-26 09:15:06 +08:00
Shaoxu Cheng
fbae41697e [310P]: refactoring for 310p kvcache and some ops class (#6117)
### What this PR does / why we need it?
* Refactor the LayerNorm and activation operator classes to decouple the
310P device implementation from the main branch.
* Refactor `mm_encoder_attention` on 310P to use the
`torch_npu._npu_flash_attention_unpad` operator.
* Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P
so they are no longer padded to 16× alignment.
* Refactor `model_runner` on 310P to align the KV-cache initialization
logic with the mainline implementation.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
use the e2e tests.

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-01-24 20:34:29 +08:00
yjmyl
e90b14140b [feature] add_rms_norm support bias (#5790)
### What this PR does / why we need it?
This PR is to replace addRmsNorm and Add With addRmsNormBias. This way
can lead to a more effecient result.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Full Test Pass

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com>
Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>
2026-01-23 21:09:54 +08:00
SILONG ZENG
78af0c30a3 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #12) (#6177)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `vllm_ascend/ops/triton/activation/swiglu_quant.py` |
| `vllm_ascend/ops/triton/batch_invariant/matmul.py` |
| `vllm_ascend/ops/triton/batch_invariant/mean.py` |
| `vllm_ascend/ops/triton/batch_invariant/rmsnorm.py` |
| `vllm_ascend/ops/triton/fla/chunk.py` |
| `vllm_ascend/ops/triton/fla/chunk_delta_h.py` |
| `vllm_ascend/ops/triton/fla/chunk_o.py` |
| `vllm_ascend/ops/triton/fla/chunk_scaled_dot_kkt.py` |
| `vllm_ascend/ops/triton/fla/cumsum.py` |
| `vllm_ascend/ops/triton/fla/fused_qkvzba_split_reshape.py` |
| `vllm_ascend/ops/triton/fla/l2norm.py` |
| `vllm_ascend/ops/triton/fla/layernorm_guard.py` |
| `vllm_ascend/ops/triton/fla/sigmoid_gating.py` |
| `vllm_ascend/ops/triton/fla/solve_tril.py` |
| `vllm_ascend/ops/triton/fla/utils.py` |
| `vllm_ascend/ops/triton/fla/wy_fast.py` |
| `vllm_ascend/ops/triton/fused_gdn_gating.py` |
| `vllm_ascend/ops/triton/layernorm_gated.py` |
| `vllm_ascend/ops/triton/linearnorm/split_qkv_rmsnorm_rope.py` |
| `vllm_ascend/ops/triton/mamba/causal_conv1d.py` |
| `vllm_ascend/ops/triton/reject_sample.py` |
| `vllm_ascend/ops/triton/rope.py` |
| `vllm_ascend/ops/triton/spec_decode/utils.py` |
| `vllm_ascend/ops/triton/triton_utils.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-23 14:59:19 +08:00
Cao Yi
a69ef10c3a [Refactor] Quantization Module Refactor (#5738)
### Summary

This PR refactors the `vllm_ascend/quantization` module to improve code
organization, maintainability, and extensibility. The refactoring
introduces a clear separation of concerns with a registry-based scheme
discovery pattern, abstract base classes for quantization schemes, and
dedicated wrapper classes.

### Key Changes

#### 1. **Modular Directory Structure**

| Before | After |
|--------|-------|
| Flat file structure with mixed responsibilities | Organized into
`methods/` subpackage for schemes |
| Single `quant_config.py` (600+ lines) | Separate config files:
`modelslim_config.py`, `compressed_tensors_config.py` |
| `utils.py` with scheme lookup logic | `methods/registry.py` with
decorator-based registration |

#### 2. **Registry-Based Scheme Discovery**

Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a
decorator-based registry pattern:

```python
# Before: Manual dictionary mapping
ASCEND_QUANTIZATION_METHOD_MAP = {
    "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...},
    ...
}

# After: Decorator-based registration
@register_scheme("W8A8_DYNAMIC", "linear")
class AscendW8A8DynamicLinearMethod(AscendLinearScheme):
    ...
```

#### 3. **Abstract Base Classes**

Introduced three abstract base classes in `methods/base.py`:
- `AscendLinearScheme` - Base for linear layer quantization
- `AscendMoEScheme` - Base for MoE layer quantization  
- `AscendAttentionScheme` - Base for attention layer quantization

#### 4. **Separated Config and Wrapper Classes**

- **Config classes** (`AscendModelSlimConfig`,
`AscendCompressedTensorsConfig`): Handle config parsing and scheme
selection
- **Wrapper classes** (`AscendLinearMethod`, `AscendFusedMoEMethod`,
etc.): Implement vLLM interfaces and delegate to schemes

#### 5. **Cleaner Public API**

```python
# New clean module interface
from vllm_ascend.quantization import (
    AscendModelSlimConfig,
    AscendCompressedTensorsConfig,
)
from vllm_ascend.quantization.methods import get_scheme_class
```

### Architecture Diagram

```mermaid
classDiagram
    direction TB
    
    class QuantizationConfig {
        <<vLLM Interface>>
        +get_quant_method()
    }
    
    class AscendModelSlimConfig {
        +quant_description
        +get_quant_method()
        -create_scheme_for_layer()
    }
    
    class AscendCompressedTensorsConfig {
        +target_scheme_map
        +get_quant_method()
        -_get_scheme_from_parts()
    }
    
    class AscendLinearMethod {
        <<Wrapper>>
        +quant_method: AscendLinearScheme
        +create_weights()
        +apply()
    }
    
    class AscendFusedMoEMethod {
        <<Wrapper>>
        +quant_method: AscendMoEScheme
        +create_weights()
        +apply()
    }
    
    class AscendLinearScheme {
        <<Abstract>>
        +get_weight()*
        +apply()*
        +get_pertensor_param()
        +get_perchannel_param()
    }
    
    class AscendMoEScheme {
        <<Abstract>>
        +get_weight()*
        +get_dynamic_quant_param()*
        +apply()*
    }
    
    class W8A8DynamicLinear {
        +get_weight()
        +apply()
    }
    
    class W8A8DynamicMoE {
        +get_weight()
        +apply()
    }
    
    QuantizationConfig <|-- AscendModelSlimConfig
    QuantizationConfig <|-- AscendCompressedTensorsConfig
    
    AscendModelSlimConfig ..> AscendLinearMethod : creates
    AscendModelSlimConfig ..> AscendFusedMoEMethod : creates
    AscendCompressedTensorsConfig ..> AscendLinearMethod : creates
    AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates
    
    AscendLinearMethod o-- AscendLinearScheme : delegates to
    AscendFusedMoEMethod o-- AscendMoEScheme : delegates to
    
    AscendLinearScheme <|-- W8A8DynamicLinear
    AscendMoEScheme <|-- W8A8DynamicMoE
```

### Scheme Registration Flow

```mermaid
sequenceDiagram
    participant Module as Scheme Module
    participant Registry as _SCHEME_REGISTRY
    participant Config as QuantConfig
    participant Wrapper as Wrapper Class
    
    Note over Module: At import time
    Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear")
    Registry->>Registry: Store (quant_type, layer_type) -> Class
    
    Note over Config: At runtime
    Config->>Config: Determine quant_type from description
    Config->>Registry: get_scheme_class(quant_type, layer_type)
    Registry-->>Config: Return scheme class
    Config->>Config: scheme = scheme_cls()
    Config->>Wrapper: Create wrapper with scheme
    Wrapper-->>Config: Return wrapper instance
```

### File Changes Summary

| Original Files | Refactored Files |
|----------------|------------------|
| `__init__.py` (empty) | `__init__.py` (exports public API) |
| `quant_config.py` | `modelslim_config.py` + `wrappers.py` |
| `compressed_tensors/` | `compressed_tensors_config.py` |
| `utils.py` | `methods/registry.py` |
| `w8a8_dynamic.py` | `methods/w8a8_dynamic.py` |
| `w8a8.py` | `methods/w8a8_static.py` |
| `w4a4_flatquant_dynamic.py` | `methods/w4a4_flatquant.py` |
| ... | `methods/base.py` (new) |

### Benefits

1. **Extensibility**: Adding new quantization schemes only requires
implementing the base class and adding `@register_scheme` decorator
2. **Maintainability**: Clear separation between config parsing, wrapper
logic, and scheme implementation
3. **Testability**: Abstract base classes enable easier unit testing and
mocking
4. **Discoverability**: Registry pattern makes it easy to list all
supported schemes
5. **Reduced Coupling**: Config classes no longer need to know about all
scheme implementations

___

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-01-23 14:13:47 +08:00
zhangxinyuehfad
819a4459ce Drop vLLM 0.13.0 support (#6069)
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-23 09:45:08 +08:00
zhangxinyuehfad
9bba0a2a68 [Bugfix] Fix Triton operator usage for multimodal models based on the mrope_interleaved parameter (#6042)
### What this PR does / why we need it?

When running the Qwen2.5-Omni-7B model on Ascend NPU, the engine fails
during the profiling/warmup stage with the following error:
`AclNN_Runtime_Error(EZ9903): rtKernelLaunchWithHandleV2 failed: 507035.
The vector core execution is abnormal.`

error log:
https://github.com/vllm-project/vllm-ascend/actions/runs/21144534911/job/60806765393#step:17:6412

This error is specifically triggered by the `triton_mrope` kernel when
handling the unique `mrope_section` configurations of the Omni model.
Other multimodal models with standard sections (e.g., [16, 24, 24]) or
standard LLMs work correctly with Triton.

Modified vllm_ascend/ops/rotary_embedding.py to add a conditional check
before calling forward_triton.

1. For standard LLMs (mrope_interleaved = True ), it continues to use
Triton for acceleration.

2. For complex configurations (like Qwen2.5-Omni mrope_interleaved =
False ), it now falls back to the native super().forward_oot() path,
which uses the stable torch_npu or PyTorch implementation.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-22 15:46:05 +08:00
zzhxxx
dd8571860d [Feature] Support DSA-CP for Hybrid scenario (#5702)
Signed-off-by: zzhx1 <zzh_201018@outlook.com>

### What this PR does / why we need it?
> Extracted from PR #5513
Based on the Sharded-CP feature PR:#4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

### Support FULL_DECODE_ONLY Mode under PD-Mixed Scenario:
Extends DSA-CP to handle the FULL_DECODE_ONLY execution mode when
running in a prefill-decode mixed (PD-mixed) serving environment,
improving throughput and resource utilization for decode-intensive
workloads.
**In pure prefill nodes:**
- Both q_proj and o_proj are sharded across world ranks, using
**broadcast** for weights distribution.

**In PD-mixed nodes (supporting both prefill and decode):**

- q_proj is fully replicated (not sharded) to avoid communication
overhead during decoding.
- o_proj Using the original TP `RowParallelLinear` method to store
weights

**During prefill execution:**
- o_proj forwards through all_gather to collect weights, reconstructing
the complete o_proj weights on each card.

**During decode (graph replay phase):**
- Additional all_to_all (before o_proj) and reduce_scatter (after
o_proj) are introduced to enable sequence-parallel output aggregation
while maintaining correctness under SFA CP.

### benchmark:
- TTFT increased by **527%**
- TPOT increased by **180%**

<img width="1550" height="938" alt="image"
src="https://github.com/user-attachments/assets/9b7a03d8-a3db-4a99-8923-6e5bfcfecf72"
/>


### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: clrs97 <524936896@qq.com>
2026-01-22 10:12:09 +08:00
ZCG12345
8900e3398b [Ascend] perf: optimize rope embedding with triton kernel for huge performance gain (#5918)
### What this PR does / why we need it?
1. Implement a **high-performance Triton custom kernel** for the rotary
position embedding (RoPE) operator on **Ascend NPU** platform
2. Fix critical bugs in the Triton RoPE kernel registration and
invocation process: including incorrect fake impl function name
matching, wrong torch ops namespace for kernel call, missing self
parameter in cos/sin slice fetching, and syntax errors in function type
annotations.
3. Achieve **extreme performance optimization** for the core RoPE
operator: the single inference latency is reduced from **57.1 μs** to
**9 μs**, with **6.34x performance improvement** and **84.24% latency
reduction**.
4. The RoPE operator is a **hot path** that is executed in every
transformer layer during LLM inference, the optimization will directly
reduce the overall inference latency and improve the throughput of LLM
serving on Ascend NPU.
5. Keep full backward compatibility: the Triton kernel is enabled only
when `HAS_TRITON=True`, and automatically fall back to the original
Ascend NPU native implementation if Triton is not available, no
functional regression.

### Does this PR introduce _any_ user-facing change?
**NO**
- No changes to any public APIs, interfaces or inference behaviors of
vLLM.
- No impact on the text generation quality and correctness of the large
model.
- The optimization is transparent to end users, only the inference speed
(latency/throughput) is improved without any functional change.

### How was this patch tested?
1. **Environment Validation**: Tested on Ascend NPU platform with
vLLM-Ascend framework, Triton library installed and enabled
(`HAS_TRITON=True`).
2. **Kernel Registration Test**: Verified the Triton RoPE kernel
(`rope_forward_triton`) is successfully registered to
`torch.ops._C_ascend` namespace without any
`ValueError/NameError/SyntaxError`.
3. **Functional Correctness Test**: Run large model (GLM4/MoE) inference
on the Ascend NPU platform, the generated text content is **completely
correct** (no garbled text, no logical errors), consistent with the
original implementation.
4. **Performance Benchmark Test**: Measure the single execution latency
of the RoPE operator before/after optimization, confirm the latency is
stably reduced from 57.1 μs to 9 μs, the performance gain is valid and
stable.
5. **Fallback Mechanism Test**: Manually disable Triton
(`HAS_TRITON=False`), verify the code correctly falls back to the
original Ascend NPU native RoPE implementation, no service crash and
normal inference.
6. **Compatibility Test**: Test with different tensor shapes/sizes of
query/key, all cases work correctly with the Triton kernel, no shape
mismatch error.
- operator supply by Hexiang Wang 
- vLLM version: v0.13.0
- vLLM main:
11b6af5280

---------

Signed-off-by: ZCG12345 <2097562023@qq.com>
2026-01-21 22:01:22 +08:00
LeeWenquan
2a618d2454 [Ops] update causal_conv1d_update (#5984)
### What this PR does / why we need it?
Update causal_conv1d_update ops for better perf.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-01-21 16:33:52 +08:00
wangqiankun13
bec8641876 [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] (#5932)
### What this PR does / why we need it?

In [PR 5040](https://github.com/vllm-project/vllm-ascend/pull/5040), the
`dispatch_gmm_combine_decode` operator was configured with an incorrect
global_bs parameter. This PR is to fix the bug.

The global_bs provided as input should have the same meaning as in the
`moe_distributed_dispatch` operator, specifically: (the maximum batch
size across all cards) * (expert parallel world size).
However, the implementation incorrectly used the variable
max_num_tokens, which does not account for tensor parallelism. This
error likely resulted in an unnecessarily large (overestimated) value.

More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 80.00 |
- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
2026-01-21 09:26:40 +08:00
LeeWenquan
55b20ac63b [Ops] Add layernorm for qwen3Next (#5765)
### What this PR does / why we need it?
Add layernormFn triton op for qwen3Next model for better performance.

<img width="248" height="526" alt="image"
src="https://github.com/user-attachments/assets/27b47157-5df5-4db1-aa88-1dae799b2bf6"
/>

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-01-20 14:43:14 +08:00
aipaes
f58e110afe 【feat】switch for fusion ops gmmswigluquant (#5992)
### What this PR does / why we need it?

Set a additional config parameter to control whether the gmmswigluequant
fuseion operator is enabled; it is enabled by True. / When enabled with
a small number of GPUs, the gmmswigluquant fused operator can cause some
performance degradation.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

#### Perf

test model: GLM 4.6(w8a8)
- single A3 node(ep16, tp16),  async-scheduling, mtp, FULL_DECODE_ONLY
- bs=1, input_lens=32000, ouput_lens=1024

Without this PR: TPOT 32.22.ms
With this PR: TPOT 30.23ms

---------

Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
2026-01-19 13:19:25 +00:00