Commit Graph

2474 Commits

Author SHA1 Message Date
linfeng-yuan
700423156f [Triton] Centralize Ascend extension op dispatch in triton_utils (#6937)
### What this PR does / why we need it?

This pull request refactors the dispatch mechanism for the
**triton-ascend-specific operators** `insert_slice`, `extract_slice`,
and `get_element` to ensure compatibility with both CANN 8.5 and 9.0.

A unified helper function, `_resolve_triton_ascend_op`, has been
introduced in `vllm_ascend/ops/triton/triton_utils.py`. This function
dynamically resolves these operators by first attempting to import them
from the `triton.language.extra.cann.extension` module, which is present
in newer CANN versions. If that fails, it falls back to the standard
`triton.language` module.

This approach centralizes operator dispatch logic, allowing individual
Triton kernels to use these functions without being aware of the
underlying Triton/CANN version. All call sites have been updated to use
these new unified functions.

### Does this PR introduce _any_ user-facing change?

No. This is an internal refactoring of operator implementations and does
not introduce any user-facing changes.

### How was this patch tested?

CI is expected to pass with existing tests.

**Testing Context:**
- vLLM version: v0.16.0
- vLLM main: `15d76f74e2fdb12a95ea00f0ca283acf6219a2b7`

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-03 17:10:30 +08:00
linfeng-yuan
cb893bcdb0 [csrc][bugfix] Add compile-time Ascend950/910_95 compatibility for custom ops between CANN8.5 and 9.0 (#6936)
### What this PR does / why we need it?
Remove hardcoded ASCEND910_95 usage in csrc custom-op host/tiling code
and
select the SoC target at CMake configure time.

- Probe CANN headers with check_cxx_source_compiles:
prefer platform_ascendc::SocVersion::ASCEND950, fallback to
ASCEND910_95.
- Export the selected enum/config string via shared compile definitions
(VLLM_ASCEND_950_SOC_ENUM / VLLM_ASCEND_950_SOC_CONFIG).
- Apply the shared macros to affected paths (moe_gating_top_k,
add_rms_norm_bias) to avoid per-file hardcoding.
- Keep behavior unchanged; this is an internal build-compatibility fix
for CANN 8.5 and 9.x.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-03 17:08:22 +08:00
Shaoxu Cheng
2064afe380 [300I][Bugfix] fix unquant model weight nd2nz error (#6851)
### What this PR does / why we need it?
- This PR fixes an issue with weight format conversion for unquantized
models running on Ascend 310P devices.

- The changes refactor the logic for converting weights to the
FRACTAL_NZ format. Previously, this was handled in a 310P-specific
linear layer implementation (`AscendUnquantizedLinearMethod310`). This
implementation has been removed, and the logic is now centralized in the
`maybe_trans_nz` utility function. This function now checks if the
device is a 310P and applies the NZ format cast accordingly for
`float16`/`bfloat16` weights.

- This refactoring simplifies the code by removing platform-specific
duplication and ensures correct weight handling for unquantized models
on 310P.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
ut and local test
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-03 15:57:26 +08:00
zzzzwwjj
f19f7b1fe2 [doc] fix supported_models (#6930)
### What this PR does / why we need it?

Add Experimental supported model/feature for supported_models.md

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2026-03-03 09:47:50 +08:00
starmountain1997
248d07566f [CI] nightly test timeout (#6912)
### What this PR does / why we need it?

The nightly test is currently failing due to a
[timeout](https://github.com/vllm-project/vllm-ascend/actions/runs/22547280169/job/65326335134).

As noted in #6778, this issue can be resolved by applying this fix.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

run nightly test.

Co-authored-by: guozr <guozr1997@hotmail.com>
2026-03-03 09:31:46 +08:00
Xiaoshuang Wang
f7a8befc20 [CI] Upgrade CANN to 8.5.1 (#6897)
### What this PR does / why we need it?
[CI] Upgrade CANN to 8.5.1

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-03-03 09:02:42 +08:00
tanhaoan333
15f6564976 [Model]Add Qwen3-Omni quantization Ascend NPU adaptation and optimization (#6828)
### What this PR does / why we need it?
This pull request is for quantization adaptation of Qwen3Omni, and it
achieves operator-level optimization and AUT (Auto-Quantization Tuning)
component optimization through patch-based modifications.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
2026-03-03 00:07:23 +08:00
wangxiaoteng888
dfa9ff7f2a [P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 (#6898)
### What this PR does / why we need it?
Adapt the recompute feature to vLLM 0.16.0, where the D node forwards
recompute requests to the P node.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-03-02 23:24:03 +08:00
pu-zhe
5899438a86 [Feat][310p] 310P support w8a8s quantization and saving w8a8sc state (#6878)
### What this PR does / why we need it?
This pull request introduces significant enhancements for 310P device
support, primarily by enabling W8A8S quantization and facilitating the
saving of models with W8A8SC state outputs. It provides an example
script for saving sharded and compressed model states, implements the
core W8A8S quantization method, and integrates metadata generation
within the 310P worker to accurately describe the quantization types of
saved parameters. These changes aim to improve efficiency and
compatibility for quantized models on 310P hardware.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
W8A8S accuarcy test and W8A8SC states save.
<img width="886" height="184" alt="image"
src="https://github.com/user-attachments/assets/e9bcac54-1f69-4d3a-a5b8-221a147ef99d"
/>

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-03-02 20:09:15 +08:00
linfeng-yuan
68d8d20ca2 [misc] move mxfp_compat into device to decouple from quantization init chain (#6918)
### What this PR does / why we need it?
`mxfp_compat` only provides dtype/symbol compatibility helpers for
different `torch_npu` versions, but it was placed under
`vllm_ascend.quantization`. Importing it from device/ops paths could
trigger `quantization/__init__.py` and pull in heavy quantization method
dependencies, increasing startup coupling and causing import-cycle risk
(especially on 310P paths).

### Does this PR introduce _any_ user-facing change?
No functional behavior change intended.

### How was this patch tested?
CI passed.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-02 18:17:01 +08:00
pu-zhe
632801b0ad [CI][310P] Add 310p tracked files in CI light. (#6923)
### What this PR does / why we need it?
Add 310p tracked files in CI light.
'vllm_ascend/attention/attention_v1.py'
'vllm_ascend/ops/fused_moe/**'
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI test
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-03-02 18:03:46 +08:00
whx
16c879cdf7 [Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518)
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-03-02 17:54:25 +08:00
realliujiaxu
8547520726 [Doc][Misc] Update AGENTS.md with sign-off and PR template requirements (#6892)
### What this PR does / why we need it?

This PR enhances the `AGENTS.md` contribution guidelines with the
following improvements:

1. **Add sign-off requirement** - All commits must include
"Signed-off-by:" line using `git commit -s`
2. **Add PR title format** - Document the `[Type][Module] Description`
format for PR titles
3. **Add PR template reference** - Guide contributors to follow
`.github/PULL_REQUEST_TEMPLATE.md`
4. **Add linting check step** - Require running `bash format.sh ci`
before pushing
5. **Add lint note for all file types** - Emphasize markdown files also
need lint checking
6. **Add fork workflow guidance** - Clarify to push to fork repository,
not main repository
7. **Clarify test section** - Provide specific examples for "How was
this patch tested?"
8. **Add vLLM version note** - Warn about preserving auto-added vLLM
version info

### Does this PR introduce _any_ user-facing change?

No. This is a documentation-only update for contributors.

### How was this patch tested?

- Verified the markdown formatting renders correctly
- Confirmed all links and references are valid
- Ran `bash format.sh ci` locally and all checks passed

---

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-02 16:44:59 +08:00
Yuzhou Tong
9180dd6c51 [BugFix][PCP] Fix presion bugs for pcp/dcp in PD disaggregate (#6876)
### What this PR does / why we need it?
Fix a bug for PD disaggregate of PCP/DCP, some conditions only consider
MLA while ignoring DSA.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
15d76f74e2
- vLLM Ascend main: 81fb7d5779

Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com>
Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com>
2026-03-02 16:11:00 +08:00
Shaoxu Cheng
ddc78dbade [300I] support decode-only aclgraph mode (#6849)
### What this PR does / why we need it?
310p aclgraph mode, but has some problems:
- the event-id hardware limit, the num of graph will be limited.
- the cann version support this feature cannot be get from external of
huawei.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
local test
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-02 14:15:14 +08:00
dependabot[bot]
86c9109d16 Bump actions/upload-artifact from 6 to 7 (#6906)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 6 to 7.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-02 14:08:28 +08:00
dependabot[bot]
002ec24dd8 Bump actions/download-artifact from 7 to 8 (#6907)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 7 to 8.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-02 14:07:59 +08:00
Eric-dot
3c66a970f2 add mxfp8 moe quantization (#6670)
### What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more
maintainable
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
13397841ab

---------

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
Co-authored-by: fangrongcan <f00876277@china.huawei.com>
Co-authored-by: wangyao-i <iwangyao@outlook.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
2026-03-02 11:04:06 +08:00
wjunLu
c324053b44 [CI] Revert speedup image building and CI Installation related PRs (#6891)
### What this PR does / why we need it?

Revert speedup image building and CI Installation related PRs

git revert 8835236181
git revert 64fba51275
git revert 263c2f8e8d
git revert 84b00695f8


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-03-02 08:53:10 +08:00
Frank Chen
a77fe932e4 [Platform] Fix CPU binding logic (#6889)
### What this PR does / why we need it?

- Rework CpuAlloc.handle_no_affinity() to build available NUMA nodes
after allowed_cpus filtering, assign NPUs to NUMA nodes via round‑robin,
and split CPUs per NPU with disjoint slices for better balance.
- Improve bind_memory() robustness by deriving the target NUMA from each
NPU’s CPU pool, validating NUMA existence, and skipping binding when
data is missing.
- bind_memory() now only bind the single NUMA node that corresponds to
NPU id, instead of 2 NUMA nodes.
- Fix the issue that all NPUs bind to 0th NUMA node when DP16 due to
global NPU id is not visible across DP domain.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added/updated unit tests:

test_cpu_binding.py
1.   test_binding_mode_table covers A2 vs A3 binding mode mapping.
2. test_build_cpu_pools_fallback_to_numa_balanced covers fallback when
affinity info is missing.
3. TestBindingSwitch.test_is_arm_cpu covers ARM/x86/unknown arch
detection.
4.   test_bind_cpus_skip_non_arm covers non‑ARM skip path in bind_cpus.

test_worker_v1.py
1. Updated mocks for enable_cpu_binding default True to align with new
config default.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-03-01 20:30:43 +08:00
realliujiaxu
5e24b26a54 [Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883)
### What this PR does / why we need it?

PR #5632 introduced a bug by replacing some branches gated by enable_sp
with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is
enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and
VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with
the previous logic and leads to accuracy issues. This PR restores the
original enable_sp-based branching to recover expected behavior and
accuracy.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

#### 1. start server
``` bash
vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/  \
    --port 8001 \
    --served-model-name auto \
    --max-model-len 1024 \
    --enforce-eager \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --additional-config '{"enable_shared_expert_dp": true}'
```

#### 2. curl
```bash
curl -s http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "auto",
  "messages": [
    {"role": "user", "content": "Hello. I have a question. Who are you?"}
  ],
  "max_tokens": 10,
  "temperature": 0.0,
  "ignore_eos_token": true
}'
```

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-01 20:22:50 +08:00
wjunLu
8835236181 [Image] Fix docker image merge tag settings (#6884)
### What this PR does / why we need it?
Fix docker image merge tag settings, to use tag as the branch name.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-03-01 12:20:57 +08:00
Bai Yongbin
9d09488b4a [Feat] support basic pcp&dcp for qwen3next (#6091)
### What this PR does / why we need it?
This PR implements Context Parallelism (CP) support for the Qwen3-Next
model, including PCP (Parallel Context Parallelism) and DCP
(Dynamic/Data Context Parallelism).

- vLLM version: v0.15.0
- vLLM main:
f176443446

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Co-authored-by: SunnyLee219 <3294305115@qq.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-02-28 21:44:08 +08:00
wjunLu
64fba51275 [Bugfix] Fix openEuler dockerfile error (#6871)
### What this PR does / why we need it?
This pull request addresses several issues within the openEuler
Dockerfiles : `Dockerfile.a3.openEuler` and `Dockerfile.openEuler` to
ensure correct installation and setup of the Mooncake dependency. The
changes primarily involve fixing incorrect file paths for the
mooncake_installer.sh script and streamlining the declaration of the
MOONCAKE_TAG build argument, leading to more robust and accurate
container builds.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-02-28 20:55:18 +08:00
starmountain1997
5ffae03156 [bugfix] fix capture shape in sp_eagle_fullgraph (#6846)
### What this PR does / why we need it?

This was meant to be merged in #6536, but I accidentally restored a
commit. You can find the relevant discussion
[here](https://github.com/vllm-project/vllm-ascend/pull/6536#issuecomment-3882883471).

Since `self.pass_config.enable_sp` is forcibly set to `False` in the
[source
code](f176443446/vllm/config/compilation.py (L1066)),
this section will no longer verify whether the generated cudagraph
shapes are multiples of both `uniform_decode_query_len`
(`num_speculative_tokens + 1`) and `tensor_parallel_size`.

This PR enables the `num_speculative_tokens + 1` and
`tensor_parallel_size` check upfront. Therefore, it won't silently round
up the `cudagraph_size` and throw a cryptic error for the user.

A typical example of this cryptic error looks like:
```
ValueError: could not broadcast input array from shape (196,) into shape (14,)
```

### Does this PR introduce _any_ user-facing change?

no.

### How was this patch tested?

Have passed all test.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: lilinsiman <lilinsiman@gmail.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-28 17:30:02 +08:00
zyz111222
81fb7d5779 [Doc] add 310P3 guidance of PaddleOCR-VL (#6837)
### What this PR does / why we need it?
add 310P3 guidance of PaddleOCR-VL model, refresh PaddleOCR-VL.md in the
docs/source/tutorials/

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
by CI

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: zouyizhou <zouyizhou@huawei.com>
2026-02-28 16:03:07 +08:00
luomin2005
3cc8bf15da Support platform.get_device_uuid function (#6777)
### What this PR does / why we need it?
Support platform.get_device_uuid function.
currently, the pytorch.npu.get_device_properties return uuid as full
zero, vllm-ascend implement the interface at first, once the
pytorch.npu.get_device_properties return the real uuid, vllm-ascend will
support without modification.
more details see 
https://github.com/vllm-project/vllm-ascend/issues/6669
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea
root@localhost:/workspace/l00614971/vllm_test# python vllm_test.py 
INFO 02-24 09:43:48 [__init__.py:43] Available plugins for group
vllm.platform_plugins:
INFO 02-24 09:43:48 [__init__.py:45] - ascend -> vllm_ascend:register
INFO 02-24 09:43:48 [__init__.py:48] All plugins in this group will be
loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 02-24 09:43:48 [__init__.py:217] Platform plugin ascend is
activated
device_uuid =  00000000-0000-0000-0000-000000000000

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-02-28 14:17:12 +08:00
wjunLu
263c2f8e8d [CI] Revert auto rebase (#6867)
### What this PR does / why we need it?
Revert auto rebase

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-02-28 11:54:31 +08:00
wangxiyuan
3d563292f3 clean 0.15.0 support (#6852)
Clean up vllm 0.15.0 related code

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-28 09:20:57 +08:00
wjunLu
84b00695f8 [CI] Refactor to speedup image building and CI Installation (#6708)
### What this PR does / why we need it?
1. Refactor  image workflow using cache-from to speedup builds

![build](https://github.com/user-attachments/assets/02135c12-0069-44f8-a3ec-5c2b4282448a)

Simultaneously refactored all Dockerfiles by placing layers that rarely
change before those that change frequently, improving build cache hit
rate.

2. Refactor E2E test using vllm-ascend container images, to skip C
compile while no C code are changed

![e2e](https://github.com/user-attachments/assets/49f5b166-0df3-41e1-8f71-b3bbbed17cfd)

In this case, the job will only replace the source code of vllm-ascend
and install `requirements-dev.txt`, saving about 10min before tests

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-02-28 09:06:00 +08:00
drslark
5666ce03f5 [bugfix] Fixed an accuracy problem of gdn layer in graph (#6822)
### What this PR does / why we need it?

There will be random ouputs if we run model with GDN attention in graph
mode:

```python
prompts = [
    "1. Who are you?",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
            tensor_parallel_size=4,

            distributed_executor_backend="mp",
            gpu_memory_utilization=0.7,
            speculative_config={
                "method": "qwen3_next_mtp",
                "num_speculative_tokens": 3,
            },
            
            compilation_config={
                "cudagraph_mode": "FULL_DECODE_ONLY",
                "cudagraph_capture_sizes": [8],
            },
            
            max_model_len=4096, 
            enable_prefix_caching=False)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"{output.prompt_token_ids=}")
    print(f"{output.outputs[0].token_ids=}")
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Before appling this change, the outputs was:

```text
output.prompt_token_ids=[16, 13, 10479, 525, 498, 30]
output.outputs[0].token_ids=[3555, 323, 279, 1112, 279]
Prompt: '1. Who are you?', Generated text: ' What and the... the'
```

After applying this change, the output is:

```text
output.prompt_token_ids=[16, 13, 10479, 525, 498, 30]
output.outputs[0].token_ids=[3555, 374, 697, 829, 30]
Prompt: '1. Who are you?', Generated text: ' What is your name?'
```

**Why does this change sovle the problem?**

Now, `query_start_loc` is padded because of `fia`.

But, for `gdn-attention`, padded version of `query_start_loc` will cause
accuracy problem.

So, we need an unpadded version of `query_start_loc` named
`gdn_query_start_loc` and use it in `gdn-attention`, it works fine.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

As described aboved.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: drslark <slarksblood@qq.com>
2026-02-28 08:57:53 +08:00
wangxiyuan
9cd0d6c33d [Doc][Misc] Update release notes for v0.15.0rc1 (#6859)
### What this PR does / why we need it?

This PR updates the release notes for `v0.15.0rc1` to:
- Mark the `310P MoE and W8A8 Support` feature as experimental.
- Add a note for `Kimi-K2.5 Model Support` clarifying that it has known
issues in vLLM 0.15.0 and requires manual patching to work correctly.

### Does this PR introduce _any_ user-facing change?

No, this is a documentation-only update.

### How was this patch tested?

N/A (documentation change).

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-27 22:35:09 +08:00
wjunLu
b60b991005 [CI] Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector (#5441)
### What this PR does / why we need it?

Add nightly test for Qwen3-235B-A22B with  mooncake layerwise connector.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-02-27 16:31:02 +08:00
lilinsiman
c13d90b766 [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (#6811)
### What this PR does / why we need it?
[Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into
eagle_proposer.py

This pull request significantly refactors the speculative decoding
mechanism by merging Parallel Context Processing (PCP) and Multi-Token
Prediction (MTP) functionalities directly into the eagle_proposer.py.
The changes aim to enhance the efficiency and correctness of distributed
speculative decoding, particularly by enabling the Eagle feature to work
seamlessly with the disable_padded interface. This involves detailed
adjustments to attention metadata, input/output processing, and state
management to ensure proper operation in parallel environments.

1. The PCP and MTP features are migrated to the eagle_proposer.py
2. The Eagle and PCP features are integrated
3. Enable the eagle feature to use the disable_padded interface

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tests and UT

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-02-27 16:06:56 +08:00
Canlin Guo
e4458b2d2b [Main2Main] Upgrade vLLM to 0226 (#6813)
### What this PR does / why we need it?

Breaking:
1. https://github.com/vllm-project/vllm/pull/33452
2. https://github.com/vllm-project/vllm/pull/33451
3. https://github.com/vllm-project/vllm/pull/32567
4. https://github.com/vllm-project/vllm/pull/32344

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-02-27 16:05:21 +08:00
starmountain1997
80316c5824 [DOC] enable both flashcomm1 and cudagraph (#6807)
## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-27 14:52:55 +08:00
wangxiyuan
3d43ed997e add release note for 0.15.0rc1 (#6839)
Add release note for 0.15.0rc1

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-27 11:55:55 +08:00
wangxiyuan
a95c0b8b82 [Doc] fix the nit in docs (#6826)
Refresh the doc, fix the nit in the docs

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-27 11:50:27 +08:00
Nengjun Ma
981d803cb7 [CI] Fix doc test fail when load model with error information: 'Stale file handle' (#6832)
### What this PR does / why we need it?

This PR fixes a `Stale file handle` error that occurs during doctests in
the CI environment. The error appears when loading models from
ModelScope, likely due to issues with network file systems used in CI.

The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment
variable to `false` in the `run_doctests.sh` script. This disables file
locking in the ModelScope hub, which is a common workaround for this
type of file system error.

### Does this PR introduce _any_ user-facing change?

No, this change only affects the CI test execution environment and has
no impact on users.

### How was this patch tested?

This change is validated by the CI pipeline. A successful run of the
doctests indicates that the fix is effective.

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-27 09:14:42 +08:00
realliujiaxu
5def28dcd3 [Feat]support sequence parallelism by pass for VL models (#5632) 2026-02-27 08:27:41 +08:00
Yikun Jiang
ed175d6d92 [Doc][Release] Add release note skill (#6824)
### What this PR does / why we need it?
This PR adds the releaseing note skills:
- `SKILL.md`: vLLM Ascend Releasing Note Writer
- `references/ref-past-release-notes-highlight.md`:
And also add a `output/v0.13.0` examples which was used by
2da476d82f

Inspired: https://github.com/simon-mo/release-notes-writing/

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1


Co-authored-by: esmeetu <jasonailu87@gmail.com>

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2026-02-26 21:01:21 +08:00
MengLong Chen
2d49f9079a [BugFix] Support ALL D-Nodes in fullgraph when running MTP in PD (#5472)
### What this PR does / why we need it?
**BUG**
When using prefill-decode disaggregation + MTP + full graph
+asynchronous scheduling, the KV cache pulled by decode nodes from
prefill decodes does not include spec tokens. As a result, the
total_num_scheduled_tokens obtained by decode nodes from the scheduler
lacks spec tokens. When determining whether to enqueue the full graph on
decode nodes, the condition for uniform_decode `
scheduler_output.total_num_scheduled_tokens == self.input_batch.num_reqs
* max_query_len` is not met, leading to the current instance not being
enqueued into the full graph.

The above situation leads to both full graph and eagle mode instances
coexisting in the decode instances. Due to the synchronization wait of
MoeDispatch, the decode instances in full graph are significantly slowed
down by the instance in eagle mode.

**Solution**
The scenario is PD separation + MTP + Full Graph + asynchronous
scheduling.
On the decode nodes, the spec tokens of the request with KV cache from P
need be padded. Then, the padded spec tokens will be rejected by
sampling. This operation ensures that the uniform_decode condition is
satisfied when determining whether decode nodes are included in the full
graph, thereby guaranteeing that all decode instances are present in the
full graph and avoiding synchronous waiting for MoeDispatch.

- vLLM version: v0.15.0
- vLLM main:
5326c89803

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2026-02-26 19:09:05 +08:00
wangxiyuan
532f7a82f2 [Patch][Misc] Cleanup and update patches (#6802)
### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-26 14:45:33 +08:00
wangxiyuan
c9d05d10aa [Doc][Misc] Refactor skill documentation and add Claude support instructions (#6817)
### What this PR does / why we need it?
This PR refactors the documentation for vLLM Ascend skills.
- It renames and moves the `vllm-ascend-model-adapter` skill's README to
serve as a new top-level README for the `.agents` directory.
- It adds instructions on how to use the Ascend skills with Claude,
including a new README in the `.claude` directory.
- It updates `.gitignore` to exclude skills copied for Claude's use.
- Add main2main skill

This improves the documentation structure, making it more organized and
providing clear instructions for developers using these skills with
different tools.

### Does this PR introduce _any_ user-facing change?
No, this PR contains only documentation and repository configuration
changes. It does not affect any user-facing code functionality.

### How was this patch tested?
These changes are documentation-only and do not require specific
testing. The correctness of the instructions is being verified through
this review.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-26 14:42:59 +08:00
pu-zhe
e76b69b9ef [BugFix] [310p] Fix attention accuracy issue (#6803)
### What this PR does / why we need it?
This pull request resolves an attention accuracy issue by enhancing the
AttentionMaskBuilder310 to correctly handle the maximum model length.
The change ensures that the attention mask generation process is
properly parameterized by the model's configuration, rather than relying
on a fixed internal value. This leads to more accurate attention mask
creation, which is crucial for the correct functioning of the attention
mechanism.
Update fused_moe to main branch.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Qwen3 dense mode & moe model e2e test
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-02-26 14:30:39 +08:00
Canlin Guo
9f8b84e5fc [Misc] Drop patch_rope.py (#6291)
### What this PR does / why we need it?

Part of #5304.

We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't
need this patch anymore.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-02-26 14:04:53 +08:00
Cao Yi
3953dcf784 [Feature][Quant] Auto-detect quantization format from model files (#6645)
## Summary

- Add automatic quantization format detection, eliminating the need to
manually specify `--quantization` when serving quantized models.
- The detection inspects only lightweight JSON files
(`quant_model_description.json` and `config.json`) at engine
initialization time, with no `.safetensors` reads.
- User-explicit `--quantization` flags are always respected;
auto-detection only applies when the flag is omitted.

## Details

**Detection priority:**
1. `quant_model_description.json` exists → `quantization="ascend"`
(ModelSlim)
2. `config.json` contains `"quant_method": "compressed-tensors"` →
`quantization="compressed-tensors"` (LLM-Compressor)
3. Neither → default float behavior

**Technical approach:**
Hooked into `NPUPlatform.check_and_update_config()` to run detection
after `VllmConfig.__post_init__`. Since `quant_config` is already `None`
at that point, we explicitly recreate it via
`VllmConfig._get_quantization_config()` to trigger the full quantization
initialization pipeline.

## Files Changed

| File | Description |
|------|-------------|
| `vllm_ascend/quantization/utils.py` | Added
`detect_quantization_method()` and `maybe_auto_detect_quantization()` |
| `vllm_ascend/platform.py` | Integrated auto-detection in
`check_and_update_config()` |
| `vllm_ascend/quantization/modelslim_config.py` | Improved error
handling for weight loading |
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-26 10:59:25 +08:00
starmountain1997
bc1622338c [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536)
### What this PR does / why we need it?

This version has no divisibility constraint between tp and mtp+1.
However, cudagraph_capture_sizes must be a common multiple of tp and
mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed
cudagraph_capture_sizes.

We added a long-sequence test (64k input, 3k output) for the two-node
mixed deployment scenario. Due to the excessive time required for
performance benchmarking, we are only verifying functionality. The
single-node scenario is skipped because VRAM limitations prevent
launching the model with a max-model-len of 68,000.

and we also add aime2025 test for dual-node deepseek 3.2 nightly test.

### How was this patch tested?

test at nightly environment.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-26 10:58:50 +08:00
Dijurido
169e434f78 [CI] Fix EAGLE CI problems (#6702)
### What this PR does / why we need it?
New FIA operator requires queryT equal to the last element of
actualSequenceLengthQ.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passed existing test (test_mtp_eagle_correctness.py).

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: Wangbingjie <w30061490@china.huawei.com>
Co-authored-by: Wangbingjie <w30061490@china.huawei.com>
2026-02-26 10:26:01 +08:00
Li-Yongwen
2870f7c8ad [Feat] Support routing replay (#6696)
### What this PR does / why we need it?

[Feat] Support routing replay
same as https://github.com/vllm-project/vllm-ascend/pull/6666
resubmit  because of DOC failure

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: liyongwen <1310439159@qq.com>
Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-26 10:22:47 +08:00