1120 Commits

Author SHA1 Message Date
hucong
d3de7333dc [BugFix][v0.18.0][cherry-pick] Fix embedding prefix caching for APC (#7894)
## What this PR does / why we need it?
pick-from:https://github.com/vllm-project/vllm-ascend/pull/7452
### Problem
Embedding models produce inconsistent outputs when prefix caching is
enabled vs disabled.

### Root Cause
The attention router condition was too broad:
- All `model_runner_type == "pooling"` → `_forward_encoder_attention()`
→ uses `npu_fusion_attention`
- **But `npu_fusion_attention` does NOT support prefix caching**
- Result: Numerical mismatch when KV cache is managed by prefix caching

### Solution
Refine the router condition to check causality:

**Before**: 
```
if attn_metadata.model_runner_type == "pooling":
    → npu_fusion_attention (no prefix caching support)
```

**After**: 
```
if attn_metadata.model_runner_type == "pooling" and not attn_metadata.causal:
    → npu_fusion_attention (for true encoders)
else:
    → npu_fused_infer_attention_score (prefix caching support)
```
### Changes Made

1. **Fixed router condition** (`vllm_ascend/attention/attention_v1.py`
L968)
   - Added `and not attn_metadata.causal` check
   - Effect: Non-causal embeddings now use correct operator

2. **Simplified encoder attention**
(`vllm_ascend/attention/attention_v1.py` L864-877)
   - Removed redundant causal branch (encoders never use causal mask)
   - Reduced from 34 lines to 14 lines

3. **Added test** (`tests/e2e/singlecard/pooling/test_embedding.py`)
- Validates embedding outputs with/without prefix caching are consistent
  
## Does this PR introduce _any_ user-facing change?

### Functional Changes
 **Yes** - Bug fix: Embedding models now produce consistent outputs
with prefix caching

### API Changes
 **No** - All public APIs unchanged

### Configuration Changes
 **No** - No new configuration required

### Backward Compatibility
 **Fully compatible** - Only fixes incorrect behavior

## How was this patch tested?
### New Test
Added `test_embed_models_using_prefix_caching_correctness()`:
- Tests: `Qwen3-Embedding-0.6B`
- Validates numerical consistency between runs with/without prefix
caching
- Uses long sequences to activate prefix caching
- Tolerance: 1e-2
- vLLM version: v0.18.0

Signed-off-by: underfituu <hzhucong@163.com>
2026-04-01 16:57:33 +08:00
Nagisa125
2cb9195ff0 [Releases/v0.18.0][CI] Updated the parameters for the single-node test to fix the OOM issue for DeepSeek-V3.2 (#7862)
### What this PR does / why we need it?
Fix the OOM (Out-of-Memory) error in the single-node-deepseek-v3-2-w8a8
nightly test of vllm-ascend:

- Reduced the value of HCCL_BUFFSIZE

- Lowered the gpu-memory-utilization

Optimize service-side performance:
Updated service-oriented configuration parameters (e.g., max-num-seqs,
cudagraph_capture_sizes, batch_size) to improve the inference
performance,so that the performance is closer to the optimal performance
of the current mainline.
Align performance baseline with main branch:
Updated the performance baseline according to the latest performance
data

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The test has passed.

https://github.com/vllm-project/vllm-ascend/actions/runs/23734079080/job/69134387320?pr=7793

---------

Signed-off-by: wyh145 <1987244901@qq.com>
2026-04-01 10:28:46 +08:00
weiguihua2
59a7526339 [CI][Misc] modify ds3.2+dcp ci (#7841)
### What this PR does / why we need it?

Due to the current dcp solution of allgathering the KV cache, the
performance deteriorates significantly, and the CI may get stuck. This
PR temporarily removes the performance and accuracy benchmarks for
DeepSeek-V3.2-W8A8-cp to prevent CI hangs until optimization is
complete.

pcik-from:https://github.com/vllm-project/vllm-ascend/pull/7842

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Verified that the configuration file remains valid and that the CI no
longer attempts to run the problematic benchmarks.

pick-from: https://github.com/vllm-project/vllm-ascend/pull/7842

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-04-01 08:58:21 +08:00
pz1116
0b48ddbc8b [Bugfix][0.18.0][KV Pool]Fix KV transfer put logic (#7718)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Before when we do put for KV Pool, we find the first non-existing key
and put all the blocks starting from that index; however, if the prefix
cache blocks is from another request, and some of the blocks are evicted
due to LRU, we will be putting blocks that still exist in the pool, and
causing MooncakeStore printing unnecessary logs in master service.

What this PR does:

Now we lookup all the keys and only put the ones that are missing.
Fix lookup_scheduler in pool_worker so it handles GQA correctly.
Fixes a few existing typos
Add UT, written by codex
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: DreamerLeader <2270923832@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
2026-03-31 20:21:23 +08:00
linfeng-yuan
ed4ef1f4e7 [releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties (#7794)
### What this PR does / why we need it?
Implement get_token_bin_counts_and_mask and apply_penalties with
Triton-Ascend kernels. This significantly reduces latency of the
sampling process when repetition/frequency/presence penalties are
enabled.

Cherry-pick from main PR #7569 
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
2026-03-31 19:01:51 +08:00
ZT-AIA
66db070423 [cherry-pick][Test]repair for test_compute_slot_mapping (#7836)
### What this PR does / why we need it?
repair for test_compute_slot_mapping

Signed-off-by: ZT-AIA <1028681969@qq.com>
2026-03-31 16:52:58 +08:00
jack
7314bbe2df fix(platform): reimplement MiniMax usage accounting patch (#7835)
## Summary
- replace the MiniMax usage accounting monkey patch with a runtime
wrapper implementation instead of source-text rewriting
- preserve MiniMax reasoning-token semantics when `</think>` is missing
by counting the emitted output as reasoning tokens
- add unit coverage for usage tracking helpers and MiniMax
reasoning-token counting

## Why
The previous implementation rewrote `OpenAIServingChat` by matching
exact source blocks. That was brittle against `vllm` source drift and
could crash during early plugin initialization with:
`RuntimeError: Failed to locate expected block while patching
OpenAIServingChat usage accounting.`

This change keeps the usage-accounting backport, but applies it by
wrapping the original stream/full generators and tracking output token
ids at runtime.

For MiniMax reasoning counting, a missing `</think>` should not be
treated as zero reasoning tokens. It can mean the whole output is still
in thinking mode, or that generation stopped before the closing token
was produced. In that case, the emitted output should still be counted
as reasoning.

## Validation
- `pytest -q
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `vllm serve --help`

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-03-31 16:27:00 +08:00
Yang Yuxi
e776d5c0f1 [Bugfix]v0.18.0 support FlashComm1 & DCP for Qwen (#7726)
### What this PR does / why we need it?
This PR backports the changes from #7673 ([Bugfix] support FlashComm1 &
DCP for Qwen) to the releases/v0.18.0 branch.

--------
Signed-off-by: Yang Yuxi <907276627@qq.com>
2026-03-29 15:59:19 +08:00
wangbj127
9cc41c9457 [v0.18.0][Bugfix][EAGLE] Fix FIA pad bug under max concurrency (#7754)
cherry picked from https://github.com/vllm-project/vllm-ascend/pull/7740
Fixes padding problems of FIA op under max concurrency.

- vLLM version: v0.18.0
- vLLM main:
35141a7eed

Signed-off-by: Wangbingjie <wangbj1207@126.com>
2026-03-29 12:23:44 +08:00
Wang Kunpeng
5df2ddd8db [v0.18.0][Bugfix]Fix Error "AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'" (#7748)
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736

**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`

**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.

**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-03-28 17:03:56 +08:00
jack
f83cb0e6dc [Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710)
### What this PR does / why we need it?
This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0`
after the MiniMax usage-accounting patch merged upstream on March 27,
2026.

It fixes OpenAI chat tool-call streaming for GLM47 by:
- draining terminal parser chunks that contain both the final argument
text and the closing `</tool_call>` suffix
- computing finish backfill from the tool argument bytes actually
emitted to the client, instead of trusting parser-internal buffered
state
- adding focused regression tests for finish backfill and terminal chunk
handling

### Does this PR introduce _any_ user-facing change?
Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit
correct final chunks and argument payloads on `releases/v0.18.0`.

### How was this patch tested?
- `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `python -m pre_commit run --files
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py`

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-03-28 09:15:04 +08:00
weiguihua2
bc8e87f3db [v0.18.0][Bugfix] fix ds3.2 dcp mtp (#7681)
### What this PR does / why we need it?
Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2
scenario.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-03-27 14:24:53 +08:00
Li Wang
2c2d8bb015 [cherry-pick][CI] Enforce torchaudio and torchvison compatible with pta (#7688)
### What this PR does / why we need it?
This patch cherry-pick from #7648 

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-27 11:06:13 +08:00
jack
53cc225cac [v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting (#7700)
### What this PR does / why we need it?
This backports the MiniMax M2 reasoning-token usage accounting fix onto
`releases/v0.18.0` for vllm-ascend.

The release branch does not include the other local GLM patch commit, so
this PR keeps the MiniMax change self-contained by:
- registering `patch_minimax_usage_accounting` on the release branch
- backporting `completion_tokens_details.reasoning_tokens` into chat
usage generation
- fixing MiniMax reasoning token counting for `</think>`-delimited
outputs without depending on the GLM suffix patch

### Does this PR introduce _any_ user-facing change?
Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses
now reports corrected reasoning token counts on the release branch.

### How was this patch tested?
- `python -m compileall
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py`
- `python - <<'PY'` import check for
`vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of
`releases/v0.18.0`

No targeted automated regression test exists for this release-branch
backport yet, so I validated syntax and module import compatibility on
the release branch.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-03-27 10:45:28 +08:00
wangbj127
2ad0ca52a6 Qwen3.5 MoE supports flashcomm v1 (#7644)
cherry pick from https://github.com/vllm-project/vllm-ascend/pull/7486
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Multimodal models like Qwen3.5 MoE does embedding in model_runner, so
when flash comm is enabled, the first AllGather operation should be
skipped.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>
2026-03-25 23:09:33 +08:00
linfeng-yuan
05a561129e [Graph][Bugfix] Set default cudagraph max capture size via platform defaults (#7572)
### What this PR does / why we need it?

This PR lets NPU platform provide its own default
`max_cudagraph_capture_size` via
`NPUPlatform.apply_config_platform_defaults()`.

Previously, when cudagraph sizing was left unset, Ascend inherited
vLLM's upstream default heuristic in `_set_cudagraph_sizes()`, which
uses `max_num_seqs * decode_query_len * 2`. This PR changes Ascend's
default to `min(max_num_seqs * decode_query_len, 512)` while keeping the
rest of vLLM's cudagraph sizing logic unchanged.

### Does this PR introduce _any_ user-facing change?

Yes, but only for Ascend when users do not explicitly configure
cudagraph sizing.

If `max_cudagraph_capture_size` and `cudagraph_capture_sizes` are both
unset, we now uses `max_num_seqs * decode_query_len` (capped at `512`)
instead of the upstream `* 2` default. Explicit user settings are
unchanged.

### How was this patch tested?

Add unit tests to cover:

- default max injection via `apply_config_platform_defaults()`
- explicit `max_cudagraph_capture_size` is preserved
- explicit `cudagraph_capture_sizes` are preserved
- Ascend default max no longer uses the upstream `* 2`
- late `_set_cudagraph_sizes()` recomputation reuses the current max
input

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-25 17:57:19 +08:00
Shaoxu Cheng
e0e585a109 [310P]: add torch chunk gated delta rule and 910b parity ut (#7594)
### What this PR does / why we need it?
RFC https://github.com/vllm-project/vllm-ascend/issues/7394
Add a PyTorch implementation of the  chunk gated delta rule on 310P.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
UT

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-25 16:46:43 +08:00
Shaoxu Cheng
3f4087a8f0 [310P]fused recurrent gated delta rule pytorch core and ut (#7398)
### What this PR does / why we need it?
RFC https://github.com/vllm-project/vllm-ascend/issues/7394
Add a PyTorch implementation of the fused recurrent gated delta ruler on
310P.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
UT
- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-25 08:53:14 +08:00
SILONG ZENG
1e3c1e76bf [Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511)
### What this PR does / why we need it?
This PR introduces several upstream `vllm`-aligned lint hooks into
`vllm-ascend` and makes them part of the actual `pre-commit` flow.

Main changes in this PR:
- add `check-boolean-context-manager` to catch boolean expressions in
`with` statements
- add `check-forbidden-imports` to forbid direct `re` imports and
disallowed direct `triton` imports
- enable shell script linting through `tools/shellcheck.sh`
- add root `.clang-format` aligned with upstream `vllm`, enable
`clang-format` in `pre-commit`, temporarily **exclude all `csrc/**`**
from `clang-format` to avoid bringing a large native code reformat into
this PR

This PR focuses on landing the smaller and immediately useful lint
alignment first, without mixing in the larger requirements-management
migration.

### Does this PR introduce _any_ user-facing change?
No.

This PR only updates repository lint configuration, static checks, and
internal import/style enforcement. It does not change runtime behavior
or public interfaces.

### How was this patch tested?
Tested locally in the project virtual environment.

Commands used:
```bash
bash format.sh
```
Verified checks passed:
``` bash
ruff check...............................................................Passed
ruff format..............................................................Passed
codespell................................................................Passed
typos....................................................................Passed
clang-format.............................................................Passed
Lint GitHub Actions workflow files.......................................Passed
Lint shell scripts.......................................................Passed
Lint PNG exports from excalidraw.........................................Passed
Check for spaces in all filenames........................................Passed
Enforce __init__.py in Python packages...................................Passed
Check for forbidden imports..............................................Passed
Check for boolean ops in with-statements.................................Passed
Suggestion...............................................................Passed
- hook id: suggestion
- duration: 0s

To bypass pre-commit hooks, add --no-verify to git commit.
```
**note:**
clang-format is enabled but currently excludes all csrc/**


- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-24 20:03:01 +08:00
lhp-deep
0e3186f07c [model_runner_v2]:optimize the performance of the _compute_slot_mappings_kernel (#7575)
### What this PR does / why we need it?

This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to
improve performance. The key changes include:
- A new Triton kernel implementation (`_compute_slot_mappings_kernel`)
with NPU-specific optimizations, such as using `tl.gather` to handle
non-contiguous memory access and replacing modulo operations.
- A new method `compute_slot_mappings` in `AscendBlockTables` to use
this new kernel.
- An end-to-end test to verify the correctness of the new kernel against
the reference GPU implementation.

The optimization is needed to avoid performance degradation from scalar
computation on Ascend devices.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
2026-03-24 17:29:14 +08:00
realliujiaxu
5d12446573 [Feat][SP] Suport SP for VL MoE models (#7044)
### What this PR does / why we need it?

2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.


### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.


### How was this patch tested?
- Model: Qwen3-VL-30B-A3B, 
- TP4 DP2
- 100 reqs
- max concurrency 1

| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k         | 429.40               | 323.3                  |
| 16k        | 1297.01              | 911.74                |

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-24 17:16:00 +08:00
LeeWenquan
9615bc33fd Fix Qwen3Next CI Config (#7561)
### What this PR does / why we need it?
This pr modifies qwen3Next nightly CI config. 
(1) Add a nightly CI .
(2) Set a more precise accuracy standard

- vLLM version: v0.18.0
- vLLM main:
6a9cceb219

Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
2026-03-24 17:08:17 +08:00
Angazenn
bdb65319a9 [UT] Align input arguments with Ascend(Yarn)RotaryEmbedding with vLLM and add ut (#7358)
### What this PR does / why we need it?
This PR adds missing arguments in `AscendRotaryEmbedding`,
`AscendYarnRotaryEmbedding` to conform with vLLM. Besides, corresponding
ut is introduced.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Angazenn <supperccell@163.com>
2026-03-24 16:02:56 +08:00
Shaoxu Cheng
83bd77c983 [310p]: add rmsnorm gated fallback and unit test (#7424)
### What this PR does / why we need it?
RFC #7394
310P cannot use the fused `rmsnormgated` operator and must fall back to
the native implementation.

### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
ut
- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-24 09:00:11 +08:00
jiaojiao
1de805ce0a [Ops][Misc] Refactor and optimize CausalConv1d for Ascend (#7495)
### What this PR does / why we need it?
During the prefill phase of Qwen3-Next and Qwen3.5, the
`torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant
performance bottlenecks. To address this, we have re-implemented the
optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`.

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
1 accuracy test
```
[2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ...
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
| Task Name                   |   Process | Progress   | Time Cost   | Status   | Log Path                                  | Extend Parameters   |
+=============================+===========+============+=============+==========+===========================================+=====================+
| vllm-api-general-chat/gsm8k |   2918978 | NA         | 0:00:01     | finish   | logs/eval/vllm-api-general-chat/gsm8k.out | None                |
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
[2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed.
[2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results...
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      271d0b     accuracy  gen                       96.21
```
2 ut modify test
`pytest -sv
/home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d`

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: wenba0 <3054239545@qq.com>
Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>
2026-03-24 00:07:12 +08:00
Nengjun Ma
8e0789bb36 [CI] Recover pd disaggregated encoder test case that been incorrectly skipped (#7505)
### What this PR does / why we need it?
[CI] Recover pd disaggregated encoder test case that been incorrectly
skipped in PR: https://github.com/vllm-project/vllm-ascend/pull/7412

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-03-23 21:41:28 +08:00
weijinqian0
bdd90c0088 [model_runner_v2]optimize the performance of the post_update. (#7496)
### What this PR does / why we need it?
- This PR aims to enhance the operator performance in the `post_update`
phase of `model_runner_v2` on NPUs. By optimizing the relevant
operations, it is expected to improve the overall efficiency and speed
of the model running on NPU hardware, which is crucial for scenarios
where high-performance inference is required.
- when bs = 256, time cost reduce from 26us to 11 us; 

### Does this PR introduce _any_ user-facing change?
No, there are no changes to the API, interface, or other high-level
behaviors that would directly affect the user's code or interaction with
the system beyond the performance improvement.

### How was this patch tested?
CI passed with new added/existing tests. In addition to the regular CI
tests, specific benchmark tests were conducted on NPU hardware to
measure the performance improvement of the `post_update` operators.

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2026-03-23 20:29:55 +08:00
Shaoxu Cheng
13397e9cb7 [310p] Add a PyTorch implementation of the GDN gating operator on 310P (#7430)
### What this PR does / why we need it?
RFC #7394
Add a PyTorch implementation of the GDN gating operator on 310P.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
UT

- vLLM version: v0.17.0
- vLLM main:
4497431df6

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-23 20:26:39 +08:00
zhangxinyuehfad
886756aea0 [Bugfix][CI] Fix aisbench installation to avoid Gitee authentication (#7536)
### What this PR does / why we need it?
- Pass GITEE_USERNAME (var) and GITEE_TOKEN (secret) as Docker build
  args in nightly image build so Dockerfile can authenticate to Gitee
- In Dockerfile.nightly.a2/a3, embed credentials into clone URL to
  avoid auth failure during `git clone`
- In single-node and multi-node PR test workflows, backup the
  pre-installed benchmark from the nightly image before wiping
  vllm-ascend, then restore it instead of re-cloning from Gitee,
  which is inaccessible from fork PR contexts

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-23 20:16:51 +08:00
liuhy1213-cell
fb283b5820 [CI] Add nightly CI test cases for the GLM-5 (#7429)
### What this PR does / why we need it?
Add nightly CI test cases for the GLM-5
Add model download for the GLM-5

https://github.com/vllm-project/vllm-ascend/actions/runs/23286178651/job/67710409642#logs
- vLLM version: v0.17.0
- vLLM main:
b31e9326a7
---------
Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com>
Signed-off-by: liuhy1213-cell <liuhy1213@gmail.com>
Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>
2026-03-23 19:14:19 +08:00
Nengjun Ma
8e2c59e1ee Main2main upgrade vllm commit to 03 19 17:00 (#7478)
### What this PR does / why we need it?
Upgrade vllm commit to 2026.03.19.

1.Fix socket removed from StatelessProcessGroup. Upstream vLLM PR
[#36330](https://github.com/vllm-project/vllm/pull/36330) ("elastic_ep:
Fix stateless group port races") refactored StatelessProcessGroup and
removed the socket: socket.socket | None field. The socket ownership was
moved to a new create_tcp_store() helper instead of being stored as a
field on the dataclass.

2.fix `virtual_engine` parameter removed from `set_forward_context().
Upstream [V0 Deprecation] Deprecate virtual engine
[#37195](https://github.com/vllm-project/vllm/pull/37195)

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-03-23 16:25:57 +08:00
Qiu
71df17f4e6 bugfix(MC2): refactor the comm group of MC2 to be compatible with PP (#7291)
### What this PR does / why we need it?
This PR refactors the communication group of MC2 to keep it consistent
with vllm's EP group, making it compatible with PP.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-03-23 15:44:21 +08:00
Shaoxu Cheng
5b60b530d6 [Bugfix][310p] the new A5 mmencoder op donot support 310p (#7518)
### What this PR does / why we need it?

Because the new A5 MMEncoder operator was merged, the 310P can no longer
run any VL models. This PR fixes that issue. details at #7046

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e
- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-23 15:40:34 +08:00
Shanshan Shen
5c0d02f689 [Bugfix] Fix multi-instance serving OOM on single card (#7427)
### What this PR does / why we need it?
Fix https://github.com/vllm-project/vllm-ascend/issues/7308.

Subtracting `init_non_torch_memory` (maybe used by the first instance)
from the total `non_torch_memory` when calculating
`available_kv_cache_memory`.

Directly use `non_torch_memory_increase` (contained in
`non_kv_cache_memory`) to calculate `available_kv_cache_memory`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Launch tow vllm-ascend instances sequentially on single card.

```bash
# Launch first instance
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \
--port 8100 \
--host 0.0.0.0 \
--additional-config='{"enable_cpu_binding":true}'  \
--gpu-memory-utilization 0.3 \
--max-num-seqs 1 \
--max-model-len 2048 \
--max-num-batched-tokens 2048 \
--no-enable-prefix-caching \
--enforce-eager

# Launch second instance
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B \
--port 8101 \
--host 0.0.0.0 \
--additional-config='{"enable_cpu_binding":true}'  \
--gpu-memory-utilization 0.3 \
--max-num-seqs 1 \
--max-model-len 2048 \
--max-num-batched-tokens 2048 \
--no-enable-prefix-caching \
--enforce-eager
```

**Before this PR:**

```bash
# First instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2340388298034668 GiB
init_non_torch_memory: 0.3616676330566406 GiB
non_torch_memory_before_empty_cache: 0.3896217346191406 GiB
non_torch_memory_increase: 0.0279541015625 GiB
non_torch_memory_cleared_by_empty_cache: 0.3616676330566406 GiB
------------------------------------------------------------------

# Second instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2336344718933105 GiB
init_non_torch_memory: 18.37220001220703 GiB
non_torch_memory_before_empty_cache: 18.399906158447266 GiB
non_torch_memory_increase: 0.02754974365234375 GiB
non_torch_memory_cleared_by_empty_cache: 18.372356414794922 GiB
------------------------------------------------------------------
# available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache
Available KV cache memory: -1.32 GiB
```

**After this PR:**

```bash
# First instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.2340540885925293 GiB
init_non_torch_memory: 0.36182403564453125 GiB
non_torch_memory_before_empty_cache: 0.38979339599609375 GiB
non_torch_memory_increase: 0.0279693603515625 GiB
non_torch_memory_cleared_by_empty_cache: 0.0 GiB
------------------------------------------------------------------

# Second instance:
------------------------------------------------------------------
requested_memory: 18.287109375 GiB
non_kv_cache_memory: 1.233344554901123 GiB
init_non_torch_memory: 18.74309539794922 GiB
non_torch_memory_before_empty_cache: 18.770355224609375 GiB
non_torch_memory_increase: 0.02725982666015625 GiB
non_torch_memory_cleared_by_empty_cache: 0.0 GiB
------------------------------------------------------------------
# available_kv_cache_memory = requested_memory - non_kv_cache_memory - non_torch_memory_cleared_by_empty_cache
Available KV cache memory: 17.05 GiB
```

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2026-03-23 14:22:59 +08:00
Canlin Guo
e68464a1d6 [Bugfix] Fix slow hasattr in ACLGraphWrapper.__getattr__ (#7442)
### What this PR does / why we need it?

Follow https://github.com/vllm-project/vllm/pull/37425,
https://github.com/vllm-project/vllm-omni/pull/1982

Copied from them:

Notice that `hasattr(self.model, "flush_pending_metadata")` cost 6ms per
decode step when profiling Qwen3 Omni.

The original `CUDAGraphWrapper.__getattr__` raises:
```python
  raise AttributeError(f"... cudagraph wrapper: {self.runnable}")
  ```
When hasattr() is called for a non-existent attribute, Python internally
calls __getattr__ which constructs this AttributeError. The
{self.runnable} triggers `__repr__()` on the underlying model (e.g.,
`Qwen3OmniMoeForConditionalGeneration`), which recursivelytraverses the
entire nn.Module tree to generate an 18,000+ character string. This
takes ~6-7ms per call.
Since `hasattr(self.model, "flush_pending_metadata") ` is called every
decode step in the Talker forward path, this adds ~6ms overhead per
step, severely impacting audio inter-chunk latency (ICL).

```Python
hasattr(self.model, "flush_pending_metadata")
  → getattr(self.model, "flush_pending_metadata")
    → not found in CUDAGraphWrapper.__dict__
    → not found in the CUDAGraphWrapper class hierarchy
    → triggers CUDAGraphWrapper.__getattr__("flush_pending_metadata")
      → hasattr(self.runnable, "flush_pending_metadata")  # runnable also doesn't have it
      → executes raise AttributeError(f"... {self.runnable}")
        → Python needs to construct the exception object
        → the f-string triggers self.runnable.__repr__()
        → Qwen3OmniMoeForConditionalGeneration.__repr__()
          → recursively traverses the entire nn.Module tree
          → generates a 18,000+ character string
          → takes ~6 ms
        → AttributeError object is created
    → hasattr catches the AttributeError and returns False
    → the 18,000-character string is immediately discarded (no one ever sees it)
```

### Does this PR introduce _any_ user-facing change?

NO.

### How was this patch tested?

See https://github.com/vllm-project/vllm-omni/pull/1982


- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-03-23 09:26:24 +08:00
Li Wang
75fae619d5 [Misc] Refactor aclgraph accuracy test to use logprob-based comparison (#7455)
### What this PR does / why we need it?

Replace text-match assertions with a two-tier logprob accuracy check:

- Prefill (token 0): assert token ID is identical between eager baseline
and compiled mode, then verify logprob matches within `atol`.
- Decode (tokens 1-2): if chosen tokens match, compare logprobs
directly; if they differ, cross-lookup the baseline token in the
compiled model's top-20 distribution and assert the assigned logprob is
within `decode_atol` (defaults to 2x atol). This tolerates minor argmax
drift caused by floating-point differences while still catching
distribution divergence.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-23 09:08:21 +08:00
Qi Mao
9bf9b4b267 [Feature] Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487)
### What this PR does / why we need it?
This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend
by reducing host/device synchronization overhead.

The current implementation of the `chunk_gated_delta_rule` path for
variable-length sequences prepares chunk metadata during the forward
pass. This approach triggers frequent CPU intervention and host/device
round-trips. When running prefill-heavy workloads with asynchronous
scheduling enabled, these synchronizations result in execution "bubbles"
and prefill stalling (stuttering). **Note that this does not cause
asynchronous scheduling to fail; rather, it prevents the system from
reaching its theoretical throughput due to these unnecessary stalls.**

To resolve this, the patch moves metadata preparation out of the hot
path:
- **Prebuilt Metadata:** All non-speculative varlen chunk metadata for
GDN is now prebuilt on the CPU.
- **Asynchronous Transfer:** Staging buffers are kept in pinned memory
and transferred to the NPU asynchronously.
- **Integration:** The prebuilt bundle is attached to GDN attention
metadata via `patch_gdn_attn.py` and passed into Triton wrappers.
- **Backward Compatibility:** Triton wrappers fall back to the legacy
preparation path if no prebuilt metadata is provided.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
2026-03-22 23:09:23 +08:00
Qi Mao
9d0b7c8e98 [Platform][BugFix] Preserve hybrid block size on Ascend (#7528)
### What this PR does / why we need it
This PR fixes a startup regression for Ascend hybrid attention + mamba
models after upgrading to vLLM `0.18.0`.
However, after the vLLM `0.18.0` upgrade, worker initialization still
calls the generic platform hook:
- `current_platform.update_block_size_for_backend(vllm_config)`

### How this PR fixes it

This PR keeps the fix strictly inside `vllm-ascend`.

It adds an Ascend override for
`NPUPlatform.update_block_size_for_backend()`:

- for hybrid models, do not run the generic upstream block-size fallback
- preserve the block size that was already computed by the hybrid
model-specific config logic
- for non-hybrid models, keep the original upstream behavior unchanged

- vLLM version: v0.18.0
- vLLM main:
8b6325758c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-22 11:21:49 +08:00
Zetong Li
84a74f0cb1 [Bugfix] Fix padding logic in eagle proposer for kimi25 (#7348)
### What this PR does / why we need it?
This PR aims to fix padding logic in eagle proposer for kimi25. Main
changes involve:
1. modify the way to obtain draft model attention builder and backend
2. add block table padding & related tensor slicing in common metadata
when `draft_step>1` for solving fia verifying error
3. replace block table in `update_graph_params` for solving fia
verifying error

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-21 16:57:22 +08:00
meihanc
bff4fbfca5 upgrade to 0.18.0 (#7502)
### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-03-21 16:05:38 +08:00
HongtaoYang
80a4265717 [Feat] Support separate attention backend for target and draft model. (#7342)
### What this PR does / why we need it?
This PR enables separate attention backend configuration for target and
draft models in speculative decoding, decoupling the previously bound
attention backend settings between the two models.

It solves the compatibility issue where some draft models do not support
the attention backend used by the target model, and allows users to
select the optimal attention backend for each model individually to
maximize inference performance. The change is fully backward compatible.
---------
Signed-off-by: SidaoY <1024863041@qq.com>
2026-03-21 10:48:01 +08:00
linfeng-yuan
88d03a783f [refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024)
### What this PR does / why we need it?
Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business
`**kwargs` with typed request objects and explicit stage boundaries.

- Prepare, dispatch, MLP, and quant stages now have clearer ownership.
- Main MoE path no longer depends on business `kwargs.get(...)` lookups.
- Comm and dispatcher interfaces are request-only on the main path.
- UTs can assert stage-level fields directly instead of inferring
behavior indirectly.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-20 23:23:57 +08:00
LI SHENGYONG
4e6dbe0956 [EPLB][Bugfix] Set parallel_config.enable_eplb to true to load redundant experts (#7470)
### What this PR does / why we need it?
pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because
it filters out redundant experts.
pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use
parallel_config.enable_eplb to determine whether to skip the weight
loading filter.
But in vllm-ascend, parallel_config.enable_eplb is always false. When we
use eplb, we temporarily set it to true.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?

![Snipaste_2026-03-19_16-13-01](https://github.com/user-attachments/assets/b3a4911e-36b3-4c31-951c-7c091f416d00)
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-20 15:22:55 +08:00
LI SHENGYONG
1e05c4908f [EPLB] Reduce the memory used for batch_isend_irecv (#7344)
### What this PR does / why we need it?

#6729 seems to reduce the NPU memory usage of eplb, but actually moves
the buffer allocation of dist.all_gather_into_tensor to
dist.batch_isend_irecv. Therefore, the overall NPU memory usage is not
reduced. This PR completely reduces the memory usage in this part.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Remaining memory of each rank before the repair.
<img width="649" height="99" alt="image"
src="https://github.com/user-attachments/assets/52a67592-e0e8-4f9a-b194-b84cb848c598"
/>

Remaining memory of each rank after the repair.
<img width="641" height="99" alt="image"
src="https://github.com/user-attachments/assets/0bc2e67c-f328-4dea-98af-d7a459fb4876"
/>

Close EPLB.
<img width="543" height="45" alt="image"
src="https://github.com/user-attachments/assets/6dcba19d-4401-44b8-a6d3-c7b35ee983c7"
/>

Memory of weights for each rank.
<img width="648" height="46" alt="image"
src="https://github.com/user-attachments/assets/4db2fd04-98a0-4d26-a026-2e8287102b99"
/>

Estimated memory for EPLB: 15.68  / 48 (layer_num) + 2 * 0.02 = 0.35 GB


- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-20 12:25:58 +08:00
wangyu
7be66cec75 [Test] Add the always_check_nodes parameter to the _wait_for_multiple_servers function in conftest.py for the EPD test case. (#7410)
### What this PR does / why we need it?
This PR add the always_check_nodes parameter to the
_wait_for_multiple_servers function in conftest.py for the EPD test
case.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
1.by running the test
`pytest -sv test_disaggregated_encoder.py`

2.by running ci

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: yenuo26 <410167048@qq.com>
2026-03-20 11:33:48 +08:00
ichaoren
9d1452c74d [OPS]add split_qkv_tp_rmsnorm_rope ops (#7376)
### What this PR does / why we need it?
This PR introduces a new fused Triton kernel,
`split_qkv_tp_rmsnorm_rope` for Minimax-m2.5.

The implementation includes two Triton kernels:
1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input
and computes the local variance for RMSNorm.
2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering
TP all-reduce for variance) and Neox-style RoPE.

### Does this PR introduce _any_ user-facing change?
Does not.

### How was this patch tested?
```python
pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py
```
### Test Data
A3 TP16
基线  

| data       | TTFT(ms) | TPOT(ms) | TPS    |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1  | 267.55   | 25.5     | 38.85  |
| 4k/1k@bs4  | 542.4    | 26.51    | 148.06 |

测试线

| data       | TTFT(ms) | TPOT(ms) | TPS    |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1  | 234.64   | 20.96    | 47.24  |
| 4k/1k@bs4  | 508.36   | 22.16    | 176.69 |


- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: xutianyi <xutianyi5@huawei.com>
Co-authored-by: xutianyi <xutianyi5@huawei.com>
2026-03-19 17:19:18 +08:00
Nengjun Ma
ee804ce23e Main2main upgrade vllm to 0318 commit (#7412)
### What this PR does / why we need it?
Upgrade vllm commit to 0318. 

Main content: Added a pre-operation for cleaning up and waiting(default
max 50s) for the completion of the clean up of the NPU memory to some
test cases that failed due to the failure to release the NPU memory in a
timely manner when the previous test cases were executed.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-03-19 17:17:36 +08:00
ZT-AIA
05afc7f8c3 [CI]repair for ci custom ops (#7461)
### What this PR does / why we need it?
NPU resources are not released immediately when custom operator test
cases are executed, causing an error when other operator test cases are
executed.

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
2026-03-19 17:13:12 +08:00
Li Wang
83a4065b4b [CI] Add pre-commit check for patch logger (#7446)
### What this PR does / why we need it?
See https://github.com/vllm-project/vllm-ascend/pull/7402, pre-commit
hook will forbid init_logger(__name__) in vllm_ascend patch modules

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-19 16:53:20 +08:00
aipaes
87d6424b2e [CI] Add nightly CI test cases for the GLM-4.7 model. (#7391)
### What this PR does / why we need it?
Add acc nightly CI test cases for the GLM-4.7 model.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
through CI

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
2026-03-19 16:43:29 +08:00