Commit Graph

2455 Commits

Author SHA1 Message Date
SILONG ZENG
62ea664aa7 [Lint]Style: Convert test/ to ruff format(Batch #5) (#6747)
### What this PR does / why we need it?
| File Path |
| :--- |
| `tests/e2e/singlecard/compile/backend.py` |
| `tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py` |
| `tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py` |
| `tests/e2e/singlecard/compile/test_norm_quant_fusion.py` |
| `tests/e2e/singlecard/model_runner_v2/test_basic.py` |
| `tests/e2e/singlecard/test_aclgraph_accuracy.py` |
| `tests/e2e/singlecard/test_aclgraph_batch_invariant.py` |
| `tests/e2e/singlecard/test_aclgraph_mem.py` |
| `tests/e2e/singlecard/test_async_scheduling.py` |
| `tests/e2e/singlecard/test_auto_fit_max_mode_len.py` |
| `tests/e2e/singlecard/test_batch_invariant.py` |
| `tests/e2e/singlecard/test_camem.py` |
| `tests/e2e/singlecard/test_completion_with_prompt_embeds.py` |
| `tests/e2e/singlecard/test_cpu_offloading.py` |
| `tests/e2e/singlecard/test_guided_decoding.py` |
| `tests/e2e/singlecard/test_ilama_lora.py` |
| `tests/e2e/singlecard/test_llama32_lora.py` |
| `tests/e2e/singlecard/test_models.py` |
| `tests/e2e/singlecard/test_multistream_overlap_shared_expert.py` |
| `tests/e2e/singlecard/test_quantization.py` |
| `tests/e2e/singlecard/test_qwen3_multi_loras.py` |
| `tests/e2e/singlecard/test_sampler.py` |
| `tests/e2e/singlecard/test_vlm.py` |
| `tests/e2e/singlecard/test_xlite.py` |
| `tests/e2e/singlecard/utils.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-24 15:50:00 +08:00
xleoken
747484cb64 [Bugfix] Fix wrong computed_tokens when meet exception. (#6522)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

Fix wrong computed_tokens when meet exception. This pull request
addresses a bug in the KV transfer mechanism where an exception during
token lookup operations could lead to an incorrect count of
computed_tokens. By modifying the exception handling in both the lookup
and lookup_scheduler functions to return 0 instead of the start index,
the system now correctly indicates that no tokens were successfully
processed when a remote connection failure occurs. This enhancement
improves the robustness and accuracy of token management within the
vllm_ascend distributed KV pool.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

NO.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: xleoken <xleoken@163.com>
2026-02-24 15:29:30 +08:00
LI SHENGYONG
ff29e029de [EPLB][Bugfix] Bugfix for ineffective dynamic eplb (#6653)
### What this PR does / why we need it?
#6043 deleted the forward_before phase of the dynamic eplb. Currently,
the end-to-end precision is monitored in the UT, and the log is not
printed in the key place. As a result, the eplb does not take effect and
is not intercepted.
1. The forward_before function is added back.
2. Delete unnecessary logs and add key logs.
3. Warm-up of algorithm 3 is added.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298)

#### The conversation is normal.
Okay, the user is asking, \"What is deep learning?\" I need to explain
this in a clear and concise way. Let me start by recalling what I know
about deep learning. It's a subset of machine learning, right? So first,
I should mention that it's part of machine learning, which itself is a
branch of AI. Then, the key aspect of deep learning is the use of neural
networks with multiple layers. These are called deep neural
networks.\n\nWait, I should define neural networks first. Maybe start
with the basics. A neural network is inspired by the human brain, with
layers of nodes (neurons) that process data. But deep learning
specifically refers to networks with many layers—hence \"deep.\" So the
term \"deep\" comes from the number of layers. \n\nI should explain how
deep learning works. It involves training these networks on large
datasets, allowing them to automatically learn features from the data.
Unlike traditional machine learning, where you might have to manually
extract features, deep learning models can do this automatically. That's
a key point. For example, in image recognition, a deep learning model
can learn to detect edges, shapes, and then more complex patterns
without human intervention.\n\nApplications are important too. The user
might want to know where deep learning is used. Common examples include
image and speech recognition, natural language processing, autonomous
vehicles, and recommendation systems. Maybe mention specific
technologies like self-driving cars using computer vision or virtual
assistants like Siri or Alexa

- vLLM version: v0.15.0
- vLLM main:
13397841ab

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-24 14:43:04 +08:00
luomin2005
f41eeeb11e Refactor the ops PyTorch adapter,cleanup for csrc/torch_binding.cpp (#6732)
### What this PR does / why we need it?
Refactor the ops PyTorch adapter,cleanup for csrc/torch_binding.cpp,
more details see
https://github.com/vllm-project/vllm-ascend/issues/6486

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
install the new package to test the new modification, here is the
result:


- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-02-24 09:12:43 +08:00
Nengjun Ma
f0caeeadcb [CI] unlock when load model (#6771)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-14 18:54:04 +08:00
yydyzr
70e26551cf [Doc] modify glm doc (#6770)
### What this PR does / why we need it?
1. add description of another version of glm5-w4a8 weight 
2. update the introduction of  installation 
3. introduce a script to enable bf16 MTP
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
N/A
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
2026-02-14 16:47:23 +08:00
SILONG ZENG
e2237819a9 [CI]Fixed the spell check function in typos.toml (#6753)
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`

```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
    ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
    ".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
    ".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
    ".*ot.*", ".*[Tt]h[rR].*"]
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-14 11:57:26 +08:00
JIACHENG XU
64aea60f2e [EPLB][Nightly] Refactor UT (#6543)
### What this PR does / why we need it?
The basic configs are extracted and reused for eplb UT. This is done so
that if the basic configs are changed later, eplb UT does not need to be
modified repeatedly.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: bigsir007 <xujiacheng12@huawei.com>
Co-authored-by: bigsir007 <xujiacheng12@huawei.com>
2026-02-14 10:56:29 +08:00
xulei
1e77077788 [Bugfix][DispatchFFNCombine] resolve vec error caused by unaligned UB access (#6707)
### What this PR does / why we need it?
1. Fix a vec error caused by unaligned UB accesss in the
DispatchFFNCombine;
2. Fix expert_token_nums tensor defined on host instead of NPU in
moe_comm_method.py
3. Fix multi-core copy issue of expert_token_nums in dispatchffnCombine
op (single aiv copy is sufficient)

### Does this PR introduce _any_ user-facing change?

No, this PR does not introduce any user-facing changes. The fix only
addresses internal memory access logic and does not modify any public
APIs, interfaces, or user-visible behaviors.

### How was this patch tested?

`export VLLM_ASCEND_ENABLE_FUSED_MC2=1`

vLLM version: v0.15.0

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: xulei_ict <xulei292@huawei.com>
Co-authored-by: xulei_ict <xulei292@huawei.com>
2026-02-14 10:32:50 +08:00
whx
e2175d9c7e [Lint] Adapt lint tools for windows (#6727)
### What this PR does / why we need it?
If users run bash format.sh with `git bash` on windows system, there
exists `Executable /bin/bash not found` error. This is because in
Windows Git Bash environment, the Bash executable is actually located at
`/usr/bin/bash`, while the `/bin` directory may not exist, or may be
just an empty directory or a broken symlink that does not contain bash.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
With this PR and `pre-commit` installed, windows coders can directly run
`bash format.sh` to clean lint issues.

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-02-13 15:53:16 +08:00
Cao Yi
6de207de88 [main][Docs] Fix typos across documentation (#6728)
## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-13 15:50:05 +08:00
Shaoxu Cheng
b6bc3d2f9d [Feat.][310P]: weightNZ feature with quant or unquant. (#6705)
NZ Format Support for Linear Layers: Implemented support for the NZ
(N-dimensional Z-order) format for linear layer weights on Ascend 310P,
enhancing performance for both quantized and unquantized layers.
Unquantized Linear Method for Ascend 310P: Introduced
AscendUnquantizedLinearMethod310 to specifically handle and apply NZ
format casting to unquantized linear layer weights during the loading
process.
MRotaryEmbedding Integration: Extended Rotary Embedding support by
adding AscendMRotaryEmbedding310 to provide an Ascend-specific
implementation for MRotaryEmbedding.
Quantization Method Updates: Updated the w8a8_static quantization method
to directly transpose weights and apply NZ format casting, ensuring
consistency with the new format.
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-02-13 15:41:02 +08:00
Shaoxu Cheng
f40256b697 [Feat.][310P] addrmsnorm for 300I DUO (#6704)
### What this PR does / why we need it?
This PR integrates the `npu_add_rms_norm` fused kernel for RMSNorm
operations with residual connections on 310P devices. This change
optimizes the computation by replacing a two-step process (manual
residual addition followed by RMSNorm) with a single, more efficient
fused operation. This is needed to improve the performance of models
utilizing RMSNorm with residual connections on the 310P architecture.

Fixes #

### Does this PR introduce _any_ user-facing change?
No, this PR introduces an internal optimization and does not change any
user-facing APIs or behaviors.

### How was this patch tested?
This patch was tested with updated unit tests
(`test_RMSNorm_forward_310p`) that mock the `npu_add_rms_norm` operation
to verify the correctness of the fused kernel integration.

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-02-13 15:40:49 +08:00
Icey
7164990904 [Graph][Fusion] Integrating inductor pass and npugraph ex pass (#6354)
### What this PR does / why we need it?
Integrating inductor pass and npugraph ex pass, see RFC:
https://github.com/vllm-project/vllm-ascend/issues/6347

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
all tests passed.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-02-13 15:34:55 +08:00
iiiklw
87a0b7b7c7 [bugfix] adapt bugfix for norm_quant_fusion_pass to npugraph_ex (#6726)
### What this PR does / why we need it?

This PR adapts bugfixes from `norm_quant_fusion_pass` to
`graphex_norm_quant_fusion_pass` for the `npugraph_ex` backend.

The main changes are:
- Replaced `torch.ops.npu.npu_add_rms_norm` with
`torch.ops._C_ascend.npu_add_rms_norm_bias`.
- For patterns without bias, `None` is passed as the bias argument.
- For patterns with bias, the separate `add` operation for bias is
removed and the bias is passed directly to `npu_add_rms_norm_bias`. This
improves fusion.

These changes ensure consistency and correctness for RMSNorm and
quantization fusion patterns when using `npugraph_ex`.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
2026-02-13 10:10:39 +08:00
taoyao1221
41d056f947 [doc] add A2 series doc for GLM5.md (#6717)
### What this PR does / why we need it?
Added support for A2 in the GLM-5 doc.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

vLLM version: v0.15.0
vLLM main:
9562912cea

- vLLM version: v0.15.0
- vLLM main:
9562912cea
2026-02-12 16:08:17 +08:00
wangxiaoteng888
b881fab416 [P/D][PCP] mooncake layerwise support pcp function (#6627)
### What this PR does / why we need it?
mooncake layerwise support pcp function
PCP (Prefill Context Parallelism) Support: Introduced explicit support
for Prefill Context Parallelism (PCP) and Decode Context Parallelism
(DCP) in the Mooncake layerwise KV cache transfer mechanism, allowing
for more granular control and awareness of parallel configurations
during data transfer.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2026-02-12 11:02:25 +08:00
yejj
8b23554741 [Misc] gen kv events in ascendconnector (#6593)
### What this PR does / why we need it?
refer to https://github.com/vllm-project/vllm-ascend/issues/6391,
Currently adapted the complete process of event publishing in vllm:
* `kv_connector_model_runner_mixin` invoke kv-connector
`get_kv_connector_kv_cache_events` func to collect kvevents
* in `scheduler.py` , it's `update_from_output` func will invoke
`_update_from_kv_xfer_finished` which invoke
`connector.update_connector_output` to collect kv-events from all
kv-worker, and then scheduler will invoke `connector.take_events` api to
collect all kv-events and add it to the events which from
`kv_cache_manager`

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
You can add `--kv-events-config` parameter to the `vllm server` command
to enable this feature.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: yejj710 <abyss1999@163.com>
Co-authored-by: fems14 <1804143737@qq.com>
2026-02-12 11:01:09 +08:00
jiahao.quan
7221045777 [Attention] add gpt-oss support (#5901)
### What this PR does / why we need it?
Please refer to the following link for the historical conversation
https://github.com/vllm-project/vllm-ascend/pull/4467. We have made
updates in light of the comments from the prior PR review. Given the
refactoring of the attention_v1 component, we have carried out necessary
adjustments to fit the newly revised code.

### Does this PR introduce _any_ user-facing change?

1. Modified the code in the Attention section to adapt to the SWA and
Sink features required by gpt-oss.
2. Modified the code in the MoE section to add support for bias and
swigluoai.

### How was this patch tested?
Please refer to the
https://github.com/vllm-project/vllm-ascend/pull/4467 for performance
tests, on the basis of which the accuracy tests from AIME2024 have been
newly added.

![img_v3_02tu_501e88e3-2217-4565-8edf-b9acf4f43f2g](https://github.com/user-attachments/assets/024f8283-18ab-4d4d-ab12-27917b5d7d06)


- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: mikequan0425 <mikequan0425@foxmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leon_tao <taoyao2@huawei.com>
Co-authored-by: nurxat <738457498@qq.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: mikequan <199741451@qq.com>
Co-authored-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com>
Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Co-authored-by: pu-zhe <zpuaa@outlook.com>
Co-authored-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Cao Yi <slightwindsec@gmail.com>
Co-authored-by: Icey <1790571317@qq.com>
Co-authored-by: SILONG ZENG <2609716663@qq.com>
2026-02-12 10:55:34 +08:00
lih827
f71812011d [Feature] DispatchGmmCombineDecode support bf16/float16 gmm1/gmm2 weight and support gmm weight with ND format (#6393)
### What this PR does / why we need it?
1. support ND format gmm weight input.
Before this pr, gmm1_weight and gmm2_weight could only be passed as
input to the DispatchGmmCombineDecode operator in NZ data format. After
the modification, they are allowed to be passed in ND data format.
2. support bf16/float16 gmm weight
The current PR modification enables the DispatchGmmCombineDecode
operator to support non-W8A8 scenarios, allowing gmm1_weight and
gmm2_weight to be passed as float16/bfloat16 which is correspond with
input token data type.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: lih827 <383084552@qq.com>
2026-02-12 10:37:41 +08:00
Ronald
f1ffb5fb19 [Feature] adapt to uva buffer and main2main (#6657)
### What this PR does / why we need it?
vllm model runner v2 use uva buffer to prepare input data, but npu
doesn't support uva yet, this pr implement a uvawrapper class to mimic
gpu's uva backend. what's more, this pr make some modifications to adapt
to the newer main branch.

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM main:
13397841ab

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-02-12 10:36:31 +08:00
ZYang6263
56269eae0e [BugFix] Fix AddRMSNormQuant not taking effect (#6620)
### What this PR does / why we need it?
Fix the issue where, in graph mode, the fused `AddRMSNormQuant` operator
does not take effect when there is no bias.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2026-02-12 09:26:05 +08:00
Canlin Guo
052cc4e61b [Docs] Fix GLM-5 deploy command (#6711)
This pull request refines the GLM-5 deployment documentation by updating
the Docker run command to include a more comprehensive set of device
mappings and by removing an extraneous quantization flag from the `vllm
serve` commands. These changes aim to correct and clarify the deployment
instructions, ensuring users can successfully set up and run the GLM-5
model as intended.


- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: Canlin Guo <961750412@qq.com>
2026-02-12 08:55:48 +08:00
iiiklw
a0315f6697 [npugraph_ex]enable npugraph_ex by default (#6664)
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
2026-02-12 08:44:06 +08:00
rika
b86ea66b0a [doc]add GLM5.md (#6709)
### What this PR does / why we need it?
Add GLM5 doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: nakairika <982275964@qq.com>
2026-02-12 04:00:40 +08:00
yydyzr
ff3a50d011 [Model] GLM5 adaptation (#6642)
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
978a37c823

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-11 22:22:22 +08:00
Zetong Li
140fcaffc3 [Bugfix] Update target probs to target logits in rejection sample (#6685)
### What this PR does / why we need it?
This PR aims to update `target_probs` to `target_logits` in
`rejection_sample`, for catching up with
https://github.com/vllm-project/vllm/pull/32852. Otherwise, sampling
with temperature will incur accuracy problem where tokens can be
accepted or rejected unreasonably.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: v0.15.0
- vLLM main:
13397841ab

Signed-off-by: Zetong Li <slippersss@126.com>
2026-02-11 21:31:40 +08:00
Angazenn
c0c2eb614e [Main][Ops] Make triton rope support index_selecting from cos_sin_cache (#5450)
### What this PR does / why we need it?

This PR extends original `rope_triton_forward` and
`split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as
inputs. This fully aligns to vLLM RoPE api interface. Compared with
earlier implementation for RoPE, the benefits are:

1. avoiding pre-computation of `cos` `sin` before model execution, which
helps to remove redundant codes.
2. allowing eagle3 draft model to have different rope parameters with
main model (see #6612 ). This help to recover accept rate && accuracy in
that case.

In addition, this kernel change only introduces very small performance
degradation. Those `index_select` or `chunk` operations are now changed
into simple memory access in triton kernel (For example,
https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81).

**Highlights**

- **RoPE Cache Unification**: Replaced separate _sin and _cos global
tensors with a unified cos_sin_cache and explicit positions tensor for
Rotary Positional Embeddings (RoPE), streamlining data handling.
- **Triton Kernel Integration**: Updated Triton kernels
(split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the
cos_sin_cache and positions for more efficient and integrated RoPE
calculations.
- **Custom Operation Registration**: Registered `rope_forward_oot` as a
new custom operation, allowing its use in fused compilation passes and
providing a dedicated entry point for the new RoPE implementation.
- **Refactored RoPE Forward Pass**: Modified the rope_forward_oot
function to accept the new cos_sin_cache and positions arguments,
enabling a more flexible and integrated RoPE application within the
system.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Additional test on Qwen3-235b accuracy:

| Aime2024 | GSM8K | Livecodebench |
| -------- | -------- | -------- |
| 83.33 | 96.26 | 70.23 |

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-02-11 21:20:53 +08:00
SILONG ZENG
6bc44bf49b [CI]fix nightly multi node test error for wait for pod ready (#6675)
### What this PR does / why we need it?
Fixes the issue where nightly multi-node tests hang during the "wait for
pod ready" stage due to strict shell mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
13397841ab

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-11 18:11:00 +08:00
Icey
88773bb101 [main to main] upgrade main 0210 (#6673)
### What this PR does / why we need it?
upgrade vllm commit to `9562912cead1f11e8540fb91306c5cbda66f0007`

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
all tests passed

- vLLM version: v0.15.0
- vLLM main:
13397841ab

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-02-11 18:10:14 +08:00
Cao Yi
53b494b1e4 [main][Quant] Remove unused rotation functions and parameters from W4A4 LAOS quantization (#6648)
## Summary
- Remove unused `set_rotation_config` and `apply_rotation` methods from
`AscendW4A4LaosDynamicLinearMethod`
- Remove unused `rotation_type` field and associated conditional
quantization parameters (`heads_rotation`, `kronecker_rotation_n`,
`kronecker_rotation_m`)

These rotation-related functions and parameters are never called in the
current W4A4 LAOS dynamic quantization workflow.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-11 16:38:45 +08:00
whx
bb73478c00 [Test][BugFix] Fix torch.rand usage in triton penalty test (#6680)
### What this PR does / why we need it?
This PR fixes a `TypeError` in
`tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py`
that was causing nightly test failures. The `torch.rand()` function was
being called with the `device` string as a positional argument, which is
incorrect. This has been corrected to use the `device` keyword argument.

Fixes #

### Does this PR introduce _any_ user-facing change?
No, this change only affects a test file.

### How was this patch tested?
CI is expected to pass with this fix.

- vLLM version: v0.15.0
- vLLM main:
13397841ab

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-02-11 16:31:49 +08:00
luomin2005
0c1cfa2bac Add Worker Interface:check_health (#6681)
This pull request introduces a new capability to monitor the health of
NPU cards directly from the Worker class. This enhancement allows for
proactive detection of NPU issues by executing the npu-smi command,
improving system reliability and operational visibility within the
vllm_ascend worker environment.

- vLLM version: v0.15.0
- vLLM main:
13397841ab

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-02-11 15:24:48 +08:00
pu-zhe
02886e2641 [Feat] 310p support MoE W8A8 quantizaition (#6641)
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-02-10 17:17:44 +08:00
jiangyunfan1
1eb07986bf [TEST]add a qwen3-30b acc case with mooncake mempool (#6244)
### What this PR does / why we need it?
This PR adds a case of qwen3-30b w8a8 with mooncake mempool, we need to
test it regual
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2026-02-10 16:26:55 +08:00
LI SHENGYONG
7cf285a77a [MOE Refactor] Remove QuantType in prepare_finalize.py (#6534)
### What this PR does / why we need it?
To prevent confusion between different QuantType classes, we remove**
QuantType in prepare_finalize.py

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:59:58 +08:00
LI SHENGYONG
34eecacace [EPLB] Avoiding eplb's dependency on a specified model (#6528)
### What this PR does / why we need it?
1. Currently, eplb registers different attributes for different models,
but these attributes are not actually used. Now, these attributes are
directly deleted.
2. Add some log about eplb.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
#### Deepseek v3.1 chat
Of course! Here is a comprehensive explanation of deep learning, broken
down for clarity.\n\n### The Simple Analogy: A Child Learning to
Recognize a Cat\n\nImagine teaching a child what a cat is. You don't
give them a rulebook with instructions like \"has pointy ears, whiskers,
and a tail.\" Instead, you show them many pictures, saying \"this is a
cat\" or \"this is not a cat.\" The child's brain gradually learns to
identify the complex patterns—the combination of shapes, colors, and
textures—that define \"cat-ness.\"\n\n**Deep learning is essentially
this, but for computers.** It's a method for teaching computers to learn
from examples and recognize patterns directly from data (like images,
sound, or text) without being explicitly programmed with rigid
rules.\n\n---\n\n### The Technical Definition\n\n**Deep Learning is a
subfield of machine learning, which itself is a subfield of artificial
intelligence (AI).** It uses artificial **neural networks** with many
layers (\"deep\" networks) to model and understand complex patterns in
data.\n\nHere are the key concepts in that definition:\n\n1.
**Artificial Intelligence (AI):** The broad science of making machines
smart and capable of performing tasks that typically require human
intelligence.\n2. **Machine Learning (ML):** A subset of AI that gives
computers the ability to learn from data *without* being explicitly
programmed for every single rule.\n3. **Deep Learning (DL):** A
specific, powerful

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:58:44 +08:00
wangxiyuan
7d4833bce9 [Doc][Misc] Restructure tutorial documentation (#6501)
### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-10 15:03:35 +08:00
Ronald
77305df398 implement batch invariant with ascendc (#6590)
### What this PR does / why we need it?
there are batch invariant ops implemented by triton and ascendc, this pr
aims to choose which kind of ops to be used to enable batch invariant.
#5487

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-02-10 14:15:26 +08:00
Nengjun Ma
66b60c9440 [Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (#6629)
### What this PR does / why we need it?
1. [Refact] Refact MLA/SFA weight prefetch to consist with moe weight
prefetch
2. Remove duplicated o_proj weight prefetch in forward for MLA/SFA

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?

1) Performance result:
Perf test data:
*) MLA:

| | 1st test | 2nd test | Output Token Throughput(Avg) | Performance
improvement percentage |
| --- | --- | --- | --- | --- |
| o_proj duplicate prefetch | 11.9669 token/s | 12.0287 token/s |
11.9978 |
| o_proj no duplicate prefetch | 12.5594 token/s | 12.6216 token/s |
12.5905 | 4.94%| |

single layer performace improve: 5%~8%

*) SFA:

| | 1st test | 2nd test | Output Token Throughput(Avg) | Performance
improvement percentage |
| --- | --- | --- | --- | --- |
| o_proj duplicate prefetch | 13.0523 token/s | 13.1084 token/s |
13.08035 | |
| o_proj no duplicate prefetch | 13.9844 token/s | 14.1678 token/s |
14.0761 | 7.6% |

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-10 14:14:37 +08:00
wangxiyuan
2a826b5fad [Misc] upgrade to vllm main (#6646)
### What this PR does / why we need it?
This PR upgrades the core vLLM dependency to a newer version from the
main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is
necessary to keep our project up-to-date with the latest features and
fixes from upstream vLLM.

1.
ac32e66cf9
pass file is moved.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
2026-02-10 14:08:59 +08:00
Cao Yi
1c7d1163f5 [main][Docs] Fix spelling errors across documentation (#6649)
Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-10 11:14:57 +08:00
meihanc
5b8e47cb68 [bugfix]Fix no attribute 'data' when MLAPO is enable (#6601)
### What this PR does / why we need it?

This PR fixes an `AttributeError: 'Parameter' object has no attribute
'data'` that occurs when MLAPO is enabled with vLLM v0.15.0.

The error is caused by a monkey-patch on
`MLAAttention.process_weights_after_loading` which is incompatible with
changes in vLLM v0.15.0. This is likely related to PyTorch's deprecation
of the `.data` attribute on `torch.nn.Parameter` objects.

This change makes the monkey-patch conditional, so it is not applied for
vLLM v0.15.0 and newer versions, resolving the crash.


- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-02-10 09:04:32 +08:00
DreamerLeader
905f0764e0 [DOC]Add Memcache Usage Guide (#6476)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
2026-02-09 21:55:00 +08:00
lilinsiman
9564c6bb5d [main][bugfix] Fix spec acceptance rate problem in vllm_0.15.0 (#6606)
### What this PR does / why we need it?
The speculative inference acceptance rate decreases after the vllm
version is upgraded to v0.15.0. This issue is resolved.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
UT and tests case

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-02-09 21:33:58 +08:00
yupeng
8d44ddacb0 [Test][LoRA] Add e2e test for base model inference (#6624)
### What this PR does / why we need it?

This PR adds an end-to-end test case to verify the correctness of base
model inference when LoRA is enabled. This is to ensure that after a
LoRA base model request issue was fixed, the functionality remains
correct and does not regress. The new test case calls `do_sample` with
`lora_id=0` to target the base model and asserts the output against
expected SQL queries.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI passed with the new test case. The test can be run with:
```bash
pytest -sv tests/e2e/singlecard/test_llama32_lora.py

Signed-off-by: paulyu12 <507435917@qq.com>
2026-02-09 21:06:49 +08:00
Wang Kunpeng
156976b982 [refactor]Optimized the kvcache usage of Deepseek v3.2 (#6610)
### What this PR does / why we need it?

For deepseek v3.2, DSA use FullAttentionSpec, allocate 2 * mla page size
bytes, and we use half of that for k cache in DSA

However, the actual proportion of k cache is not high, which results in
a large amount of kvcache being wasted. The proportion of discarded
kvcache is (576-128)/(576 x 2) = 0.388.

Run the same script to start DeepSeek V3.2 on a single A3 server. The
following shows the comparison of kvcache usage:
Before refactoring
```
[kv_cache_utils.py:1307] GPU KV cache size: 15,872 tokens
```
After refactoring
```
[kv_cache_utils.py:1307] GPU KV cache size: 25,984 tokens
```

This pull request refactors the KV cache allocation for Deepseek v3.2
models that use sparse attention. It replaces the use of
`FullAttentionSpec` with `MLAAttentionSpec` and introduces a more
principled way of calculating KV cache tensor split factors based on
model configuration.

This change removes hardcoded values and correctly sizes the cache
tensors, leading to optimized memory usage and improved code
maintainability.

### Does this PR introduce _any_ user-facing change?

No, this is an internal optimization and does not introduce any
user-facing changes.

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-02-09 18:53:56 +08:00
Qiu
cb7c419bc0 [Feat](sfa,dcp) support dcp for sfa (#6563)
### What this PR does / why we need it?
This PR adds DCP support to the SFA backend.

Please note that due to operator constraints, the current implementation
has to all-gather the entire KV cache and modify the block table to
satisfy the operator input requirements. This results in significantly
increased communication overhead and peak memory usage. Therefore, this
is only a temporary workaround and will be refactored once the operator
provides proper support.

Additionally, because of the above limitations,
`cp_kv_cache_interleave_size` is currently required to be equal to
`block_size`. This restriction will also be removed after the refactor.

#### Test
accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8

| dataset | version | metric | mode | vllm-api-general-stream |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.35 |

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-02-09 18:52:25 +08:00
GoCHug
80e5812b39 [BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (#6581)
### What this PR does / why we need it?
Issue: If a model such as Ling-1T adopts partial rotary position
embedding (partial RoPE), but config.json uses the rotary_dim parameter
instead of partial_rotary_factor, it will trigger a RuntimeError: The
expanded size of the tensor (128) must match the existing size (64) at
non-singleton dimension 3.
<img width="1681" height="472" alt="image"
src="https://github.com/user-attachments/assets/ba03d7df-ecba-4d6f-9ec1-4dc55f59799e"
/>

This PR addresses an issue where models using partial rotary position
embedding (partial RoPE) with the `rotary_dim` parameter in
`config.json` (instead of `partial_rotary_factor`) would encounter a
`RuntimeError`.

This change adds support for the `rotary_dim` parameter in
`vllm_ascend/ops/rotary_embedding.py` to correctly calculate the
`rope_dim`, resolving the tensor size mismatch error.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The patch was tested successfully with the Ling-1T model, which
previously triggered the error.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: GoCHug <93277779+GoCHug@users.noreply.github.com>
2026-02-09 17:17:52 +08:00
lhp-deep
d060c797ed [fix bug] fix tensor mismatch bug in sigmoid operate test case (#6619)
### What this PR does / why we need it?
This PR fixes a bug in the `test_triton_fusion_ops` test case. The test
compares a fused kernel (`fused_sigmoid_gating_delta_rule_update`) with
a split implementation. Both paths use a recurrent state tensor.

The bug was that the state tensor was being modified in-place by the
fused kernel call, and this modified tensor was then reused for the
split implementation path. This led to an incorrect comparison and test
failure.

This fix ensures that each path starts with an identical, clean initial
state by creating separate tensors. It also changes the state
initialization from `torch.randn` to `torch.ones` to make the test
deterministic.

### Does this PR introduce _any_ user-facing change?
No, this change only affects a test case and has no user-facing impact.
### How was this patch tested?
The fix is applied directly to the test case. The CI passing for
`test_fused_sigmoid_gating_delta_rule.py` will confirm that the fix is
working as expected.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
2026-02-09 16:43:27 +08:00