Commit Graph

535 Commits

Author SHA1 Message Date
Yizhou
39a85c49fa [Refactor] Rename cudagraph_support to aclgraph_support (#3104)
### What this PR does / why we need it?
Updates the `cudagraph_support` attribute to `aclgraph_support` to use
terminology appropriate for the Ascend platform (ACL graphs instead of
CUDA graphs).

This change also explicitly disables graph support for the MLA attention
backend.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

- vLLM version: v0.10.2
- vLLM main:
5aeb925452

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-23 11:30:31 +08:00
Mercykid-bash
29c173ab48 FlashLB algorithm (#3042)
## Purpose
This Pull Request enhances the EPLB (Expert Parallelism Load Balancing)
system by introducing a novel balancing algorithm: FlashLB.

## Motivation
1. The default algorithm adopts a two-stage greedy strategy: 
a. Replica allotment: Determine the number of expert replicas by
minimizing the maximum load per replica (Min Max Replica, MMR).
b. Replica placement: Distribute replicas across devices by repeatedly
assigning the heaviest replica to the least loaded device (Longest
Processing Time First, LPT).

However, this sequential process lacks inter-stage collaborative
optimization, often leading to suboptimal load balancing. For example,
in the simple case shown in the figure below: given 8 logical experts
with hotness values of 600, 560, 120, 120, 20, 10, 10, 10, and 2
replicas allocated per device across 8 devices, the EPLB algorithm
yields a maximum per-device hotness of 232, while our proposed FlashLB
algorithm can reduce this value to 205.

2. The default algorithm relies on the averaged expert hotness over a
fixed time window for optimization. While this provides a coarse
approximation of the hotness distribution, it fails to capture
oscillatory deviations and temporal correlations of expert hotness
observed across iterations in real-world scenarios, limiting
optimization quality.

3. The default algorithm periodically regenerates the expert placement
table. However, it generates the table for each individual layer, and
the new table does not account for correlations with the previous one;
these two factors collectively lead to nearly full-scale expert
reassignment.

## FlashLB Algorithm Principle
1. Joint Optimization
FlashLB achieves joint optimization of replica allotment and placement
through group-based decision-making. Each group gradually determines the
replica count and placement for a subset of experts, ensuring that the
expected inter-device load balance (considering both deployed and
pending expert replicas) is holistically optimized. To attain superior
load balancing, FlashLB employs tree search to expand the solution space
while integrating pruning and precompilation techniques for
acceleration, thereby delivering load balancing that is both
high-quality and practically efficient.

2. Multi-Shot Enhancement
FlashLB partitions each profiling interval (e.g., 1024 iterations) into
consecutive smaller sub-intervals (e.g., 16 iterations), each capturing
independent hotness measurements. It then performs multi-shot
optimization to co-optimize these sub-intervals simultaneously—enabling
adaptation to time-variant expert hotness while enhancing robustness.

3. Incremental Adjustment
To reduce the overhead of frequent expert re-deployment, FlashLB
introduces an incremental adjustment scheme operating at both
inter-layer and intra-layer levels:
a. Inter-Layer: Hotness variations are tracked at the layer level. Only
layers with fluctuations exceeding a predefined threshold trigger
re-computation of expert placement, avoiding unnecessary redeployment
for stable layers;
b. Intra-Layer (Optional): A lightweight incremental LPT algorithm
(LPT-Incremental) is applied. Instead of recomputing full placement for
all experts in a layer, it selectively adjusts only the hottest experts
or those with replica count changes, further reducing migration
overhead.

This incremental strategy significantly reduces adjustment costs while
maintaining balanced performance across layers and devices.

## Co-author:

Co-authored-by: Skywalker-EP 173723846@qq.com

- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: zhanghw0354 <zhanghaiwencmss@139.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com>
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Icey <1790571317@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: rjg-lyh <83491835+rjg-lyh@users.noreply.github.com>
Co-authored-by: weichen <132029610+Pr0Wh1teGivee@users.noreply.github.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>
2025-09-23 10:27:14 +08:00
hucong
8dd53c8860 [Bugfix][PD] Auto-clear producer KV cache if no pull notification (#2174)
### What this PR does / why we need it?

This PR addresses a critical issue where Node D (Device) failures cause
Node P (Processor) to hang due to inability to release KV cache.

**Trigger Scenarios:**  
1. Node D fails mid-inference (e.g., network disconnection)  
2. Node D rejects requests at a certain stage (e.g., via API server)  
3. Load-test script termination causes Node P or D to abort queued
requests

**Root Cause Analysis:**  
1. Currently, Node D sends a "KV cache pull complete, release approved"
message to Node P
2. This message is transmitted via the worker connector. If PD
connection breaks or requests are rejected upstream, Node D cannot send
the message
3. Node P will never release KV cache without receiving this message  

**Solution:**  
Following VLLM community's approach (NIXL connector timeout mechanism),
we're implementing:
- A timeout mechanism with comprehensive warnings  
- Updated README documentation  
- Reference: VLLM's optimization PR
[#20139](https://github.com/vllm-project/vllm/pull/20139)
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
None


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: underfituu <hzhucong@163.com>
2025-09-23 09:53:34 +08:00
yupeng
704467cd9a [Bugfix][LoRA] Fix bug introduced by upstream vllm#25249 (#3095)
### What this PR does / why we need it?
Fix the impact to LoRA that
https://github.com/vllm-project/vllm/pull/25249 brought.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py

- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2025-09-22 22:26:01 +08:00
Yizhou
3fa7cf6345 [Refactor][Graph] Move graph parameter logic to acl_graph module (#3101)
### What this PR does / why we need it?
This is the follow-up PR of #2128 .

Moves graph parameter management components, including `GraphParams`,
`get_graph_params`, and `set_graph_params`, from the generic `utils.py`
to the more specific `compilation/acl_graph.py`.

Additionally, extracts the `update_attn_params` logic from the
`NPUModelRunner` class into a standalone function within the `acl_graph`
module.

This refactoring improves code organization by centralizing ACL
graph-related logic into its own dedicated module, enhancing modularity
and clarity.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 22:23:14 +08:00
Li Wang
02f89d166f [CI] Update vllm version to 20250922(5aeb925) (#3091)
### What this PR does / why we need it?
This pr bump vllm commit hash to
5aeb925452
fix issues:  
1. https://github.com/vllm-project/vllm/pull/25345 has remove v0
metadata
2. https://github.com/vllm-project/vllm/pull/25332
3. https://github.com/vllm-project/vllm/pull/25334
4. https://github.com/vllm-project/vllm/pull/23558, note that this vllm
commit update the model register logic, which will check all the model
registered have the `vllm.model_executor.models` path , which breaks our
custom registration of the deepseek_v3 model (it doesn't exist in the
vllm model path). so I move deepseek_v3 model registy to deepseek_v2 to
solve temporary

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-22 22:18:13 +08:00
fems14
1c9f0fe26f Fix of DeepSeek Error in KV Pool Mixed Deployment Scenario (#3087)
### What this PR does / why we need it?
A new kv_role "kv_both" is added to run mixed deployment scenarios. The
mixed deployment will involve a decode phase, where with_prefill should
be false.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

Signed-off-by: fems14 <1804143737@qq.com>
2025-09-22 20:36:41 +08:00
weichen
37a0715eda [Refactor] Adjustments to moe_comm_method selection process (#3001)
### What this PR does / why we need it?
Fix issues mentioned in
https://github.com/vllm-project/vllm-ascend/pull/2791 and some minor
refactoring.
1. Use Enum instead of string.
2. Avoid setting a new property to forward_context in
AscendFusedMoE.forward().
3. Enabling TokenDispatcherWithMoge.
4. Remove redundant code.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:
1. Enable/Disable EP
2. Aclgraph & eager


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-09-22 19:12:58 +08:00
rjg-lyh
bb1f0d5a62 [main] remove the redundant log prints in register_custom_ops.py (#3094)
### What this PR does / why we need it?
This PR removed the redundant log prints in register_custom_ops.py, in
order to make output clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-09-22 17:17:31 +08:00
Yizhou
338231acaf [Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models (#2128)
Note: This depends on [vLLM
#25161](https://github.com/vllm-project/vllm/pull/25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of #1503 and #1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 17:14:28 +08:00
Mengqing Cao
f39bd309b6 [Hybrid KV] Follow up UniformTypeKVCacheSpecs (#3070)
### What this PR does / why we need it?
Follow up `UniformTypeKVCacheSpecs` changes introduced by
https://github.com/vllm-project/vllm/pull/25101, which support different
hidden size in uniform type kvcache specs

This also fix the CI issue about `TypeError: AttentionGroup.__init__()
missing 1 required positional argument: 'kv_cache_spec'`

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Tests passed with exsiting e2e tests.

- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-22 15:02:41 +08:00
tianyitang
f1f2c8f5e5 [Perf] Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems (#2962)
### What this PR does / why we need it?
Add new npu_fused_infer_attention_score op to improve perfomance in
splitfuse cases and resolve long-seq mask problems .

1. The original op's performance is suboptimal in certain scenarios,
necessitating optimization through the _new op_
(npu_fused_infer_attention_score)。
2. For ultra-long sequences (128k), the original operator will allocate
a large attn_mask, which consumes excessive CPU memory. In contrast, the
_new op_ supports a fixed-size compressed mask, effectively resolving
this issue.

NOTE1: The current PR retains the original logic and uses a version
check of the CANN package to determine whether the _new op_ can be
enabled. This ensures no impact on existing users. In future versions,
this version check and the original logic will be deprecated, and the
_new op_ scheduling will be uniformly adopted.
NOTE2: This pr relies on future CANN version, which is not available
now.
NOTE3: To enable the new op in chunked prefill, the parameter
additional_config should be set like `--additional-config
'{"ascend_scheduler_config":
{"enabled":true,"enable_chunked_prefill":true}}' \` at least.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed




- vLLM version: v0.10.2
- vLLM main:
6c5f82e5aa

---------

Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
2025-09-22 14:56:14 +08:00
linfeng-yuan
ffdd1a36e2 [bugfix][torchair] fix wasted NPU memory buffer allocation for quantized deepseek with unquantized MTP layer (#3068)
### What this PR does / why we need it?
While running quantized deepseek models with unquantized MTP layer, free
NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This
results from the wasted VRAM buffer allocation casued by calling
`dist.all_to_all_single` without correct device process group argument.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We run vllm online serving with quantized deepseek-r1 and unquantized
MTP layer, and observed that free_memory increased without redundat VRAM
buffer for HCCL communication op (all_to_all_single).

- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-22 14:06:43 +08:00
Icey
14b39d3c70 [1/N][Refactor][Qwen3-Next] remove redundant Qwen3NextSparseMoeBlock and Qwen3NextAttention (#3019)
### What this PR does / why we need it?
remove redundant Qwen3NextSparseMoeBlock and Qwen3NextAttention

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "The future of AI is",
    ]

    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        # model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-30B-A3B",
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64,
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.2
- vLLM main:
9d1c50a5ac

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-22 11:24:08 +08:00
Yikun Jiang
b8b68b3dfe [CI] Upgrade vLLM to 20250920 (c60e613) and address config break (#3067)
### What this PR does / why we need it?
Bump main to
c60e6137f0

- Updated imports in `vllm.config` to
`vllm.config.model`(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252

- Refactored `vllm_ascend/sample/sampler.py` to use string values for
`logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs
mode handling and improving compatibility with recent vLLM changes
(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-21 09:49:17 +08:00
Li Wang
12bcbd02bb [CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907)
### What this PR does / why we need it?
1. This pr bump vllm commit to
6d8246aaff
2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548
abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable
3. fix metadata_builder changes introduced by
https://github.com/vllm-project/vllm/pull/23693
4. fix `structured_outputs_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22772
5. fix `moe_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22537

Co-authored-by:  MengqingCao <cmq0113@163.com>
Co-authored-by:  Yikun Jiang <yikunkero@gmail.com>


- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-20 17:37:57 +08:00
Lucas Kabela
53ecd89e8f [Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE (#2969)
### What this PR does / why we need it?
This PR prepares for deleting this enviroment variable,
`VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True`
to run

- Fixes https://github.com/vllm-project/vllm/issues/21834

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
See CI 

- vLLM version: v0.10.2
- vLLM main:
99cc41ad50

---------

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2025-09-20 08:22:30 +08:00
zhangxinyuehfad
a22b532d38 [Fixbug] Fix shape not match when sliding_window and dynamic batch_size (#2830)
### What this PR does / why we need it?
Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy 

### Does this PR introduce _any_ user-facing change?

Users can't set dynamic batch_size or use lm_eval test accuracy when
using models(sliding_window)

### How was this patch tested?
accuarcy of LLM-Research/Phi-4-mini-instruct is ok :
```
vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8105|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.8097|±  |0.0108|
```


- vLLM version: v0.10.2
- vLLM main:
3c96e7b8a1

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-19 22:35:14 +08:00
22dimensions
0942d9aaab [3/N][Refactor][Quantization]remove packed_modules_mapping from models (#3021)
### What this PR does / why we need it?

Some custom models in vllm-ascend define packed_modules_mapping, which
prevent keeping same model class with vllm community. So move these
custom packed_modules_mapping to quant utils.py. After this pr, some
custom models can be removed.

### Does this PR introduce _any_ user-facing change?

tested by CI

### How was this patch tested?

tested by CI

- vLLM version: v0.10.2
- vLLM main:
5089fd749c

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-19 20:50:14 +08:00
Shanshan Shen
8326f15ecf [CustomOp] Register AscendSharedFusedMoE custom op (#2980)
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```


- vLLM version: v0.10.2
- vLLM main:
486c5599e3

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-09-19 19:05:01 +08:00
sdmyzlp
05a700d370 [Bugfix] Fix async copy bug under single expert scenario (#3005)
Add missing barrier when no implicit synchonize by `repeat_interleave`
is available. Otherwise, the `non_blocking=True` copy of `output_splits`
and `input_splits` from NPU may failed to complete before later
`async_all_to_all` uses them.

### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-09-19 14:05:36 +08:00
xuyexiong
2a87b4cecb [Bugfix] Fix specdecoding in chunkedprefill scenario (#3025)
### What this PR does / why we need it?

The speculative decode phase of chunkedprefill has taken an incorrect
path, should always use TND layout for speculative decoding.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-19 14:05:08 +08:00
Song Zhixin
833cd1b698 [BugFix] Async scheduling and PP compatibility with DP (#2796)
### What this PR does / why we need it?
based on the https://github.com/vllm-project/vllm/pull/23770,
fix Async scheduling and PP compatibility with DP, also fixes issue with
finished requests not being processed in async scheduling and PP cases,
and possible worker race conditions.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
544fe76b95

---------

Signed-off-by: jesse <szxfml@gmail.com>
2025-09-19 11:29:50 +08:00
whx
0a526768f5 [Feature] Support moe multi-stream for aclgraph. (#2946)
This PR puts the calculation of shared experts into a separate stream,
overlaping with routing experts.

- vLLM version: v0.10.2
- vLLM main:
fbd6523ac0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-19 11:06:45 +08:00
Mengqing Cao
367edff5af [HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007)
### What this PR does / why we need it?
This pr fixes a few issues on prefill disaggregation:
1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist
needs the addr of tensors to be aligned with 2M
2. Fix prefill disaggregation kvcache shape error, llmdatadist requires
k/v tensors with shape [num_blocks, ...], however the implentment before
this pr is [2, num_blocks, ...], which will break prefill disaggregation
3. Use hybrid kv cache only when running qwen3_next to fix accuracy
issue on prefill disaggregation.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Tested locally by @liziyu179 

- vLLM version: v0.10.2
- vLLM main:
4f02b77de4

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-18 21:43:22 +08:00
Li Wang
01592515b8 [Bugfix] Fix sleep mode level 2 (#1376)
### What this PR does / why we need it?
For sleep mode level 2, we discarded model both weights and kv_cache,
but the problems is: When we discard weights, we also discard some
tensors representing the model state which we called
`model.named_buffers()`, such as: `running_mean / running_var` in
BatchNorm、rope cos-sin cache ... when we update weights, but forgot to
update buffers as well, this will lead to some unknown issue
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
5963b98b46

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-18 19:51:52 +08:00
LeeWenquan
f4e3d22432 Remove chunked_prefill_for_mla and fix ring_mla bug (#2781)
### What this PR does / why we need it?
Remove chunked prefill for mla branch in mla , and change dtype of
prefill_mask to avoid accuracy problem
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-09-18 19:43:26 +08:00
linfeng-yuan
79a910ef47 [bugfix][torchair] fix multistream_moe problems in torchair graph mode (#2681)
This pr fixes two problems while `multistream_moe` enabled in torchair
graph mode:
1. check `TorchairAscendW8A8DynamicFusedMoEMethod` instead of incorrect
`AscendW8A8DynamicFusedMoEMethod`
2. mc2_mask should be chunked no matter `replace_allreduce` is True or
False in forward function of `TorchairAscendFusedMoE`

- vLLM version: v0.10.2
- vLLM main:
0fb2551c23

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-18 17:35:04 +08:00
realliujiaxu
af2a886814 refactor linear (#2867)
### What this PR does / why we need it?
The current linear.py has the following issues:

- There is redundant conditional logic in the `comm_group` and `forward`
selection for classes such as `AscendMergedColumnParallelLinear`.

- Inconsistent comm_group selection logic exists among
`AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and
`AscendQKVParallelLinear`.

To address these two issues, this PR encapsulates `comm_group` and
`forward` into classes and extracts the classes selection logic into
common functions. For future additions of custom communication groups or
forward methods, it will only be necessary to extend
`CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection
logic.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
dd39baf717

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: weijinqian0 <weijinqian@huawei.com>
2025-09-18 14:09:19 +08:00
panchao-hub
a7f8ed38ed [Bugfix]:replace npu_incre_flash_attention with npu_fused_infer_atten… (#2901)
### What this PR does / why we need it?
[Bugfix]:replace npu_incre_flash_attention with
npu_fused_infer_attention_score in order to be able to tiling update

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
2b85697031

Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
2025-09-18 14:06:08 +08:00
xuyexiong
6681dde902 [Feat][Graph] Support MTP for ACL Graph (#2932)
### What this PR does / why we need it?
This PR depends on the merge of #2707 and has adapted the aclgraph
functionality to support MTP.

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
2b85697031

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-18 14:05:33 +08:00
Chao Lei
cef43b524e [Feat] A Connector that supports Mooncake store (#2913)
### What this PR does / why we need it?
Added a new connector for Mooncake store integration to enable kvcache
reuse in scenarios with system prompts or multi-turn dialogues.

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
5963b98b46

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: lizy124 <1950471827@qq.com>
Co-authored-by: zouyida2052 <zouyida2002@gmail.com>
2025-09-18 14:04:45 +08:00
realliujiaxu
723d460894 [Bugfix] fix kv nz accuracy bug (#2988)
when `enable_kv_nz` is true, output of Deepseek R1 is invalid.

- vLLM version: v0.10.2
- vLLM main:
2b85697031

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-09-17 21:10:25 +08:00
linfeng-yuan
8bcc0ccd57 [bugfix] fix shared expert dp with hybrid kvcache (#2964)
### What this PR does / why we need it?
https://github.com/vllm-project/vllm-ascend/pull/2849 moves the
implementation of `shared_expert_dp` to torchair deepseek_modeling.
However, the calling of `set_forward_context` with `enforce_eager` and
`shared_expert_dp` falls back to the implementation of
model_runner_v1.py and set the global attn_metadata as a dictionary. It
leads to a RuntimerError when attn_metadata is got from the forward
context and used in torchair_deepseek_v2.py. This PR fixes this problem
by introducing the transformation of attn_metadata in this file.

Note that current E2E testing lacks the case of deepseek with
`shared_expert_dp`. We need to add an ST with `shared_expert_dp` in
testing workflow.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
e2e vllm serving with `enable_shared_expert_dp: true` passed.

- vLLM version: v0.10.2
- vLLM main:
de3e53a75b

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-17 20:01:47 +08:00
1Fire4
1f6465c399 Add an option of enable frozen parameter (#2869)
### What this PR does / why we need it?
Add an option of enable  frozen parameter

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb

Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
2025-09-17 12:00:44 +08:00
offline893
76844eec78 Dynamic Expert Load Balance with Zero-like-overhead (#2956)
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:

Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.

SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com


- vLLM version: v0.10.2
- vLLM main:
567939953b

---------

Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
xuyexiong
ae758dda05 [Bugfix] Fix mtp torchair in pd Disaggregation scenario (#2951)
### What this PR does / why we need it?
1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario
2. fix mla bug in SpecDecoding Scenario, since num_decodes !=
num_decode_tokens


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
5206ab20ba

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-17 09:07:58 +08:00
rjg-lyh
6b7117dbb7 [main] addrmsnorm + quant fusion optim in Dense Models (#2772)
### What this PR does / why we need it?
This PR fused addrmsnorm op and w8a8 quant op to get better perf.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.2
- vLLM main:
0faf3cc3e8

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-09-16 22:31:38 +08:00
yiz-liu
88ca8a051c [Feat][Graph] Support DeepSeek with ACL Graph (#2707)
### What this PR does / why we need it?
In memory of #677 , a long overdue milestone. Now DeepSeek V3/R1 should
be OK with ACL Graph.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Working on it.

- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-16 17:50:17 +08:00
linfeng-yuan
1c5900327b [refactor] refactor deepseek-related files (#2849)
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E  vllm serving with torchair graph mode and eager mode.

- vLLM version: v0.10.2
- vLLM main:
759ef49b15

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-16 14:13:07 +08:00
weichen
18ca7861f6 [Main] [Refactor] Enable MoECommMethod in Eager Mode (#2791)
### What this PR does / why we need it?
1. Replace prepare/finalize operation in fused_moe.py by
moe_comm_method.prepare()/finalize()
2. Replace unified_fused_experts by moe_comm_method.fused_experts() in
fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py
3. Add calling _select_moe_comm_method in spec-decode proposers.
4. Currently, w4a8_dynamic does not support gatherep, use all2allv
instead.
5. Remove redundant code.
### Does this PR introduce _any_ user-facing change?
AllgatherEP switch is disabled in aclgraph/eager mode, just follow the
rules in modelrunner_v1._select_moe_comm_method()
### How was this patch tested?
e2e & ut


- vLLM version: v0.10.2
- vLLM main:
7f6f2c1182

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-09-16 11:06:00 +08:00
wangxiyuan
c556038ef0 [New model] Qwen3-next support (#2917)
### What this PR does / why we need it?
Add Qwen3-next support.

### Does this PR introduce _any_ user-facing change?
Yes, users can use Qwen3 next.
Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the
tutorial will be ready in
[here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html)

### How was this patch tested?
Doc CI passed

Related: https://github.com/vllm-project/vllm-ascend/issues/2884

Co-Authored-By: Angazenn <supperccell@163.com>
Co-Authored-By: zzzzwwjj <1183291235@qq.com>
Co-Authored-By: MengqingCao <cmq0113@163.com>
Co-Authored-By: linfeng-yuan <1102311262@qq.com>
Co-Authored-By: hust17yixuan <303660421@qq.com>
Co-Authored-By: SunnyLee219 <3294305115@qq.com>
Co-Authored-By: maoxx241 <maoxx241@umn.edu>


- vLLM version: v0.10.2
- vLLM main:
b834b4cbf1

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: hust17yixuan <303660421@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: hust17yixuan <303660421@qq.com>
2025-09-16 01:17:42 +08:00
wangxiyuan
382c29f3e1 [BugFix] Fix world size bug in model_runner (#2915)
- Fix world size bug in model_runner to make sure ep>16 runs with MC2 
- enable e2e test for vl

Co-Authored-By: whx-sjtu <2952154980@qq.com>
Co-Authored-By: Icey <1790571317@qq.com>
- vLLM version: v0.10.2
- vLLM main:
3e903b6cb4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-14 12:20:25 +08:00
fan2956
c5a502fd2e main add ascend scheduler support multimodal (#2844)
### What this PR does / why we need it?
On main, AscendScheduler does not support Multimodels, becuse of lacking
of scheduled_encoder_inputs which is need on multimodels inference

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
vLLM version: main@93e28e6862669e3b5cf47cea9f782a65ec47e155

- vLLM version: v0.10.2rc2
- vLLM main:
15b8fef453

---------

Signed-off-by: fan2956 <zhoufan53@huawei.com>
Co-authored-by: zhoufan2956 <zhoufan2956@163.com>
2025-09-14 09:38:51 +08:00
zxr2333
0a27705917 fix mooncake connector adxl hostname usage (#2824)
### What this PR does / why we need it?
This PR is used to adapt the hostname format for Mooncake when using
adxl. When Mooncake uses adxl, it is necessary to set
```USE_ASCEND_DIRECT``` to True in the file
```/Mooncake/mooncake-common/common.cmake``` during compilation. The
mooncake_connector obtains this config by calling
```vllm_config.kv_transfer_config.get_from_extra_config```, determines
whether Mooncake is using adxl, and selects the corresponding hostname
format.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.


- vLLM version: main
- vLLM main:
d21a36f5f9

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-09-13 14:38:48 +08:00
Jiawei Li
e57cca971c Fix the bugs about operator registration by PyTorch Dispatcher (#2786)
**Background:**

There are two principles about operator registration in PyTorch
- The same namespace can be only registered once by `TORCH_LIBRARY`
- The operator signatures can be only registered once by `def`

Considering that all custom operators defined in the current repo are
only used by Ascend, instead of defining a common operator schema by
vLLM, all accelerators then follow this operator schema and complete the
implementation based on their respective hardware, which is conducive to
functional abstraction.

Therefore, we can rename the operator registration namespace to an
Ascend-specific namespace(**_C_ascend**).

Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742


- vLLM version: main
- vLLM main:
f592b3174b

Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
2025-09-13 11:58:52 +08:00
rjg-lyh
585a494baa [Core] Disable the chunked prefill feature in Non-MLA LLMs (#2894)
### What this PR does / why we need it?
This PR enforces the forcible disabling of the chunked prefill feature
in Non-MLA models, as the performance of operators supporting this
functionality is currently suboptimal. Unless the user has enabled
chunked prefill in the ascend_scheduler_config, we would allow this
feature.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

Related: https://github.com/vllm-project/vllm-ascend/pull/2659

- vLLM version: main
- vLLM main:
d21a36f5f9

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-09-12 23:17:09 +08:00
Yikun Jiang
756b8a1946 Revert "[Feat] Unquantized linear nz support (#2619)" (#2896)
### What this PR does / why we need it?
This reverts commit 7b2ecc1e9a.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

- vLLM version: main
- vLLM main:
64d90c3e4f

Closes: https://github.com/vllm-project/vllm-ascend/issues/2890
Closes: https://github.com/vllm-project/vllm-ascend/issues/2887
Closes: https://github.com/vllm-project/vllm-ascend/issues/2885

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-12 20:51:12 +08:00
rjg-lyh
fc2bcbe21c [Ops] Fix bug in register_custom_ops without forward_context (#2883)
### What this PR does / why we need it?
This PR fixed the bug in register_custom_ops without forward_context. We
set try-except to consider this situation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: main
- vLLM main:
7920de0a2a

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-09-12 16:58:08 +08:00
realliujiaxu
778cb72556 fix bug when rotary_dim is not 128 (#2847)
### What this PR does / why we need it?
`torch_npu.npu_apply_rotary_pos_emb` only support head_size and
rotary_dim equal 128. Error occurs when running GLM

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: main
- vLLM main:
404c85ca72

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-09-12 09:49:36 +08:00