Commit Graph

111 Commits

Author SHA1 Message Date
wangxiyuan
ac1c2cd9ac [CI] Upgrade vllm version - 0925 (#3167)
Upgrade vLLM to newest commit.

1. Remove the useless func get_state_cls, it has been removed from vLLM
already.
e6750d0b18
2. Fix ut broken by
6160ba4151


- vLLM version: v0.10.2
- vLLM main:
b1068903fd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-25 14:20:10 +08:00
weijinqian0
6aa4253798 [Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085)
What this PR does / why we need it?

there are two sets of sp implementations for moe and dense models. One
is called sequence_parallelism, and the other is flashcomm_v1.
We did the following things:

Merge two sets of code with the same implementation into one.
Remove the implementation of sequence_parallelism, as this solution
cannot support aclgraph.
Does this PR introduce any user-facing change?

No

How was this patch tested?

e2e&ut

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-09-24 11:29:59 +08:00
lidenghui1110
0f3939e5a9 [Feature]cpu offload connector (#1659)
This PR implements cpu offload connector to enable NPU kv cache offload
to host DRAM.

- vLLM version: v0.10.2
- vLLM main:
5aeb925452

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
Co-authored-by: AlvisGong <gwly0401@163.com>
2025-09-23 14:25:05 +08:00
Yizhou
39a85c49fa [Refactor] Rename cudagraph_support to aclgraph_support (#3104)
### What this PR does / why we need it?
Updates the `cudagraph_support` attribute to `aclgraph_support` to use
terminology appropriate for the Ascend platform (ACL graphs instead of
CUDA graphs).

This change also explicitly disables graph support for the MLA attention
backend.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

- vLLM version: v0.10.2
- vLLM main:
5aeb925452

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-23 11:30:31 +08:00
Yizhou
3fa7cf6345 [Refactor][Graph] Move graph parameter logic to acl_graph module (#3101)
### What this PR does / why we need it?
This is the follow-up PR of #2128 .

Moves graph parameter management components, including `GraphParams`,
`get_graph_params`, and `set_graph_params`, from the generic `utils.py`
to the more specific `compilation/acl_graph.py`.

Additionally, extracts the `update_attn_params` logic from the
`NPUModelRunner` class into a standalone function within the `acl_graph`
module.

This refactoring improves code organization by centralizing ACL
graph-related logic into its own dedicated module, enhancing modularity
and clarity.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 22:23:14 +08:00
Yizhou
338231acaf [Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models (#2128)
Note: This depends on [vLLM
#25161](https://github.com/vllm-project/vllm/pull/25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of #1503 and #1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 17:14:28 +08:00
tianyitang
f1f2c8f5e5 [Perf] Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems (#2962)
### What this PR does / why we need it?
Add new npu_fused_infer_attention_score op to improve perfomance in
splitfuse cases and resolve long-seq mask problems .

1. The original op's performance is suboptimal in certain scenarios,
necessitating optimization through the _new op_
(npu_fused_infer_attention_score)。
2. For ultra-long sequences (128k), the original operator will allocate
a large attn_mask, which consumes excessive CPU memory. In contrast, the
_new op_ supports a fixed-size compressed mask, effectively resolving
this issue.

NOTE1: The current PR retains the original logic and uses a version
check of the CANN package to determine whether the _new op_ can be
enabled. This ensures no impact on existing users. In future versions,
this version check and the original logic will be deprecated, and the
_new op_ scheduling will be uniformly adopted.
NOTE2: This pr relies on future CANN version, which is not available
now.
NOTE3: To enable the new op in chunked prefill, the parameter
additional_config should be set like `--additional-config
'{"ascend_scheduler_config":
{"enabled":true,"enable_chunked_prefill":true}}' \` at least.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed




- vLLM version: v0.10.2
- vLLM main:
6c5f82e5aa

---------

Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
2025-09-22 14:56:14 +08:00
zhangxinyuehfad
a22b532d38 [Fixbug] Fix shape not match when sliding_window and dynamic batch_size (#2830)
### What this PR does / why we need it?
Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy 

### Does this PR introduce _any_ user-facing change?

Users can't set dynamic batch_size or use lm_eval test accuracy when
using models(sliding_window)

### How was this patch tested?
accuarcy of LLM-Research/Phi-4-mini-instruct is ok :
```
vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8105|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.8097|±  |0.0108|
```


- vLLM version: v0.10.2
- vLLM main:
3c96e7b8a1

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-19 22:35:14 +08:00
xuyexiong
2a87b4cecb [Bugfix] Fix specdecoding in chunkedprefill scenario (#3025)
### What this PR does / why we need it?

The speculative decode phase of chunkedprefill has taken an incorrect
path, should always use TND layout for speculative decoding.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-19 14:05:08 +08:00
LeeWenquan
f4e3d22432 Remove chunked_prefill_for_mla and fix ring_mla bug (#2781)
### What this PR does / why we need it?
Remove chunked prefill for mla branch in mla , and change dtype of
prefill_mask to avoid accuracy problem
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-09-18 19:43:26 +08:00
xuyexiong
6681dde902 [Feat][Graph] Support MTP for ACL Graph (#2932)
### What this PR does / why we need it?
This PR depends on the merge of #2707 and has adapted the aclgraph
functionality to support MTP.

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
2b85697031

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-18 14:05:33 +08:00
realliujiaxu
723d460894 [Bugfix] fix kv nz accuracy bug (#2988)
when `enable_kv_nz` is true, output of Deepseek R1 is invalid.

- vLLM version: v0.10.2
- vLLM main:
2b85697031

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-09-17 21:10:25 +08:00
xuyexiong
ae758dda05 [Bugfix] Fix mtp torchair in pd Disaggregation scenario (#2951)
### What this PR does / why we need it?
1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario
2. fix mla bug in SpecDecoding Scenario, since num_decodes !=
num_decode_tokens


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
5206ab20ba

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-17 09:07:58 +08:00
wangxiyuan
c556038ef0 [New model] Qwen3-next support (#2917)
### What this PR does / why we need it?
Add Qwen3-next support.

### Does this PR introduce _any_ user-facing change?
Yes, users can use Qwen3 next.
Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the
tutorial will be ready in
[here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html)

### How was this patch tested?
Doc CI passed

Related: https://github.com/vllm-project/vllm-ascend/issues/2884

Co-Authored-By: Angazenn <supperccell@163.com>
Co-Authored-By: zzzzwwjj <1183291235@qq.com>
Co-Authored-By: MengqingCao <cmq0113@163.com>
Co-Authored-By: linfeng-yuan <1102311262@qq.com>
Co-Authored-By: hust17yixuan <303660421@qq.com>
Co-Authored-By: SunnyLee219 <3294305115@qq.com>
Co-Authored-By: maoxx241 <maoxx241@umn.edu>


- vLLM version: v0.10.2
- vLLM main:
b834b4cbf1

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: hust17yixuan <303660421@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: hust17yixuan <303660421@qq.com>
2025-09-16 01:17:42 +08:00
Marco Barletta
6666e5265d Added support for KV connector v1 (#2039)
### What this PR does / why we need it?
- This PR adds the support for the KV connector interface in the V1
architecture, in the same way as vllm. Vllm-ascend currently lacks of
this support, required to support also layerwise management of KV
caches.

- The connector interface allows using external tools and integrate them
with vllm

### Notes:
We are aware of Issue #684 , however that issue does not modify the
attention classes as necessary to perform layerwise management of KV
caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla
vllm. The KV connector API is the same as vanilla vllm, supporting the
standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour
before this PR was made on the file model_runner_v1.py. I solved the
conflicts by removing any modification to the model_runner_v1 file,
which now are largely already merged in main. Now this PR is left for
the modifications to the attention_v1 file.

### Does this PR introduce _any_ user-facing change?
The PR does not modify current APIs, but it extends the behavior of
current worker runner and attention classes to save and load KV caches.
In absence of connectors, the behavior should stay untouched.

### How was this patch tested?
- No unit test implemented yet for the worker.

- Tested together with LMCache using
https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py
with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.

- Performed LMEval on LMCache integrated with vllm-ascend.

Results without LMCache on Qwen3-8B:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102|
 
Results with LMCache Layerwise:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103|


- vLLM version: v0.10.1.1
- vLLM main:
50fede6634

---------

Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
2025-09-08 09:04:22 +08:00
yeyifan
b2f77d3aa8 [fix] prefill unsupport sliding window attention (#2758)
### What this PR does / why we need it?
fix prefill attention bug,not support sliding window.
npu_fused_infer_attention_score head_dim only equal 128, not support
other number.
### Does this PR introduce _any_ user-facing change?
remove prefill phase npu_fused_infer_attention_score
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: nsdie <yeyifan@huawei.com>
2025-09-07 10:34:38 +08:00
1092626063
5b3646ab21 [FEATURE][MTP] Support MTP > 1 (#2708)
### What this PR does / why we need it?
[RFC:Support MTP > 1 for
DeepSeek](https://github.com/vllm-project/vllm-ascend/issues/2745)

- [x] dp1 tp16
- [x] dp4 tp4
- [x] dp2 tp 8
- [x] torchair graph

- vLLM version: v0.10.1.1
- vLLM main:
c9f7081f9c

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-09-05 09:11:22 +08:00
whx
a58013440a [BugFix][MLA] Fix attn_mask bug for ring mla (#2704)
This PR fix a bug related to attention mask used in ring mla. Current
ring mla has supported compressed mask, so we can directly use a 512 *
512 attention mask.

- vLLM version: v0.10.1.1
- vLLM main:
b5ee1e3261

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-04 10:22:46 +08:00
yeyifan
1191a64ae5 [Feat]attention add sliding windows size (#2528)
### What this PR does / why we need it?
Add a sliding window size parameter to attention
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Regarding the `Gemma3` model, set
additional_config={"ascend_scheduler_config": {"enabled":True}}, only
support AscendScheduler
test commond:`python3 -m vllm.entrypoints.openai.api_server --model
gemma3 --additional-config
'{"ascend_scheduler_config":{"enabled":true}}'`


- vLLM version: v0.10.1.1
- vLLM main:
6578e87365

---------

Signed-off-by: nsdie <yeyifan@huawei.com>
2025-08-28 10:37:19 +08:00
LeeWenquan
c8d1df3a3f [Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465)
### What this PR does / why we need it?
In order to support fused kernels, multi-stream, communication
optimization etc, it's better to aggregate all opreations in Attention
layer togather. This PR tries to refactor mla_v1 by moving all MLA
preprocessing ops into mla_v1 attention impl.
Note that new mla_v1 doesn't take torchair into consideration. So this
PR can only be merged after torchair related mla_v1 is isolated into a
new file.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

### Features Test

<img width="506" height="141" alt="image"
src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466"
/>

### Performance After Refact
<img width="648" height="486" alt="image"
src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb"
/>

### Performance Before Refact
<img width="618" height="494" alt="image"
src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1"
/>


- vLLM version: v0.10.1.1
- vLLM main:
e03940762b

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: SunnyLee219 <3294305115@qq.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-08-28 10:35:57 +08:00
rjg-lyh
2bfbf9b9b3 [main][bugfix] Fix bugs and refactor cached mask generation logic (#2442)
### What this PR does / why we need it?
This PR fix bugs and refactor cached mask generation logic. Now just
pre-construct and use the cached mask on cpu instead of device on npu.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
9b5f64238f

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-08-27 12:07:29 +08:00
ZhaoJiangJiang
3629bc4431 feat: add mtp ut and fix some bugs (#2453)
### What this PR does / why we need it?
Fix mtp mode ut

### Does this PR introduce _any_ user-facing change?
Nothing

### How was this patch tested?
This can be tested in the same way as a unit test.


- vLLM version: v0.10.0
- vLLM main:
53415653ff

Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com>
Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>
2025-08-22 17:09:08 +08:00
linfeng-yuan
0ca3f48c90 [2/N][refactor] torchair deepseek mla backend refactor (#2459)
### What this PR does / why we need it?
This PR move current unified mla backend to torchair folder and remove
torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).

 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running eager mode with mla backend, and torchair mode with code before
[2445](https://github.com/vllm-project/vllm-ascend/pull/2445)


- vLLM version: v0.10.0
- vLLM main:
f571ff8eb6

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-08-21 14:02:30 +08:00
Mengqing Cao
1327f9be1c Fix some ci issue and refactor modelrunner (#2445)
### What this PR does / why we need it?
Fix some ci issue and refactor modelrunner

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.0
- vLLM main:
4d9c61993a

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
2025-08-20 09:01:04 +08:00
Shanshan Shen
83e0f41408 [3/N][Refactor] Move torchair_attention to torchair dir (#2017)
### What this PR does / why we need it?

1. Move `torchair_attention` to `torchair` dir.
2. Make `AscendAttentionTorchairBackend` extend `AscendAttentionBackend`
to reduce duplicate methods.
3. Make `AscendTorchairMetadata` extend `AscendMetadata` to reduce
duplicate properties.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
0933f9d518

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-19 10:25:22 +08:00
Shanshan Shen
103654ccd6 [Misc] Remove redundant imported envs, using envs_ascend instead (#2193)
### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.

```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```

- vLLM version: v0.10.0
- vLLM main:
71683ca6f6

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-14 09:33:39 +08:00
Shanshan Shen
55d0790597 [2/N][Refactor] Refactor V1 attention for better extensibility (#1995)
### What this PR does / why we need it?

Refactor V1 Attention for better extensibility (prepared for torchair
attention refactor).

**Main changes:**
- Move different kinds of foward into their method respectively, e.g.,
`_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`,
`_forward_decode_only()`, `_forward_v1_style()`.

### Does this PR introduce _any_ user-facing change?

No.

- vLLM version: v0.10.0
- vLLM main:
14a5d903ab

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-14 09:32:41 +08:00
Wang Kunpeng
dc585f148a [main][prefill optimization] Optimize parallel strategies to reduce communication overhead (#2198)
### What this PR does / why we need it?
1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to
pure DP for shared experts, enabling more efficient execution.
2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by
using ReduceScatter, made possible by pure DP sharding.
3.AllGather Postponed: Delayed to after QKV down projection to reduce
synchronization impact during prefill.

### How was this patch tested?
Adding ut case in `tests/ut/attention/test_mla_v1.py`

#### How to run

use parameter `--additional_config='{"enable_shared_expert_dp": true}'`

##### a.How to run eager mode

eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true}'

##### b.How to run graph mode

eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
9edd1db02b

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
2025-08-12 14:12:12 +08:00
zhenghaojiang
eb43a475f4 [Feat] chunkprefill mla support torchair graph (#1772)
chunkprefill mla only support eager mode now,we want to optimaze it by
support torchair graph, the idea is simple, when all the request is
running in decode, use torchair graph to deal with it, else when
chunkprefill or prefill only, use the eager mode

- vLLM version: v0.10.0
- vLLM main:
ebf7605b0d

Signed-off-by: haojiangzheng <justineric096@gmail.com>
Co-authored-by: haojiangzheng <justineric096@gmail.com>
2025-08-11 19:58:59 +08:00
lbk-sys
c611291661 【main】SP For Qwen3 MoE (#2209)
### What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2,
replacing AllReduce with Reduce-Scatter and AllGather achieves
computational benefits in norm operations while saving one AllGather
communication. This feature is enabled during the P-phase and delivers
notable gains in long-sequence scenarios (e.g., 16k–25k), with
performance improvements reaching 5%–10%.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
``` 
compilation_config={
    "pass_config":{
        "enable_sequence_parallelism": True
    }
},
enable_expert_parallel=True,
```

- vLLM version: v0.10.0
- vLLM main:
9edd1db02b

---------

Signed-off-by: libaokui <libaokui@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
2025-08-07 09:15:49 +08:00
xuyexiong
26fc36b0e0 [V1] MTP supports torchair (#2145)
### What this PR does / why we need it?
Support MTP  with:

- [x]  V0 Scheduler
- [x]  TorchAir
- [x]  Single DP
- [x]  Multi DP
- [x]  Disaggregate PD

Known issues:
- [ ] Not support V1 Scheduler (chunked prefill), will be supported in a
few weeks
- [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now,
need to comment out the line 171-175 in file
`vllm/vllm/v1/metrics/loggers.py`
```
            if (len(self.engine_indexes) > 1
                and vllm_config.speculative_config is not None):
            raise NotImplementedError("Prometheus metrics with Spec Decoding "
                                      "with >1 EngineCore per AsyncLLM is not "
                                      "supported yet.")
```

To start an online server with torchair enabled, here is an example:
```
python -m vllm.entrypoints.openai.api_server \
 --model="/weights/DeepSeek-R1_w8a8/" \
 --trust-remote-code \
 --max-model-len 40000 \
 --tensor-parallel-size 4 \
 --data_parallel_size 4 \
 --max-num-seqs 16 \
 --no-enable-prefix-caching \
 --enable_expert_parallel \
 --served-model-name deepseekr1 \
 --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
 --quantization ascend \
 --host 0.0.0.0 \
 --port 1234 \
 --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \
 --gpu_memory_utilization 0.9 
``` 

offline example with torchair enabled
```
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=16, temperature=0)
# Create an LLM.
llm = LLM(
    model="/home/data/DeepSeek-R1_w8a8/",
    tensor_parallel_size=16,
    max_num_seqs=16,
    gpu_memory_utilization=0.9,
    distributed_executor_backend="mp",
    enable_expert_parallel=True,
    speculative_config={
        "method": "deepseek_mtp",
        "num_speculative_tokens": 1,
    },
    trust_remote_code=True,
    enforce_eager=False,
    max_model_len=2000,
    additional_config = {
       'torchair_graph_config': {
            'enabled': True,
            "graph_batch_sizes": [16],
            'enable_multistream_shared_expert': False,
        },
       "ascend_scheduler_config": {
            "enabled": True
        },
        # 'expert_tensor_parallel_size': 16,
    }
)

# Generate texts from the prompts.
# llm.start_profile()
outputs = llm.generate(prompts, sampling_params)
# llm.stop_profile()
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.0
- vLLM main:
302962e806

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-08-06 19:37:43 +08:00
Li Wang
2284289880 [MISC] Cherry pick #1291 from v0.9.1-dev (#1825)
### What this PR does / why we need it?
Cherry pick #1291 from v0.9.1-dev, This pr implement the synchronization
of whether `dbo` is enabled across all dp ranks. specifically, it
performed allreduce op across multiple DP ranks, only when all the dp
rank is `enable_dbo`, it is enabled

Co-authored-by: shikang-hangzhou <459956190@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-01 09:08:45 +08:00
Mengqing Cao
4c8842da65 [BugFix] Fix a bug of running chunked-prefill with torchair. (#1378) (#1844)
This PR fixes the bug `local variable 'decode_hs_or_q_c' referenced
before assignment` when running chunked-prefill with torchair. We should
calculate `decode_hs_or_q_c` whether or not torchair graphics mode is
enabled.

backport of #1378
fix https://github.com/vllm-project/vllm-ascend/issues/1369


- vLLM version: v0.10.0
- vLLM main:
0e36abf993

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-07-31 20:08:45 +08:00
whx
98cadc2146 [Perf] Avoid performing index selection of sin/cos cache every layer (#1890)
Optimize number of index selections of sin/cos cache.

- vLLM version: v0.10.0
- vLLM main:
656c24f1b5

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-29 18:06:45 +08:00
whx
e7d32ed3f1 [BugFix] Fix the problem that torchair doesn't support tp > 4. (#1508)
This PR removes the restriction that TP cannot be greater than 4 in
torchair scenario, because current newest version of CANN has fixed this
bug.

- vLLM version: v0.10.0
- vLLM main:
04ff4be310

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-28 16:48:05 +08:00
wangxiyuan
4a008c4dac [Misc]Clean up useless import from vllm (#2049)
Clean up useless  import from vllm to make code more clear.

- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 16:01:59 +08:00
zzzzwwjj
ba3dfbd59e [main][refactor] Refactoring forward_context and model_runner_v1 (#1979)
### What this PR does / why we need it?

A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:

Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;

### Does this PR introduce _any_ user-facing change?

No change at user-facing.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
57c22e57f9

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
Pleaplusone
df0ec55162 Disaggregate prefill for kv cache register style (#950)
### What this PR does / why we need it?
This PR adopt `LLMDataDist` for kv cache register and `pull_blocks`
style disaggregate prefill implementation. The interface implementation
mainly follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.

This PR can be test with the following step:
- Generate the rank table for all machine.
- execute`toy_proxy.py` to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
- Run the prefill server and decode server.
- send the request to the disaggregate prefill proxy

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Signed-off-by: liziyu179 <3475441767@qq.com>
Signed-off-by: underfitc <hucong24@huawei.com>
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Co-authored-by: liziyu179 <3475441767@qq.com>
Co-authored-by: underfitc <hucong24@huawei.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: underfituu <hzhucong@163.com>
2025-07-26 17:15:47 +08:00
Yikun Jiang
17a430f7b8 Upgrade vLLM to v0.10.0 (#1927)
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
7728dd77bb

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 15:43:29 +08:00
Shanshan Shen
84fc7402c3 [Misc] Refactor AscendMetaData Comments to Make It Clearer (#1967)
### What this PR does / why we need it?
Refactor the comments of `AscendMetaData` to make it clearer.

- vLLM version: v0.9.2
- vLLM main:
f3137cdd81

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-24 19:31:36 +08:00
wangxiyuan
846555cdb5 [Misc] Clean up uesless code in attention (#1933)
Before do attention module refactor, we can do some code cleanup to make
the next step easier.

What this PR does:

1. remove uesless `common_prefix_len` for attention builder
2. remove uesless `is_only_prefill` and `num_input_tokens` in attention
metadata.
3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead,
`CommonAttentionMetadata` is over designed and uesless
4. update the attention backend input parameters to keep the same as
vLLM.
5. Rename attention name to the same style with `ASCEND` prefix

- vLLM version: v0.9.2
- vLLM main:
107111a859

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-24 10:23:34 +08:00
Mengqing Cao
3aa3b46bfe [V1][PP] Support pp with ray backend in V1 (#1800)
### What this PR does / why we need it?
Support pipeline parallel with ray backend in V1Engine.

Fixes #1751

### Does this PR introduce _any_ user-facing change?
Users could specify ray as distributed backend when inferencing with pp

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.9.2
- vLLM main:
32142b3c62

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-23 14:52:52 +08:00
wangxiyuan
7265dc090d [2/4][Refactor] Refactor torchair utils (#1892)
There is a lot torchair specified logic in common code. It results hard
code maintenance. We will create a new torchair module to launch
torchair related logic there. I plan to add 4 PR.

1. Refactor worker
2. Refactor utils (this PR)
- simple change that move all torchair related util function to torchair
module
3. Refactor model_runner
4. Refactor attention

- vLLM version: v0.9.2
- vLLM main:
8188196a1c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-21 19:43:30 +08:00
wangxiyuan
a8b316ac5b [CI] Make AttentionBackend interface compatible to fix broken CI (#1893)
vLLM commit
752c6ade2e
removed `blocksparse_params` for attention backend. This PR does the
same change to make CI happy.


- vLLM version: v0.9.2
- vLLM main:
9499e26e2a

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-21 08:21:06 +08:00
Shanshan Shen
d08ff304cd [Misc][V0 Deprecation] Remove V0 Attention (#1835)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-18 14:10:13 +08:00
ApsarasX
0fc9b56d40 [Perf] Improve MLA multistream performance (#1353)
### What this PR does / why we need it?
> Need to merge after PR #1322

According to benchmark results, this PR brings approximately 1%
performance gain.

#### Before Improvement
Profiling
<img width="1147" alt="截屏2025-06-22 14 54 47"
src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c"
/>

Evaluation
```
# server launch command
python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=16 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.96

# client benchmark command
python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 1536 \
        --num-prompts 200 \
        --ignore-eos \
        --model auto \
        --tokenizer /DeepSeek-R1-W8A8 \
        --port 8006 \
        --request-rate 1 \
        --max-concurrency 24 \
        --save-result \
        --skip-initial-test \
        --metric-percentiles "50,90,99"
```

```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  958.59    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2086    
Output token throughput (tok/s):         320.47    
Total Token throughput (tok/s):          1175.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          942.70    
Median TTFT (ms):                        713.87    
P50 TTFT (ms):                           713.87    
P90 TTFT (ms):                           1363.88   
P99 TTFT (ms):                           2008.73   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.96     
Median TPOT (ms):                        69.49     
P50 TPOT (ms):                           69.49     
P90 TPOT (ms):                           70.42     
P99 TPOT (ms):                           70.72     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.96     
Median ITL (ms):                         59.88     
P50 ITL (ms):                            59.88     
P90 ITL (ms):                            61.59     
P99 ITL (ms):                            68.82     
==================================================
```

#### After Improvement
Profiling
<img width="1200" alt="截屏2025-06-22 14 55 42"
src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f"
/>

Evaluation
```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  948.08    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2110    
Output token throughput (tok/s):         324.02    
Total Token throughput (tok/s):          1188.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          1019.25   
Median TTFT (ms):                        714.63    
P50 TTFT (ms):                           714.63    
P90 TTFT (ms):                           1367.31   
P99 TTFT (ms):                           2661.52   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.14     
Median TPOT (ms):                        68.68     
P50 TPOT (ms):                           68.68     
P90 TPOT (ms):                           69.33     
P99 TPOT (ms):                           70.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.14     
Median ITL (ms):                         59.04     
P50 ITL (ms):                            59.04     
P90 ITL (ms):                            60.93     
P99 ITL (ms):                            66.89     
==================================================
```
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?




- vLLM version: v0.9.2
- vLLM main:
65393ee064

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-11 08:51:17 +08:00
ApsarasX
643e6f5486 [Bugfix] Fix accuracy problem caused by mask pollution (#1678)
### What this PR does / why we need it?
If a small batch of short requests is sent first, forming a chunk with a
length <128, it will corrupt the `attn_mask_cache`, causing subsequent
requests that do not form a chunk to have accuracy issues.

The root cause of this problem is the use of in-place multiplication.
Modifying it to use out-of-place multiplication will resolve the
accuracy problem.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Yes.

- vLLM version: v0.9.2
- vLLM main:
ad6c2e1a0b

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-10 14:06:49 +08:00
wangxiyuan
392fd7239b [Misc] Add attention mask (#1673)
Move attention mark from V0 to common place.
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 09:12:03 +08:00
Angazenn
18495f44b2 [BugFix] Fix max_num_tokens_across_dp calculation bugs in attention_v1_torchair (#1636)
### What this PR does / why we need it?
This PR fixes a bug that is caused by max_num_tokens_across_dp
calculation. In earlier version, we compute this by graph_pad_size plus
max_num_tokens(actual). This will result in different
max_num_tokens_across_dp across dp ranks. If padding related is
required, this might cause a wrong padding.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed normally.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-07-07 20:03:02 +08:00
wangxiyuan
343955c7ac [CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625)
This commit
78fe77534b
from vllm reverted the change for FusedMoEParallelConfig

This PR do the same to fix the CI error

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-04 17:54:33 +08:00