Commit Graph

75 Commits

Author SHA1 Message Date
Apocalypse
71866d5311 [feature] chunkprefill support pcp&dcp (#3801)
### What this PR does / why we need it?
ChunkPrefill now can support Long Sequence Feature Pcp&Dcp

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI tests passed with self-test


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <3834144971@qq.com>
2025-11-11 09:18:02 +08:00
weiguihua2
1d7cb5880a [Bugfix]fix pcp dcp attn aclgraph (#4066)
### What this PR does / why we need it?
In the DCP-PCP graph mode scenario, there is a shape issue with multiple
batches. This PR fixes this problem.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-08 18:47:12 +08:00
drslark
23b785fdfb [Feat] Adapted mtp function to Qwen3-next (#3918)
### What this PR does / why we need it?

Adapts mtp function to Qwen3-next.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-07 16:39:03 +08:00
drslark
1804b60ec8 [BugFix][main] Adapted to torch_npu.npu_fused_infer_attention_score (#4025)
### What this PR does / why we need it?

Fixes a compatible bug with `torch_npu.npu_fused_infer_attention_score`
which is discribed in
https://github.com/vllm-project/vllm-ascend/issues/4020.
@momo609 tells us this solution.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

The environment is same with this issue,
https://github.com/vllm-project/vllm-ascend/issues/4020.

We modify the code according to
https://github.com/vllm-project/vllm-ascend/pull/3918.

And run below codes:

```python
# run with Qwen3-next-mtp

prompts = [
    "Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
          tensor_parallel_size=4,
          enforce_eager=True,
          distributed_executor_backend="mp",
          gpu_memory_utilization=0.7,
          speculative_config={
              "method": "qwen3_next_mtp",
              "num_speculative_tokens": 1,
          },
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Outputs:

```text
Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'
```

Now, `torch_npu.npu_fused_infer_attention_score` is compatible with
Qwen3-Next.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-06 22:00:24 +08:00
weiguihua2
2eebe1dc0a [feat]decode convert bsnd to tnd and fix bug when pcp and dcp (#3980)
### What this PR does / why we need it?
1、in attention_v1 module, convert bsnd t0 tnd when pcp and dcp
2、fix tochair bug: service startup problem

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-06 14:58:24 +08:00
weiguihua2
5453033a41 revert TND modify when dcp pcp (#3948)
### What this PR does / why we need it?
1、revert TND modify when dcp pcp, which is introduced by
f57bdb09fc
2、deal aclgraph pad border issue

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-03 22:22:17 +08:00
wangxiyuan
cc2cd42ad3 Upgrade CANN to 8.3.rc1 (#3945)
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.

TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-03 20:21:07 +08:00
wangxiyuan
fcc9a0eaeb Update torch-npu version to 2.7.1 (#3896)
### What this PR does / why we need it?
Upgrade torch-npu to the official release version 2.7.1


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-31 17:16:31 +08:00
pichangping
f57bdb09fc [long_seq_optim] BSND to TND and FA_UPDATE replacement (#3778)
### What this PR does / why we need it?
We have optimized the performance of long sequences:First,Modify the
input data format for attention calculation. Instead of using the
original BSND format, remove the logic for converting between TND and
BSND, and directly adopt the TND format.
The TND input format can be directly reused, which shortens the data
flow path. Converting to BSND is an unnecessary processing step.Second,
we switched the output update of the concatenated small operators to the
npu_attention_update fusion operator to improve performance.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: pichangping <1337510399@qq.com>
2025-10-29 09:33:35 +08:00
shiyuan680
00aa0bf33e support prefill cache mode use fia op (#3696)
### What this PR does / why we need it?
support prefill cache mode use fia op for full graph
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

origin
============ Serving Benchmark Result ============
Successful requests:                     30
Maximum request concurrency:             256
Request rate configured (RPS):           0.70
Benchmark duration (s):                  131.63
Total input tokens:                      61363
Total generated tokens:                  61440
Request throughput (req/s):              0.23
Output token throughput (tok/s):         466.77
Peak output token throughput (tok/s):    750.00
Peak concurrent requests:                30.00
Total Token throughput (tok/s):          932.95
---------------Time to First Token----------------
Mean TTFT (ms):                          125.17
Median TTFT (ms):                        121.51
P50 TTFT (ms):                           121.51
P90 TTFT (ms):                           140.91
P99 TTFT (ms):                           182.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.85
Median TPOT (ms):                        43.84
P50 TPOT (ms):                           43.84
P90 TPOT (ms):                           44.28
P99 TPOT (ms):                           44.32
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.85
Median ITL (ms):                         42.63
P50 ITL (ms):                            42.63
P90 ITL (ms):                            48.74
P99 ITL (ms):                            59.62
==================================================

after
============ Serving Benchmark Result ============
Successful requests:                     30
Maximum request concurrency:             256
Request rate configured (RPS):           0.70
Benchmark duration (s):                  130.10
Total input tokens:                      61363
Total generated tokens:                  61440
Request throughput (req/s):              0.23
Output token throughput (tok/s):         472.26
Peak output token throughput (tok/s):    750.00
Peak concurrent requests:                30.00
Total Token throughput (tok/s):          943.94
---------------Time to First Token----------------
Mean TTFT (ms):                          123.69
Median TTFT (ms):                        122.51
P50 TTFT (ms):                           122.51
P90 TTFT (ms):                           143.69
P99 TTFT (ms):                           165.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.07
Median TPOT (ms):                        43.13
P50 TPOT (ms):                           43.13
P90 TPOT (ms):                           43.50
P99 TPOT (ms):                           43.57
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.07
Median ITL (ms):                         41.81
P50 ITL (ms):                            41.81
P90 ITL (ms):                            48.11
P99 ITL (ms):                            62.13
==================================================

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-10-27 19:41:07 +08:00
weiguihua2
4312a92a4f [feat]dcp pcp support aclgraph (#3731)
### What this PR does / why we need it?
dcp pcp support  full aclgraph, including mla attention_v1

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-10-27 09:58:23 +08:00
Yizhou
8ab8111fde [Fix] Prevent memory leak in MLA decode graph (#3743)
### What this PR does / why we need it?
The cache for MLA decode graph parameters was holding strong references
to tensors, preventing them from being garbage collected and leading to
increased memory usage.

This change wraps the cached tensors in weak references, allowing them
to be deallocated when no longer in use and reducing overall memory
pressure.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 20:37:33 +08:00
zzzzwwjj
e5676fc36e [main] remove dbo code (#3712)
### What this PR does / why we need it?
Remove codes of dbo.
Currently, vLLM has supported dbo with pr:
https://github.com/vllm-project/vllm/pull/23693.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-10-25 15:53:01 +08:00
Yizhou
3158742a97 [Refactor] Refactor Ascend attention implementation forward (#3714)
### What this PR does / why we need it?
This PR refactors the Ascend attention implementation to align with
vLLM's core interfaces, simplifying the code and improving
maintainability.

### Key Changes:

* **Align with vLLM's Attention Interface**: The `forward` method
signature in `AscendAttentionBackendImpl` now matches the base
`AttentionImpl` in vLLM, removing the custom `trace_flag`.

* **Enable Opaque Attention Operator**: By adding `opaque_attention_op`
to `AscendPlatform`, we allow vLLM to wrap our attention kernel in its
standard `vllm.unified_attention_with_output` operator. This avoids the
need for a custom call path.

*   **Remove Obsolete Code**:
* The custom op `vllm.unified_ascend_attention_with_output` has been
deleted as it is now redundant.
* The `trace_flag` and its associated logic were removed, reducing code
complexity.
* An outdated quantization branch within the attention implementation
was cleaned up.

* **Improve Readability**: Renamed output variables (`output` vs.
`intermediate_output`) and added comments to clarify the in-place nature
of the attention output.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
No extra tests needed.

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 08:58:35 +08:00
Mengqing Cao
cea0755b07 [1/N][Refactor] Refactor code to adapt with vllm main (#3612)
### What this PR does / why we need it?
This is the step 1 of refactoring code to adapt with vllm main, and this
pr aligned with
17c540a993

1. refactor deepseek to the latest code arch as of
17c540a993
 
2. bunches of fixes due to vllm changes
- Fix `AscendScheduler` `__post_init__`, caused by
https://github.com/vllm-project/vllm/pull/25075
- Fix `AscendScheduler` init got an unexpected arg `block_size`, caused
by https://github.com/vllm-project/vllm/pull/26296
- Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by
https://github.com/vllm-project/vllm/pull/23485
- Fix `MLAAttention` import,caused by
https://github.com/vllm-project/vllm/pull/25103
- Fix `SharedFusedMoE` import, caused by
https://github.com/vllm-project/vllm/pull/26145
- Fix `LazyLoader` improt, caused by
https://github.com/vllm-project/vllm/pull/27022
- Fix `vllm.utils.swap_dict_values` improt, caused by
https://github.com/vllm-project/vllm/pull/26990
- Fix `Backend` enum import, caused by
https://github.com/vllm-project/vllm/pull/25893
- Fix `CompilationLevel` renaming to `CompilationMode` issue introduced
by https://github.com/vllm-project/vllm/pull/26355
- Fix fused_moe ops, caused by
https://github.com/vllm-project/vllm/pull/24097
- Fix bert model because of `inputs_embeds`, caused by
https://github.com/vllm-project/vllm/pull/25922
- Fix MRope because of `get_input_positions_tensor` to
`get_mrope_input_positions`, caused by
https://github.com/vllm-project/vllm/pull/24172
- Fix `splitting_ops` changes introduced by
https://github.com/vllm-project/vllm/pull/25845
- Fix multi-modality changes introduced by
https://github.com/vllm-project/vllm/issues/16229
- Fix lora bias dropping issue introduced by
https://github.com/vllm-project/vllm/pull/25807
- Fix structured ouput break introduced by
https://github.com/vllm-project/vllm/issues/26737

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-10-24 16:55:08 +08:00
LookAround0301
b54d44e664 support cp&dcp (#3260)
### What this PR does / why we need it?
This PR adds the Prefill Context Parallelism (PCP) feature, which
corresponds to DCP. For specific implementation details, please refer to
the RFC https://github.com/vllm-project/vllm/issues/25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning
the sequence dimension during the prefill stage.
### Does this PR introduce _any_ user-facing change?
The current implementation primarily includes the following changes:

Modified ModelRunner.py for CP partitioning logic for tokens;
Modified attention_v1.py and mla_v1.py to adapt the GQA/MLA backend to
PCP.
Modified block_tables.py to extend the KV cache storage based on
DCP&PCP;
Added necessary command-line arguments to control parallelism for PCP;
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: chenjie <chenjie137@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Signed-off-by: Feng Liu <liufeng248@huawei.com>
Signed-off-by: gaojc <1055866782@qq.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Signed-off-by: z50049692 <zhangmingwei11@huawei.com>
Co-authored-by: chenjie <chenjie137@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Co-authored-by: Feng Liu <liufeng248@huawei.com>
Co-authored-by: gaojc <1055866782@qq.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: z50049692 <zhangmingwei11@huawei.com>
Co-authored-by: w00896881 <wangzixuan40@huawei.com>
2025-10-24 10:32:01 +08:00
ZYang6263
b9e2896eb1 Revert "[Perf] Add FIA interface in FA case" (#3553)
Reverts vllm-project/vllm-ascend#3321
The output dimension mismatch and accuracy issue
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-20 19:56:10 +08:00
ZYang6263
1e78ecbad6 [Perf] Add FIA interface in FA case (#3321)
### What this PR does / why we need it?

Add new npu_fused_infer_attention_score op to improve perfomance in
flash attention case.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-19 12:45:33 +08:00
Angazenn
9547d6f0d9 [Core]Append padding logic for Attention (#3256)
### What this PR does / why we need it?

This PR aims to add padding logic to seq_lens、block_tables when running
in full decode scenario. Before this PR, the number of input tokens with
padding might exceeds corresponding seq_lens. For example, when running
in full decode scenario:

```
input_ids : [1, 3, 0, 0]
seq_lens: [2, 1]
query_start_loc: [0, 1, 2]
```
Here, `input_ids` is padded by 2 tokens while
`seq_lens`/`query_start_loc` are not. The mismatch between `input_ids`
and `seq_lens`/`query_start_loc` might cause some potential bugs. This
PR would change it into :

```
input_ids : [1, 3, 0, 0]
seq_lens: [2, 1, 1, 1]
query_start_loc: [0, 1, 2, 3, 4]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-10-17 21:56:01 +08:00
Angazenn
d9ee491f70 [BugFix]Move to_list in foward_v1 with FIA earlier to build (#3185)
### What this PR does / why we need it?
The current implementation of FIA will introduce an `to_list` operation
for actual_seq_lengths_q and seq_lens,which comsumes extra time. These
operation can be moved earlier into `build` operation of attention
metadata.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-10-17 11:19:41 +08:00
huangdong2022
3a53bbc508 [Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with bias, resolve conflict with weight prefetch (#3465)
### What this PR does / why we need it?
1.qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and
quant op' during quantization scene.
2.torch_npu.add_rms_norm_quant op fixed accuracy while model weights is
quantized by anti_method m4, m4 quantization is asymmetric outlier
suppression method, it will generate none-zero norm bias,
add_rms_norm_quant op updated to add this parameter to calculate.
3. add torch-npu check

### Does this PR introduce _any_ user-facing change?
new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919

### How was this patch tested?
1.no special parameters to set, no new envs to set. new feature works if
torch_npu version >= torch_npu-2.7.1.dev20250919
2.use qwen3 moe quantization model to test ,such as
Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8,
Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: h30027576 <huangdong51@huawei.com>
2025-10-17 09:30:51 +08:00
xuyexiong
02c26dcfc7 [Feat] Supports Aclgraph for bge-m3 (#3171)
### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>
2025-10-14 23:07:45 +08:00
XiaoxinWang
9eb62935b8 fix pagedattention to support fullgraph. (#3436)
### What this PR does / why we need it?
Calculate in advance the workspace memory size needed for the
PagedAttention operator to avoid deadlocks during resource cleanup. This
PR requires torch_npu version 0920 or newer.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-10-14 16:10:09 +08:00
无脸男
ace300a549 [Bugfix] Fix the abnormal NPU memory usage in full graph mode. (#3331)
### What this PR does / why we need it?

In the full graph mode, since paged attention operators updates are
required, the parameters of this operators needs to be retained.
However, the tensor such as query、key cache、value cache, does not need
to be persistently saved, and we can manually release this space by
`weak_ref_tensor` to save the memory.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: WithHades <244036962@qq.com>
2025-10-11 10:20:10 +08:00
panchao-hub
1756efa5fd [Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125)
### What this PR does / why we need it?
Adds support for capturing the Multi-Layer Attention (MLA) decode
operation into an ACL graph. This improves performance by compiling the
attention kernel for single-token decoding.

Key changes include:
- Implementing the graph capture logic for the MLA kernel, including
workspace management and parameter updates.
- Modifying the rotary embedding (RoPE) handling to use pre-allocated
tensors, which is a requirement for graph capture.
- Adding a `build_for_graph_capture` method to the MLA metadata builder
to create dummy metadata during the graph compilation phase.

Known issues:
- Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're
working on a fix
- We are preparing to remove update_mla_attn_params with
auto_dispatch_capture

### Does this PR introduce _any_ user-facing change?
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: panchao-hub <315134829@qq.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-10 16:31:20 +08:00
wangxiyuan
ba19dd3183 Revert PTA upgrade PR (#3352)
we notice that torch npu 0919 doesn't work. This PR revert related
change which rely on 0919 version.
Revert PR: #3295  #3205  #3102 

Related: #3353

- vLLM version: v0.11.0
2025-10-10 14:09:53 +08:00
XiaoxinWang
579b7e5f21 add pagedattention to support FULL_DECODE_ONLY. (#3102)
### What this PR does / why we need it?
Calculate in advance the workspace memory size needed for the
PagedAttention operator to avoid deadlocks during resource cleanup. This
PR requires torch_npu version 0920 or newer.
### How was this patch tested?


- vLLM version: v0.11.0

---------

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-10-10 08:50:33 +08:00
wangxiyuan
ac1c2cd9ac [CI] Upgrade vllm version - 0925 (#3167)
Upgrade vLLM to newest commit.

1. Remove the useless func get_state_cls, it has been removed from vLLM
already.
e6750d0b18
2. Fix ut broken by
6160ba4151


- vLLM version: v0.10.2
- vLLM main:
b1068903fd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-25 14:20:10 +08:00
weijinqian0
6aa4253798 [Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085)
What this PR does / why we need it?

there are two sets of sp implementations for moe and dense models. One
is called sequence_parallelism, and the other is flashcomm_v1.
We did the following things:

Merge two sets of code with the same implementation into one.
Remove the implementation of sequence_parallelism, as this solution
cannot support aclgraph.
Does this PR introduce any user-facing change?

No

How was this patch tested?

e2e&ut

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-09-24 11:29:59 +08:00
lidenghui1110
0f3939e5a9 [Feature]cpu offload connector (#1659)
This PR implements cpu offload connector to enable NPU kv cache offload
to host DRAM.

- vLLM version: v0.10.2
- vLLM main:
5aeb925452

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
Co-authored-by: AlvisGong <gwly0401@163.com>
2025-09-23 14:25:05 +08:00
Yizhou
39a85c49fa [Refactor] Rename cudagraph_support to aclgraph_support (#3104)
### What this PR does / why we need it?
Updates the `cudagraph_support` attribute to `aclgraph_support` to use
terminology appropriate for the Ascend platform (ACL graphs instead of
CUDA graphs).

This change also explicitly disables graph support for the MLA attention
backend.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

- vLLM version: v0.10.2
- vLLM main:
5aeb925452

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-23 11:30:31 +08:00
Yizhou
3fa7cf6345 [Refactor][Graph] Move graph parameter logic to acl_graph module (#3101)
### What this PR does / why we need it?
This is the follow-up PR of #2128 .

Moves graph parameter management components, including `GraphParams`,
`get_graph_params`, and `set_graph_params`, from the generic `utils.py`
to the more specific `compilation/acl_graph.py`.

Additionally, extracts the `update_attn_params` logic from the
`NPUModelRunner` class into a standalone function within the `acl_graph`
module.

This refactoring improves code organization by centralizing ACL
graph-related logic into its own dedicated module, enhancing modularity
and clarity.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 22:23:14 +08:00
Yizhou
338231acaf [Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models (#2128)
Note: This depends on [vLLM
#25161](https://github.com/vllm-project/vllm/pull/25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of #1503 and #1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 17:14:28 +08:00
tianyitang
f1f2c8f5e5 [Perf] Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems (#2962)
### What this PR does / why we need it?
Add new npu_fused_infer_attention_score op to improve perfomance in
splitfuse cases and resolve long-seq mask problems .

1. The original op's performance is suboptimal in certain scenarios,
necessitating optimization through the _new op_
(npu_fused_infer_attention_score)。
2. For ultra-long sequences (128k), the original operator will allocate
a large attn_mask, which consumes excessive CPU memory. In contrast, the
_new op_ supports a fixed-size compressed mask, effectively resolving
this issue.

NOTE1: The current PR retains the original logic and uses a version
check of the CANN package to determine whether the _new op_ can be
enabled. This ensures no impact on existing users. In future versions,
this version check and the original logic will be deprecated, and the
_new op_ scheduling will be uniformly adopted.
NOTE2: This pr relies on future CANN version, which is not available
now.
NOTE3: To enable the new op in chunked prefill, the parameter
additional_config should be set like `--additional-config
'{"ascend_scheduler_config":
{"enabled":true,"enable_chunked_prefill":true}}' \` at least.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed




- vLLM version: v0.10.2
- vLLM main:
6c5f82e5aa

---------

Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
2025-09-22 14:56:14 +08:00
zhangxinyuehfad
a22b532d38 [Fixbug] Fix shape not match when sliding_window and dynamic batch_size (#2830)
### What this PR does / why we need it?
Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy 

### Does this PR introduce _any_ user-facing change?

Users can't set dynamic batch_size or use lm_eval test accuracy when
using models(sliding_window)

### How was this patch tested?
accuarcy of LLM-Research/Phi-4-mini-instruct is ok :
```
vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8105|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.8097|±  |0.0108|
```


- vLLM version: v0.10.2
- vLLM main:
3c96e7b8a1

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-19 22:35:14 +08:00
wangxiyuan
c556038ef0 [New model] Qwen3-next support (#2917)
### What this PR does / why we need it?
Add Qwen3-next support.

### Does this PR introduce _any_ user-facing change?
Yes, users can use Qwen3 next.
Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the
tutorial will be ready in
[here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html)

### How was this patch tested?
Doc CI passed

Related: https://github.com/vllm-project/vllm-ascend/issues/2884

Co-Authored-By: Angazenn <supperccell@163.com>
Co-Authored-By: zzzzwwjj <1183291235@qq.com>
Co-Authored-By: MengqingCao <cmq0113@163.com>
Co-Authored-By: linfeng-yuan <1102311262@qq.com>
Co-Authored-By: hust17yixuan <303660421@qq.com>
Co-Authored-By: SunnyLee219 <3294305115@qq.com>
Co-Authored-By: maoxx241 <maoxx241@umn.edu>


- vLLM version: v0.10.2
- vLLM main:
b834b4cbf1

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: hust17yixuan <303660421@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: hust17yixuan <303660421@qq.com>
2025-09-16 01:17:42 +08:00
Marco Barletta
6666e5265d Added support for KV connector v1 (#2039)
### What this PR does / why we need it?
- This PR adds the support for the KV connector interface in the V1
architecture, in the same way as vllm. Vllm-ascend currently lacks of
this support, required to support also layerwise management of KV
caches.

- The connector interface allows using external tools and integrate them
with vllm

### Notes:
We are aware of Issue #684 , however that issue does not modify the
attention classes as necessary to perform layerwise management of KV
caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla
vllm. The KV connector API is the same as vanilla vllm, supporting the
standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour
before this PR was made on the file model_runner_v1.py. I solved the
conflicts by removing any modification to the model_runner_v1 file,
which now are largely already merged in main. Now this PR is left for
the modifications to the attention_v1 file.

### Does this PR introduce _any_ user-facing change?
The PR does not modify current APIs, but it extends the behavior of
current worker runner and attention classes to save and load KV caches.
In absence of connectors, the behavior should stay untouched.

### How was this patch tested?
- No unit test implemented yet for the worker.

- Tested together with LMCache using
https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py
with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.

- Performed LMEval on LMCache integrated with vllm-ascend.

Results without LMCache on Qwen3-8B:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102|
 
Results with LMCache Layerwise:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103|


- vLLM version: v0.10.1.1
- vLLM main:
50fede6634

---------

Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
2025-09-08 09:04:22 +08:00
yeyifan
b2f77d3aa8 [fix] prefill unsupport sliding window attention (#2758)
### What this PR does / why we need it?
fix prefill attention bug,not support sliding window.
npu_fused_infer_attention_score head_dim only equal 128, not support
other number.
### Does this PR introduce _any_ user-facing change?
remove prefill phase npu_fused_infer_attention_score
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: nsdie <yeyifan@huawei.com>
2025-09-07 10:34:38 +08:00
yeyifan
1191a64ae5 [Feat]attention add sliding windows size (#2528)
### What this PR does / why we need it?
Add a sliding window size parameter to attention
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Regarding the `Gemma3` model, set
additional_config={"ascend_scheduler_config": {"enabled":True}}, only
support AscendScheduler
test commond:`python3 -m vllm.entrypoints.openai.api_server --model
gemma3 --additional-config
'{"ascend_scheduler_config":{"enabled":true}}'`


- vLLM version: v0.10.1.1
- vLLM main:
6578e87365

---------

Signed-off-by: nsdie <yeyifan@huawei.com>
2025-08-28 10:37:19 +08:00
Mengqing Cao
1327f9be1c Fix some ci issue and refactor modelrunner (#2445)
### What this PR does / why we need it?
Fix some ci issue and refactor modelrunner

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.0
- vLLM main:
4d9c61993a

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
2025-08-20 09:01:04 +08:00
Shanshan Shen
83e0f41408 [3/N][Refactor] Move torchair_attention to torchair dir (#2017)
### What this PR does / why we need it?

1. Move `torchair_attention` to `torchair` dir.
2. Make `AscendAttentionTorchairBackend` extend `AscendAttentionBackend`
to reduce duplicate methods.
3. Make `AscendTorchairMetadata` extend `AscendMetadata` to reduce
duplicate properties.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
0933f9d518

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-19 10:25:22 +08:00
Shanshan Shen
55d0790597 [2/N][Refactor] Refactor V1 attention for better extensibility (#1995)
### What this PR does / why we need it?

Refactor V1 Attention for better extensibility (prepared for torchair
attention refactor).

**Main changes:**
- Move different kinds of foward into their method respectively, e.g.,
`_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`,
`_forward_decode_only()`, `_forward_v1_style()`.

### Does this PR introduce _any_ user-facing change?

No.

- vLLM version: v0.10.0
- vLLM main:
14a5d903ab

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-14 09:32:41 +08:00
lbk-sys
c611291661 【main】SP For Qwen3 MoE (#2209)
### What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2,
replacing AllReduce with Reduce-Scatter and AllGather achieves
computational benefits in norm operations while saving one AllGather
communication. This feature is enabled during the P-phase and delivers
notable gains in long-sequence scenarios (e.g., 16k–25k), with
performance improvements reaching 5%–10%.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
``` 
compilation_config={
    "pass_config":{
        "enable_sequence_parallelism": True
    }
},
enable_expert_parallel=True,
```

- vLLM version: v0.10.0
- vLLM main:
9edd1db02b

---------

Signed-off-by: libaokui <libaokui@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
2025-08-07 09:15:49 +08:00
Li Wang
2284289880 [MISC] Cherry pick #1291 from v0.9.1-dev (#1825)
### What this PR does / why we need it?
Cherry pick #1291 from v0.9.1-dev, This pr implement the synchronization
of whether `dbo` is enabled across all dp ranks. specifically, it
performed allreduce op across multiple DP ranks, only when all the dp
rank is `enable_dbo`, it is enabled

Co-authored-by: shikang-hangzhou <459956190@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-01 09:08:45 +08:00
wangxiyuan
4a008c4dac [Misc]Clean up useless import from vllm (#2049)
Clean up useless  import from vllm to make code more clear.

- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 16:01:59 +08:00
zzzzwwjj
ba3dfbd59e [main][refactor] Refactoring forward_context and model_runner_v1 (#1979)
### What this PR does / why we need it?

A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:

Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;

### Does this PR introduce _any_ user-facing change?

No change at user-facing.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
57c22e57f9

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
Pleaplusone
df0ec55162 Disaggregate prefill for kv cache register style (#950)
### What this PR does / why we need it?
This PR adopt `LLMDataDist` for kv cache register and `pull_blocks`
style disaggregate prefill implementation. The interface implementation
mainly follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.

This PR can be test with the following step:
- Generate the rank table for all machine.
- execute`toy_proxy.py` to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
- Run the prefill server and decode server.
- send the request to the disaggregate prefill proxy

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Signed-off-by: liziyu179 <3475441767@qq.com>
Signed-off-by: underfitc <hucong24@huawei.com>
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Co-authored-by: liziyu179 <3475441767@qq.com>
Co-authored-by: underfitc <hucong24@huawei.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: underfituu <hzhucong@163.com>
2025-07-26 17:15:47 +08:00
Yikun Jiang
17a430f7b8 Upgrade vLLM to v0.10.0 (#1927)
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
7728dd77bb

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 15:43:29 +08:00
Shanshan Shen
84fc7402c3 [Misc] Refactor AscendMetaData Comments to Make It Clearer (#1967)
### What this PR does / why we need it?
Refactor the comments of `AscendMetaData` to make it clearer.

- vLLM version: v0.9.2
- vLLM main:
f3137cdd81

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-24 19:31:36 +08:00
wangxiyuan
846555cdb5 [Misc] Clean up uesless code in attention (#1933)
Before do attention module refactor, we can do some code cleanup to make
the next step easier.

What this PR does:

1. remove uesless `common_prefix_len` for attention builder
2. remove uesless `is_only_prefill` and `num_input_tokens` in attention
metadata.
3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead,
`CommonAttentionMetadata` is over designed and uesless
4. update the attention backend input parameters to keep the same as
vLLM.
5. Rename attention name to the same style with `ASCEND` prefix

- vLLM version: v0.9.2
- vLLM main:
107111a859

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-24 10:23:34 +08:00