Commit Graph

1194 Commits

Author SHA1 Message Date
realliujiaxu
6bc770cd78 [Perf] fix async copy for async scheduling (#4113)
### What this PR does / why we need it?
Only CPU tensors with `pin_memory=True` can be asynchronously copied to
the device. Currently, there are two instances where non-pinned CPU
tensors are being copied to the device, which will trigger synchronous
operations, reducing the expected benefits of asynchronous scheduling.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-11-13 09:11:26 +08:00
22dimensions
c272747d13 Upgrade to 0.11.1 newest vllm commit (#3982)
### What this PR does / why we need it?
adapt vllm-ascend main branch with vllm releases/v0.11.1

fix `forward context not set` in test_vlm.py caused by:
https://github.com/vllm-project/vllm/pull/23207

fix import `cdiv round` failed caused by:
https://github.com/vllm-project/vllm/pull/27188

fix import `init_cached_hf_modules` failed caused by:
https://github.com/vllm-project/vllm/pull/27567

adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused
by: https://github.com/vllm-project/vllm/pull/27654
- remove unused code in sigmoid_gating.py
- `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`,
`fused_recurrent_gated_delta_rule_fwd`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI 


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-11-12 23:01:19 +08:00
Angazenn
fc7e5cd9dc [main][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4097)
### What this PR does / why we need it?
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-12 17:31:39 +08:00
zhangsicheng5
a123f355e9 [feature] support pcp + mtp (in pd co-locate scenario) (#4098)
1. support pcp + mtp in pd co-locate scenario
2. llmdatadist connector pcp related bugfix and cleancode

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-11-12 17:22:21 +08:00
XiaoxinWang
1b4ce63ec9 fix fullgraph in ds. (#4016)
### What this PR does / why we need it?
DS don't have 'AscendAttentionMetadataBuilder' class so will fail in
fullgraph.
We resolved the issue by modifying the code to only check for
'GDNAttentionMetadataBuilder ', while all other attention cases follow
the default branch.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-11-12 10:11:43 +08:00
Yizhou
638dbcdb32 [Perf] Remove D2H operations to imporve performance (#4063)
### What this PR does / why we need it?
Replace masked in-place assignment with a device-side torch.where so
selection stays on-device, allowing subsequent device ops to be enqueued
earlier and removing an implicit D2H sync, reducing latency by several
hundreds μs on Ascend.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-12 09:08:55 +08:00
thonean
e38fe92f40 [Misc][Doc] Add service profiling feature with user guide (#3756)
### What this PR does / why we need it?
To support the data collection capabilities of the msServiceProfiler on
vLLM-ascned framework and enable customization of data collection points
via configuration file, a default profiling configuration has been added
to vllm-ascend, facilitating debugging and optimization for developers
and users.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: minghangc <29514143@qq.com>
2025-11-12 09:07:14 +08:00
zzhxxx
46a41b26d3 oproj TP support acl graph (#4073)
### What this PR does / why we need it?
Reference #2167 and orpoj TP supports ACL graph.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2025-11-11 19:39:06 +08:00
wangxiyuan
f811a24bf0 Remove VLLM_USE_V1 (#4086)
Drop VLLM_USE_V1 usage.  This env has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 15:43:39 +08:00
Apocalypse
71866d5311 [feature] chunkprefill support pcp&dcp (#3801)
### What this PR does / why we need it?
ChunkPrefill now can support Long Sequence Feature Pcp&Dcp

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI tests passed with self-test


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Apocalypse990923-qshi <qiushixu@usc.edu>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <3834144971@qq.com>
2025-11-11 09:18:02 +08:00
zhaomingyu13
7ffbe73d54 [main][Bugfix] Fix ngram precision issue and open e2e ngram test (#4090)
### What this PR does / why we need it?
Fix ngram precision issue and open e2e ngram test

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-11-11 09:06:24 +08:00
Icey
e04a87f4be [BugFix] Fixes Qwen3-Next enable nz accuracy problem (#4058)
### What this PR does / why we need it?
- Fixes Qwen3-Next enable nz accuracy problem

### Does this PR introduce _any_ user-facing change?
N/A


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
2025-11-10 20:54:57 +08:00
rjg-lyh
a1558b99c2 [Core] Restore scheduling logic under default configuration (#3967)
### What this PR does / why we need it?
This PR reverts the changes introduced in PR #2894 Initially, due to
performance issues with the older version of the chunked prefill ops,
the default behavior was to use the Ascend scheduler to disable the
chunked prefill feature. However, with the improvements in the
performance of the new chunked prefill ops, this interception strategy
has been removed. This change also aligns with the community's default
configuration behavior.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-11-10 17:48:56 +08:00
Levi
0a62e671fb [Feat] flashcomm_v2 optim solution (#3232)
### What this PR does / why we need it?
Supports generalized FlashComm2 optimization, which reduces
communication overhead, decreases RmsNorm computation, and saves one
AllGather step by replacing Allreduce operations in the Attention module
with pre-AlltoAll and post-AllGather operations (used in combination
with FlashComm1). This feature is enabled during the Prefill phase and
is recommended to be used together with FlashComm1, delivering broad
performance improvements, especially in long sequence scenarios with
large tensor parallelism (TP) configurations. Benchmark tests show that
under TP16DP1 configuration, it can improve the prefill performance of
the DeepSeek model by 8% on top of FlashComm1.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: zzhxx <2783294813@qq.com>
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: zzhxx <2783294813@qq.com>
2025-11-10 11:01:45 +08:00
lilinsiman
a3ff765c65 [Info][main] Corrected the errors in the information (#4055)
### What this PR does / why we need it?
Corrected the errors in the information

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-11-08 18:48:59 +08:00
weiguihua2
1d7cb5880a [Bugfix]fix pcp dcp attn aclgraph (#4066)
### What this PR does / why we need it?
In the DCP-PCP graph mode scenario, there is a shape issue with multiple
batches. This PR fixes this problem.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-08 18:47:12 +08:00
hucong
48094148f8 [BugFix] Improve the performance of prefixcache features (#4022)
### What this PR does / why we need it?
The code bug caused an empty bubble. When the npu_paged_cache_load
operator was called, it forcibly transferred seq_len2 to the device,
which triggered synchronization and interrupted the CPU operator's
launch stream.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: underfituu <hzhucong@163.com>
2025-11-08 18:45:31 +08:00
zxr2333
1d81a289d0 [P/D][BugFix]Fix proxy format processing errors & Layerwise connector performance optimization (#4043)
### What this PR does / why we need it?
1. Fix proxy format processing errors.
2. Layer-wise connector performance optimization.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-11-08 18:44:06 +08:00
wangx700
24d6314718 [Bugfix] fix sleepmode level2 e2e test (#4019)
### What this PR does / why we need it?

enable sleepmode level2 e2e test and add the check logic to ensure the
nz is not enabled.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

use e2e tests


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangx700 <wangxin700@huawei.com>
2025-11-08 14:11:55 +08:00
offline893
e687d6af85 [BugFix]Fix group list type of mc2. (#4047)
### What this PR does / why we need it?
Fix accrucy problem of eplb because of PTA upgrade.

### How was this patch tested?
Main:
    baseline:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 87.50 |

   EPLB:

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 87.50 |
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-11-07 17:41:56 +08:00
drslark
23b785fdfb [Feat] Adapted mtp function to Qwen3-next (#3918)
### What this PR does / why we need it?

Adapts mtp function to Qwen3-next.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-07 16:39:03 +08:00
LookAround0301
79e536d939 [Feat] update op for mla (#4000)
### What this PR does / why we need it?
1、in mla_v1 module, add torch_npu.npu_attention_update op when pcp and dcp

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: LookAround <lixushi@huawei.com>
2025-11-07 09:48:39 +08:00
LookAround0301
f8610b7d67 [long_seq] fix A2 accuracy problem (#4030)
### What this PR does / why we need it?
1、update prepare_finalize.py:fix A2 accuracy problem when pcp and dcp

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: LookAround <lixushi@huawei.com>
2025-11-07 09:29:33 +08:00
Angazenn
e0d58d543b [main][bugfix] Fix a rare bug triggered by _npu_paged_attention in FULL_DECODE_ONLY mode (#3986)
### What this PR does / why we need it?
This PR fixes a bug where the workspace of `_npu_paged_attention` in
setup is smaller than execution. For current implementation of
FULL_DECODE_ONLY with `_npu_paged_attention`, we use
`_npu_paged_attention_get_workspace` when capturing with `max_model_len`
as `seq_lens`. This assumes that PA with larger `seq_lens` inputs should
have larger workspace than smaller `seq_lens`. However, there are rare
cases where PA with smaller `seq_lens` incurs larger space. So I add
`get_workspace` directly into `update_attn_params`.
This change might introduce small(≈1%) performance degradation for low
num_tokens(such as 1) in decode phase, and there is no other known
memory issues. So I think this change is acceptable. We can remove this
if new attention op (such as `npu_fused_infer_attention_score`) does not
have such problems.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-06 23:08:07 +08:00
drslark
1804b60ec8 [BugFix][main] Adapted to torch_npu.npu_fused_infer_attention_score (#4025)
### What this PR does / why we need it?

Fixes a compatible bug with `torch_npu.npu_fused_infer_attention_score`
which is discribed in
https://github.com/vllm-project/vllm-ascend/issues/4020.
@momo609 tells us this solution.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

The environment is same with this issue,
https://github.com/vllm-project/vllm-ascend/issues/4020.

We modify the code according to
https://github.com/vllm-project/vllm-ascend/pull/3918.

And run below codes:

```python
# run with Qwen3-next-mtp

prompts = [
    "Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
          tensor_parallel_size=4,
          enforce_eager=True,
          distributed_executor_backend="mp",
          gpu_memory_utilization=0.7,
          speculative_config={
              "method": "qwen3_next_mtp",
              "num_speculative_tokens": 1,
          },
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Outputs:

```text
Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'
```

Now, `torch_npu.npu_fused_infer_attention_score` is compatible with
Qwen3-Next.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-06 22:00:24 +08:00
realliujiaxu
22005c64c1 [Bugfix] Add constraints for sequence parallelism (#4014)
### What this PR does / why we need it?
Add Add constraints for sequence parallelism for unsupported scenarios:
1. tp_size > 1
2. enable_expert_parallel must be True for MoE model

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-11-06 20:02:03 +08:00
weiguihua2
2eebe1dc0a [feat]decode convert bsnd to tnd and fix bug when pcp and dcp (#3980)
### What this PR does / why we need it?
1、in attention_v1 module, convert bsnd t0 tnd when pcp and dcp
2、fix tochair bug: service startup problem

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-06 14:58:24 +08:00
Liziqi-77
25b24c02ea [Feat](Mooncake) Supports multiple input suffixes for global_segment_size (#3690)
### What this PR does / why we need it?
- global_segment_size and local_buffer_size use constants for unified
management.
- Newly added support for input formats ending with GB, MB, KB, and B,
while being compatible with existing input methods.

### Does this PR introduce _any_ user-facing change?
- Users can use new input methods
- The documentation has also been modified

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: 李子琦 <liziqi_ing@163.com>
2025-11-06 14:48:15 +08:00
zxr2333
b206e831e9 [P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3981)
### What this PR does / why we need it?
Make kv-transfer env variable take effect and Fix load-balance proxy.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-11-06 12:02:47 +08:00
XiaoxinWang
738bf2b720 support qwen3-next full_decode_only mode. (#3949)
### What this PR does / why we need it?
support qwen3-next full_decode_only mode. 
bs=1, max_token=1024
| branch| tps| e2e time|
| --- | --- | --- |
|piecewise  |3.06  | 8.15 |
|fulldecodeonly | 7.2 | 3.47 |

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-11-05 08:46:05 +08:00
Mengqing Cao
5fed166a99 [ModelRunner][Refactor] Refactor kv cache tensor initialization logic (#3106)
### What this PR does / why we need it?
Refactor kv cache tensor initialization logic. 
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-11-04 17:26:54 +08:00
realliujiaxu
bedf223771 [Perf] move quant before allgather in Allgather EP (#3420)
### What this PR does / why we need it?
move quant before allgather in Allgather EP, rely on
https://github.com/vllm-project/vllm-ascend/pull/3334

Deepseek R1 W8A8 performance on A2 with
`HCCL_ALGO="level0:NA;level1:pipeline"`:
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms)  this PR |
|----------|----------|----------|
| 4k   |  375.21  | 364.99   |
| 16k  | 1465.23   | 1421.75  |
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-11-04 16:49:58 +08:00
zxr2333
15bb5098ad [PD Disaggregation]Set adxl engine as default backend and update README (#3761)
### What this PR does / why we need it?
Set adxl engine as the default Mooncake backend, because Ascend
Transport is no longer maintained.
Update README to include instructions for installing the adxl backend
Mooncake.
### Does this PR introduce _any_ user-facing change?
Users need to compile and install the mooncake backend for adxl
according to the revised README instructions.
### How was this patch tested?
By CI.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-11-04 16:06:39 +08:00
whx
e9bb4491ec [BugFix] Fix deepseek v3.2 mtp bug. (#3900)
### What this PR does / why we need it?
This PR fixes deepseek v3.2 mtp bug.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
All existed ci tests should pass.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-11-04 14:06:59 +08:00
Shanshan Shen
40c7db6559 [MM][Bugfix] Add MoE verification for multi-modal models (#3897)
### What this PR does / why we need it?

Fix #3891.

The empty of `moe_comm_method` in the above issue is due to the wrong
check for MoE models. To be specific, the method `is_moe_model` only
checks whether a text-only model is a MoE model, without considering
multi-modal models, e.g., `VL` and `Omni`.

Check the config dict recursively to find if it has a key contains
"expert", without checking the model architecture.

It is worth noting that, we can't verify a model by if it contains
`FusedMoE` module because `is_moe_model` is called somewhere before the model loading, e.g., it's called when updating the ACLGraph config in
platform initialization.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-04 09:16:19 +08:00
weiguihua2
5453033a41 revert TND modify when dcp pcp (#3948)
### What this PR does / why we need it?
1、revert TND modify when dcp pcp, which is introduced by
f57bdb09fc
2、deal aclgraph pad border issue

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-03 22:22:17 +08:00
wangxiyuan
cc2cd42ad3 Upgrade CANN to 8.3.rc1 (#3945)
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.

TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-03 20:21:07 +08:00
zouyida2052
ec98320285 correct bug to fix the value of max_num_tokens (#3933)
### What this PR does / why we need it?
correct bug to fix the value of max_num_tokens

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-11-03 14:17:51 +08:00
1Fire4
0b9b6d79fe [Feat][UT] Support Deepseekv32 FULL_DECODE_ONLY mode and add unit test of sfa_v1 (#3763)
### What this PR does / why we need it?
- Add support for DeepSeek v3.2 in FULL_DECODE_ONLY mode.
- Add unit test for sfa_v1.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
2025-11-03 10:02:47 +08:00
XiaoxinWang
d4c75088a0 [Perf] Move attention update stream out of loop to optimize performance (#3848)
### What this PR does / why we need it?
In the `update_*attn_params` functions, the
`torch.npu.stream(update_stream)` context manager was previously located
inside the for-loop that updates parameters for each layer. This
resulted in redundant stream initiations for every layer, adding
unnecessary overhead.

This commit refactors the code by moving the stream context manager to
wrap the entire for-loop. This ensures that the update stream is
initiated only once per function call, rather than for each layer. This
change reduces 90us in each decode model.
update stream in every layer:
<img width="1720" height="383" alt="image"
src="https://github.com/user-attachments/assets/70e4cb69-5bc1-4180-a67d-c99132134be6"
/>

remove update stream in every layer:
<img width="1269" height="175" alt="image"
src="https://github.com/user-attachments/assets/0e290edb-b0ce-48fe-b032-1b924ade6ae5"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-11-03 09:19:57 +08:00
wangxiyuan
fcc9a0eaeb Update torch-npu version to 2.7.1 (#3896)
### What this PR does / why we need it?
Upgrade torch-npu to the official release version 2.7.1


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-31 17:16:31 +08:00
zhangsicheng5
0f70698d6d [feature] support pcp + mtp (with pd disaggregate) (#3822)
### What this PR does / why we need it?
support pcp + mtp (with pd disaggregate, only pcp in P nodes)

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-10-31 15:43:22 +08:00
rjg-lyh
c1a6aeab46 [main][bugfix] fix valueError in static_forward_context when prefix is empty (#3924)
### What this PR does / why we need it?
This PR temporarily bypasses the scenario where some models in vLLM
trigger a `ValueError` during the process of storing values in
`static_forward_context` when no `prefix` is specified for the linear
layers, which is a bug in some models in vLLM. The official fix will be
addressed by submitting a PR to the vLLM community that specifies a
prefix for the linear layers in each model.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-10-31 14:55:58 +08:00
Nagisa125
6764777f00 [Bugfix] Fix MTP support for lmhead_tensor_parallel_size (#3915)
### What this PR does / why we need it?
Fix the issue of MTP being enabled and setting
Imhead_tensor_parallel_size=16 causing the inference to hang.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wyh145 <1987244901@qq.com>
2025-10-31 10:30:28 +08:00
zouyida2052
1966885be2 mfix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_lentp (#3910)
### What this PR does / why we need it?
1. Revert [bugfix for mtp in
fullgraph](0948483642)
and support it when vllm supports
2. raise error when cudagraph_capture_sizes can't be an integer multiple
of uniform_decode_query_len
3. bugfix when max_num_seqs=14 in mtp=2 scenario

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-31 09:24:50 +08:00
wangxiaoteng888
a2b325ee00 [bugfix]cancel tokenize for layerwise_proxy (#3914)
### What this PR does / why we need it?
cancel tokenize for layerwise_proxy

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
by ci

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-10-30 23:54:46 +08:00
wangxiaoteng888
2c291bc63f [bugfix] layerwise D first plan (#3866)
### What this PR does / why we need it?
Refactored the layerwise code to send to the D node first, preventing
P-node hangs due to communication timeouts when DP > 1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-10-30 22:20:34 +08:00
offline893
627f20ce26 [BugFix]Fix group list type of mc2. (#3864)
### What this PR does / why we need it?
Fix the precision issue caused by the inconsistency between the group
list type used by mc2 and that of eplb.

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: offline0806 <3337230449@qq.com>
2025-10-30 21:39:01 +08:00
Song Zhixin
216fc0e8e4 [feature] Prompt Embeddings Support for v1 Engine (#3026)
### What this PR does / why we need it?
this PR based on
[19746](https://github.com/vllm-project/vllm/issues/19746), support
Prompt Embeddings for v1 engine on NPU

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

```python
python examples/prompt_embed_inference.py
```


- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1

---------

Signed-off-by: jesse <szxfml@gmail.com>
2025-10-30 17:15:57 +08:00
whx
f6149f3894 [Model][3/N] Refactor sfa into mla and remove deepseek_v3_2.py (#3769)
This is the follow-up PR to PR #3189, which continues to refactor sfa
into mla and finally remove deepseek_v3_2.py. This is the last PR of
deepseek modeling refactoring. After this, all deepseek-related model
codes are removed from vllm_ascend.

FurtherMore, after this PR deepseek v3.2 can run chunk-prefill with
correct accuracy.

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-30 17:06:38 +08:00