Commit Graph

1136 Commits

Author SHA1 Message Date
jiangyunfan1
44b58b8665 [TEST]Add full graph for multimodal nightly tests (#3968)
### What this PR does / why we need it?
This PR adds full graph for multimodal nightly test, we need to maintain
this senario

### How was this patch tested?
by running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-11-04 16:47:48 +08:00
ZengSilong
dc1a6cb503 [Test]Add accuracy test for multiple models (#3823)
### What this PR does / why we need it?
Add accuracy test for multiple models:
- Meta_Llama_3.1_8B_Instruct
- Qwen2.5-Omni-7B
- Qwen3-VL-8B-Instruct

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-11-04 14:46:39 +08:00
zhangxinyuehfad
646fbac7a9 [Test] Add accuracy test for qwen3-8b-w8a8 (#3799)
### What this PR does / why we need it?
Add accuracy test for qwen3-8b-w8a8

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-11-04 09:23:11 +08:00
wangxiyuan
cc2cd42ad3 Upgrade CANN to 8.3.rc1 (#3945)
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.

TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-03 20:21:07 +08:00
CodeCat
49d74785c4 [Test] Add new e2e test use deepseek-v2-lite in ge graph mode (#3937)
### What this PR does / why we need it?
The current test cases lack end-to-end (e2e) testing for the
deepseek-v2-lite network in ge graph mode.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>
2025-11-03 20:10:01 +08:00
Li Wang
8f222f21f1 [CI][Nightly] Fix mooncake build (#3958)
### What this PR does / why we need it?
Fix https://github.com/vllm-project/vllm-ascend/pull/3943

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-11-03 20:07:47 +08:00
1Fire4
0b9b6d79fe [Feat][UT] Support Deepseekv32 FULL_DECODE_ONLY mode and add unit test of sfa_v1 (#3763)
### What this PR does / why we need it?
- Add support for DeepSeek v3.2 in FULL_DECODE_ONLY mode.
- Add unit test for sfa_v1.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
2025-11-03 10:02:47 +08:00
Li Wang
d0cc9c1203 [CI][Nightly] Correct the commit hash available for mooncake (#3943)
### What this PR does / why we need it?
Because the previous commit hash was accidentally deleted or
overwritten. This patch correct the commit hash available for
https://github.com/AscendTransport/Mooncake to make nightly ci happy
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-11-01 21:52:16 +08:00
wangxiyuan
fcc9a0eaeb Update torch-npu version to 2.7.1 (#3896)
### What this PR does / why we need it?
Upgrade torch-npu to the official release version 2.7.1


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-31 17:16:31 +08:00
Canlin Guo
f99762eb25 [E2E][MM] Add e2e tests for InternVL model (#3796)
### What this PR does / why we need it?

As a validation for #3664, add end-to-end tests to monitor the InternVL
model and ensure its continuous proper operation. This PR is only for
single-card. So the models that have more parameters than 8B like 78B
are needed to test using multi-cards.
 

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

`pytest -sv tests/e2e/singlecard/multi-modal/test_internvl.py`


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-10-31 15:42:47 +08:00
lilinsiman
1f486b2dd1 [Test] Add new test model for aclgraph single_request (#3888)
### What this PR does / why we need it?
add new test model for aclgraph single_request

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-31 11:23:13 +08:00
lilinsiman
35a913cf1e add new e2e tests case for aclgraph memory (#3879)
### What this PR does / why we need it?
add new e2e tests case for aclgraph memory

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-31 09:16:52 +08:00
Li Wang
eb0a2ee2d0 [CI] Optimize nightly CI (#3898)
### What this PR does / why we need it?
This patch mainly fix the the problem of not being able to determine the
exit status of the pod's entrypoint script and some other tiny
optimizations:
1. Shorten wait for server timeout
2. fix typo
3. fix the issue of ais_bench failing to correctly access the proxy URL
in a PD separation scenario.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-30 23:42:20 +08:00
wangxiaoteng888
2c291bc63f [bugfix] layerwise D first plan (#3866)
### What this PR does / why we need it?
Refactored the layerwise code to send to the D node first, preventing
P-node hangs due to communication timeouts when DP > 1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-10-30 22:20:34 +08:00
jiangyunfan1
655a229455 [TEST]Add MALPO for aclgraph in nightly test (#3894)
### What this PR does / why we need it?
This PR adds MALPO for deepseek aclgraph, we need to test it nightly
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-10-30 18:25:54 +08:00
Song Zhixin
216fc0e8e4 [feature] Prompt Embeddings Support for v1 Engine (#3026)
### What this PR does / why we need it?
this PR based on
[19746](https://github.com/vllm-project/vllm/issues/19746), support
Prompt Embeddings for v1 engine on NPU

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

```python
python examples/prompt_embed_inference.py
```


- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1

---------

Signed-off-by: jesse <szxfml@gmail.com>
2025-10-30 17:15:57 +08:00
xuyexiong
eff3e5fc6f [FEAT] Refactor spec decode to support efficient padded speculation (#3528)
### What this PR does / why we need it?
1. Refactor the file `mtp_proposer.py`, splits torchair related codes
into `mtp_torchair_proposer.py`
2. According to https://github.com/vllm-project/vllm/pull/24539,
implements padded speculative decoding as described in
https://github.com/vllm-project/vllm/issues/21984.
### Does this PR introduce _any_ user-facing change?
User can use `disable_padded_drafter_batch` to disable/enable padded
speculation, default is `False`.
offline example:
```
speculative_config={"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False}
```

### How was this patch tested?

- [x] egaer with pad/unpad:
- [x] aclgraph with pad/unpad
- [x] torchair with pad/unpad

performance test of deepseek-r1 with tp16、dp1
aclgraph with pad ITL: 168ms
aclgraph with unpad ITL: 169ms
original: 178ms


- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-30 16:53:05 +08:00
Meihan-chen
67dd3a4581 [UT] fix skip ut test for test_utils (#3803)
### What this PR does / why we need it?
[UT] fix ut test for test_utils that
https://github.com/vllm-project/vllm-ascend/pull/3612 skipped.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.11.0rc3
vLLM main:
17c540a993

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2025-10-30 15:52:53 +08:00
offline893
14ca1e5cb2 [CI]Fix oom of deepseek-eplb nigtly test. (#3884)
### What this PR does / why we need it?
Fix oom of deepseek-eplb nigtly test

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-30 10:18:07 +08:00
baxingpiaochong
d6ef3df3b3 [Bugfix]fix_mulit_connector_bug (#3332)
### What this PR does / why we need it?
When using multi connector, the multi connector does not define
get_finished_count, which will cause the kv cache to be released
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: baxingpiaochong <771405853@qq.com>
2025-10-29 23:23:06 +08:00
offline893
5f176ca992 [CI]Fix eplb nightly tests. (#3863)
### What this PR does / why we need it?

Fix eplb nightly tests.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-29 23:06:05 +08:00
Li Wang
4a2ab13743 [CI] Optimize nightly CI (#3858)
### What this PR does / why we need it?
This patch optimize nightly CI:
1. Bug fixes ais_bench get None repo_type error
2. Fix A2 install kubectl error with arm arch
3. Fix the multi_node CI unable to determine whether the job was
successful error
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-29 22:30:19 +08:00
realliujiaxu
74191864b7 [Perf] Delete redundant operations in model_runner and forward_context (#3677)
### What this PR does / why we need it?

Remove redundant operations from `model_runner` and `forward_context`.
This optimization can significantly reduce the idle time (bubble) before
decoding when running models with small parameter counts (e.g.,
Qwen/Qwen2.5-0.5B).

Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms :
Before
<img width="1655" height="696" alt="image"
src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495"
/>

After
<img width="1607" height="774" alt="image"
src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806"
/>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-10-29 15:59:55 +08:00
weichen
0d1859af08 [Bugfix] [MoE] fix error in deepseek when using allgather (#3824)
### What this PR does / why we need it?
After refactoring vllm_ascend/models and FusedMoE, we are unable to pass
`gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result
in error when running deepseek v3/r1 with allgather.
Hence, this pr removes `gate` related computations from FusedMoE module
in eager/aclgraph mode.
### Does this PR introduce _any_ user-facing change?
`rm_router_logits` is deprecated in eager/aclgraph.
### How was this patch tested?
e2e & ut

- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-10-29 14:51:39 +08:00
Mengqing Cao
900086fdc6 [HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type (#3760)
### What this PR does / why we need it?
Part of https://github.com/vllm-project/vllm-ascend/pull/3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4

<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-29 14:18:52 +08:00
jiangyunfan1
e56b0017a3 [TEST]Add aisbench log and A2 cases (#3841)
### What this PR does / why we need it?
This PR adds 2 more A2 caces which we need to test daily. It also
enhances the logging for aisbench test failures to improve issues
identification
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test

- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1

---------

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-10-28 23:33:15 +08:00
Li Wang
90ae114569 [CI] Fix nightly CI (#3821)
### What this PR does / why we need it?
This patch fix the nightly CI runs
[failure](https://github.com/vllm-project/vllm-ascend/actions/runs/18848144365)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-28 20:40:03 +08:00
Li Wang
f846bd20e4 [CI] Add multi-node test case for a2 (#3805)
### What this PR does / why we need it?
This patch add multi-node test case for a2
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-27 23:10:17 +08:00
jiangyunfan1
9030106a14 [TEST]Add 2P1D multi node cases for nightly test (#3764)
### What this PR does / why we need it?
This PR adds the 2P1D multi node func/acc/perf test cases, we need test
them daily
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-10-27 23:09:15 +08:00
Li Wang
60ee4af6d0 [CI] Add custom op to nightly (#3765)
### What this PR does / why we need it?
1. Add custom op to nightly tests, fix
https://github.com/vllm-project/vllm-ascend/pull/3665
2. Correctly pass github secrets when using workflow_call, see
https://docs.github.com/en/actions/how-tos/reuse-automations/reuse-workflows
3. Fix the single node mutual cancellation issue

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-27 14:07:03 +08:00
weiguihua2
4312a92a4f [feat]dcp pcp support aclgraph (#3731)
### What this PR does / why we need it?
dcp pcp support  full aclgraph, including mla attention_v1

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-10-27 09:58:23 +08:00
ck-hw-1018
7572939b94 add qwq testcase (#3757)
### What this PR does / why we need it?
This PR adds a qwq case for nightly test for qwen-qwq on A3 ,we need
test them daily

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
by running the test


- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: ckhw <cuikai1@huawei.com>
2025-10-25 17:11:35 +08:00
zzzzwwjj
e5676fc36e [main] remove dbo code (#3712)
### What this PR does / why we need it?
Remove codes of dbo.
Currently, vLLM has supported dbo with pr:
https://github.com/vllm-project/vllm/pull/23693.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-10-25 15:53:01 +08:00
Icey
d9cdc65854 Upgrade to new vllm commit (#3719)
### What this PR does / why we need it?
Upgrade to new vllm commit:
c9461e05a4

- Fix many imports, caused by
https://github.com/vllm-project/vllm/pull/26908
- Fix import ```sha256```, caused by
https://github.com/vllm-project/vllm/pull/27169
- Remove ```SchedulerConfig.send_delta_data```, caused by
https://github.com/vllm-project/vllm/pull/27142
- Fix ```FusedMoE``` because of dual stream execution, caused by
https://github.com/vllm-project/vllm/pull/26440

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-10-25 15:36:32 +08:00
HuaJiaHeng
11f75883be [Test] add test for prefix cache feature of deepseek (#3733)
### What this PR does / why we need it?
This PR adds a prefix cache case for nightly test for
DeepSeek-r1-0528-W8A8 on A3, we need test them daily.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the test

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>
2025-10-25 14:08:15 +08:00
weichen
63c363d3de [Refactor] [MoE] Rename moe-related classes & files (#3646)
### What this PR does / why we need it?
1. Rename common_fused_moe.py to fused_moe.py.
2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize
to prepare_finalize.py / PrepareAndFinalize.
3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe.
4. Move vllm_ascend/ops/fused_moe.py to
vllm_ascend/ops/fused_moe/fused_moe.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-10-25 11:22:03 +08:00
zhangxinyuehfad
8f6f967028 [Test] Add e2e test and accuracy test for Qwen3-Next-80B-A3B-Instruct (#3450)
### What this PR does / why we need it?

Add e2e test and accuracy test for Qwen3-Next-80B-A3B-Instruct

### How was this patch tested?
accuracy test:
https://github.com/vllm-project/vllm-ascend/actions/runs/18771221544/job/53556027634?pr=3450
ci test:
https://github.com/vllm-project/vllm-ascend/actions/runs/18771221530/job/53556027614?pr=3450
<img width="1703" height="562" alt="image"
src="https://github.com/user-attachments/assets/973b6cfa-8240-41e3-893a-5024ff8d0693"
/>



- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-10-25 10:57:56 +08:00
whx
d5609e2c48 [BugFix] Comment out newly added vlm e2e. (#3736)
This PR comments out newly added vlm e2e test of ascend scheduler
scenario because I found that when running in multi-batch this will
stuck. Need to add this back after dealing with this issue.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-25 10:34:59 +08:00
whx
e33751ef8b [BugFix][Core] Fix a bug running multi-modal with ascend_scheduler (#3675)
This PR fix the bug related with running multi-modal models with
AscendScheduler. This bug was introduced by PR #2372 by using the same
parameter names as vLLM with different default values. 

Currently I fix this bug by changing the default values of these two
parameters to align with vLLM. 

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-10-25 09:41:33 +08:00
Canlin Guo
8295136575 [UT][fix] Add missing get_ascend_config mock to NPUWorker initialization tests (#3729)
### What this PR does / why we need it?

Enable the unit tests that #3612 skipped.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Unit tests.

- vLLM main:
17c540a993

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-10-25 09:33:16 +08:00
Li Wang
7f73c28a24 [CI][Doc] Optimize multi-node CI (#3565)
### What this PR does / why we need it?
This pull request mainly do the following things:
1. Add a doc for multi-node CI, The main content is the mechanism
principle and how to contribute
2. Simplify the config yaml for more developer-friendly
3. Optimized the mooncake installation script to prevent accidental
failures during installation
4. Fix the workflow to ensure the kubernetes can be apply correctly
5. Add Qwen3-235B-W8A8 disaggregated_prefill test
6. Add GLM-4.5 multi dp test
7. Add 2p1d 4nodes disaggregated_prefill test
8. Refactor nightly tests
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-25 09:23:47 +08:00
shaopeng-666
39b994a987 [Feat] Add mrope fusion op (#3708)
### What this PR does / why we need it?
Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support
Qwen3-VL currently. Thus could only take affect in qwen2.5-vl

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
2025-10-25 09:12:18 +08:00
Yizhou
3158742a97 [Refactor] Refactor Ascend attention implementation forward (#3714)
### What this PR does / why we need it?
This PR refactors the Ascend attention implementation to align with
vLLM's core interfaces, simplifying the code and improving
maintainability.

### Key Changes:

* **Align with vLLM's Attention Interface**: The `forward` method
signature in `AscendAttentionBackendImpl` now matches the base
`AttentionImpl` in vLLM, removing the custom `trace_flag`.

* **Enable Opaque Attention Operator**: By adding `opaque_attention_op`
to `AscendPlatform`, we allow vLLM to wrap our attention kernel in its
standard `vllm.unified_attention_with_output` operator. This avoids the
need for a custom call path.

*   **Remove Obsolete Code**:
* The custom op `vllm.unified_ascend_attention_with_output` has been
deleted as it is now redundant.
* The `trace_flag` and its associated logic were removed, reducing code
complexity.
* An outdated quantization branch within the attention implementation
was cleaned up.

* **Improve Readability**: Renamed output variables (`output` vs.
`intermediate_output`) and added comments to clarify the in-place nature
of the attention output.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
No extra tests needed.

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 08:58:35 +08:00
wangyu
d301c56d1a [TEST]Add initial multi modal cases of Qwen2.5-VL-32B-Instruct for nightly test (#3707)
### What this PR does / why we need it?
This PR adds the initial multi modal model for nightly test, including 2
cases for Qwen2.5-vl-32b acc/perf test on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
2025-10-24 17:12:06 +08:00
Mengqing Cao
cea0755b07 [1/N][Refactor] Refactor code to adapt with vllm main (#3612)
### What this PR does / why we need it?
This is the step 1 of refactoring code to adapt with vllm main, and this
pr aligned with
17c540a993

1. refactor deepseek to the latest code arch as of
17c540a993
 
2. bunches of fixes due to vllm changes
- Fix `AscendScheduler` `__post_init__`, caused by
https://github.com/vllm-project/vllm/pull/25075
- Fix `AscendScheduler` init got an unexpected arg `block_size`, caused
by https://github.com/vllm-project/vllm/pull/26296
- Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by
https://github.com/vllm-project/vllm/pull/23485
- Fix `MLAAttention` import,caused by
https://github.com/vllm-project/vllm/pull/25103
- Fix `SharedFusedMoE` import, caused by
https://github.com/vllm-project/vllm/pull/26145
- Fix `LazyLoader` improt, caused by
https://github.com/vllm-project/vllm/pull/27022
- Fix `vllm.utils.swap_dict_values` improt, caused by
https://github.com/vllm-project/vllm/pull/26990
- Fix `Backend` enum import, caused by
https://github.com/vllm-project/vllm/pull/25893
- Fix `CompilationLevel` renaming to `CompilationMode` issue introduced
by https://github.com/vllm-project/vllm/pull/26355
- Fix fused_moe ops, caused by
https://github.com/vllm-project/vllm/pull/24097
- Fix bert model because of `inputs_embeds`, caused by
https://github.com/vllm-project/vllm/pull/25922
- Fix MRope because of `get_input_positions_tensor` to
`get_mrope_input_positions`, caused by
https://github.com/vllm-project/vllm/pull/24172
- Fix `splitting_ops` changes introduced by
https://github.com/vllm-project/vllm/pull/25845
- Fix multi-modality changes introduced by
https://github.com/vllm-project/vllm/issues/16229
- Fix lora bias dropping issue introduced by
https://github.com/vllm-project/vllm/pull/25807
- Fix structured ouput break introduced by
https://github.com/vllm-project/vllm/issues/26737

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-10-24 16:55:08 +08:00
jiangyunfan1
ec9ec78b53 [TEST]Add initial prefix cache case for nightly test (#3709)
### What this PR does / why we need it?
This PR adds the initial prefix cache case for nightly test for
Qwen3-32b-int8 on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-10-24 16:33:18 +08:00
lio
cd58a643c5 [UT] Fix test_sample_recovered_tokens_pytorch_autoregressive (#3434)
### What this PR does / why we need it?

This 'test_rejection_sampler' unit test is something wrong.

> def test_sample_recovered_tokens_pytorch_autoregressive(self):
>       output_token_ids = torch.empty(2, dtype=torch.int32)
>       cu_num_draft_tokens = torch.tensor([1, 1])
>       draft_token_ids = torch.tensor([0, 1])

len(draft_token_ids ) = 2, cu_num_draft_tokens should be
torch.tensor([1, 2]) or torch.tensor([2, 2])

I fix it and set cu_num_draft_tokens = torch.tensor([1, 2]). The methods
before and after optimization can pass.

### Does this PR introduce _any_ user-facing change?
No 
### How was this patch tested?
NA

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: lio <1983142975@qq.com>
2025-10-24 11:20:57 +08:00
whx
1b270a64bd [MoE][Multistream] Avoid performing communication in extra stream. (#3582)
This PR moves the communication operation of shared experts out of extra
stream because I found that this might cause rtMemcpy related errors
when running shared experts multistream with aclgraph.

Furthermore, I utilize a global variable as extra stream object to avoid
allocating streams for each layer in full-graph mode.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-24 10:44:38 +08:00
LookAround0301
b54d44e664 support cp&dcp (#3260)
### What this PR does / why we need it?
This PR adds the Prefill Context Parallelism (PCP) feature, which
corresponds to DCP. For specific implementation details, please refer to
the RFC https://github.com/vllm-project/vllm/issues/25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning
the sequence dimension during the prefill stage.
### Does this PR introduce _any_ user-facing change?
The current implementation primarily includes the following changes:

Modified ModelRunner.py for CP partitioning logic for tokens;
Modified attention_v1.py and mla_v1.py to adapt the GQA/MLA backend to
PCP.
Modified block_tables.py to extend the KV cache storage based on
DCP&PCP;
Added necessary command-line arguments to control parallelism for PCP;
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: chenjie <chenjie137@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Signed-off-by: Feng Liu <liufeng248@huawei.com>
Signed-off-by: gaojc <1055866782@qq.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Signed-off-by: z50049692 <zhangmingwei11@huawei.com>
Co-authored-by: chenjie <chenjie137@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Co-authored-by: Feng Liu <liufeng248@huawei.com>
Co-authored-by: gaojc <1055866782@qq.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: z50049692 <zhangmingwei11@huawei.com>
Co-authored-by: w00896881 <wangzixuan40@huawei.com>
2025-10-24 10:32:01 +08:00
HuaJiaHeng
062257f624 [Test] add a new Qwen3-32b-int8 test case with feature_stack3 (#3676)
### What this PR does / why we need it?
This PR add a new Qwen3-32b-int8 test case for nightly test. This test
case mainly test the performance and accuracy of Qwen3-32b-int8 with a
new feature.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the test.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>
2025-10-23 20:43:14 +08:00