Commit Graph

648 Commits

Author SHA1 Message Date
Li Wang
ad366bf908 [Bugfix] Follow vLLM Qwen-Moe/VL and KV Connector change to fix broken CI (#2181)
### What this PR does / why we need it?
This pr fix broken CI:
1. Fix the
ee2eb6ecd8
changes, in this commit, they fused the gate and up projections in the
vision MLP, This can improve performance by reducing one matrix
multiplication. so, this pr do the following things:
- Specify that the two linear layers are fused as `mlp.gate_up_proj`
when loading the weights.
    - Use a SiluAndMul activation function.
2. Fix
aefeea0fde,
Update ModelRunnerOutput parameters to adapt to its changes
3. Fix
[vllm-commit](https://github.com/vllm-project/vllm/pull/20815/files#diff-3ffb829a39ab2b3e4706aa28f5e476815f36c3a87b98d6a66514ebedc8f3ffb4R354-R356),
fix qwen moe
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
fed5849d3f

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-04 21:37:50 +08:00
hucong
e38fab011d [Doc][PD] Restore the default configuration items in examples/disaggregate_prefill_v1/README.md (#2165)
### What this PR does / why we need it?
- In the D node, the max-num-batched-tokens parameter can be set to a
smaller value since the D node processes at most max-num-seqs batches
concurrently. As the profile_run only needs to handle max-num-seqs
sequences at a time, we can safely set max-num-batched-tokens equal to
max-num-seqs. This optimization will help reduce activation memory
consumption.
- Restore the default configuration items for PD separation.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.0
- vLLM main:
61dcc280fa

Signed-off-by: underfituu <hzhucong@163.com>
2025-08-04 20:30:53 +08:00
CaveNightingale
957c7f108d [Bugfix][PD] Make multiple Ps and Ds work on a single machine (#2080)
(cherry picked from commit 816375e0c1071d0696dfab1a1ce35674f9f37aa0)

### What this PR does / why we need it?

Suppose that you want to start a prefiller instance with npus `2,3`
only. So you start the instance with `ASCEND_RT_VISIBLE_DEVICES=2,3`.
The current programming will start two workers, whose ranks are `0` and
`1` respectedly. And they will pick the first and second ip addresses of
npus in the ranktable instead of the thirdth and forth ones. But
actually they are using card `2,3` and therefore they can not link with
remote instances when they attempt to transfer the KVCache.

Hence, at most 1 prefiller instance and at most 1 decoder instance can
work on a single machine since they always pick the first npu ip address
in the ranktable currently.

This pull request is proposed to fix the problem. This fix pick ips of
only those devices that are in `ASCEND_RT_VISIBLE_DEVICES` from the
ranktable.

### Does this PR introduce _any_ user-facing change?

If the user use ranktable generated by `gen_ranktable.sh`, they should
not face any change.

### How was this patch tested?
Qwen-0.6B 1P 1D, dp=2, `ASCEND_RT_VISIBLE_DEVICES=2,3` for prefiller and
`ASCEND_RT_VISIBLE_DEVICES=4,5` for decoder.


- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

Signed-off-by: CaveNightingale <cavenightingale@foxmail.com>
2025-08-04 17:22:18 +08:00
yiz-liu
a9480d5f0a [Fix] Adjust use_aclgraph logic (#2156)
### What this PR does / why we need it?
Updates the FusedMoE method to determine whether to use ACL Graph based
on the `torchair_graph_config`

This is equivalent to #2154 on v0.9.1-dev.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-04 15:23:20 +08:00
liu
688350a3bb [bugfixed] fix the bug when run the inference of quantized ds-w8a8-mtp (#2134)
When run the inference of ds-w8a8-mtp, it reported 'ParamllelLMhead has
no attribute 'params_dtype''.

1. add wrapper of vocab_parallel_embedding, fixed the bugs when running
deepseek-w8a8-mtp

Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
2025-08-04 15:16:42 +08:00
Pleaplusone
4b3a210c33 Implementation of simple load balance routing proxy server (#1953) (#2124)
### What this PR does / why we need it?
The PR is the cherry-pick from v0.9.1
https://github.com/vllm-project/vllm-ascend/pull/1953

This PR introduce a new load balance proxy server example implementation
for disaggregated pd, which support simple token&kv_cache aware load
balance routing strategy for the disaggregated pd system compared with
origin round robin toy_proxy.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested on real workload and unittest

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-08-04 10:35:53 +08:00
Mengqing Cao
af04ee9e7a [MoE][Dist] Fix Qwen MoE accuracy bug in DP scenario (#1856)
### What this PR does / why we need it?
Fix Qwen MoE accuracy bug in DP scenario.

Now the implentment of `FusedMoE` in vLLM use `All2AllManager` to
manager different all2all algorithm branch. And the default branch use
`Multicast` in `dispatch` phase and `all_reduce` in `combine` phase,
which are not implented in vLLM-Ascend. This leading to invoking into a
default implentment in `base_communicator`, with empty `dispatch` and
`combine` operations, thus causing the accuracy issue on it.

This pr is a temporary workaround, refacting all2all in vLLM-Ascend
could be a better way.


- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-04 10:24:18 +08:00
Pleaplusone
f939381c6f [Bugfix] Adopt the new changes on disaggregated pd from vllm main branch (#2122)
### What this PR does / why we need it?
We notice that vllm's main branch merged the PR
https://github.com/vllm-project/vllm/pull/21072 and
https://github.com/vllm-project/vllm/pull/21473 to support ray backend
and fix some rebase bug from previous change. Those changes makes the
disaggregate pd in vllm ascend breaks in some scenario.

In this PR, we adopt those changes to make sure the
`llmdatddist_c_mgr_connector` works fine on the newest vllm main branch.

### Does this PR introduce _any_ user-facing change?

No user face change.

### How was this patch tested?
relevant ut will be added to make sure the functionality of those
changes.

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-08-04 10:08:58 +08:00
YuanCheng-coder
ddaded1537 Add ut for envs.py (#2131)
What this PR does / why we need it?
test vllm_ascend/envs.py contains environment variables defination

Does this PR introduce any user-facing change?
N/A

How was this patch tested?
CI passed with new added test.

vLLM version: v0.10.0
vLLM main:
9532a6d563

- vLLM version: v0.10.0
- vLLM main:
b4e081cb15

---------

Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
2025-08-02 16:53:44 +08:00
xleoken
bea3d5bbb4 [Bug] Fix run bug in run_dp_server.sh (#2139)
### What this PR does / why we need it?

For `Qwen2.5-0.5B-Instruct` model
- the model's total number of attention heads (14) must be divisible by
tensor parallel size. (4 -> 2)
- the model does not support enable-expert-parallel

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local Test.

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

Signed-off-by: xleoken <xleoken@163.com>
2025-08-02 16:52:12 +08:00
yangqinghao-cmss
47f688a2f0 Change retrieving remote files to local retrieval. (#2141)
### What this PR does / why we need it?
Using vllm's AudioAsset class to retrieve remote audio
files(https://vllm-public-assets.s3.us-west-2.amazonaws.com) is not
feasible in some cases; it is recommended to switch to local retrieval.

### How was this patch tested?
vllm:main
vllm:ascend:main
results:
```bash
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.62s/it]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.01s/it, est. speed input: 79.03 toks/s, output: 6.31 toks/s]
generated_text: The sport referenced is soccer, and the nursery rhyme is 'Hey Diddle Diddle'.
```

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
2025-08-02 16:51:22 +08:00
zhangxinyuehfad
e48f32ec59 [CI] Update image for 310p ci (#2155)
### What this PR does / why we need it?
update the latest image for 310p ci test


- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-08-02 16:46:02 +08:00
leo-pony
e467fe1b77 Add qwen-vl model and sampling feature UT for 310I series (#2168)
### What this PR does / why we need it?
Add qwen-vl model and sampling feature UT for  310I series

- vLLM version: v0.10.0
- vLLM main:
e0f63e4a35

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-08-02 11:26:12 +08:00
weijinqian0
6e00aed4d5 [main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088)
It comes from 0.9.1dev
[0.9.1][Feature]Moe alltoallv communication optimization for unquantized
RL training sence & alltoallv support dpo (#1547)

- vLLM version: v0.10.0
- vLLM main:
97608dc276

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: TaoYu Chen <ctynb@qq.com>
Co-authored-by: taoxudonghaha <justsheldon@163.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-08-02 09:49:10 +08:00
leo-pony
f0c1f0c828 [Doc] Add qwen vl example in tutorials for 310I series (#2160)
### What this PR does / why we need it?
Add qwen vl example in tutorials for 310I series. 

Model: Qwen2.5-VL-3B-Instruct
Accuracy test result, dataset MMM-val:
| | 910B3 | 310P3 |
| --- | --- | --- |
|Summary|0.455 | 0.46 |
|--art_and_design| 0.558 | 0.566 |
|--business| 0.373 | 0.366 |
|--health_and_medicine|0.513 | 0.52 |
|--science|0.333 | 0.333 |
|--tech_and_engineering|0.362 | 0.380 |
|--humanities_and_social_science|0.691 | 0.691 |

Function test result:

1. On line:
![image](https://github.com/user-attachments/assets/d81bba61-df28-4676-a246-c5d094815ac7)
![image](https://github.com/user-attachments/assets/0be81628-9999-4ef2-93c1-898b3043e09e)

2. Offline:
![image](https://github.com/user-attachments/assets/603275c1-6ed6-4cfc-a6e2-7726156de087)

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-08-02 08:58:56 +08:00
22dimensions
8cf97d8310 [Misc] Add extra checking to torchair_graph_config. (#1939)
### What this PR does / why we need it?

cherry-pick #1675  to main
This PR adds validation checking to torchair_graph_config for better
reliability.

Co-authored-by: whx-sjtu <2952154980@qq.com>

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-08-01 09:24:11 +08:00
Li Wang
2284289880 [MISC] Cherry pick #1291 from v0.9.1-dev (#1825)
### What this PR does / why we need it?
Cherry pick #1291 from v0.9.1-dev, This pr implement the synchronization
of whether `dbo` is enabled across all dp ranks. specifically, it
performed allreduce op across multiple DP ranks, only when all the dp
rank is `enable_dbo`, it is enabled

Co-authored-by: shikang-hangzhou <459956190@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-01 09:08:45 +08:00
22dimensions
9e65da990e [Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132)
### What this PR does / why we need it?

cherry-pick #1501 from 0.9.1-dev to main

Currently, Ray is not compatible with ACL Graph, so we need to fall back
to eager mode when using the Ray backend.

co-authored: Yizhou Liu <liu_yizhou@outlook.com>

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-08-01 09:06:09 +08:00
yangqinghao-cmss
99fa0ac882 [BugFix] update the kv transfer config (#2121)
### What this PR does / why we need it?
The functions KVTransferConfig.from_cli and AscendHcclConnector are
missing in the latest vLLM version. To resolve this, I propose modifying
the kv_connector to use LLMDataDistCMgrConnector, which depends on [PR
#2079](https://github.com/vllm-project/vllm-ascend/pull/2079)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
vllm:main
vllm-ascend:mian
results:
```bash
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 374.27it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 66.06it/s, est. speed input: 449.08 toks/s, output: 66.51 toks/s]
Prefill node is finished.
INFO 07-31 09:18:30 [model_runner_v1.py:2282] Graph capturing finished in 36 secs, took 0.21 GiB
INFO 07-31 09:18:30 [core.py:201] init engine (profile, create kv cache, warmup model) took 52.49 seconds
INFO 07-31 09:18:30 [factory.py:74] Creating v1 connector with name: LLMDataDistCMgrConnector and engine_id: 28c8ced8-575c-4f87-840a-48d04d0edf7e
INFO 07-31 09:18:30 [platform.py:157] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 07-31 09:18:30 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 76
INFO 07-31 09:18:30 [utils.py:359] No adjustment needed for ACL graph batch sizes: Qwen2ForCausalLM model (layers: 24) with 67 sizes
INFO 07-31 09:18:30 [llm.py:293] Supported_tasks: ['generate']
Waiting for prefill node to finish...
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 709.70it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.23it/s, est. speed input: 109.70 toks/s, output: 260.01 toks/s]
Prompt: 'Hello, how are you today?', Generated text: " I'm a computer program, so I don't have feelings. But I can"
Prompt: 'Hi, what is your name?', Generated text: ' I am a computer programmer. I have a question about the programming language I am'
Prompt: 'Tell me a very long story.', Generated text: ' I want to read it. I want to read it. I want to read'
Prompt: 'what is your favourite book?', Generated text: " I'm sorry, but as an AI language model, I don't have personal"
Cleanup prefill resources
All process done
```

- vLLM version: v0.10.0
- vLLM main:
9cb497bfa3

Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
2025-08-01 08:56:55 +08:00
Li Wang
968e6791d3 [Misc] Add data preprocess functions to qwen2.5_vl_without_padding (#2148)
### What this PR does / why we need it?
Cherry pick #1705 from v0.9.1-dev
Compared qwen2_5_vl.py, qwen2_5_vl_without_padding.py missing some
funtions. The purpose of this PR is to supplement these.

add:
- rot_pos_emb(self, grid_thw: torch.Tensor)
- get_window_index(self, grid_thw)
- _process_image_input(self, image_input)
- _process_video_input(self, video_input)

Co-authored-by: zheliuyu
[15750543867@163.com](mailto:15750543867@163.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.0
- vLLM main:
207b750e19

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-01 08:54:02 +08:00
Li Wang
e3b3ffb875 [Misc] Disable quantization in mindie_turbo (#2147)
### What this PR does / why we need it?
cherry pick #1749 from v0.9.1-dev
since the interface in vllm-ascend has changed so quickly, the
quantization function in mindie_turbo is no longer needed, so it needs
to be discarded.

Co-authored-by: zouyida [zouyida@huawei.com](mailto:zouyida@huawei.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.0
- vLLM main:
207b750e19

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-01 08:53:00 +08:00
leo-pony
c62f346f5d Fixed 310p failure when using the sampler feature (#2151)
### What this PR does / why we need it?
Fixed 310p failure when using the sampler feature.
The root cause is: torch_npu.npu_top_k_top_p uses the operator
aclnnApplyTopKTopP, but aclnnApplyTopKTopP currently does not support
310P.
First PR that has the issue is #1308.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.10.0
- vLLM main:
207b750e19

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-08-01 08:43:08 +08:00
Icey
86bdde1ca8 Enable pytest and yaml style accuracy test (#2073)
### What this PR does / why we need it?

This PR enabled pytest and yaml style accuracy test, users now can
enable accuracy test by running:

```bash
cd ~/vllm-ascend
pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
          --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \
          --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md

pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
          --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1970

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-07-31 21:39:13 +08:00
huangxialu
9c9a7cd90b [main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112)
backport of v0.9.1-dev:
https://github.com/vllm-project/vllm-ascend/pull/1902

origin main npu_moe_gating_top_k_softmax:
https://github.com/vllm-project/vllm-ascend/pull/1355

- vLLM version: v0.10.0
- vLLM main:
055bd3978e

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-07-31 21:05:56 +08:00
Ronald1995
e8660d7978 ut:add ut for qwen2_5_vl (#2143)
### What this PR does / why we need it?
add ut for qwen2_5_vl

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:46:17 +08:00
Ronald1995
cb0a303080 ut:add e2e test for external launcher (#2091)
### What this PR does / why we need it?
This pr add e2e testcase to make sure initialize LLM by
external_launcher method is ok.

### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:37:42 +08:00
Mengqing Cao
4c8842da65 [BugFix] Fix a bug of running chunked-prefill with torchair. (#1378) (#1844)
This PR fixes the bug `local variable 'decode_hs_or_q_c' referenced
before assignment` when running chunked-prefill with torchair. We should
calculate `decode_hs_or_q_c` whether or not torchair graphics mode is
enabled.

backport of #1378
fix https://github.com/vllm-project/vllm-ascend/issues/1369


- vLLM version: v0.10.0
- vLLM main:
0e36abf993

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-07-31 20:08:45 +08:00
daniel
db310c6ec9 add ut for device allocator/camem and mutistream/layers (#2037)
What this PR does / why we need it?

test device allocator/camem and mutistream/layers contains resource
allocation and stream ops
Does this PR introduce any user-facing change?

N/A
How was this patch tested?

CI passed with new added test.


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 1024daniel <xxltju324@gmail.com>
2025-07-31 19:17:27 +08:00
zhanghw0354
2008152c48 [main][bugfix]Fix vLLM startup failure when inferring DeepSeek R1 model in DP scenario (#2020)
### What this PR does / why we need it?
Fix vLLM startup failure when inferring DeepSeek R1 model in DP
scenario.
When running vLLM inference for the DeepSeek R1 model in DP32+TP1
configuration, the vLLM service fails to start with the following error.
<img width="1786" height="918" alt="21b2011042d4f77f36f5243fa64d9c18"
src="https://github.com/user-attachments/assets/df1963fe-587e-43ca-822e-a9094d0034fb"
/>
The root cause is a missing else branch after [this line of
code](d629f0b2b5/vllm_ascend/ops/fused_moe.py (L1411)).
This PR fixes the issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.0
- vLLM main:
5bbaf492a6

---------

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
2025-07-31 15:30:28 +08:00
CaranLic
7c90ba5fe8 [Test] add ut for decorator.py/deepseek_mtp.py (#2127)
### What this PR does / why we need it?
add ut for decorator.py/deepseek_mtp.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with new tests
- vLLM version: v0.10.0
- vLLM main:
055bd3978e

---------

Signed-off-by: CaranLic <740821011@qq.com>
2025-07-31 15:21:15 +08:00
Joey Gao
6192bc95c0 [Bugfix] fix tensor not same device in qwen2_5_vl_without_padding (#2051)
bugfix cherry-pick from v0.9.1-dev
https://github.com/vllm-project/vllm-ascend/pull/2007
### What this PR does / why we need it?
Minimum reproducing code:
```python
# test.py
from vllm import LLM, SamplingParams
 
prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen2.5-VL-7B-Instruct", max_model_len=26240)
 
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    
```
```bash
export USE_OPTIMIZED_MODEL=0
python test.py
```
exception as follow:
```
[rank0]:   File "/home/xxx/vllm_ascend/models/qwen2_5_vl_without_padding.py", line 84, in forward
[rank0]:     q = torch_npu.npu_rotary_mul(q, cos, sin)
[rank0]:   File "/home/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu! (when checking argument for argument r1 in method wrapper__npu_rotary_mul)
```

In `AscendQwen2_5_VisionAttention_Without_Padding`,
`torch_npu.npu_rotary_mul(q, cos, sin)`, `cos`/`sin` on cpu, but `q` on
npu, so there will be an error.

`qwen2_5_vl_without_padding.py` need this bugfix, because
`AscendQwen2_5_VisionTransformer_Without_Padding.rot_pos_emb` in
wen2_5_vl_without_padding.py is from vllm and `inv_freq` will create on
cpu.

40d86ee412/vllm/model_executor/models/qwen2_5_vl.py (L482)
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float, device='cpu') / dim))
```
`qwen2_5_vl.py` do not need, because
`AscendQwen2_5_VisionRotaryEmbedding` in qwen2_5_vl.py rewrite
`AscendQwen2_5_VisionRotaryEmbedding` and `inv_freq` will create on
device.
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float) / dim))
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: pjgao <gaopengju3@huawei.com>
Co-authored-by: pjgao <gaopengju3@huawei.com>
2025-07-31 15:18:54 +08:00
ApsarasX
72eceff94d [Bugfix] grammar_bitmask IndexError caused by outdated apply_grammar_bitmask method (#2022)
### What this PR does / why we need it?
Fix #2033 

Sync https://github.com/vllm-project/vllm/pull/14702 to solve
`grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask`
method

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested by upstream vllm


- vLLM version: v0.10.0
- vLLM main:
6e599eebe8

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-31 09:03:27 +08:00
Mengqing Cao
75e28d0356 [Build][Ray] Fix protobuf version in Dockerfile (#2028)
### What this PR does / why we need it?
Fix protobuf version in Dockerfile to resolve `AttributeError: 'str'
object has no attribute 'DESCRIPTOR' when packaging message to dict`
using protobuf. will remove version specification after
https://github.com/ray-project/ray/pull/54910 is merged

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.0
- vLLM main:
0e36abf993

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-30 22:49:20 +08:00
Ronald1995
3386e09a40 ut:add ut for qwen2_vl.py (#2096)
### What this PR does / why we need it?
add ut for qwen2_vl.py

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
555e7225bc

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-30 22:31:47 +08:00
Mengqing Cao
936df1cb9b [Doc] Fix cann related urls (#2106)
### What this PR does / why we need it?
Fix cann related urls in installation doc.

### Does this PR introduce _any_ user-facing change?
The users install cann manually could use the correct url after this pr

### How was this patch tested?
N/A

- vLLM version: v0.10.0
- vLLM main:
5bbaf492a6

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-30 22:31:30 +08:00
Ruri
4fcca137a7 [main][Feature] Support Qwen3 W4A8 quantization (#2060)
### What this PR does / why we need it?

Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC`
to test qwen3 w4a8_dynamic quantized model

Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`

1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```

2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
  --model vllm-ascend/Qwen3-8B-W4A8 \
  --port 8000 \
  --quantization ascend \
  --tensor_parallel_size 2 \
  --enforce-eager
```

- vLLM version: v0.10.0
- vLLM main:
4cd7fe6cea

---------

Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
2025-07-30 14:57:14 +08:00
zhangxinyuehfad
6874d666fa [CI]Add e2e test for 310p (#1879)
### What this PR does / why we need it?
Add e2e test for 310p:
trigger conditions:tag, labels(ready-for-test, e2e-310p-test), schedule
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10
runner: linux-aarch64-310p-1, linux-aarch64-310p-4
model: IntervitensInc/pangu-pro-moe-model, Qwen/Qwen3-0.6B-Base,
Qwen/Qwen2.5-7B-Instruct

- vLLM version: v0.10.0
- vLLM main:
b917da442b

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-30 14:52:16 +08:00
YuanCheng-coder
34dd24adf2 add ut for vocab_parallel_embedding (#2067)
### What this PR does / why we need it?

test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel
embedding forward

CI passed with new added test.

vLLM version: v0.10.0
vLLM main:
2cc571199b


- vLLM version: v0.10.0
- vLLM main:
05cbbe20c5

Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
2025-07-30 14:35:45 +08:00
Yikun Jiang
d9f82ebfce [misc] Add reminder comment when PR submitted (#2092)
### What this PR does / why we need it?
Add reminder comment when PR submitted

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test locally:
https://github.com/Yikun/vllm-ascend/pull/51#issuecomment-3132425126
This PR will take effect after this PR merged.


- vLLM version: v0.10.0
- vLLM main:
0e36abf993

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-30 10:14:33 +08:00
hongfugui
1dbb888275 [Bugfix] LoRA logits einsum dimension mismatch in add_lora_logits (#1583)
### What this PR does / why we need it?
This PR fixes a tensor shape mismatch in `add_lora_logits`.

Previously, `lora_a_stacked` was passed as shape `[num_loras, in_dim,
rank]`, which does not match the expected einsum pattern `"bi, boi ->
bo"` used in `bgmv_shrink`.

This causes runtime errors like:
RuntimeError: einsum(): subscript i has size 3 for operand 1 which does
not broadcast with previously seen size 4

![image](https://github.com/user-attachments/assets/63029479-49ae-4c3c-b995-f6805d15ad06)

This fix transposes `lora_a_stacked` and `lora_b_stacked` to match the
expected shapes:
- `lora_a`: `[num_loras, rank, in_dim]`
- `lora_b`: `[num_loras, out_dim, rank]`

All unit tests pass after this fix.
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
import torch
import pytest
from unittest.mock import patch, PropertyMock, ANY
from vllm_ascend.lora.punica_wrapper.punica_npu import PunicaWrapperNPU

@pytest.fixture
def wrapper_cpu():
    cfg = {"max_num_batched_tokens": 10, "max_batches": 2, "device": "cpu"}
    w = PunicaWrapperNPU(**cfg)
    w.is_prefill = True
    w.no_lora = False
    return w

def test_add_lora_logits(wrapper_cpu):
    batch_size = 2
    hidden_size = 4
    lora_rank = 3
    vocab_size = 5
    
    y = torch.zeros(batch_size, vocab_size)
    x = torch.randn(batch_size, hidden_size)
    
    num_loras = 1
    lora_a = torch.randn(num_loras, hidden_size, lora_rank)
    lora_b = torch.randn(num_loras, lora_rank, vocab_size)
    
    with patch.object(wrapper_cpu.__class__, "sampler_indices", 
                     new_callable=PropertyMock) as mock_idx:

        mock_idx.return_value = torch.zeros(batch_size, dtype=torch.long)

        wrapper_cpu.add_lora_logits(y, x, lora_a, lora_b, scale=1.0)

        assert y.shape == (batch_size, vocab_size)
        assert not torch.allclose(y, torch.zeros_like(y))

Signed-off-by: hongfugui <hongfugui_yewu@cmss.chinamobile.com>
2025-07-30 09:50:36 +08:00
Mengqing Cao
d80b0cca5d [CI] Fix test on pyhccl to 2 cards (#2094)
### What this PR does / why we need it?
Fix test on pyhccl to 2 cards

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
0d0cc9e150

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-30 09:08:00 +08:00
wangxiyuan
9b67c87b14 [Refactor]Refactor sampler (#2050)
Refactor Sampler implementation from patch way to inherit from vLLM
Sampler interface.

Next step: Make the op `TopKTopPSampler` in vLLM support custom ops
register mechanism

- vLLM version: v0.10.0
- vLLM main:
61a6905ab0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-30 08:47:22 +08:00
whx
b6a7f07c70 [Perf][MoE] Improve MoE multistream parallel performace. (#1891)
This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.

- vLLM version: v0.10.0
- vLLM main:
2cc571199b

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-29 23:53:19 +08:00
leo-pony
4df8e0027c [e2e]Fixed the issue that pyhccl e2e cannot run continuously with other tests (#1246)
### What this PR does / why we need it?
1.Fixed the issue that pyhccl e2e cannot run continuously with other
tests.
2.Cleaned up the resources occupied by the dynamic_npugraph_batchsize
e2e test.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
This is a e2e test

e2e multi-cards tests local running successfully.


- vLLM version: v0.9.2
- vLLM main:
0df4d9b06b

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-29 19:38:30 +08:00
Shanshan Shen
61fc35184b [Doc] Add performance tuning doc to main (#1392)
### What this PR does / why we need it?
Add performance tuning doc to main.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1387


- vLLM version: v0.9.1
- vLLM main:
923147b5e8

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-07-29 19:36:34 +08:00
taoxudonghaha
540336edc9 Add Custom Kernels For LoRA Performance (#1884)
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.

- vLLM version: v0.9.2
- vLLM main:
40d86ee412

---------

Signed-off-by: taoxudonghaha <justsheldon@163.com>
2025-07-29 19:27:50 +08:00
TaoYu Chen
2da281ec5a bump default python version to 3.11 (#2072)
### What this PR does / why we need it?
Bump default python version to 3.11, see #1980 

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
pass CI

- vLLM version: v0.10.0
- vLLM main:
12a223ef9b

Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
2025-07-29 19:07:17 +08:00
Li Wang
f60bb474f9 [CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065)
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.

- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test

### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.10.0
- vLLM main:
a2480251ec

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-29 18:59:05 +08:00
curryliu
ca8007f584 [Feature] Enable inference support for Deepseekr1-w8a8-MTP (#1994)
Support the inference of the Deepseekr1-w8a8-mtp model with
statically-quantized shared_head in MTP layers.

- vLLM version: v0.9.2
- vLLM main:
6eca337ce0

Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
2025-07-29 18:51:57 +08:00
whx
98cadc2146 [Perf] Avoid performing index selection of sin/cos cache every layer (#1890)
Optimize number of index selections of sin/cos cache.

- vLLM version: v0.10.0
- vLLM main:
656c24f1b5

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-29 18:06:45 +08:00