Commit Graph

626 Commits

Author SHA1 Message Date
Icey
86bdde1ca8 Enable pytest and yaml style accuracy test (#2073)
### What this PR does / why we need it?

This PR enabled pytest and yaml style accuracy test, users now can
enable accuracy test by running:

```bash
cd ~/vllm-ascend
pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
          --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \
          --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md

pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
          --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1970

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-07-31 21:39:13 +08:00
huangxialu
9c9a7cd90b [main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112)
backport of v0.9.1-dev:
https://github.com/vllm-project/vllm-ascend/pull/1902

origin main npu_moe_gating_top_k_softmax:
https://github.com/vllm-project/vllm-ascend/pull/1355

- vLLM version: v0.10.0
- vLLM main:
055bd3978e

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-07-31 21:05:56 +08:00
Ronald1995
e8660d7978 ut:add ut for qwen2_5_vl (#2143)
### What this PR does / why we need it?
add ut for qwen2_5_vl

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:46:17 +08:00
Ronald1995
cb0a303080 ut:add e2e test for external launcher (#2091)
### What this PR does / why we need it?
This pr add e2e testcase to make sure initialize LLM by
external_launcher method is ok.

### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:37:42 +08:00
Mengqing Cao
4c8842da65 [BugFix] Fix a bug of running chunked-prefill with torchair. (#1378) (#1844)
This PR fixes the bug `local variable 'decode_hs_or_q_c' referenced
before assignment` when running chunked-prefill with torchair. We should
calculate `decode_hs_or_q_c` whether or not torchair graphics mode is
enabled.

backport of #1378
fix https://github.com/vllm-project/vllm-ascend/issues/1369


- vLLM version: v0.10.0
- vLLM main:
0e36abf993

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-07-31 20:08:45 +08:00
daniel
db310c6ec9 add ut for device allocator/camem and mutistream/layers (#2037)
What this PR does / why we need it?

test device allocator/camem and mutistream/layers contains resource
allocation and stream ops
Does this PR introduce any user-facing change?

N/A
How was this patch tested?

CI passed with new added test.


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 1024daniel <xxltju324@gmail.com>
2025-07-31 19:17:27 +08:00
zhanghw0354
2008152c48 [main][bugfix]Fix vLLM startup failure when inferring DeepSeek R1 model in DP scenario (#2020)
### What this PR does / why we need it?
Fix vLLM startup failure when inferring DeepSeek R1 model in DP
scenario.
When running vLLM inference for the DeepSeek R1 model in DP32+TP1
configuration, the vLLM service fails to start with the following error.
<img width="1786" height="918" alt="21b2011042d4f77f36f5243fa64d9c18"
src="https://github.com/user-attachments/assets/df1963fe-587e-43ca-822e-a9094d0034fb"
/>
The root cause is a missing else branch after [this line of
code](d629f0b2b5/vllm_ascend/ops/fused_moe.py (L1411)).
This PR fixes the issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.0
- vLLM main:
5bbaf492a6

---------

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
2025-07-31 15:30:28 +08:00
CaranLic
7c90ba5fe8 [Test] add ut for decorator.py/deepseek_mtp.py (#2127)
### What this PR does / why we need it?
add ut for decorator.py/deepseek_mtp.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with new tests
- vLLM version: v0.10.0
- vLLM main:
055bd3978e

---------

Signed-off-by: CaranLic <740821011@qq.com>
2025-07-31 15:21:15 +08:00
Joey Gao
6192bc95c0 [Bugfix] fix tensor not same device in qwen2_5_vl_without_padding (#2051)
bugfix cherry-pick from v0.9.1-dev
https://github.com/vllm-project/vllm-ascend/pull/2007
### What this PR does / why we need it?
Minimum reproducing code:
```python
# test.py
from vllm import LLM, SamplingParams
 
prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen2.5-VL-7B-Instruct", max_model_len=26240)
 
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    
```
```bash
export USE_OPTIMIZED_MODEL=0
python test.py
```
exception as follow:
```
[rank0]:   File "/home/xxx/vllm_ascend/models/qwen2_5_vl_without_padding.py", line 84, in forward
[rank0]:     q = torch_npu.npu_rotary_mul(q, cos, sin)
[rank0]:   File "/home/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu! (when checking argument for argument r1 in method wrapper__npu_rotary_mul)
```

In `AscendQwen2_5_VisionAttention_Without_Padding`,
`torch_npu.npu_rotary_mul(q, cos, sin)`, `cos`/`sin` on cpu, but `q` on
npu, so there will be an error.

`qwen2_5_vl_without_padding.py` need this bugfix, because
`AscendQwen2_5_VisionTransformer_Without_Padding.rot_pos_emb` in
wen2_5_vl_without_padding.py is from vllm and `inv_freq` will create on
cpu.

40d86ee412/vllm/model_executor/models/qwen2_5_vl.py (L482)
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float, device='cpu') / dim))
```
`qwen2_5_vl.py` do not need, because
`AscendQwen2_5_VisionRotaryEmbedding` in qwen2_5_vl.py rewrite
`AscendQwen2_5_VisionRotaryEmbedding` and `inv_freq` will create on
device.
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float) / dim))
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: pjgao <gaopengju3@huawei.com>
Co-authored-by: pjgao <gaopengju3@huawei.com>
2025-07-31 15:18:54 +08:00
ApsarasX
72eceff94d [Bugfix] grammar_bitmask IndexError caused by outdated apply_grammar_bitmask method (#2022)
### What this PR does / why we need it?
Fix #2033 

Sync https://github.com/vllm-project/vllm/pull/14702 to solve
`grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask`
method

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested by upstream vllm


- vLLM version: v0.10.0
- vLLM main:
6e599eebe8

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-31 09:03:27 +08:00
Mengqing Cao
75e28d0356 [Build][Ray] Fix protobuf version in Dockerfile (#2028)
### What this PR does / why we need it?
Fix protobuf version in Dockerfile to resolve `AttributeError: 'str'
object has no attribute 'DESCRIPTOR' when packaging message to dict`
using protobuf. will remove version specification after
https://github.com/ray-project/ray/pull/54910 is merged

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.0
- vLLM main:
0e36abf993

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-30 22:49:20 +08:00
Ronald1995
3386e09a40 ut:add ut for qwen2_vl.py (#2096)
### What this PR does / why we need it?
add ut for qwen2_vl.py

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
555e7225bc

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-30 22:31:47 +08:00
Mengqing Cao
936df1cb9b [Doc] Fix cann related urls (#2106)
### What this PR does / why we need it?
Fix cann related urls in installation doc.

### Does this PR introduce _any_ user-facing change?
The users install cann manually could use the correct url after this pr

### How was this patch tested?
N/A

- vLLM version: v0.10.0
- vLLM main:
5bbaf492a6

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-30 22:31:30 +08:00
Ruri
4fcca137a7 [main][Feature] Support Qwen3 W4A8 quantization (#2060)
### What this PR does / why we need it?

Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC`
to test qwen3 w4a8_dynamic quantized model

Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`

1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```

2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
  --model vllm-ascend/Qwen3-8B-W4A8 \
  --port 8000 \
  --quantization ascend \
  --tensor_parallel_size 2 \
  --enforce-eager
```

- vLLM version: v0.10.0
- vLLM main:
4cd7fe6cea

---------

Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
2025-07-30 14:57:14 +08:00
zhangxinyuehfad
6874d666fa [CI]Add e2e test for 310p (#1879)
### What this PR does / why we need it?
Add e2e test for 310p:
trigger conditions:tag, labels(ready-for-test, e2e-310p-test), schedule
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10
runner: linux-aarch64-310p-1, linux-aarch64-310p-4
model: IntervitensInc/pangu-pro-moe-model, Qwen/Qwen3-0.6B-Base,
Qwen/Qwen2.5-7B-Instruct

- vLLM version: v0.10.0
- vLLM main:
b917da442b

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-30 14:52:16 +08:00
YuanCheng-coder
34dd24adf2 add ut for vocab_parallel_embedding (#2067)
### What this PR does / why we need it?

test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel
embedding forward

CI passed with new added test.

vLLM version: v0.10.0
vLLM main:
2cc571199b


- vLLM version: v0.10.0
- vLLM main:
05cbbe20c5

Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
2025-07-30 14:35:45 +08:00
Yikun Jiang
d9f82ebfce [misc] Add reminder comment when PR submitted (#2092)
### What this PR does / why we need it?
Add reminder comment when PR submitted

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test locally:
https://github.com/Yikun/vllm-ascend/pull/51#issuecomment-3132425126
This PR will take effect after this PR merged.


- vLLM version: v0.10.0
- vLLM main:
0e36abf993

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-30 10:14:33 +08:00
hongfugui
1dbb888275 [Bugfix] LoRA logits einsum dimension mismatch in add_lora_logits (#1583)
### What this PR does / why we need it?
This PR fixes a tensor shape mismatch in `add_lora_logits`.

Previously, `lora_a_stacked` was passed as shape `[num_loras, in_dim,
rank]`, which does not match the expected einsum pattern `"bi, boi ->
bo"` used in `bgmv_shrink`.

This causes runtime errors like:
RuntimeError: einsum(): subscript i has size 3 for operand 1 which does
not broadcast with previously seen size 4

![image](https://github.com/user-attachments/assets/63029479-49ae-4c3c-b995-f6805d15ad06)

This fix transposes `lora_a_stacked` and `lora_b_stacked` to match the
expected shapes:
- `lora_a`: `[num_loras, rank, in_dim]`
- `lora_b`: `[num_loras, out_dim, rank]`

All unit tests pass after this fix.
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
import torch
import pytest
from unittest.mock import patch, PropertyMock, ANY
from vllm_ascend.lora.punica_wrapper.punica_npu import PunicaWrapperNPU

@pytest.fixture
def wrapper_cpu():
    cfg = {"max_num_batched_tokens": 10, "max_batches": 2, "device": "cpu"}
    w = PunicaWrapperNPU(**cfg)
    w.is_prefill = True
    w.no_lora = False
    return w

def test_add_lora_logits(wrapper_cpu):
    batch_size = 2
    hidden_size = 4
    lora_rank = 3
    vocab_size = 5
    
    y = torch.zeros(batch_size, vocab_size)
    x = torch.randn(batch_size, hidden_size)
    
    num_loras = 1
    lora_a = torch.randn(num_loras, hidden_size, lora_rank)
    lora_b = torch.randn(num_loras, lora_rank, vocab_size)
    
    with patch.object(wrapper_cpu.__class__, "sampler_indices", 
                     new_callable=PropertyMock) as mock_idx:

        mock_idx.return_value = torch.zeros(batch_size, dtype=torch.long)

        wrapper_cpu.add_lora_logits(y, x, lora_a, lora_b, scale=1.0)

        assert y.shape == (batch_size, vocab_size)
        assert not torch.allclose(y, torch.zeros_like(y))

Signed-off-by: hongfugui <hongfugui_yewu@cmss.chinamobile.com>
2025-07-30 09:50:36 +08:00
Mengqing Cao
d80b0cca5d [CI] Fix test on pyhccl to 2 cards (#2094)
### What this PR does / why we need it?
Fix test on pyhccl to 2 cards

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
0d0cc9e150

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-30 09:08:00 +08:00
wangxiyuan
9b67c87b14 [Refactor]Refactor sampler (#2050)
Refactor Sampler implementation from patch way to inherit from vLLM
Sampler interface.

Next step: Make the op `TopKTopPSampler` in vLLM support custom ops
register mechanism

- vLLM version: v0.10.0
- vLLM main:
61a6905ab0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-30 08:47:22 +08:00
whx
b6a7f07c70 [Perf][MoE] Improve MoE multistream parallel performace. (#1891)
This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.

- vLLM version: v0.10.0
- vLLM main:
2cc571199b

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-29 23:53:19 +08:00
leo-pony
4df8e0027c [e2e]Fixed the issue that pyhccl e2e cannot run continuously with other tests (#1246)
### What this PR does / why we need it?
1.Fixed the issue that pyhccl e2e cannot run continuously with other
tests.
2.Cleaned up the resources occupied by the dynamic_npugraph_batchsize
e2e test.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
This is a e2e test

e2e multi-cards tests local running successfully.


- vLLM version: v0.9.2
- vLLM main:
0df4d9b06b

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-29 19:38:30 +08:00
Shanshan Shen
61fc35184b [Doc] Add performance tuning doc to main (#1392)
### What this PR does / why we need it?
Add performance tuning doc to main.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1387


- vLLM version: v0.9.1
- vLLM main:
923147b5e8

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-07-29 19:36:34 +08:00
taoxudonghaha
540336edc9 Add Custom Kernels For LoRA Performance (#1884)
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.

- vLLM version: v0.9.2
- vLLM main:
40d86ee412

---------

Signed-off-by: taoxudonghaha <justsheldon@163.com>
2025-07-29 19:27:50 +08:00
TaoYu Chen
2da281ec5a bump default python version to 3.11 (#2072)
### What this PR does / why we need it?
Bump default python version to 3.11, see #1980 

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
pass CI

- vLLM version: v0.10.0
- vLLM main:
12a223ef9b

Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
2025-07-29 19:07:17 +08:00
Li Wang
f60bb474f9 [CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065)
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.

- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test

### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.10.0
- vLLM main:
a2480251ec

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-29 18:59:05 +08:00
curryliu
ca8007f584 [Feature] Enable inference support for Deepseekr1-w8a8-MTP (#1994)
Support the inference of the Deepseekr1-w8a8-mtp model with
statically-quantized shared_head in MTP layers.

- vLLM version: v0.9.2
- vLLM main:
6eca337ce0

Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
2025-07-29 18:51:57 +08:00
whx
98cadc2146 [Perf] Avoid performing index selection of sin/cos cache every layer (#1890)
Optimize number of index selections of sin/cos cache.

- vLLM version: v0.10.0
- vLLM main:
656c24f1b5

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-29 18:06:45 +08:00
wangxiyuan
0190b68f51 [Misc]Remove PD v0 code (#2047)
Cleanup V0 disaggregated prefill code for V0 Engine.

part of https://github.com/vllm-project/vllm-ascend/issues/1620

TODO: enable v1 e2e test.

- vLLM version: v0.10.0
- vLLM main:
2cc571199b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 19:09:22 +08:00
Yikun Jiang
935e9d4c9d Pin transformers to fix v0.9.1 doctest (#2048)
### What this PR does / why we need it?
Pin transformers to fix v0.9.1 doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
doctest passed


- vLLM version: v0.10.0
- vLLM main:
c657369841

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-28 17:51:56 +08:00
huangxialu
1a25b0a2dd [Test] add ut for qwen3_moe.py (#2055)
### What this PR does / why we need it?
Add ut for qwen3_moe.py

### Does this PR introduce _any_ user-facing change?
No.


- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-07-28 17:37:13 +08:00
whx
e7d32ed3f1 [BugFix] Fix the problem that torchair doesn't support tp > 4. (#1508)
This PR removes the restriction that TP cannot be greater than 4 in
torchair scenario, because current newest version of CANN has fixed this
bug.

- vLLM version: v0.10.0
- vLLM main:
04ff4be310

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-28 16:48:05 +08:00
wangxiyuan
4a008c4dac [Misc]Clean up useless import from vllm (#2049)
Clean up useless  import from vllm to make code more clear.

- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 16:01:59 +08:00
wangxiyuan
34cfdf5520 [Misc] Fix logger bug (#2024)
1. Remove useless logger
2. Fix logger bug, same problem as
https://github.com/vllm-project/vllm-ascend/pull/515

- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 15:59:09 +08:00
LeeWenquan
3ad582c9a9 [Test] Add ut for files in /attention (#1944)
### What this PR does / why we need it?
Add ut for files in folder /attention
### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.10.0
- vLLM main:
139a7f07bd

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
2025-07-28 15:54:40 +08:00
Ronald1995
32a9c5f694 [Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)
### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-28 15:13:37 +08:00
zzzzwwjj
ba3dfbd59e [main][refactor] Refactoring forward_context and model_runner_v1 (#1979)
### What this PR does / why we need it?

A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:

Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;

### Does this PR introduce _any_ user-facing change?

No change at user-facing.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
57c22e57f9

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
Wang Kunpeng
e3a2443c3a [main][Doc] add mla pertoken quantization FAQ (#2018)
### What this PR does / why we need it?
When using deepseek series models generated by the --dynamic parameter,
if torchair graph mode is enabled, we should modify the configuration
file in the CANN package to prevent incorrect inference results.

- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-07-27 08:47:51 +08:00
Yikun Jiang
5b579ddafe Upgrade CANN to 8.2.RC1 (A3) (#2043)
### What this PR does / why we need it?
Upgrade CANN to 8.2.RC1

### Does this PR introduce _any_ user-facing change?
Yes, A3 image are using 8.2.rc1

### How was this patch tested?
CI passed
- vLLM version: v0.10.0
- vLLM main:
de509ae8eb

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 23:10:27 +08:00
Mengqing Cao
ed2ab8a197 [CI/Build] Upgrade CANN to 8.2.RC1 (#1653)
### What this PR does / why we need it?
Upgrade CANN to 8.2.rc1

Backport: https://github.com/vllm-project/vllm-ascend/pull/1653

### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.2.RC1

### How was this patch tested?
CI passed

- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 22:37:46 +08:00
zhangxinyuehfad
d1c640841b [Bugfix] Fix num_hidden_layers when Qwen2-Audio 7B (#1803)
### What this PR does / why we need it?
Fix num_hidden_layers when Qwen2-Audio 7B and #1760 :
```
INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
Traceback (most recent call last):
  File "/workspace/test1.py", line 58, in <module>
    main(audio_count)
  File "/workspace/test1.py", line 38, in main
    llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
  File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context)
  File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config
    config = VllmConfig(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__
    current_platform.check_and_update_config(self)
  File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config
    update_aclgraph_sizes(vllm_config)
  File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes
    num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers'
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes: https://github.com/vllm-project/vllm-ascend/issues/1780
https://github.com/vllm-project/vllm-ascend/issues/1760
https://github.com/vllm-project/vllm-ascend/issues/1276
https://github.com/vllm-project/vllm-ascend/issues/359

- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-26 20:13:00 +08:00
Pleaplusone
df0ec55162 Disaggregate prefill for kv cache register style (#950)
### What this PR does / why we need it?
This PR adopt `LLMDataDist` for kv cache register and `pull_blocks`
style disaggregate prefill implementation. The interface implementation
mainly follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.

This PR can be test with the following step:
- Generate the rank table for all machine.
- execute`toy_proxy.py` to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
- Run the prefill server and decode server.
- send the request to the disaggregate prefill proxy

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Signed-off-by: liziyu179 <3475441767@qq.com>
Signed-off-by: underfitc <hucong24@huawei.com>
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Co-authored-by: liziyu179 <3475441767@qq.com>
Co-authored-by: underfitc <hucong24@huawei.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: underfituu <hzhucong@163.com>
2025-07-26 17:15:47 +08:00
Yikun Jiang
17a430f7b8 Upgrade vLLM to v0.10.0 (#1927)
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
7728dd77bb

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 15:43:29 +08:00
Li Wang
2f50304c19 [Bugfix] Add get_supported_tasks interface to fix broken CI (#2023)
### What this PR does / why we need it?
Added `get_supported_tasks` interface to adapt to vllm
[changes](46d81d6951 (diff-80ee7e2a62f9dcfbb8a312dc4e3948557e97ef187290daebbcae1e28596bda29))
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
5ac3168ee3

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-26 08:20:21 +08:00
Li Wang
bdfb065b5d [1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)
### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-25 22:16:10 +08:00
Li Wang
d629f0b2b5 [CI] Remove transformers installation (#2014)
### What this PR does / why we need it?
Remove transformers installation, The transformers version bug has been
fixed by
e936e401de.
We can safe to remove the version limit now

- vLLM version: v0.9.2
- vLLM main:
40d86ee412

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-25 15:20:37 +08:00
Ronald1995
e561a2c6ec ut:add ut for qwen2_5_vl_without_padding.py (#1988)
### What this PR does / why we need it?
this pr is to add ut for qwen2_5_vl_without_padding.py

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
this is only a ut test


- vLLM version: v0.9.2
- vLLM main:
9c8b2c2a8a

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-25 14:12:44 +08:00
SunnyLee151064
ae560f7131 [Test] Add uts for files in /core (#1957)
### What this PR does / why we need it?

Add uts for files in folder /core

### Does this PR introduce _any_ user-facing change?

No

- vLLM version: v0.9.2
- vLLM main:
5a19a6c670

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
2025-07-25 09:48:19 +08:00
Icey
6bc82cf6a7 Enable image push CI for build file and csrc has changes (#1977)
### What this PR does / why we need it?
- Fixes image CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.9.2
- vLLM main:
f3137cdd81

Signed-off-by: Icey <1790571317@qq.com>
2025-07-24 21:19:41 +08:00
JohnJan
cfdd45ed00 [Bug] Fix duplicate 'torch.' prefix in qwen-vl (#1986)
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>

### What this PR does / why we need it?
Fix duplicate 'torch.' prefix in qwen2-vl, qwen2.5-vl

- vLLM version: v0.9.2
- vLLM main:
dde295a934
2025-07-24 20:16:00 +08:00