### What this PR does / why we need it?
Fix protobuf version in Dockerfile to resolve `AttributeError: 'str'
object has no attribute 'DESCRIPTOR' when packaging message to dict`
using protobuf. will remove version specification after
https://github.com/ray-project/ray/pull/54910 is merged
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
0e36abf993
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
add ut for qwen2_vl.py
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
not involved
- vLLM version: v0.10.0
- vLLM main:
555e7225bc
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Fix cann related urls in installation doc.
### Does this PR introduce _any_ user-facing change?
The users install cann manually could use the correct url after this pr
### How was this patch tested?
N/A
- vLLM version: v0.10.0
- vLLM main:
5bbaf492a6
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC`
to test qwen3 w4a8_dynamic quantized model
Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`
1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```
2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
--model vllm-ascend/Qwen3-8B-W4A8 \
--port 8000 \
--quantization ascend \
--tensor_parallel_size 2 \
--enforce-eager
```
- vLLM version: v0.10.0
- vLLM main:
4cd7fe6cea
---------
Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
### What this PR does / why we need it?
test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel
embedding forward
CI passed with new added test.
vLLM version: v0.10.0
vLLM main:
2cc571199b
- vLLM version: v0.10.0
- vLLM main:
05cbbe20c5
Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
### What this PR does / why we need it?
Add reminder comment when PR submitted
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Test locally:
https://github.com/Yikun/vllm-ascend/pull/51#issuecomment-3132425126
This PR will take effect after this PR merged.
- vLLM version: v0.10.0
- vLLM main:
0e36abf993
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR fixes a tensor shape mismatch in `add_lora_logits`.
Previously, `lora_a_stacked` was passed as shape `[num_loras, in_dim,
rank]`, which does not match the expected einsum pattern `"bi, boi ->
bo"` used in `bgmv_shrink`.
This causes runtime errors like:
RuntimeError: einsum(): subscript i has size 3 for operand 1 which does
not broadcast with previously seen size 4

This fix transposes `lora_a_stacked` and `lora_b_stacked` to match the
expected shapes:
- `lora_a`: `[num_loras, rank, in_dim]`
- `lora_b`: `[num_loras, out_dim, rank]`
All unit tests pass after this fix.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
```
import torch
import pytest
from unittest.mock import patch, PropertyMock, ANY
from vllm_ascend.lora.punica_wrapper.punica_npu import PunicaWrapperNPU
@pytest.fixture
def wrapper_cpu():
cfg = {"max_num_batched_tokens": 10, "max_batches": 2, "device": "cpu"}
w = PunicaWrapperNPU(**cfg)
w.is_prefill = True
w.no_lora = False
return w
def test_add_lora_logits(wrapper_cpu):
batch_size = 2
hidden_size = 4
lora_rank = 3
vocab_size = 5
y = torch.zeros(batch_size, vocab_size)
x = torch.randn(batch_size, hidden_size)
num_loras = 1
lora_a = torch.randn(num_loras, hidden_size, lora_rank)
lora_b = torch.randn(num_loras, lora_rank, vocab_size)
with patch.object(wrapper_cpu.__class__, "sampler_indices",
new_callable=PropertyMock) as mock_idx:
mock_idx.return_value = torch.zeros(batch_size, dtype=torch.long)
wrapper_cpu.add_lora_logits(y, x, lora_a, lora_b, scale=1.0)
assert y.shape == (batch_size, vocab_size)
assert not torch.allclose(y, torch.zeros_like(y))
Signed-off-by: hongfugui <hongfugui_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
Fix test on pyhccl to 2 cards
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
0d0cc9e150
Signed-off-by: MengqingCao <cmq0113@163.com>
Refactor Sampler implementation from patch way to inherit from vLLM
Sampler interface.
Next step: Make the op `TopKTopPSampler` in vLLM support custom ops
register mechanism
- vLLM version: v0.10.0
- vLLM main:
61a6905ab0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.
- vLLM version: v0.10.0
- vLLM main:
2cc571199b
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
1.Fixed the issue that pyhccl e2e cannot run continuously with other
tests.
2.Cleaned up the resources occupied by the dynamic_npugraph_batchsize
e2e test.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This is a e2e test
e2e multi-cards tests local running successfully.
- vLLM version: v0.9.2
- vLLM main:
0df4d9b06b
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.
- vLLM version: v0.9.2
- vLLM main:
40d86ee412
---------
Signed-off-by: taoxudonghaha <justsheldon@163.com>
### What this PR does / why we need it?
Bump default python version to 3.11, see #1980
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
pass CI
- vLLM version: v0.10.0
- vLLM main:
12a223ef9b
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.
- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
a2480251ec
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Support the inference of the Deepseekr1-w8a8-mtp model with
statically-quantized shared_head in MTP layers.
- vLLM version: v0.9.2
- vLLM main:
6eca337ce0
Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
### What this PR does / why we need it?
Pin transformers to fix v0.9.1 doctest
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
doctest passed
- vLLM version: v0.10.0
- vLLM main:
c657369841
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Add ut for qwen3_moe.py
### Does this PR introduce _any_ user-facing change?
No.
- vLLM version: v0.10.0
- vLLM main:
18cc33dd60
Signed-off-by: huangxialu <huangxialu1@huawei.com>
This PR removes the restriction that TP cannot be greater than 4 in
torchair scenario, because current newest version of CANN has fixed this
bug.
- vLLM version: v0.10.0
- vLLM main:
04ff4be310
Signed-off-by: whx-sjtu <2952154980@qq.com>
Clean up useless import from vllm to make code more clear.
- vLLM version: v0.10.0
- vLLM main:
18cc33dd60
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add ut for files in folder /attention
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
139a7f07bd
---------
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.
### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:
Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;
### Does this PR introduce _any_ user-facing change?
No change at user-facing.
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
57c22e57f9
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
When using deepseek series models generated by the --dynamic parameter,
if torchair graph mode is enabled, we should modify the configuration
file in the CANN package to prevent incorrect inference results.
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Upgrade CANN to 8.2.RC1
### Does this PR introduce _any_ user-facing change?
Yes, A3 image are using 8.2.rc1
### How was this patch tested?
CI passed
- vLLM version: v0.10.0
- vLLM main:
de509ae8eb
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Upgrade CANN to 8.2.rc1
Backport: https://github.com/vllm-project/vllm-ascend/pull/1653
### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.2.RC1
### How was this patch tested?
CI passed
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Fix num_hidden_layers when Qwen2-Audio 7B and #1760 :
```
INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
Traceback (most recent call last):
File "/workspace/test1.py", line 58, in <module>
main(audio_count)
File "/workspace/test1.py", line 38, in main
llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config
config = VllmConfig(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__
current_platform.check_and_update_config(self)
File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config
update_aclgraph_sizes(vllm_config)
File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes
num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers
File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__
return super().__getattribute__(key)
AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers'
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes: https://github.com/vllm-project/vllm-ascend/issues/1780https://github.com/vllm-project/vllm-ascend/issues/1760https://github.com/vllm-project/vllm-ascend/issues/1276https://github.com/vllm-project/vllm-ascend/issues/359
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed
- vLLM version: v0.9.2
- vLLM main:
7728dd77bb
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Remove transformers installation, The transformers version bug has been
fixed by
e936e401de.
We can safe to remove the version limit now
- vLLM version: v0.9.2
- vLLM main:
40d86ee412
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
this pr is to add ut for qwen2_5_vl_without_padding.py
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
this is only a ut test
- vLLM version: v0.9.2
- vLLM main:
9c8b2c2a8a
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Add uts for files in folder /core
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
5a19a6c670
---------
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
### What this PR does / why we need it?
- Fixes image CI
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.9.2
- vLLM main:
f3137cdd81
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
Fix duplicate 'torch.' prefix in qwen2-vl, qwen2.5-vl
- vLLM version: v0.9.2
- vLLM main:
dde295a934
### What this PR does / why we need it?
Refactor the comments of `AscendMetaData` to make it clearer.
- vLLM version: v0.9.2
- vLLM main:
f3137cdd81
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
More discussion can be found
[here](https://github.com/ascend-gha-runners/docs/issues/23).
The infra team deployed a internal registry since both `m.daocloud.io`
and `quay.io` suffered a unstable connect quality.
CI will benefit both the connection and download speed by switching to
the internal registry.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
tested locally
- vLLM version: v0.9.2
- vLLM main:
6b46c4b653
---------
Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>
### What this PR does / why we need it?
Add some uts for files in folder /multistream
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
b77c7d327f
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
### What this PR does / why we need it?
Add some ut for files in folder /distributed
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
107111a859
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
Add prefix parameter to parent class initialization to avoid parameter
naming conflicts
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
32142b3c62
What this PR does / why we need it?
Add uts for deepseek_v2
Does this PR introduce any user-facing change?
No
How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
f3137cdd81
---------
Signed-off-by: 张帮政 <zhangbangzheng@huawei.com>
Before do attention module refactor, we can do some code cleanup to make
the next step easier.
What this PR does:
1. remove uesless `common_prefix_len` for attention builder
2. remove uesless `is_only_prefill` and `num_input_tokens` in attention
metadata.
3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead,
`CommonAttentionMetadata` is over designed and uesless
4. update the attention backend input parameters to keep the same as
vLLM.
5. Rename attention name to the same style with `ASCEND` prefix
- vLLM version: v0.9.2
- vLLM main:
107111a859
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Change AI Vector core number getting function to glibc ABI free
function. After this PR merged in, there should been no glibc ABI
problems for bump torch version to 2.7.1.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
f59ec35b7f
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Add UT for patches in vLLM Ascend
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Irrelevant
- vLLM version: v0.9.2
- vLLM main:
107111a859
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>