### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.
- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
a2480251ec
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Pin transformers to fix v0.9.1 doctest
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
doctest passed
- vLLM version: v0.10.0
- vLLM main:
c657369841
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Add ut for qwen3_moe.py
### Does this PR introduce _any_ user-facing change?
No.
- vLLM version: v0.10.0
- vLLM main:
18cc33dd60
Signed-off-by: huangxialu <huangxialu1@huawei.com>
### What this PR does / why we need it?
Add ut for files in folder /attention
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
139a7f07bd
---------
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.
### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:
Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;
### Does this PR introduce _any_ user-facing change?
No change at user-facing.
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
57c22e57f9
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
Fix num_hidden_layers when Qwen2-Audio 7B and #1760 :
```
INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
Traceback (most recent call last):
File "/workspace/test1.py", line 58, in <module>
main(audio_count)
File "/workspace/test1.py", line 38, in main
llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config
config = VllmConfig(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__
current_platform.check_and_update_config(self)
File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config
update_aclgraph_sizes(vllm_config)
File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes
num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers
File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__
return super().__getattribute__(key)
AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers'
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes: https://github.com/vllm-project/vllm-ascend/issues/1780https://github.com/vllm-project/vllm-ascend/issues/1760https://github.com/vllm-project/vllm-ascend/issues/1276https://github.com/vllm-project/vllm-ascend/issues/359
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed
- vLLM version: v0.9.2
- vLLM main:
7728dd77bb
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
this pr is to add ut for qwen2_5_vl_without_padding.py
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
this is only a ut test
- vLLM version: v0.9.2
- vLLM main:
9c8b2c2a8a
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Add uts for files in folder /core
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
5a19a6c670
---------
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
### What this PR does / why we need it?
Add some uts for files in folder /multistream
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
b77c7d327f
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
### What this PR does / why we need it?
Add some ut for files in folder /distributed
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
107111a859
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
What this PR does / why we need it?
Add uts for deepseek_v2
Does this PR introduce any user-facing change?
No
How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
f3137cdd81
---------
Signed-off-by: 张帮政 <zhangbangzheng@huawei.com>
Before do attention module refactor, we can do some code cleanup to make
the next step easier.
What this PR does:
1. remove uesless `common_prefix_len` for attention builder
2. remove uesless `is_only_prefill` and `num_input_tokens` in attention
metadata.
3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead,
`CommonAttentionMetadata` is over designed and uesless
4. update the attention backend input parameters to keep the same as
vLLM.
5. Rename attention name to the same style with `ASCEND` prefix
- vLLM version: v0.9.2
- vLLM main:
107111a859
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Change AI Vector core number getting function to glibc ABI free
function. After this PR merged in, there should been no glibc ABI
problems for bump torch version to 2.7.1.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
f59ec35b7f
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Add UT for patches in vLLM Ascend
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Irrelevant
- vLLM version: v0.9.2
- vLLM main:
107111a859
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
Support pipeline parallel with ray backend in V1Engine.
Fixes#1751
### Does this PR introduce _any_ user-facing change?
Users could specify ray as distributed backend when inferencing with pp
### How was this patch tested?
CI passed with new added test.
- vLLM version: v0.9.2
- vLLM main:
32142b3c62
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Optimizes the performance of the Qwen3 quantization model by registering
a custom model and adding the AddRmsNormQuant operation. Subsequent PRs
will focus on performance optimizations based on this custom model.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2
Signed-off-by: rjg-lyh <1318825571@qq.com>
What this PR does / why we need it?
According to issue
https://github.com/vllm-project/vllm-ascend/issues/1298 , this pull
request adds unit test code for schedule_config.py.
Does this PR introduce any user-facing change?
No
How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2
### What this PR does / why we need it?
Use base test to avoid patch everwhere.
Followup here: https://github.com/vllm-project/vllm-ascend/pull/1566
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
ut ci passed
- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
There is a lot torchair specified logic in common code. It results hard
code maintenance. We will create a new torchair module to launch
torchair related logic there. I plan to add 4 PR.
1. Refactor worker
2. Refactor utils (this PR)
- simple change that move all torchair related util function to torchair
module
3. Refactor model_runner
4. Refactor attention
- vLLM version: v0.9.2
- vLLM main:
8188196a1c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
There is a lot torchair specified logic in common code. It results hard
code maintenance. We will create a new torchair module to launch
torchair related logic there. I plan to add 4 PR.
1. Refactor worker (this PR)
- create torchair module and move torchair related code in worker to the
new module
3. Refactor utils
4. Refactor model_runner
5. Refactor attention
- vLLM version: v0.9.2
- vLLM main:
8188196a1c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced
This is a part of #1422 backport.
Fixes https://github.com/vllm-project/vllm-ascend/issues/1396https://github.com/vllm-project/vllm-ascend/issues/1154
### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.
### How was this patch tested?
CI passed with new added and existing test.
- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a
Signed-off-by: MengqingCao <cmq0113@163.com>
vLLM commit
752c6ade2e
removed `blocksparse_params` for attention backend. This PR does the
same change to make CI happy.
- vLLM version: v0.9.2
- vLLM main:
9499e26e2a
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Fix e2e data parallel test: add resource release code and give more time
to engine to pause their processing loops before exiting.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.9.2
- vLLM main:
5895afd780
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
maybe fixes
[#1728](https://github.com/vllm-project/vllm-ascend/issues/1728#issuecomment-3065083433)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Test Qwen3-32B tp=4 with:
```bash
vllm serve --port 1234 Qwen/Qwen3-32B \
--served-model-name Qwen3-32B \
--tensor-parallel-size 4 \
--swap-space 16 \
--max-model-len 6000 \
--load-format dummy \
--disable-log-stats \
--disable-log-requests \
```
Request batch_size=128 input/output token=1024
**In 0.9.2rc1**
```text
=====================================================
Total TPS with prefill(tokens/s) : 785.1395
Total TPS without prefill : 846.6809
Mean TPS with prefill : 6.1339
Mean TPS without prefill : 6.6147
=====================================================
Mean TTFT(ms) : 10307.8123
Max TTFT(ms) : 21423.0733
Min TTFT(ms) : 362.3602
=====================================================
Mean TPOT(ms) : 151.3051
Max TPOT(ms) : 159.4649
Min TPOT(ms) : 140.899
=====================================================
Total Time(s) : 175.6032
Request Throughput(requests/s) : 0.7289
=====================================================
```
**Apply this PR**
```text
=====================================================
Total TPS with prefill(tokens/s) : 811.0014
Total TPS without prefill : 876.4423
Mean TPS with prefill : 6.3359
Mean TPS without prefill : 6.8472
=====================================================
Mean TTFT(ms) : 10263.8382
Max TTFT(ms) : 21151.2547
Min TTFT(ms) : 375.9136
=====================================================
Mean TPOT(ms) : 146.1686
Max TPOT(ms) : 154.0957
Min TPOT(ms) : 136.8879
=====================================================
Total Time(s) : 169.8579
Request Throughput(requests/s) : 0.7536
=====================================================
```
The TPOT performance gap between these two sets of data is about 3%.
- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33
Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
### What this PR does / why we need it?
We'll refator `CustomOp` in vllm-ascend from this pr on.
Use function `CustomOp.register_oot` to achieve the customop registery,
taking `AscendQuickGELU` as an example:
```python
from vllm_ascend.ops.activation import AscendQuickGELU
CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU")
```
This is a quick adapt for `CustomOp.register_oot` mechanism from vllm
0.9.2. For further step, we can remove inherit from `QuickGELU` can
write our own `QuickGELU` at all.
Part of https://github.com/vllm-project/vllm-ascend/pull/1647
- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
test func wrapper file
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added test.
- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33
Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>
There are some duplicate tests for ascend scheduler. This PR remove them
to make the test clear.
After this PR. the singlecard e2e cost time is reduced from 47min to
46min.
- vLLM version: v0.9.2
- vLLM main:
1eb2b9c102
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
V1 is enabled by default, no need to set it by hand now. This PR remove
the useless setting in example and tests
- vLLM version: v0.9.2
- vLLM main:
9ad0a4588b
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add accuracy ci for DP and EP and TP
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
35514b682a
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Performance optimization for apply_top_k_top_p
### Does this PR introduce _any_ user-facing change?
Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature
### How was this patch tested?
e2e & ut
- vLLM version: v0.9.2
- vLLM main:
6a9e6b2abf
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
This patch supports pipeline parallel in V1 Engine
### Does this PR introduce _any_ user-facing change?
Yes, users can run PP in V1
### How was this patch tested?
Manully test
- vLLM version: v0.9.2
- vLLM main:
31d5c1797f
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
The optimization solution for non-deepseek select_experts is to replace
gating_topk_softmax with softmax+topk+to, which is optimized from 37us
to 14us on bf16/fp16 of qwen3-235b
- vLLM version: v0.9.2
- vLLM main:
1a4f35e2ea
---------
Signed-off-by: ttanzhiqiang <389825161@qq.com>
### What this PR does / why we need it?
Now there is no need to calculate `num_draft_tokens` when allocating
slots.
This PR follows the changes in vllm:
https://github.com/vllm-project/vllm/pull/20701
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test
- vLLM version: v0.9.2
- vLLM main:
cc876d0f29
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml
Enable pre-commit to avoid some low level error AMAP.
This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.
TODO:
- clang-format(check for csrc with google style)
need clean code, disable for now
- pymarkdown
need clean code, disable for now
- shellcheck
need clean code, disable for now
### Does this PR introduce _any_ user-facing change?
Only developer UX change:
https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally
```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```
### How was this patch tested?
CI passed with new added/existing test.
Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
If a small batch of short requests is sent first, forming a chunk with a
length <128, it will corrupt the `attn_mask_cache`, causing subsequent
requests that do not form a chunk to have accuracy issues.
The root cause of this problem is the use of in-place multiplication.
Modifying it to use out-of-place multiplication will resolve the
accuracy problem.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Yes.
- vLLM version: v0.9.2
- vLLM main:
ad6c2e1a0b
---------
Signed-off-by: ApsarasX <apsarax@outlook.com>
### What this PR does / why we need it?
To solve the error in the CI of long term test:
```bash
modelscope - ERROR - Repo JackFram/llama-68m not exists on either https://www.modelscope.cn/ or https://www.modelscope.ai/
```
Replace the hf model with modelscope model.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
71d1d75b7a
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
vllm has released 0.9.2. This PR drop 0.9.1 support.
- vLLM version: v0.9.1
- vLLM main:
b942c094e3
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1、Sometimes loading torchair cache will fail because of the floating of
npu memory, so this pr add a new cache to save the old kv cache bytes to
avoid the possible crash while loading the torchair graph cache.
2、When caching is enabled and does not exist, the first compilation
introduces the overhead of Dynamo Gurad. So in this case, we will
compile them directly twice to skip them (This will bring 3-4 ms of tpot
optimization)
### Does this PR introduce _any_ user-facing change?
Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to
control kv cache floating tolerance
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
1fd471e957
Signed-off-by: boying <897013703@qq.com>
### What this PR does / why we need it?
Add ut for test_pooling_model_runner.py
### Does this PR introduce _any_ user-facing change? N/A
### How was this patch tested?
python -m unittest test_pooling_model_runner.py
- vLLM version: v0.9.1
- vLLM main:
2e610deb72
---------
Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>