Commit Graph

64 Commits

Author SHA1 Message Date
leo-pony
e467fe1b77 Add qwen-vl model and sampling feature UT for 310I series (#2168)
### What this PR does / why we need it?
Add qwen-vl model and sampling feature UT for  310I series

- vLLM version: v0.10.0
- vLLM main:
e0f63e4a35

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-08-02 11:26:12 +08:00
weijinqian0
6e00aed4d5 [main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088)
It comes from 0.9.1dev
[0.9.1][Feature]Moe alltoallv communication optimization for unquantized
RL training sence & alltoallv support dpo (#1547)

- vLLM version: v0.10.0
- vLLM main:
97608dc276

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: TaoYu Chen <ctynb@qq.com>
Co-authored-by: taoxudonghaha <justsheldon@163.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-08-02 09:49:10 +08:00
22dimensions
9e65da990e [Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132)
### What this PR does / why we need it?

cherry-pick #1501 from 0.9.1-dev to main

Currently, Ray is not compatible with ACL Graph, so we need to fall back
to eager mode when using the Ray backend.

co-authored: Yizhou Liu <liu_yizhou@outlook.com>

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-08-01 09:06:09 +08:00
Icey
86bdde1ca8 Enable pytest and yaml style accuracy test (#2073)
### What this PR does / why we need it?

This PR enabled pytest and yaml style accuracy test, users now can
enable accuracy test by running:

```bash
cd ~/vllm-ascend
pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
          --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \
          --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md

pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
          --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1970

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-07-31 21:39:13 +08:00
huangxialu
9c9a7cd90b [main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112)
backport of v0.9.1-dev:
https://github.com/vllm-project/vllm-ascend/pull/1902

origin main npu_moe_gating_top_k_softmax:
https://github.com/vllm-project/vllm-ascend/pull/1355

- vLLM version: v0.10.0
- vLLM main:
055bd3978e

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-07-31 21:05:56 +08:00
Ronald1995
cb0a303080 ut:add e2e test for external launcher (#2091)
### What this PR does / why we need it?
This pr add e2e testcase to make sure initialize LLM by
external_launcher method is ok.

### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:37:42 +08:00
Mengqing Cao
4c8842da65 [BugFix] Fix a bug of running chunked-prefill with torchair. (#1378) (#1844)
This PR fixes the bug `local variable 'decode_hs_or_q_c' referenced
before assignment` when running chunked-prefill with torchair. We should
calculate `decode_hs_or_q_c` whether or not torchair graphics mode is
enabled.

backport of #1378
fix https://github.com/vllm-project/vllm-ascend/issues/1369


- vLLM version: v0.10.0
- vLLM main:
0e36abf993

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-07-31 20:08:45 +08:00
Ruri
4fcca137a7 [main][Feature] Support Qwen3 W4A8 quantization (#2060)
### What this PR does / why we need it?

Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC`
to test qwen3 w4a8_dynamic quantized model

Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`

1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```

2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
  --model vllm-ascend/Qwen3-8B-W4A8 \
  --port 8000 \
  --quantization ascend \
  --tensor_parallel_size 2 \
  --enforce-eager
```

- vLLM version: v0.10.0
- vLLM main:
4cd7fe6cea

---------

Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
2025-07-30 14:57:14 +08:00
zhangxinyuehfad
6874d666fa [CI]Add e2e test for 310p (#1879)
### What this PR does / why we need it?
Add e2e test for 310p:
trigger conditions:tag, labels(ready-for-test, e2e-310p-test), schedule
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10
runner: linux-aarch64-310p-1, linux-aarch64-310p-4
model: IntervitensInc/pangu-pro-moe-model, Qwen/Qwen3-0.6B-Base,
Qwen/Qwen2.5-7B-Instruct

- vLLM version: v0.10.0
- vLLM main:
b917da442b

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-30 14:52:16 +08:00
Mengqing Cao
d80b0cca5d [CI] Fix test on pyhccl to 2 cards (#2094)
### What this PR does / why we need it?
Fix test on pyhccl to 2 cards

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
0d0cc9e150

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-30 09:08:00 +08:00
leo-pony
4df8e0027c [e2e]Fixed the issue that pyhccl e2e cannot run continuously with other tests (#1246)
### What this PR does / why we need it?
1.Fixed the issue that pyhccl e2e cannot run continuously with other
tests.
2.Cleaned up the resources occupied by the dynamic_npugraph_batchsize
e2e test.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
This is a e2e test

e2e multi-cards tests local running successfully.


- vLLM version: v0.9.2
- vLLM main:
0df4d9b06b

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-29 19:38:30 +08:00
taoxudonghaha
540336edc9 Add Custom Kernels For LoRA Performance (#1884)
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.

- vLLM version: v0.9.2
- vLLM main:
40d86ee412

---------

Signed-off-by: taoxudonghaha <justsheldon@163.com>
2025-07-29 19:27:50 +08:00
Li Wang
f60bb474f9 [CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065)
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.

- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test

### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.10.0
- vLLM main:
a2480251ec

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-29 18:59:05 +08:00
Yikun Jiang
935e9d4c9d Pin transformers to fix v0.9.1 doctest (#2048)
### What this PR does / why we need it?
Pin transformers to fix v0.9.1 doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
doctest passed


- vLLM version: v0.10.0
- vLLM main:
c657369841

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-28 17:51:56 +08:00
zhangxinyuehfad
d1c640841b [Bugfix] Fix num_hidden_layers when Qwen2-Audio 7B (#1803)
### What this PR does / why we need it?
Fix num_hidden_layers when Qwen2-Audio 7B and #1760 :
```
INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
Traceback (most recent call last):
  File "/workspace/test1.py", line 58, in <module>
    main(audio_count)
  File "/workspace/test1.py", line 38, in main
    llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
  File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context)
  File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config
    config = VllmConfig(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__
    current_platform.check_and_update_config(self)
  File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config
    update_aclgraph_sizes(vllm_config)
  File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes
    num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers'
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes: https://github.com/vllm-project/vllm-ascend/issues/1780
https://github.com/vllm-project/vllm-ascend/issues/1760
https://github.com/vllm-project/vllm-ascend/issues/1276
https://github.com/vllm-project/vllm-ascend/issues/359

- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-26 20:13:00 +08:00
Pleaplusone
df0ec55162 Disaggregate prefill for kv cache register style (#950)
### What this PR does / why we need it?
This PR adopt `LLMDataDist` for kv cache register and `pull_blocks`
style disaggregate prefill implementation. The interface implementation
mainly follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.

This PR can be test with the following step:
- Generate the rank table for all machine.
- execute`toy_proxy.py` to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
- Run the prefill server and decode server.
- send the request to the disaggregate prefill proxy

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Signed-off-by: liziyu179 <3475441767@qq.com>
Signed-off-by: underfitc <hucong24@huawei.com>
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Co-authored-by: liziyu179 <3475441767@qq.com>
Co-authored-by: underfitc <hucong24@huawei.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: underfituu <hzhucong@163.com>
2025-07-26 17:15:47 +08:00
Yikun Jiang
17a430f7b8 Upgrade vLLM to v0.10.0 (#1927)
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
7728dd77bb

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 15:43:29 +08:00
SunnyLee151064
ae560f7131 [Test] Add uts for files in /core (#1957)
### What this PR does / why we need it?

Add uts for files in folder /core

### Does this PR introduce _any_ user-facing change?

No

- vLLM version: v0.9.2
- vLLM main:
5a19a6c670

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
2025-07-25 09:48:19 +08:00
leo-pony
b5ad70e1a6 [Optimize]Change AI Vector core number getting function to glibc ABI free funcition (#1974)
### What this PR does / why we need it?
Change AI Vector core number getting function to glibc ABI free
function. After this PR merged in, there should been no glibc ABI
problems for bump torch version to 2.7.1.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.9.2
- vLLM main:
f59ec35b7f

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-24 10:00:19 +08:00
Mengqing Cao
3aa3b46bfe [V1][PP] Support pp with ray backend in V1 (#1800)
### What this PR does / why we need it?
Support pipeline parallel with ray backend in V1Engine.

Fixes #1751

### Does this PR introduce _any_ user-facing change?
Users could specify ray as distributed backend when inferencing with pp

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.9.2
- vLLM main:
32142b3c62

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-23 14:52:52 +08:00
rjg-lyh
9a3bdf2162 [main] Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance (#1806)
### What this PR does / why we need it?
Optimizes the performance of the Qwen3 quantization model by registering
a custom model and adding the AddRmsNormQuant operation. Subsequent PRs
will focus on performance optimizations based on this custom model.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-07-22 19:03:13 +08:00
Mengqing Cao
8cfd257992 [Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-21 09:08:04 +08:00
leo-pony
2ee90461d0 Fix e2e data parallel test: add resource release code (#1881)
### What this PR does / why we need it?
Fix e2e data parallel test: add resource release code and give more time
to engine to pause their processing loops before exiting.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.9.2
- vLLM main:
5895afd780

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-19 11:39:48 +08:00
Mengqing Cao
574fe407eb [1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841)
### What this PR does / why we need it?
We'll refator `CustomOp` in vllm-ascend from this pr on. 

Use function `CustomOp.register_oot` to achieve the customop registery,
taking `AscendQuickGELU` as an example:
```python
from vllm_ascend.ops.activation import AscendQuickGELU
CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU")
```

This is a quick adapt for `CustomOp.register_oot` mechanism from vllm
0.9.2. For further step, we can remove inherit from `QuickGELU` can
write our own `QuickGELU` at all.

Part of https://github.com/vllm-project/vllm-ascend/pull/1647



- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-18 23:07:14 +08:00
Shanshan Shen
8a91e6e59c [Misc][V0 Deprecation] Remove V0 Related Custom Ops (#1871)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
ca4eb82bcb

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-18 23:06:03 +08:00
wangxiyuan
ef99fe1c54 [Test] Clean up duplicate test for ascend scheduler (#1819)
There are some duplicate tests for ascend scheduler. This PR remove them
to make the test clear.

After this PR. the singlecard e2e cost time is reduced from 47min to
46min.

- vLLM version: v0.9.2
- vLLM main:
1eb2b9c102

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-16 17:57:48 +08:00
Shanshan Shen
f96100fad5 [Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805)
### What this PR does / why we need it?
Remove V0 related codes of test, example, platform.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:58:55 +08:00
wangxiyuan
787010a637 [Test] Remove VLLM_USE_V1 in example and tests (#1733)
V1 is enabled by default, no need to set it by hand now. This PR remove
the useless setting in example and tests

- vLLM version: v0.9.2
- vLLM main:
9ad0a4588b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 12:49:57 +08:00
wangxiyuan
494b0f474f [CI]Fix broken CI (#1773)
This PR fixed the broken CI. It require
https://github.com/vllm-project/vllm/pull/20900 merged first.

- vLLM version: v0.9.2
- vLLM main:
e8cc53af5e

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 00:54:20 +08:00
zhangxinyuehfad
1b4a2f3817 [CI] Add accuracy ci for DP and EP and TP and ETP (#1140)
### What this PR does / why we need it?

Add accuracy ci for DP and EP and TP

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
35514b682a

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-11 17:25:17 +08:00
Pr0Wh1teGivee
d13fb0766e [Perf] add patch to optimize apply_topk_topp (#1732)
### What this PR does / why we need it?
Performance optimization for apply_top_k_top_p
### Does this PR introduce _any_ user-facing change?
Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature
### How was this patch tested?
e2e & ut

















- vLLM version: v0.9.2
- vLLM main:
6a9e6b2abf

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-07-11 15:32:02 +08:00
weiguihua2
aa4240c67f Support pipeline parallel in V1 Engine (#1700)
### What this PR does / why we need it?
This patch supports pipeline parallel in V1 Engine

### Does this PR introduce _any_ user-facing change?
Yes, users can run PP in V1

### How was this patch tested?
Manully test














- vLLM version: v0.9.2
- vLLM main:
31d5c1797f

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-07-11 15:30:51 +08:00
ttanzhiqiang
ee40d3d850 use npu_moe_gating_top_k_softmax (#1355)
### What this PR does / why we need it?
The optimization solution for non-deepseek select_experts is to replace
gating_topk_softmax with softmax+topk+to, which is optimized from 37us
to 14us on bf16/fp16 of qwen3-235b

- vLLM version: v0.9.2
- vLLM main:
1a4f35e2ea

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:55:06 +08:00
Mengqing Cao
cc210f46e6 [AscendScheduler][Bugfix] Remove num_draft_tokens while allocating slots (#1718)
### What this PR does / why we need it?

Now there is no need to calculate `num_draft_tokens` when allocating
slots.

This PR follows the changes in vllm:
https://github.com/vllm-project/vllm/pull/20701

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test






- vLLM version: v0.9.2
- vLLM main:
cc876d0f29

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-10 18:47:45 +08:00
Li Wang
c7446438a9 [1/N][CI] Move linting system to pre-commits hooks (#1256)
### What this PR does / why we need it?

Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml

Enable pre-commit to avoid some low level error  AMAP.

This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.

TODO: 
- clang-format(check for csrc with google style)
need clean code, disable for now 
- pymarkdown
need clean code, disable for now 
- shellcheck
need clean code, disable for now 

### Does this PR introduce _any_ user-facing change?

Only developer UX change:

https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally

```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```

### How was this patch tested?

CI passed with new added/existing test.

Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 14:17:15 +08:00
Shanshan Shen
6af35f60cc [Bugfix][CI] Remove V0 Spec Decode CI (#1656)
### What this PR does / why we need it?

To solve the error in the CI of long term test:

```bash
modelscope - ERROR - Repo JackFram/llama-68m not exists on either https://www.modelscope.cn/ or https://www.modelscope.ai/
```

Replace the hf model with modelscope model.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
71d1d75b7a

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
2025-07-09 15:53:58 +08:00
wangxiyuan
830332ebfc Clean up v0.9.1 code (#1672)
vllm has released 0.9.2. This PR drop 0.9.1 support.

- vLLM version: v0.9.1
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:52:24 +08:00
Mengqing Cao
7efa4e92fe [CI] Fix oom in chunk prefill (#1622)
### What this PR does / why we need it?
Add the resource clear logic to fix oom issue when testing
`tests/e2e/singlecard/core/ascend_scheduler`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-07 10:14:40 +08:00
ApsarasX
c58accc15e [Bugfix] Support Qwen3-MOE on aclgraph mode (#1381)
### What this PR does / why we need it?
Fix the shape of the `npu_moe_init_routing` input parameters to support
aclgraph mode on qwen3-moe

In addition to this PR, resolving the `gatherv3` error might be
necessary. See related PR
https://github.com/vllm-project/vllm-ascend/pull/1297
https://github.com/vllm-project/vllm-ascend/pull/1446

Thanks to @yiz-liu  for providing the idea

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested on Qwen3-30B-A3B

Closes: https://github.com/vllm-project/vllm-ascend/issues/1368

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 15:29:36 +08:00
Mengqing Cao
dd22ac38b2 [CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136)
### What this PR does / why we need it?
1. run deepseek acc ut per pr --- multicard CI time increased by 9 min
2. run spec decode e2e test on v1 per pr --- singlecard CI time
increased by 3 min (partly is disabled due to not work now)
~~3. align the output of whether dbo is enabled or not~~
    The generated results with and without dbo cannot be aligned.

https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136
4. skip V0 mtp test due to failure in
https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816
5. fix some version conflicts
### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-04 18:05:45 +08:00
zhangxinyuehfad
4e910186de [CI/UT] Unify model usage via ModelScope in CI (#1207)
### What this PR does / why we need it?
Unify Model Usage via ModelScope

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-04 10:52:17 +08:00
Angazenn
a5f33590d3 [CORE]initial support for torchair with non-mla backend (#1506)
### What this PR does / why we need it?
This PR supports torchair graph mode with non-mla backend on both 800IA2
and 300I Duo platforms. The main change is to add
`attention_v1_torchair.py` to support specific attention related
operations that are required by torchair.

### Does this PR introduce _any_ user-facing change?
Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we
can also use it with pangu. Besides, we add a support model list to
control which type of models that can use torchair.

### How was this patch tested?
We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms,
and model generates answer normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:21:42 +08:00
wangxiyuan
a45dfde283 [CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602)
Make CI happy

1.
c1909e7e8c
changed moeConfig init way
2.
48fb076cbc
changed input batch logic.

This PR address these change to vllm-ascend.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1600

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-03 18:36:17 +08:00
Mengqing Cao
59237ea788 [CI/UT] Add test for chunk prefill and prefix cache on v1/AscendScheduler (#1505)
### What this PR does / why we need it?
Add test for chunked prefill and prefix cache on v1/AscendScheduler

Covered scenarios:
- `Qwen/Qwen3-0.6B-Base` and `deepseek-ai/DeepSeek-V2-Lite-Chat` ---
multicard CI time increased by 19 min
- `V1 + default scheduler` vs `V1 + default scheduler + enable prefix
cache`
- `V1 + Ascend scheduler` vs `V1 + Ascend scheduler + enable prefix
cache` vs `V1 + Ascend scheduler + enable prefix cache + enable chunked
prefill`
- `Qwen/Qwen3-0.6B-Base` --- singlecard CI time increased by 8 min
- `V1 + Ascend scheduler` vs `V1 + Ascend scheduler + enable chunked
prefill`

should rebase after #1498 and #1446
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-02 16:57:03 +08:00
wangxiyuan
641a4e6092 [CI] Cache sampled token ids in model runner to fix CI error (#1573)
### What this PR does / why we need it?
vllm change
7f280d69c9
break vllm-ascend.

This PR Fix the broken CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/1572

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-02 12:11:14 +08:00
Pleaplusone
0e43813120 [ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546)
### What this PR does / why we need it?

This PR (adapted from
2863befce3)
updates the CachedRequestData definition to use a single instance shared
across all requests in a batch, instead of creating a new instance per
request.

Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53
[core.py:521] TypeError: 'CachedRequestData' object is not iterable`,
Modify the model_runner to fix it.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
pass ci will verify this.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 06:05:21 +08:00
wangxiyuan
a054f0f4ca [CI] change to new ds model (#1513)
Previous, the DeepSeek V3 Pruning weight is not correct, the moe layer
is not tested. We update a new Pruning model to enable moe layer
compute.

This PR fix the CI to address the new weight.

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-30 19:02:29 +08:00
Li Wang
5f8241c25c [V1][ModelRunner] Support pooling model for v1 engine (#1359)
### What this PR does / why we need it?
Change as little existing code as possible to add v1 pooling task's
support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to
vllm-ascend, Considering the frequent changes in upstream interfaces, in
order to decouple, so i move it here
### How was this patch tested?
CI passed with new added/existing test, and I have a simple test was
first conducted locally which is adapted from
https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like
bellow:
```python
import os

import torch
from vllm import LLM


os.environ["VLLM_USE_MODELSCOPE"]="True"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]
```
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-30 16:31:12 +08:00
sdmyzlp
53c2d58ae1 Handle with_prefill_across_dp for multistream mla (#1322)
### What this PR does / why we need it?
After #1094, decode might be executed with non-compiled mode, despite of
`torchair_graph_config.enabled`, causing multistream mla to fail, which
assumes torchair compiled mode for decode when
`torchair_graph_config.enabled == True`.
Augment that assumption to fix this.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested both offline, and by graph mode mla e2e testcase.

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-26 09:32:07 +08:00
sharonyunyun
941269a6c5 adjusting the communication method in graph mode (#1194)
### What this PR does / why we need it?
Communication performance optimization: replace allreduce with
reduce_scatter+all_gather in MLA layer's TP group,to remove
stridedsliced and all_gather in MOE layer.
when tp > 1, It is enabled during the decode phase of the graph mode
when enable_multistream_moe、MLA, use_v1, and MC2 are used.
According to the end-to-end RL inference test results, this PR can bring
3% gain in the decode stage.

**Before Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003)
Evaluation

![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7)

![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057)

**After Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e)
Evaluation

![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0)

![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4)

### Does this PR introduce _any_ user-facing change?
Users need to configure enable_multistream_moe=True

### How was this patch tested?
Add e2e test cases to cover code logic

Signed-off-by: sharonyunyun <zhangying134@huawei.com>
2025-06-25 19:56:49 +08:00