xc-llm-ascend

Author	SHA1	Message	Date
Icey	2a9d02e080	[Bugfix] eagle and eagle3 spec decode failures and enable e2e test (#2979 ) ### What this PR does / why we need it? - Fix the bug https://github.com/vllm-project/vllm-ascend/issues/2978 - Enable e2e test, - Adapt to scenarios where Speculative tokens are greater than 2, - Fix the bug that causes Eagle3 inference failures under high concurrency and improve the acceptance rate of draft models, by https://github.com/vllm-project/vllm-ascend/pull/2794 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: hukongyi [hukongyi@cmbchina.com](mailto:hukongyi@cmbchina.com) Co-authored-by: guanyuzhu [zhuguanyu@huawei.com](mailto:zhuguanyu@huawei.com) Co-authored-by: liumail680 [liumail680@163.com](mailto:liumail680@163.com) - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-25 14:39:12 +08:00
Li Wang	12bcbd02bb	[CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907 ) ### What this PR does / why we need it? 1. This pr bump vllm commit to `6d8246aaff` 2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by https://github.com/vllm-project/vllm/pull/23693 4. fix `structured_outputs_config` changes introduced by https://github.com/vllm-project/vllm/pull/22772 5. fix `moe_config` changes introduced by https://github.com/vllm-project/vllm/pull/22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-20 17:37:57 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
xuyexiong	6681dde902	[Feat][Graph] Support MTP for ACL Graph (#2932 ) ### What this PR does / why we need it? This PR depends on the merge of #2707 and has adapted the aclgraph functionality to support MTP. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `2b85697031` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-18 14:05:33 +08:00
wangxiyuan	382c29f3e1	[BugFix] Fix world size bug in model_runner (#2915 ) - Fix world size bug in model_runner to make sure ep>16 runs with MC2 - enable e2e test for vl Co-Authored-By: whx-sjtu <2952154980@qq.com> Co-Authored-By: Icey <1790571317@qq.com> - vLLM version: v0.10.2 - vLLM main: `3e903b6cb4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-14 12:20:25 +08:00
Jiawei Li	e57cca971c	Fix the bugs about operator registration by PyTorch Dispatcher (#2786 ) Background: There are two principles about operator registration in PyTorch - The same namespace can be only registered once by `TORCH_LIBRARY` - The operator signatures can be only registered once by `def` Considering that all custom operators defined in the current repo are only used by Ascend, instead of defining a common operator schema by vLLM, all accelerators then follow this operator schema and complete the implementation based on their respective hardware, which is conducive to functional abstraction. Therefore, we can rename the operator registration namespace to an Ascend-specific namespace(_C_ascend). Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742 - vLLM version: main - vLLM main: `f592b3174b` Signed-off-by: FFFrog <ljw1101.vip@gmail.com>	2025-09-13 11:58:52 +08:00
无脸男	c3c2221503	[Feat]support dynamic quantization in allgather (#2841 ) ### What this PR does / why we need it? [Feat]support dynamic quantization in allgather ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: main - vLLM main: `5931b7e5d9` Signed-off-by: withHades <244036962@qq.com> Signed-off-by: WithHades <244036962@qq.com>	2025-09-11 18:47:20 +08:00
jiangpeng	2b9269b581	[Perf][V1] Fully overlap model execution (#2783 ) This PR is based on top of [#23569](https://github.com/vllm-project/vllm/pull/23569) and [#24219](https://github.com/vllm-project/vllm/pull/24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: \|\|forward async\| scheduler async \|sync\| \|-\|-\|-\|-\| \|avg\|41.73\|41.86\|44.20\| \|improve0\|0.3%\|0\|0\| \|improve1\|5.58%\|0\|0\| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: \|\|forward async\|sync\| \|-\|-\|-\| \|avg\|23.22\|29.16\| \|improve\|20.3%\|0\| - vLLM version: main - vLLM main: `e93f4cc9e3` Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com> Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com> Co-authored-by: Ronald1995 <ronaldautomobile@163.com>	2025-09-11 16:35:36 +08:00
weichen	a041d4f328	[main] [refactor] refactor common_fused_moe.py (#2706 ) ### What this PR does / why we need it? 1. Move prepare/finalize operation from moe_comm_method to /ops/moe/fused_moe_prepare_and_finalize 2. Adapt to token_dispatcher in moe_comm_method 3. Move moe_comm_method/experts_selector/token_dispatcher/fused_moe_prepare_and_finalize to /ops/moe ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: weichen <calvin_zhu0210@outlook.com> Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-08 20:09:50 +08:00
1092626063	5b3646ab21	[FEATURE][MTP] Support MTP > 1 (#2708 ) ### What this PR does / why we need it? [RFC：Support MTP > 1 for DeepSeek](https://github.com/vllm-project/vllm-ascend/issues/2745) - [x] dp1 tp16 - [x] dp4 tp4 - [x] dp2 tp 8 - [x] torchair graph - vLLM version: v0.10.1.1 - vLLM main: `c9f7081f9c` Signed-off-by: 1092626063 <1092626063@qq.com>	2025-09-05 09:11:22 +08:00
sherie	f86596a66c	allgather use fusedop. (#2689 ) ### What this PR does / why we need it? Use 'npu_moe_init_routing_v2' &'npu_moe_token_unpermute' repalce 'npu_moe_init_routing' &‘npu_moe_compute_expert_tokens’& 'npu_moe_finalize_routing' to optimize performance ### Does this PR introduce _any_ user-facing change? \| branch\| tps\| TTFT \|TPOT \| \| --- \| --- \| --- \|--- \| \|main \|733.98 \| 280.05 \|34.30 \| \|main+fusedop \| 740.33 \| 273.34 \|33.99 \| ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-09-04 11:56:29 +08:00
wangxiyuan	24d4dad7b2	[CI] Enable MTP torchair e2e test (#2705 ) enable MTP torchair e2e test - vLLM version: v0.10.1.1 - vLLM main: `ce30dca5c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-03 08:57:43 +08:00
wangxiyuan	0829b4873f	[CI] recover e2e test (#2688 ) 1. recover the skipped test. 2. remove pangu eager mode test, it's tested by torchair mode already. 3. skip pangu test util the bug is fixed. - vLLM version: v0.10.1.1 - vLLM main: `56d04089ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-02 18:49:17 +08:00
xuyexiong	214b32a346	[V1][BUGFIX][0.10.1] FIX mtp on main branch (#2632 ) ### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627 torchair multi DP fix: https://github.com/vllm-project/vllm-ascend/pull/2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7c8271cd1e` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-02 11:12:41 +08:00
wangxiyuan	fef18b60bc	Refactor e2e CI (#2276 ) Refactor E2E CI to make it clear and faster 1. remove some uesless e2e test 2. remove some uesless function 3. Make sure all test runs with VLLMRunner to avoid oom error 4. Make sure all ops test end with torch.empty_cache to avoid oom error 5. run the test one by one to avoid resource limit error - vLLM version: v0.10.1.1 - vLLM main: `a344a5aa0a` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-02 09:02:22 +08:00
weichen	320edde2df	[main] [refactor] refactor fused_moe.py to enable token_dispatchers (#2570 ) ### What this PR does / why we need it? Enable token_dispatcher to replace fused_experts_with_xxx in eager mode ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: `704432af3c` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: sherie <963372609@qq.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com> Co-authored-by: shiyuan680 <72335504+shiyuan680@users.noreply.github.com>	2025-08-28 10:13:35 +08:00
wangxiyuan	f22077daa6	[Embedding] Recover embedding function (#2483 ) Fix broken embedding function. It's broken by http://github.com/vllm-project/vllm/pull/23162 - vLLM version: v0.10.1.1 - vLLM main: `efc88cf64a` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-27 09:22:01 +08:00
s30076806	6a4ec186e7	[Qwen-moe] Remove the minor operation arange (#2373 ) ### What this PR does / why we need it? Integrate the arange operator to reduce the time spent and improve performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `56dcf4e7e9` --------- Signed-off-by: s30076806 <songjiayang2@h-partners.com>	2025-08-27 09:13:31 +08:00
Shanshan Shen	0767d51dd5	[Structured Output][CI] Add test for `outlines` backend for structured output in CI (#2283 ) ### What this PR does / why we need it? Add test for `outlines` backend for structured output in CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests have all passed with: ```bash pytest -sv tests/e2e/singlecard/test_guided_decoding.py ``` - vLLM version: v0.10.0 - vLLM main: `53415653ff` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-25 09:59:13 +08:00
linfeng-yuan	4af5b80606	[Scheduler] validate max_num_batched_tokens and max_model_len in AscendSchedulerConfig (#2434 ) ### What this PR does / why we need it? Add configuration check logic for ascend scheduler: if chunked_prefill is disabled, max_num_batched_tokens couldn't be less than max_model_len, following vLLM; ### Does this PR introduce _any_ user-facing change? users cannot set max_num_batched_tokens smaller than max_model_len with ascend scheduler ### How was this patch tested? CI and vllm serving passed - vLLM version: v0.10.0 - vLLM main: `f77a0802b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-23 19:39:44 +08:00
ZhaoJiangJiang	3629bc4431	feat: add mtp ut and fix some bugs (#2453 ) ### What this PR does / why we need it? Fix mtp mode ut ### Does this PR introduce _any_ user-facing change? Nothing ### How was this patch tested? This can be tested in the same way as a unit test. - vLLM version: v0.10.0 - vLLM main: `53415653ff` Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com> Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>	2025-08-22 17:09:08 +08:00
Mengqing Cao	60ac4fb576	[QuickFix] Skip failed ut to recover CI quickly (#2484 ) ### What this PR does / why we need it? Skip failed ut to recover CI quickly related ut: - `test_embed_models_correctness`: revert me when pooler is adapted with the latest vllm main - `test_check_and_update_config_enforce_eager_mode`: revert me when the occasional failed is fixed - vLLM version: v0.10.0 - vLLM main: `8896eb72eb` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-22 14:14:51 +08:00
Mengqing Cao	1327f9be1c	Fix some ci issue and refactor modelrunner (#2445 ) ### What this PR does / why we need it? Fix some ci issue and refactor modelrunner ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `4d9c61993a` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-20 09:01:04 +08:00
xleoken	2a763b8326	[Bug] Fix bug in test_chunked.py (#1992 ) ### What this PR does / why we need it? 1. Remove the return statement, it will always skip following logic. 2. Update `deepseek` to `Qwen2.5-Instruct` for OOM in github e2e test env. 3. Fix the comparison logic ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Local Test. - vLLM version: v0.10.0 - vLLM main: `0933f9d518` Signed-off-by: xleoken <xleoken@163.com>	2025-08-19 10:23:47 +08:00
shiyuan680	e14f2ef669	refactor select_experts of moe module (#2150 ) ### What this PR does / why we need it? this pr refactor select_experts of moe module i merge implementations of quantitative and non-quantitative method in a new class use such as vllm like ExpertsSelector.select_experts ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test in qwen3-moe and all ut. - vLLM version: v0.10.0 - vLLM main: `e18859298d` Signed-off-by: yangcheng <yangcheng104@huawei.com> Co-authored-by: yangcheng (AJ) <y00806874@china.huawei.com>	2025-08-14 11:50:53 +08:00
whx	29aaba5f84	[Perf][MTP] Optimize reject sampler in greedy situation. (#2137 ) This PR port optimization in PR #2002 to main and makes it cleaner. - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-08-11 17:37:49 +08:00
Pleaplusone	c0f0b70813	[core] Support capture custom ops into aclgraph (#2113 ) ### What this PR does / why we need it? Thanks to the PR https://github.com/vllm-project/vllm-ascend/pull/426 make vllm-ascend support the aclgraph inference to reduce the host overhead. However, the capability of aclgraph strongly relies on the functionality provided by `torch.compile`, which is the key feature supported in torch 2.x . Therefore, capture custom op into aclgraph is only possible when it can be recognize and captured by `torch.compile`. In this PR, we register the meta implementation of current custom ops to enable the fx graph capture. And by doing that, insert those custom ops into aclgraph become a natural thing to the ascend runtime. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? Tested in unittest, we will integrate the `rotary_embedding` op into a small custom model and use `torch.compile` and aclgraph to capture and replay it to verify its functionality. - vLLM version: v0.10.0 - vLLM main: `1b99028069` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-11 15:59:42 +08:00
wangxiyuan	9260910c8d	[CI] Fix broken CI (#2302 ) 1. disable test_eagle_ccorrectness test, we'll reopen it once oom error fixed. 2. drop transformers version limit for main, since vLLM rely on >=4.55.0, see: `65552b476b` 3. fix kv_connector_output bug, see: `796bae07c5` - vLLM version: v0.10.0 - vLLM main: `d1af8b7be9` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 11:22:32 +08:00
Icey	0bd5ff5299	Fix accuracy test config and add DeepSeek-V2-Lite test (#2261 ) ### What this PR does / why we need it? This PR fix accuracy test related to https://github.com/vllm-project/vllm-ascend/pull/2073, users can now perform accuracy tests on multiple models simultaneously and generate different report files by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/models/configs/accuracy.txt ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? <img width="1648" height="511" alt="image" src="https://github.com/user-attachments/assets/1757e3b8-a6b7-44e5-b701-80940dc756cd" /> - vLLM version: v0.10.0 - vLLM main: `766bc8162c` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-08 11:09:16 +08:00
leo-pony	807f0895b2	Bump torch version to 2.7.1 (#1562 ) ### What this PR does / why we need it? Bump torch version to 2.7.1, and cleanup infer schema patch https://github.com/vllm-project/vllm-ascend/commit/857f489 (https://github.com/vllm-project/vllm-ascend/pull/837), this patch depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974 ### Does this PR introduce any user-facing change? No #### How was this patch tested? CI passed torch-npu 2.7.1rc1 install guide: https://gitee.com/ascend/pytorch/tree/v2.7.1/ install depending: ``` pip3 install pyyaml pip3 install setuptools ``` install torch-npu: Closes: https://github.com/vllm-project/vllm-ascend/issues/1866 Closes: https://github.com/vllm-project/vllm-ascend/issues/1390 - vLLM version: v0.10.0 - vLLM main: `9af654cc38` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-05 08:43:24 +08:00
leo-pony	e467fe1b77	Add qwen-vl model and sampling feature UT for 310I series (#2168 ) ### What this PR does / why we need it? Add qwen-vl model and sampling feature UT for 310I series - vLLM version: v0.10.0 - vLLM main: `e0f63e4a35` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-08-02 11:26:12 +08:00
22dimensions	9e65da990e	[Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132 ) ### What this PR does / why we need it? cherry-pick #1501 from 0.9.1-dev to main Currently, Ray is not compatible with ACL Graph, so we need to fall back to eager mode when using the Ray backend. co-authored: Yizhou Liu <liu_yizhou@outlook.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:06:09 +08:00
Icey	86bdde1ca8	Enable pytest and yaml style accuracy test (#2073 ) ### What this PR does / why we need it? This PR enabled pytest and yaml style accuracy test, users now can enable accuracy test by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \ --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1970 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-07-31 21:39:13 +08:00
huangxialu	9c9a7cd90b	[main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112 ) backport of v0.9.1-dev: https://github.com/vllm-project/vllm-ascend/pull/1902 origin main npu_moe_gating_top_k_softmax: https://github.com/vllm-project/vllm-ascend/pull/1355 - vLLM version: v0.10.0 - vLLM main: `055bd3978e` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-31 21:05:56 +08:00
zhangxinyuehfad	6874d666fa	[CI]Add e2e test for 310p (#1879 ) ### What this PR does / why we need it? Add e2e test for 310p: trigger conditions：tag, labels(ready-for-test, e2e-310p-test), schedule image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10 runner: linux-aarch64-310p-1, linux-aarch64-310p-4 model: IntervitensInc/pangu-pro-moe-model, Qwen/Qwen3-0.6B-Base, Qwen/Qwen2.5-7B-Instruct - vLLM version: v0.10.0 - vLLM main: `b917da442b` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-30 14:52:16 +08:00
taoxudonghaha	540336edc9	Add Custom Kernels For LoRA Performance (#1884 ) ### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: `40d86ee412` --------- Signed-off-by: taoxudonghaha <justsheldon@163.com>	2025-07-29 19:27:50 +08:00
Li Wang	f60bb474f9	[CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065 ) ### What this PR does / why we need it? Currently our workflow run time takes about 3 hours in total, which seriously affects the developer experience, so it is urgent to have a optimization, after this pr, It is expected that the running time of the full CI can be shortened to 1h40min. - Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB) - Change TP4 ---> TP2 * 2 max-parallel - Move DeepSeek-V2-Lite-W8A8 to single card test ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `a2480251ec` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-29 18:59:05 +08:00
zhangxinyuehfad	d1c640841b	[Bugfix] Fix num_hidden_layers when Qwen2-Audio 7B (#1803 ) ### What this PR does / why we need it? Fix num_hidden_layers when Qwen2-Audio 7B and #1760 ： ``` INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode Traceback (most recent call last): File "/workspace/test1.py", line 58, in <module> main(audio_count) File "/workspace/test1.py", line 38, in main llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct", File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__ self.llm_engine = LLMEngine.from_engine_args( File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args vllm_config = engine_args.create_engine_config(usage_context) File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config config = VllmConfig( File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__ s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__ current_platform.check_and_update_config(self) File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config update_aclgraph_sizes(vllm_config) File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__ return super().__getattribute__(key) AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers' ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/1780 https://github.com/vllm-project/vllm-ascend/issues/1760 https://github.com/vllm-project/vllm-ascend/issues/1276 https://github.com/vllm-project/vllm-ascend/issues/359 - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-26 20:13:00 +08:00
Yikun Jiang	17a430f7b8	Upgrade vLLM to v0.10.0 (#1927 ) ### What this PR does / why we need it? - Upgrade to v0.10.0 - Drop v0.9.2 version compatibility - Add patch for `vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py` as workaround of `f3a683b7c9` for v0.10.0 and also add e2e test `test_models_prompt_logprobs` - Pin transformers<4.54.0 as workaround of https://github.com/vllm-project/vllm-ascend/issues/2034 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Test locally: `VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs` - CI passed - vLLM version: v0.9.2 - vLLM main: `7728dd77bb` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-26 15:43:29 +08:00
SunnyLee151064	ae560f7131	[Test] Add uts for files in /core (#1957 ) ### What this PR does / why we need it? Add uts for files in folder /core ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `5a19a6c670` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-25 09:48:19 +08:00
leo-pony	b5ad70e1a6	[Optimize]Change AI Vector core number getting function to glibc ABI free funcition (#1974 ) ### What this PR does / why we need it? Change AI Vector core number getting function to glibc ABI free function. After this PR merged in, there should been no glibc ABI problems for bump torch version to 2.7.1. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `f59ec35b7f` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-07-24 10:00:19 +08:00
Mengqing Cao	574fe407eb	[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841 ) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-18 23:07:14 +08:00
Shanshan Shen	8a91e6e59c	[Misc][V0 Deprecation] Remove V0 Related Custom Ops (#1871 ) ### What this PR does / why we need it? This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `ca4eb82bcb` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-18 23:06:03 +08:00
wangxiyuan	ef99fe1c54	[Test] Clean up duplicate test for ascend scheduler (#1819 ) There are some duplicate tests for ascend scheduler. This PR remove them to make the test clear. After this PR. the singlecard e2e cost time is reduced from 47min to 46min. - vLLM version: v0.9.2 - vLLM main: `1eb2b9c102` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-16 17:57:48 +08:00
Shanshan Shen	f96100fad5	[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805 ) ### What this PR does / why we need it? Remove V0 related codes of test, example, platform. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `235bfd5dfe` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-15 19:58:55 +08:00
wangxiyuan	787010a637	[Test] Remove VLLM_USE_V1 in example and tests (#1733 ) V1 is enabled by default, no need to set it by hand now. This PR remove the useless setting in example and tests - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 12:49:57 +08:00
wangxiyuan	494b0f474f	[CI]Fix broken CI (#1773 ) This PR fixed the broken CI. It require https://github.com/vllm-project/vllm/pull/20900 merged first. - vLLM version: v0.9.2 - vLLM main: `e8cc53af5e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 00:54:20 +08:00
Pr0Wh1teGivee	d13fb0766e	[Perf] add patch to optimize apply_topk_topp (#1732 ) ### What this PR does / why we need it? Performance optimization for apply_top_k_top_p ### Does this PR introduce _any_ user-facing change? Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature ### How was this patch tested? e2e & ut - vLLM version: v0.9.2 - vLLM main: `6a9e6b2abf` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-07-11 15:32:02 +08:00
ttanzhiqiang	ee40d3d850	use npu_moe_gating_top_k_softmax (#1355 ) ### What this PR does / why we need it? The optimization solution for non-deepseek select_experts is to replace gating_topk_softmax with softmax+topk+to, which is optimized from 37us to 14us on bf16/fp16 of qwen3-235b - vLLM version: v0.9.2 - vLLM main: `1a4f35e2ea` --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-07-11 08:55:06 +08:00
Mengqing Cao	cc210f46e6	[AscendScheduler][Bugfix] Remove num_draft_tokens while allocating slots (#1718 ) ### What this PR does / why we need it? Now there is no need to calculate `num_draft_tokens` when allocating slots. This PR follows the changes in vllm: https://github.com/vllm-project/vllm/pull/20701 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test - vLLM version: v0.9.2 - vLLM main: `cc876d0f29` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-10 18:47:45 +08:00

1 2

65 Commits