xc-llm-ascend

Author	SHA1	Message	Date
dependabot[bot]	ca274001b0	Bump actions/download-artifact from 4 to 5 (#2311 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5. - vLLM version: v0.10.0 - vLLM main: `ebf7605b0d` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-08-11 16:02:12 +08:00
wangxiyuan	9260910c8d	[CI] Fix broken CI (#2302 ) 1. disable test_eagle_ccorrectness test, we'll reopen it once oom error fixed. 2. drop transformers version limit for main, since vLLM rely on >=4.55.0, see: `65552b476b` 3. fix kv_connector_output bug, see: `796bae07c5` - vLLM version: v0.10.0 - vLLM main: `d1af8b7be9` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 11:22:32 +08:00
Icey	3e65c406b8	Fix accuracy test create PR (#2274 ) ### What this PR does / why we need it? Fix create PR of accuracy test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Local testing: https://github.com/nv-action/vllm-benchmarks/pull/87 - vLLM version: v0.10.0 - vLLM main: `099c046463` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-08 14:12:11 +08:00
Icey	0bd5ff5299	Fix accuracy test config and add DeepSeek-V2-Lite test (#2261 ) ### What this PR does / why we need it? This PR fix accuracy test related to https://github.com/vllm-project/vllm-ascend/pull/2073, users can now perform accuracy tests on multiple models simultaneously and generate different report files by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/models/configs/accuracy.txt ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? <img width="1648" height="511" alt="image" src="https://github.com/user-attachments/assets/1757e3b8-a6b7-44e5-b701-80940dc756cd" /> - vLLM version: v0.10.0 - vLLM main: `766bc8162c` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-08 11:09:16 +08:00
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
Wang Kunpeng	8a59367d0c	[main][Feature] Support deepseek w4a8 quantization (#2172 ) ### What this PR does / why we need it? Supports Deepseek-R1 w4a8 quantization. Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and `tests/ut/quantization/test_quantizer.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC` to test deepseek w4a8_dynamic quantized model #### 1.How to get weights using Modelslim ##### Installation steps Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24 git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### The required transformers environment transformers>=4.48.2 ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96) chapter Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format ##### Adapt to vllm-ascend Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it: `quant_model_description_w8a8_dynamic.json` rename to `quant_model_description.json`, and add `"group_size": 256` Modification in `config.json`：`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3`; `quantization_config` is removed; tips:The group_size and weights match. If the w4a8 weights are not generated using msmodelslim, you can check the group_size in quantization_config in config.json. #### 2.How to run w4a8 ##### a.How to run eager mode export VLLM_USE_V1=1 # v1 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --max-num-seqs 128 --enforce-eager ##### b.How to run graph mode export VLLM_USE_V1=1 # v1 export HCCL_BUFFSIZE=1024 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' eg: python -m vllm.entrypoints.openai.api_server --model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-08-06 10:17:44 +08:00
wangxiyuan	292fb8f696	[1/N][Refactor] torchair model runner refactor (#2205 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What this PR does: create the new torchair model runner, more function will be added later - vLLM version: v0.10.0 - vLLM main: `586f286789` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 18:43:04 +08:00
leo-pony	807f0895b2	Bump torch version to 2.7.1 (#1562 ) ### What this PR does / why we need it? Bump torch version to 2.7.1, and cleanup infer schema patch https://github.com/vllm-project/vllm-ascend/commit/857f489 (https://github.com/vllm-project/vllm-ascend/pull/837), this patch depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974 ### Does this PR introduce any user-facing change? No #### How was this patch tested? CI passed torch-npu 2.7.1rc1 install guide: https://gitee.com/ascend/pytorch/tree/v2.7.1/ install depending: ``` pip3 install pyyaml pip3 install setuptools ``` install torch-npu: Closes: https://github.com/vllm-project/vllm-ascend/issues/1866 Closes: https://github.com/vllm-project/vllm-ascend/issues/1390 - vLLM version: v0.10.0 - vLLM main: `9af654cc38` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-05 08:43:24 +08:00
zhangxinyuehfad	e48f32ec59	[CI] Update image for 310p ci (#2155 ) ### What this PR does / why we need it? update the latest image for 310p ci test - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-02 16:46:02 +08:00
weijinqian0	6e00aed4d5	[main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088 ) It comes from 0.9.1dev [0.9.1][Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo (#1547) - vLLM version: v0.10.0 - vLLM main: `97608dc276` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: curryliu <120010041@link.cuhk.edu.cn> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com> Signed-off-by: taoxudonghaha <justsheldon@163.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: TaoYu Chen <ctynb@qq.com> Co-authored-by: taoxudonghaha <justsheldon@163.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-08-02 09:49:10 +08:00
Icey	86bdde1ca8	Enable pytest and yaml style accuracy test (#2073 ) ### What this PR does / why we need it? This PR enabled pytest and yaml style accuracy test, users now can enable accuracy test by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \ --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1970 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-07-31 21:39:13 +08:00
Ruri	4fcca137a7	[main][Feature] Support Qwen3 W4A8 quantization (#2060 ) ### What this PR does / why we need it? Adding `W4A8_DYNAMIC` quantization support for linear. Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC` to test qwen3 w4a8_dynamic quantized model Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim` of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409` 1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim` ```shell git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409 bash install.sh ``` 2. Serve model using `vllm` ```shell VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \ --model vllm-ascend/Qwen3-8B-W4A8 \ --port 8000 \ --quantization ascend \ --tensor_parallel_size 2 \ --enforce-eager ``` - vLLM version: v0.10.0 - vLLM main: `4cd7fe6cea` --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>	2025-07-30 14:57:14 +08:00
zhangxinyuehfad	6874d666fa	[CI]Add e2e test for 310p (#1879 ) ### What this PR does / why we need it? Add e2e test for 310p: trigger conditions：tag, labels(ready-for-test, e2e-310p-test), schedule image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10 runner: linux-aarch64-310p-1, linux-aarch64-310p-4 model: IntervitensInc/pangu-pro-moe-model, Qwen/Qwen3-0.6B-Base, Qwen/Qwen2.5-7B-Instruct - vLLM version: v0.10.0 - vLLM main: `b917da442b` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-30 14:52:16 +08:00
Yikun Jiang	d9f82ebfce	[misc] Add reminder comment when PR submitted (#2092 ) ### What this PR does / why we need it? Add reminder comment when PR submitted ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally: https://github.com/Yikun/vllm-ascend/pull/51#issuecomment-3132425126 This PR will take effect after this PR merged. - vLLM version: v0.10.0 - vLLM main: `0e36abf993` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-30 10:14:33 +08:00
TaoYu Chen	2da281ec5a	bump default python version to 3.11 (#2072 ) ### What this PR does / why we need it? Bump default python version to 3.11, see #1980 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI - vLLM version: v0.10.0 - vLLM main: `12a223ef9b` Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>	2025-07-29 19:07:17 +08:00
Li Wang	f60bb474f9	[CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065 ) ### What this PR does / why we need it? Currently our workflow run time takes about 3 hours in total, which seriously affects the developer experience, so it is urgent to have a optimization, after this pr, It is expected that the running time of the full CI can be shortened to 1h40min. - Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB) - Change TP4 ---> TP2 * 2 max-parallel - Move DeepSeek-V2-Lite-W8A8 to single card test ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `a2480251ec` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-29 18:59:05 +08:00
Mengqing Cao	ed2ab8a197	[CI/Build] Upgrade CANN to 8.2.RC1 (#1653 ) ### What this PR does / why we need it? Upgrade CANN to 8.2.rc1 Backport: https://github.com/vllm-project/vllm-ascend/pull/1653 ### Does this PR introduce _any_ user-facing change? Yes, docker image will use 8.2.RC1 ### How was this patch tested? CI passed - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-26 22:37:46 +08:00
Pleaplusone	df0ec55162	Disaggregate prefill for kv cache register style (#950 ) ### What this PR does / why we need it? This PR adopt `LLMDataDist` for kv cache register and `pull_blocks` style disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 . This PR can be test with the following step: - Generate the rank table for all machine. - execute`toy_proxy.py` to launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, port - Run the prefill server and decode server. - send the request to the disaggregate prefill proxy ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Signed-off-by: liziyu179 <3475441767@qq.com> Signed-off-by: underfitc <hucong24@huawei.com> Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Co-authored-by: liziyu179 <3475441767@qq.com> Co-authored-by: underfitc <hucong24@huawei.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-07-26 17:15:47 +08:00
Yikun Jiang	17a430f7b8	Upgrade vLLM to v0.10.0 (#1927 ) ### What this PR does / why we need it? - Upgrade to v0.10.0 - Drop v0.9.2 version compatibility - Add patch for `vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py` as workaround of `f3a683b7c9` for v0.10.0 and also add e2e test `test_models_prompt_logprobs` - Pin transformers<4.54.0 as workaround of https://github.com/vllm-project/vllm-ascend/issues/2034 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Test locally: `VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs` - CI passed - vLLM version: v0.9.2 - vLLM main: `7728dd77bb` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-26 15:43:29 +08:00
Li Wang	d629f0b2b5	[CI] Remove transformers installation (#2014 ) ### What this PR does / why we need it? Remove transformers installation, The transformers version bug has been fixed by `e936e401de`. We can safe to remove the version limit now - vLLM version: v0.9.2 - vLLM main: `40d86ee412` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-25 15:20:37 +08:00
Icey	6bc82cf6a7	Enable image push CI for build file and csrc has changes (#1977 ) ### What this PR does / why we need it? - Fixes image CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.9.2 - vLLM main: `f3137cdd81` Signed-off-by: Icey <1790571317@qq.com>	2025-07-24 21:19:41 +08:00
li chaoran	ff97740b8d	Use mirror images (#1912 ) ### What this PR does / why we need it? More discussion can be found [here](https://github.com/ascend-gha-runners/docs/issues/23). The infra team deployed a internal registry since both `m.daocloud.io` and `quay.io` suffered a unstable connect quality. CI will benefit both the connection and download speed by switching to the internal registry. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? tested locally - vLLM version: v0.9.2 - vLLM main: `6b46c4b653` --------- Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>	2025-07-24 10:47:05 +08:00
li chaoran	3e39d7234c	[CI] Switching to infra cache server to reduce network pressure (#1792 ) ### What this PR does / why we need it? This PR introduce the infra cache server to speed up apt/pip package installation ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Tested locally, with this config, the network bandwith reduce from 100% to 5% usage when a new PR was submitted. <img width="807" height="334" alt="image" src="https://github.com/user-attachments/assets/16f03bce-4531-4c71-ab6e-8308dc2c022c" /> - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` --------- Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>	2025-07-18 18:39:25 +08:00
Icey	875a920d4a	[Platform] Add support for Altlas A3 series (#1794 ) ### What this PR does / why we need it? Add support for Ascend A3 and remove latest tag ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas A3 series ### How was this patch tested? CI passed with: - remove latest tag test: https://github.com/wxsIcey/wxs-vllm-ascend/actions/runs/16267635040/job/45926924765 - E2E image build for A3 - CI test on A3 with e2e test and longterm test - Unit test missing because need a real A3 hardware to have a test Closes: https://github.com/vllm-project/vllm-ascend/issues/1696 - vLLM version: v0.9.2 - vLLM main: `d0dc4cfca4` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-07-17 11:13:02 +08:00
wangxiyuan	bf2549856f	[CI] Fix changes CI to recover codecov (#1799 ) Add `checkout` action before `dorny/paths-filter` to make it works with `push` case. This is a known issue that `dorny/paths-filter` works without `checkout` in `pull_request` case but failed in `push` case. More detail is here: https://github.com/dorny/paths-filter/issues/60#issuecomment-1464281021 The push CI works after this PR. The test result is here: https://github.com/wangxiyuan/vllm-ascend/actions/runs/16285606468/job/45983607539 - vLLM version: v0.9.2 - vLLM main: `d4d309409f` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 15:01:13 +08:00
wangxiyuan	787010a637	[Test] Remove VLLM_USE_V1 in example and tests (#1733 ) V1 is enabled by default, no need to set it by hand now. This PR remove the useless setting in example and tests - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 12:49:57 +08:00
zhangxinyuehfad	cabfb2bc31	[Test] Resolve vllm-ascend version accuracy test (#1769 ) ### What this PR does / why we need it? Resolve vllm-ascend version for accuracy test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `66f6fbd393` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-14 15:43:37 +08:00
Li Wang	9cd4ac76a1	[CI] Remove benchmark patch and increase the scheduler frequency (#1762 ) ### What this PR does / why we need it? This pr purpose to do the following things: 1. Remove `benchmark_datasets.py` patch 2. Increase the scheduler frequency to 2 times per day, due to the recent large number of daily submissions, we need to increase the default test time(6h) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `247102f07f` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-13 20:00:35 +08:00
Yikun Jiang	eff4b5791c	Recover offline_inference_npu.py to make doctest passed (#1756 ) ### What this PR does / why we need it? Rename offline_inference_npu_v1.py to offline_inference_npu.py to recover doctest ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.9.2 - vLLM main: `a8593237c0` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-12 12:36:35 +08:00
zhangxinyuehfad	1b4a2f3817	[CI] Add accuracy ci for DP and EP and TP and ETP (#1140 ) ### What this PR does / why we need it? Add accuracy ci for DP and EP and TP ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `35514b682a` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-11 17:25:17 +08:00
zhangxinyuehfad	1cd27da5fb	[Test] Remove VLLM_USE_V1 in accuracy test (#1739 ) ### What this PR does / why we need it? Remove VLLM_USE_V1 in accuracy test Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-11 15:29:11 +08:00
wangxiyuan	011fd73a48	[CI] Make CI tracker more clear (#1720 ) 1. enable lint check for all change 2. only run ut and e2e if it's the code change. 3. only run ut and disable e2e if the change is ut only. 4. disable wheel build for push case 5. run unit test when pr is merged 6. remove useless pytest.ini - vLLM version: v0.9.2 - vLLM main: `fdfd409f8f` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-10 16:03:23 +08:00
Li Wang	c7446438a9	[1/N][CI] Move linting system to pre-commits hooks (#1256 ) ### What this PR does / why we need it? Follow vllm-project/vllm lint way: https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml Enable pre-commit to avoid some low level error AMAP. This pr is one step of #1241, The purpose is make linting system more clear and convenient, on this step, Mainly did the following things: yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit, enforce-import-regex-instead-of-re. TODO: - clang-format(check for csrc with google style) need clean code, disable for now - pymarkdown need clean code, disable for now - shellcheck need clean code, disable for now ### Does this PR introduce _any_ user-facing change? Only developer UX change: https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally ``` pip install -r requirements-lint.txt && pre-commit install bash format.sh ``` ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) - vLLM version: v0.9.1 - vLLM main: `5358cce5ff` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-10 14:17:15 +08:00
Yikun Jiang	997f156a51	Use ci_vllm_version when recording vLLM commit (#1689 ) ### What this PR does / why we need it? Use ci_vllm_version when recording vllm commit Followup on https://github.com/vllm-project/vllm-ascend/pull/1623 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Test mannually. $ python3 docs/source/conf.py \| jq .ci_vllm_version \| tr -d '"' v0.9.2 - Test on my local repo: https://github.com/Yikun/vllm-ascend/pull/35 - vLLM version: v0.9.1 - vLLM main: `49e8c7ea25` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-10 11:07:27 +08:00
Shanshan Shen	6af35f60cc	[Bugfix][CI] Remove V0 Spec Decode CI (#1656 ) ### What this PR does / why we need it? To solve the error in the CI of long term test: ```bash modelscope - ERROR - Repo JackFram/llama-68m not exists on either https://www.modelscope.cn/ or https://www.modelscope.ai/ ``` Replace the hf model with modelscope model. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `71d1d75b7a` --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-07-09 15:53:58 +08:00
wangxiyuan	830332ebfc	Clean up v0.9.1 code (#1672 ) vllm has released 0.9.2. This PR drop 0.9.1 support. - vLLM version: v0.9.1 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 08:52:24 +08:00
Icey	0d4bc03946	Fix wheel glibc version incompatibility (#1582 ) ### What this PR does / why we need it? - Fixes https://github.com/vllm-project/vllm-ascend/issues/1533 ### How was this patch tested? 1. Run the image ``` docker run \ --name cann_container \ --device /dev/davinci6 \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -it quay.io/ascend/cann:8.1.rc1-910b-openeuler22.03-py3.11 bash ``` 2. Install package torch=2.5.1 torch-npu=2.5.1.post1.dev20250619 vllm=0.9.1 vllm-ascend=vllm_ascend-0.1.dev1+g02ac443-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.whl Artifact download URL: https://github.com/vllm-project/vllm-ascend/actions/runs/16039661265/artifacts/3454481370 3. Test offline script ``` from vllm import LLM, SamplingParams import os os.environ["VLLM_USE_V1"] = "1" prompts = [ "Hello, my name is", ] llm = LLM(model="Qwen3/Qwen3-1.7B") outputs = llm.generate(prompts) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` 4. Results ![result](https://github.com/user-attachments/assets/20f9d923-00ce-4a2d-8598-9b216045705d) - vLLM version: v0.9.2 - vLLM main: `b942c094e3` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-07-08 18:46:02 +08:00
Yikun Jiang	e4e9ea02ab	Upgrade vLLM version to v0.9.2 (#1652 ) ### What this PR does / why we need it? This patch upgrade vLLM version to v0.9.2, this patch didn't remove the v0.9.1 compatible code to easy review. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `14601f5fba` - Accuracy test with 0.9.2: https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-08 14:18:17 +08:00
Mengqing Cao	f2a20393a2	[CI] Fix mypy check in CI (#1655 ) ### What this PR does / why we need it? Fix mypy check in CI: https://github.com/vllm-project/vllm-ascend/actions/runs/16115919385/job/45469646509?pr=1654 Mypy failed due to the greater numpy version. We need to pin `numpy=1.26.4` in vllm-ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-07 20:19:16 +08:00
Yikun Jiang	493768eb30	Record vLLM commit in PR description (#1623 ) ### What this PR does / why we need it? This patch enables the vllm commits recording and also cleanup unused commit msg note in PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI passed - Test on https://github.com/Yikun/vllm-ascend/pull/33 and vllm commit refreshed as expected. Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-07 10:20:38 +08:00
zhangxinyuehfad	14373f65d7	[Test] Remove V0 accuracy test and enable MoE and VL test on V1 (#1574 ) ### What this PR does / why we need it? Update accuracy test 1. remove accuarcy report on V0 2. add parallel and execution mode 3. add Qwen/Qwen3-30B-A3B and remove Qwen/Qwen2.5-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-06 11:10:19 +08:00
Mengqing Cao	dd22ac38b2	[CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136 ) ### What this PR does / why we need it? 1. run deepseek acc ut per pr --- multicard CI time increased by 9 min 2. run spec decode e2e test on v1 per pr --- singlecard CI time increased by 3 min (partly is disabled due to not work now) ~~3. align the output of whether dbo is enabled or not~~ The generated results with and without dbo cannot be aligned. https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136 4. skip V0 mtp test due to failure in https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816 5. fix some version conflicts ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-04 18:05:45 +08:00
zhangxinyuehfad	4e910186de	[CI/UT] Unify model usage via ModelScope in CI (#1207 ) ### What this PR does / why we need it? Unify Model Usage via ModelScope ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-04 10:52:17 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
Yikun Jiang	aa5fa07478	Only enable single version for wheel pr build (#1571 ) ### What this PR does / why we need it? Only enable single version for wheel pr build to speedup PR triggered CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-02 14:50:34 +08:00
Li Wang	f39365d2ea	[Benchmark] Fix error msg upload in performance benchmark (#1559 ) ### What this PR does / why we need it? Make sure that None parameters are not passed in for `--error` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed locally Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-02 14:06:08 +08:00
Li Wang	6db7dc2c85	[Benchmark] Refactor perf script to use benchmark cli (#1524 ) ### What this PR does / why we need it? Since, `vllm bench` cli has optimized enough for production use(support more datasets), we are now do not need to copy vllm codes, now , with vllm installed, we can easily use the benchmark cli ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-30 23:42:04 +08:00
leo-pony	53ec583bbb	[Docs] Update Altlas 300I series doc and fix CI lint (#1537 ) ### What this PR does / why we need it? - Update Altlas 300I series doc: cleanup unused parameters and enable optimized ops - Fix code spell CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-30 23:34:00 +08:00
Li Wang	5f8241c25c	[V1][ModelRunner] Support pooling model for v1 engine (#1359 ) ### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-30 16:31:12 +08:00
dependabot[bot]	790c810bf7	Bump actions/github-script from 6 to 7 (#1519 ) Bumps [actions/github-script](https://github.com/actions/github-script) from 6 to 7. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/actions/github-script/releases">actions/github-script's releases</a>.</em></p> <blockquote> <h2>v7.0.0</h2> <h2>What's Changed</h2> <ul> <li>Add base-url option by <a href="https://github.com/robandpdx"><code>@robandpdx</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li> <li>Expose async-function argument type by <a href="https://github.com/viktorlott"><code>@viktorlott</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a>, see for details <a href="https://github.com/actions/github-script#use-scripts-with-jsdoc-support">https://github.com/actions/github-script#use-scripts-with-jsdoc-support</a></li> <li>Update dependencies and use Node 20 by <a href="https://github.com/joshmgross"><code>@joshmgross</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/425">actions/github-script#425</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/navarroaxel"><code>@navarroaxel</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/285">actions/github-script#285</a></li> <li><a href="https://github.com/robandpdx"><code>@robandpdx</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li> <li><a href="https://github.com/viktorlott"><code>@viktorlott</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/github-script/compare/v6.4.1...v7.0.0">https://github.com/actions/github-script/compare/v6.4.1...v7.0.0</a></p> <h2>v6.4.1</h2> <h2>What's Changed</h2> <ul> <li>Add <code>@octokit/plugin-request-log</code>, to produce debug output for requests by <a href="https://github.com/mjpieters"><code>@mjpieters</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li> <li>fix input handling by <a href="https://github.com/mjpieters"><code>@mjpieters</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/357">actions/github-script#357</a></li> <li>Remove unused dependencies by <a href="https://github.com/mjpieters"><code>@mjpieters</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/356">actions/github-script#356</a></li> <li>Default debug to current runner debug state by <a href="https://github.com/mjpieters"><code>@mjpieters</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/363">actions/github-script#363</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/mjpieters"><code>@mjpieters</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/github-script/compare/v6.4.0...v6.4.1">https://github.com/actions/github-script/compare/v6.4.0...v6.4.1</a></p> <h2>v6.4.0</h2> <h2>What's Changed</h2> <ul> <li>Bump json5 from 2.1.3 to 2.2.3 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/319">actions/github-script#319</a></li> <li>Bump minimatch from 3.0.4 to 3.1.2 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/320">actions/github-script#320</a></li> <li>Add node-fetch by <a href="https://github.com/danmichaelo"><code>@danmichaelo</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/jongwooo"><code>@jongwooo</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/313">actions/github-script#313</a></li> <li><a href="https://github.com/austinvazquez"><code>@austinvazquez</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/306">actions/github-script#306</a></li> <li><a href="https://github.com/danmichaelo"><code>@danmichaelo</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/github-script/compare/v6.3.3...v6.4.0">https://github.com/actions/github-script/compare/v6.3.3...v6.4.0</a></p> <h2>v6.3.3</h2> <h2>What's Changed</h2> <ul> <li>Update <code>@actions/glob</code> to 0.3.0 by <a href="https://github.com/nineinchnick"><code>@nineinchnick</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/nineinchnick"><code>@nineinchnick</code></a> made their first contribution in <a href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/github-script/compare/v6.3.2...v6.3.3">https://github.com/actions/github-script/compare/v6.3.2...v6.3.3</a></p> <h2>v6.3.2</h2> <h2>What's Changed</h2> <ul> <li>Update <code>@actions/core</code> to 1.10.0 by <a href="https://github.com/rentziass"><code>@rentziass</code></a> in <a href="https://redirect.github.com/actions/github-script/pull/295">actions/github-script#295</a></li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`60a0d83039`"><code>60a0d83</code></a> Merge pull request <a href="https://redirect.github.com/actions/github-script/issues/440">#440</a> from actions/joshmgross/v7.0.1</li> <li><a href="`b7fb2001b4`"><code>b7fb200</code></a> Update version to 7.0.1</li> <li><a href="`12e22ed06b`"><code>12e22ed</code></a> Merge pull request <a href="https://redirect.github.com/actions/github-script/issues/439">#439</a> from actions/joshmgross/avoid-setting-base-url</li> <li><a href="`d319f8f5b5`"><code>d319f8f</code></a> Avoid setting <code>baseUrl</code> to undefined when input is not provided</li> <li><a href="`e69ef5462f`"><code>e69ef54</code></a> Merge pull request <a href="https://redirect.github.com/actions/github-script/issues/425">#425</a> from actions/joshmgross/node-20</li> <li><a href="`ee0914b839`"><code>ee0914b</code></a> Update licenses</li> <li><a href="`d6fc56f33b`"><code>d6fc56f</code></a> Use <code>@types/node</code> for Node 20</li> <li><a href="`384d6cf581`"><code>384d6cf</code></a> Fix quotations in tests</li> <li><a href="`84724927e3`"><code>8472492</code></a> Only validate GraphQL <code>previews</code></li> <li><a href="`84903f5182`"><code>84903f5</code></a> Remove <code>node-fetch</code> from type</li> <li>Additional commits viewable in <a href="https://github.com/actions/github-script/compare/v6...v7">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/github-script&package-manager=github_actions&previous-version=6&new-version=7)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-30 16:04:41 +08:00

1 2 3 4

170 Commits