xc-llm-ascend

Author	SHA1	Message	Date
Mengqing Cao	6eddbd2521	[CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889 ) Initialize PD Disaggreate UT --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-29 10:17:12 +08:00
wangxiyuan	f6e5decc10	[CI] upgrade to vllm 0.9.0 (#959 ) Upgrade to vllm 0.9.0. 0.8.5 will not be supported any more. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-28 21:18:41 +08:00
wangxiyuan	e2a0c19cea	[CI] Refactor CI (#952 ) 1. remove some useless test func and file 2. fix format.sh problem 3. enable full test for singlecard and multicard 4. move long term test to long_term folder. For this kind of test, it only runs by labeled and daily test. Include: spec decode、accuracy test ## After refactor: There are 4 test modules - `singlecard`: contains the test running on one NPU. It'll be run for each PR and daily test. - `multicard`: contains the test running on multi NPUs. It'll be run for each PR and daily test. - `long_term`: contains the test that cost much time(Now include `spec decode` and `accuracy` test). It'll be run for the PR with `long-term-test` labeled and daily test. - `e2e`: contains the test for doc and pd feature. It'll be run for the PR with `pd-test` labeled and daily test. ## Todo: 1. some test are skipped, they should be fixed and reenabled in the future. 2. pyhccl test for multicard doesn't work at all. It should be enabled as well. 3. ensure long-term-test pass by daily test. ### Know issue Now, `ready` labels is required to start pd test or long term test. And when `long-term-test` or `pd-test` is labeled after another one, the old labeled test will be re-run again. So the labeled test should be ran in the following step: 1. decide which test need run, then label it. `long-term-test` or `pd-test` or both. 2. add `ready-for-test` label, then the test will be ran. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-28 06:31:35 +08:00
Angazenn	9f5ab59e30	[WIP][BugFix]Fix accuracy issues caused by wrong etp_size passed into FusedMoEParallelConfig when using vLLM 0.9.0 (#961 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR fix accuracy issues incurred by codes that adapt to `FusedMoEParallelConfig` in vLLM 0.9.0 version. The `tp_size` used to split weights are wrongly passed. The root cause is that vLLM community and vLLM-Ascend are using different methods to decide whether to use Expert Parallel. vLLM: vLLM use a flag `enable_expert_parallel` to indicate whether to use EP and use the following codes to decide `ep_size`: ``` use_ep = (dp_size_ * tp_size_ > 1 and vllm_parallel_config.enable_expert_parallel) dp_size = dp_size_ dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0 tp_size, tp_rank = flatten_tp_across_dp(dp_rank) if not use_ep: return FusedMoEParallelConfig(tp_size=tp_size, tp_rank=tp_rank, dp_size=dp_size, dp_rank=dp_rank, ep_size=1, ep_rank=0, use_ep=False) # DP + EP / TP + EP / DP + TP + EP assert use_ep # In EP, each device owns a set of experts fully. There is no tensor # parallel update tp_size, tp_rank, ep_size and ep_rank to reflect that. ep_size = tp_size ep_rank = tp_rank return FusedMoEParallelConfig(tp_size=1, tp_rank=0, dp_size=dp_size, dp_rank=dp_rank, ep_size=ep_size, ep_rank=ep_rank, use_ep=True) ``` vLLM-Ascend: vLLM-Ascend uses `etp` to specify Tensor Parallel in MoE. ``` self.ep_size = get_ep_group().world_size self.tp_size = get_etp_group().world_size self.dp_size = (dp_size if dp_size is not None else get_dp_group().world_size) ``` So there will be conflicts if we simply combine these codes together. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-05-27 15:16:17 +08:00
Shuqiao Li	01e3d59eae	add workflow to build and release wheel (#775 ) ### What this PR does / why we need it? This is a continuing work of #716. This PR add workflow to build and release wheel, and also release source to PYPI. We have 3 conditions to trigger the workflow: 1. PR to `main` and `-dev` 2. push to `main` and `-dev` 3. push tag with name of `v*` Release to PYPI will only be done under condition 3. Under condition 1 and 2, it will generate .tar.gz and build .whl, upload to github artifacts but will not release. update: Will build .whl and upload to github artifacts with scheduled task. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All triggered conditions are well tested with my fork repo. --------- Signed-off-by: Shuqiao Li <celestialli@outlook.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-26 14:18:26 +08:00
Mengqing Cao	a0c3e9ba50	[Bugfix] Adjust inputbatch to be compatible with latest vllm (#945 ) Adjust inputbatch to be compatible with latest vllm, as kvcache group feature has been redo in https://github.com/vllm-project/vllm/pull/18593 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-26 10:33:28 +08:00
Angazenn	1f9fb869ad	[BugFix] Fix accuracy bugs for unquantized deepseekv3 models (#897 ) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR #819 when running deepseekv3 series models: 1. #819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-05-24 14:29:36 +08:00
yiz-liu	17f05b1089	[Feature] Add CustomQwen3MoeForCausalLM model (#925 ) Tweak packed_modules_mapping to support W8A8 weights. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-23 15:50:48 +08:00
jiangpeng	df58fb80ee	Spec decode support for V1 Engine (#874 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Make spec decode support for V1 Engine - Currently, Ascend does not support the triton kernel. PyTorch is used to rewrite the `rejection_sampler.py` triton kernel. However, PyTorch is not as good as Triton. Therefore, ascend c is used to implement the function in the future. - Currently, spec decode supports only the ngram algorithm. The eagle algorithm needs to be further adapted. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> Not change user facing. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> test by `tests/singlecard/spec_decode/e2e/test_v1_spec_decode.py` and `tests/sample/test_rejection_sampler.py`, test base function of rejection sampler and e2e function of spec decode. Signed-off-by: ponix-j <657511300@qq.com>	2025-05-23 14:25:46 +08:00
Angazenn	a970b27e2d	[WIP][Perf]remove unnecessary padding before MLA V1 prefill (#917 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Currently, the implementation for MLA V1 pads q, k, v to `head_dim` 256 to conform to early MLA kernel. But the new MLA kernel supports `head_dim` that can't be devided by 128. Therefore we can remove those unnecessary paddings to boost the performance ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-05-23 14:14:06 +08:00
ttanzhiqiang	dc6172efd3	update attention nz and mla nz(Improve TPOP 6ms performance) (#909 ) ### What this PR does / why we need it? Update attention nz and mla nz modules to improve TPOP 6ms performance Convert W_UV and W_UK_T to NPU format in mla_v1.py Convert layer.weight to NPU format in w8a8.py Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-05-23 10:18:10 +08:00
Jade Zheng	7153d8890b	[Feature] Impl v1 disaggregated prefill in ascend scheduler (#852 ) Implement save kv cache logic for v1 disaggregated prefill in ascend scheduler This PR adds support for saving kv cache in the ascend scheduler, which is part of the v1 disaggregated prefill design. The load functionality is not yet implemented. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-05-23 10:15:29 +08:00
rjg-lyh	b434f37b46	[V1] Revert the default value of enable_chunked_prefill in additional… (#935 ) ### What this PR does / why we need it? Revert the default value of enable_chunked_prefill to 'False' in additional_scheduler_config. In engine v1, enable_chunked_prefill is forcibly set to True in VllmConfig, which causes it to be perceived as True in check_and_update_config(). As a result, when the v0 scheduler is enabled, the chunked prefill feature remains active, leading to the failure of the v0 scheduler and causing it to fall back to the native v1 scheduling logic. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-23 10:06:50 +08:00
yangpuPKU	46df67a5e9	[bugfix] Improve log level and info for custom ops build (#937 ) ### What this PR does / why we need it? Fix the bug of #703, where vllm wrong raised the ERROR : Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'. The format for reporting import vllm_ascend_C failure is unified by warning ("Failed to import vllm_ascend_C:%s", e). ### Does this PR introduce _any_ user-facing change? No --------- Signed-off-by: yangpuPKU <604425840@qq.com>	2025-05-23 10:05:57 +08:00
yupeng	8ddc0a1002	[DOC] mark v1 multi-lora functional (#932 ) ### What this PR does / why we need it? Update feature support for lora ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? preview Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-05-22 19:53:14 +08:00
yupeng	0f53b138f6	[V1][LoRA][Test] V1 Engine LoRA support & e2e test (#893 ) ### What this PR does / why we need it? Add V1Engine LoRA support. Add LoRA e2e test on single card and multiple cards. ### Does this PR introduce _any_ user-facing change? support lora for V1 ### How was this patch tested? CI passed with new added test --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: paulyu <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: jesse <szxfml@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-05-22 19:20:51 +08:00
Mengqing Cao	7aa4f85f10	[Bugfix][kvcache] revert multiple kv cache groups (#923 ) Revert multiple kv cache groups related changes as this feature is reverted in vllm https://github.com/vllm-project/vllm/pull/18459 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-22 15:15:33 +08:00
rjg-lyh	b4d6672d01	[BugFix] Fix chunked prefill bugs in engine v1 (#844 ) ### What this PR does / why we need it? Fix the bugs when run deepseek model in engine v1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-22 10:33:50 +08:00
yiz-liu	a73bd6caf4	[Fix] Set div_mode to False and fix view_as position (#912 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Set div_mode to False to use the ACLNN kernel, which is crucial when using ACL Graph. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-22 09:57:25 +08:00
hfadzxy	58b413752b	[Doc] Support XLM-RoBERTa-based and MiniCPM3 model (#820 ) ### What this PR does / why we need it? support XLM-RoBERTa-based and MiniCPM3 model --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-05-21 15:44:54 +08:00
22dimensions	d5401a08be	[DOC] update modelslim version (#908 ) 1. update modelslim version to fix deepseek related issues 2. add note for "--quantization ascend" Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-05-21 09:12:02 +08:00
Wan_Danfeng	5cf9ff18e9	[Performance]: Custom AscendC Kernel of Multi-Step Prepare Input (#814 ) ### What this PR does / why we need it? - According to https://github.com/vllm-project/vllm-ascend/issues/807, we pull request for customer ascendc kernel of multi-step. - also a bug we found in multi_step_runner.py is fixed when we use multi-step on V0 Engine. ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file and offline inference file to test the custom ascendc kernel. See test/ops/test_multi_step.py and examples/offline_multi_step.py --------- Signed-off-by: wan_danfeng <wonderful199082@126.com>	2025-05-20 09:31:30 +08:00
22dimensions	00e0243561	enable online serving quantization (#877 ) For online serving, "ascend" quantization method is not a choice natively, so we need to add "ascend" quantization method to quantization methods list and the user can enable quantization using "vllm serve --quantization ascend" command. --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-05-17 17:36:04 +08:00
22dimensions	a8730e7a3c	[Doc] update quantization docs with QwQ-32B-W8A8 example (#835 ) 1. replace deepseek-v2-lite model with more pratical model QwQ 32B 2. fix some incorrect commands 3. replase modelslim version with a more formal tag Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-05-17 15:25:17 +08:00
wangxiyuan	7326644513	[CI] Fix qwen2.5 vl CI failure (#888 ) The [vllm commit](`67da5720d4`) changed the input and rotary position embedding for qwen 2.5 vl which break CI. This PR fix the CI failure for qwen2.5 vl in quick Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-17 05:13:32 +08:00
Mengqing Cao	df16c4f2bc	[CI/UT] Ignore vllm/tests/test_vllm_port.py (#887 ) Ignore `vllm/tests/test_vllm_port.py` in ut as no related to vllm-ascend, and it is breaking CI Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-16 18:52:59 +08:00
Mengqing Cao	7a325b2e2d	[Bugfix][Model] Fix fusedmoe and make modelrunner_v1 compatible with latest vllm (#867 ) ### What this PR does / why we need it? this PR fix CI failure broken by vllm. 1. add moe_config for fused_moe 2. adjust the change for kv cache group from vllm. currently vllm-ascend doesn't support this feature. this is just a quick fix for backward compatibility fix: #872 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-16 12:14:55 +08:00
hfadzxy	fd515cd60b	[Doc][BugFix]Fix Release Compatibility Matrix (#865 ) ### What this PR does / why we need it? Fix Release Compatibility Matrix Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-05-15 15:38:38 +08:00
Angazenn	1e67089bc9	[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? 1. This PR introduces native `all_to_all` communication operator to fix `allgather` bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs. 2. The operator `npu_dequant_swiglu_quant` only supports input hidden_states with dtype `torch.int32`. This tensor occupies space of `global_bs * seq_len * topk * hidden_size`, which might be very large as `ep_size` grows. Therefore we need to disable this operator and use original `swiglu` && `quantize`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By performing offline inference: ![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828) --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-05-15 09:19:55 +08:00
wangxiyuan	68fb63428b	[CI] Patch torch.library.infer_schema for fused moe ops to fix CI (#854 ) make sure pytorch infer_schema check is patched before some case which using fused moe ops: 1. model register 2. quantization loading 3. fused moe ut Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-14 19:49:09 +08:00
Yikun Jiang	508242425c	[CI][1/N] Add basic ci for PD disaggregation (#830 ) ### What this PR does / why we need it? Add basic CI for PD disaggregation, and enable it when schedule and label with `module:pd` - Updated `.github/actionlint.yaml` to add a new self-hosted runner configuration: `linux-arm64-npu-static-8`. - Introduced a new GitHub Actions workflow `.github/workflows/vllm_ascend_test_pd.yaml` for PD disaggregation testing: - Scheduled to run daily at 23:00 UTC and triggered by pull request label `module:pd`. - Added steps for baisci installation and other steps will add in followup PR Related: https://github.com/vllm-project/vllm-ascend/issues/841 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI passed - No trigger by default <img width="847" alt="image" src="https://github.com/user-attachments/assets/23aa128f-526d-447f-91c8-8ebf6be8400f" /> - Trigger only if we tag with pd <img width="930" alt="image" src="https://github.com/user-attachments/assets/aef1caca-2029-48e8-a6e6-860136adcd37" /> Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-14 18:04:16 +08:00
Yikun Jiang	59e02502b1	[CI] Add e2e test frame work and doctest (#730 ) ### What this PR does / why we need it? Add quickstart doctest CI ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - CI passed - Run `/vllm-ascend/tests/e2e/run_doctests.sh` Related: https://github.com/vllm-project/vllm-ascend/issues/725 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-14 09:27:54 +08:00
wangxiyuan	857f489cbf	[CI] Patch torch.library.infer_schema for torch 2.5 backward compatibility (#837 ) Patch torch.library.infer_schema for torch 2.5 backward compatibility - Introduced a new module `patch_utils` under `vllm_ascend/patch/worker/patch_common/`. - Added a function `ascend_direct_register_custom_op` to handle custom operator registration with backward compatibility for PyTorch < 2.7 (such as torch 2.5.1). - Implemented type conversion logic for annotations to ensure compatibility across different PyTorch versions. - Registered the function `ascend_direct_register_custom_op` to `utils.direct_register_custom_op`. - Updated `__init__.py` to include `patch_utils` as the first import. - Ensured `patch_utils` is available for use in other patch files and skipped isort checks for `patch_utils` import. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-14 09:20:55 +08:00
cxcxflying	e564470338	[Attention][Kernel]moe support for llama4 and mllama4 (#740 ) ### What this PR does / why we need it? moe support for llama4 and mllama4 in vllm-ascend ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? start sever: python -m vllm.entrypoints.openai.api_server --model /data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct \ --max-num-seqs=256 \ --max-model-len=8192 \ --tensor-parallel-size=8 \ --block-size=128 \ --dtype bfloat16 \ --host=0.0.0.0 \ --port=8000 \ --gpu-memory-utilization=0.9 \ --trust-remote-code client: python online_server.py --model-path /data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct --image-path /data/nfs/w60040464/cherry_blossom.jpg --docker-ip 7.242.108.253 --served-port 8000 --text "what is the content of this image?" result: {'id': 'chatcmpl-2b709a5d2e1a4017991ec4ba8248686a', 'object': 'chat.completion', 'created': 1747056823, 'model': '/data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'reasoning_content': None, 'content': 'The image depicts a tower, likely Tokyo Skytree, framed by branches of a cherry blossom tree. The tower is white and has a distinctive shape, with a large sphere at the top and a long, thin spire extending from it. The branches of the cherry blossom tree are in the foreground, with pink flowers blooming on them. The background is a clear blue sky.\n\nKey Features:\n\n* Tower: White, spherical shape at the top, long thin spire\n', 'tool_calls': []}, 'logprobs': None, 'finish_reason': 'length', 'stop_reason': None}], 'usage': {'prompt_tokens': 2340, 'total_tokens': 2440, 'completion_tokens': 100, 'prompt_tokens_details': None}, 'prompt_logprobs': None} Signed-off-by: chenxu <chenxu68@huawei.com> Co-authored-by: chenxu <chenxu68@huawei.com> Co-authored-by: evian <eviantai@u.nus.edu>	2025-05-13 19:12:40 +08:00
hfadzxy	217211d8a3	[Misc][Doc] Add the latest stable release url (#826 ) ### What this PR does / why we need it? Add the latest stable release url Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-05-13 12:53:23 +08:00
rjg-lyh	c6ac399091	[Bugfix] Fix the method of importing environment variables in DeepSee… (#817 ) ### What this PR does / why we need it? Fix the method of importing environment variables in DeepSeek model to support successful compilation via aclgraph. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-13 12:52:30 +08:00
wangxiyuan	6193ba679b	[CI] add codespell CI and fix format.sh (#827 ) 1. Fix format check error to make format.sh work 2. Add codespell check CI 3. Add the missing required package for vllm-ascend. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-12 22:04:48 +08:00
whx	5998704c08	[BugFix] Fix ascend scheduler bugs. (#822 ) This PR fixes two bugs in AscendScheduler: 1. When running with high concurrency, the length of running queue may exceed the limit of max_num_seqs 2. When some requests are prempted and recomputing is activated, the logic of computing new tokens is wrong. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-05-12 21:15:17 +08:00
yiz-liu	701b0fd95e	[Enhancement] Add padding for ACL Graph (#803 ) ### What this PR does / why we need it? Add padding for ACL Graph and refactor graph batch size adjustments to utils.py --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-12 20:26:22 +08:00
NeverRaR	efabd722eb	feat: support torchair graph mode in v1 engine (#789 ) ### What this PR does / why we need it? support torchair graph mode with v1 engine --------- Signed-off-by: boying <897013703@qq.com>	2025-05-12 19:14:07 +08:00
hfadzxy	4a2505f81f	[accuracy test]Update cann version and huggingface-hub version for Qwen3 (#823 ) ### What this PR does / why we need it? 1. update cann version to 8.1.0 for multimodal 2. fix huggingface-hub version to adapt to qwen3 3. change Qwen3-8B to Qwen-8B-Base, Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-05-12 19:12:48 +08:00
yiz-liu	5305a2ccf9	[Bugfix] Tweak distributed process group initialization and add dummy… (#816 ) fix batch execution method to enable DP in V1 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-12 17:31:29 +08:00
Li Wang	4df1e99614	[CI] Re-enable `vllm-empty/tests/benchmarks` (#812 ) ### What this PR does / why we need it? For the [#17962](https://github.com/vllm-project/vllm/pull/17962?notification_referrer_id=NT_kwDOCexQHLUxNjM0MTM3OTEwNDoxNjY0ODE5NDg#event-17608938997) has merged, vllm openapi server can now launch normally on python==3.10, we re-enable the related tests Signed-off-by: wangli <wangli858794774@gmail.com>	2025-05-12 15:50:48 +08:00
Li Wang	8e4e791fcd	[CI] Add deepseek-v2-lite test (#631 ) ### What this PR does / why we need it? Add deepseek-v2-lite test, part of #499 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-05-12 14:59:17 +08:00
Li Wang	cdece86f2c	[Bugfix] Add max_num_batched_tokens to InputBatch to make main CI pass (#806 ) ### What this PR does / why we need it? 1. Fix V1 error found by [nightly_ci](https://github.com/vllm-project/vllm-ascend/actions/runs/14950004754/job/41998136610), broken by [[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders #17483](https://github.com/vllm-project/vllm/pull/17483), make `InputBatch` parameter consistent with vllm. 2. Disable benmark and fix it in upstream. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-12 00:36:56 +08:00
Li Wang	218f21de21	[Benchmarks] Add qwen2.5-7b test (#763 ) ### What this PR does / why we need it? - Add qwen2.5-7b test - Optimize the documentation to be more developer-friendly Signed-off-by: xuedinge233 <damow890@gmail.com> Co-authored-by: xuedinge233 <damow890@gmail.com>	2025-05-10 09:47:42 +08:00
wemaster	19c8e134e4	[CI/UT] fix spec ut in vllm-ascend main and vllm main (#759 ) ### What this PR does / why we need it? #### 1. fix spec ut in vllm-ascend main and vllm main As https://github.com/vllm-project/vllm-ascend/pull/694 and https://github.com/vllm-project/vllm-ascend/pull/749 verify, Now, vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main and vllm main, CI is fail. I found the reason is a triton bug https://github.com/triton-lang/triton/issues/2266, but i I didn't figure it out that why the bug did not effect vllm-ascend main and vllm 0.8.5, maybe the usage of triton have changed when vllm 0.8.5 to latest main As the bug describe, I changed the minimum block_size in UT from 8 to 16, and the modification is verified locally to be effective. #### 2. modify some case skip form. I modified some commented out cases to skipif form, which is more standardized. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-05-10 09:45:56 +08:00
Li Wang	58d2f85c4a	[CI] Fix schedule trigger bug (#757 ) ### What this PR does / why we need it? This PR aims to fix nightly ci [broken](https://github.com/vllm-project/vllm-ascend/actions/runs/14848150987) We have a workflow containing multiple triggers: - push events (to the default branch) - pull requests (against the default branch) - scheduled events Our paths-filter action works great for the first two use-cases, detecting the context and base to compare against. However, it fails for scheduled events giving the error `This action requires 'base' input to be configured or 'repository.default_branch' to be set in the event payload.` For the scheduling trigger event, we choose to skip this filter because we don't need its results: ``` - name: Check for changes in Speculative Decode if: github.event_name != 'schedule' ``` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-05-10 09:45:07 +08:00
Yikun Jiang	804ebb17bd	[Doc] Move Release Compatibility Matrix to top and remove v0.7.x rc info (#799 ) ### What this PR does / why we need it? - Move Release Compatibility Matrix to top - Remove v0.7.x rc info because v0.7.3 final release alread published - Rename vllm-ascend to vLLM Ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-09 16:41:50 +08:00
rjg-lyh	fa99f89e93	[Core] Support the features of prefix cache and chunked prefill in v0/v1 (#782 ) ### What this PR does / why we need it? Support the features of prefix cache and chunked prefill in v0/v1. --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-09 16:39:28 +08:00

1 2 3 4 5 ...

288 Commits