xc-llm-ascend

Author	SHA1	Message	Date
yupeng	0f53b138f6	[V1][LoRA][Test] V1 Engine LoRA support & e2e test (#893 ) ### What this PR does / why we need it? Add V1Engine LoRA support. Add LoRA e2e test on single card and multiple cards. ### Does this PR introduce _any_ user-facing change? support lora for V1 ### How was this patch tested? CI passed with new added test --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: paulyu <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: jesse <szxfml@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-05-22 19:20:51 +08:00
Mengqing Cao	7aa4f85f10	[Bugfix][kvcache] revert multiple kv cache groups (#923 ) Revert multiple kv cache groups related changes as this feature is reverted in vllm https://github.com/vllm-project/vllm/pull/18459 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-22 15:15:33 +08:00
rjg-lyh	b4d6672d01	[BugFix] Fix chunked prefill bugs in engine v1 (#844 ) ### What this PR does / why we need it? Fix the bugs when run deepseek model in engine v1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-22 10:33:50 +08:00
Mengqing Cao	7a325b2e2d	[Bugfix][Model] Fix fusedmoe and make modelrunner_v1 compatible with latest vllm (#867 ) ### What this PR does / why we need it? this PR fix CI failure broken by vllm. 1. add moe_config for fused_moe 2. adjust the change for kv cache group from vllm. currently vllm-ascend doesn't support this feature. this is just a quick fix for backward compatibility fix: #872 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-16 12:14:55 +08:00
yiz-liu	701b0fd95e	[Enhancement] Add padding for ACL Graph (#803 ) ### What this PR does / why we need it? Add padding for ACL Graph and refactor graph batch size adjustments to utils.py --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-12 20:26:22 +08:00
NeverRaR	efabd722eb	feat: support torchair graph mode in v1 engine (#789 ) ### What this PR does / why we need it? support torchair graph mode with v1 engine --------- Signed-off-by: boying <897013703@qq.com>	2025-05-12 19:14:07 +08:00
Li Wang	cdece86f2c	[Bugfix] Add max_num_batched_tokens to InputBatch to make main CI pass (#806 ) ### What this PR does / why we need it? 1. Fix V1 error found by [nightly_ci](https://github.com/vllm-project/vllm-ascend/actions/runs/14950004754/job/41998136610), broken by [[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders #17483](https://github.com/vllm-project/vllm/pull/17483), make `InputBatch` parameter consistent with vllm. 2. Disable benmark and fix it in upstream. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-12 00:36:56 +08:00
rjg-lyh	fa99f89e93	[Core] Support the features of prefix cache and chunked prefill in v0/v1 (#782 ) ### What this PR does / why we need it? Support the features of prefix cache and chunked prefill in v0/v1. --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-09 16:39:28 +08:00
chris668899	6c020883a8	[WIP]Add Func: aclgraph_batch_size auto-adjust to different model (#771 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR add new function of : aclgraph_batch_size can dynamic adjust to different model; before this PR, the aclgraph_batch_sizes given from vllm to vllm-ascend always too large, and that may result in ERROR while running on different, with the information: "The resources are insufficient". Now, with this PR, the code can dynamic adjust aclgraph_batch_sizes depend on the model hidden_layer_nums and parallel config, for example: a. for Qwen2.5-7B, the aclgraph_batch_size length is 33 total; b. for Qwen2.5-72B, the aclgraph_batch_size length is 11 total; Signed-off-by: chris668899 <15105191595@126.com>	2025-05-08 16:23:33 +08:00
wangxiyuan	f8350569e6	[CI] upgrade vllm to 0.8.5 (#715 ) 1. Upgrade vllm to 0.8.5 2. Drop 0.8.4 support 3. Keep doc to 0.8.4rc2 until we release 0.8.5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:15:50 +08:00
Yikun Jiang	742f679c7d	Remove prompt string from engine core data structures (#663 ) ### What this PR does / why we need it? vLLM Ascend side followup on: [Core] Remove prompt string from engine core data structures `df6f3ce883` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-26 23:15:58 +08:00
wangxiyuan	3879d9cad9	[CI] Fix sample backward compatibility problem (#648 ) `b411418ff0` this vllm commit change the sample usage. This PR adapt the change for main and make sure it works for 0.8.4 as well. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-25 11:53:26 +08:00
Bug Hunter Yan	05bdcbeae4	support aclgraph (#426 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> support npugraph to default， Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-23 20:56:24 +08:00
Pleaplusone	d12a057df8	Add note for deepseek related docs and remove unnecessary comments (#590 ) ### What this PR does / why we need it? Add notes for deepseek's patch and remove some of the unnecessary comments --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-22 09:59:09 +08:00
Pleaplusone	1a1f9a6d89	port deepseekv2 and mtp to main branch (#429 ) ### What this PR does / why we need it? This PR ports all the deepseek graph mode code and mtp code from v0.7.3 to the main branch --------- Signed-off-by: SidaoY <1024863041@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: mengwei805 <mengwei25@huawei.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: q00832892 <qiaoyang19@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: SidaoY <1024863041@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-04-19 17:38:18 +08:00
Shanshan Shen	65c1f4579f	[V1][Structured Output] Add `apply_grammar_bitmask()` method to model runner (#555 ) ### What this PR does / why we need it? Add `apply_grammar_bitmask()` method to model runner. This method is necessary for `xgrammar` structured output. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-18 16:47:55 +08:00
whx	20dff4deff	[Scheduler] Add AscendScheduler. (#543 ) This PR adds AscendScheduler to vllm v1 engine. This scheduler currently supports v0-style prefill-first scheduling strategy. In the future more schedule methods will be supported by this scheduler. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-04-17 19:31:50 +08:00
hfadzxy	9935d45728	[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 ) ### What this PR does / why we need it? Add model basic accuracy test(Qwen2.5-0.5B-Instruct) Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-17 14:59:56 +08:00
wangxiyuan	c7f6584d75	[V1] clean up V1 code (#505 ) Clean up V1 code: 1. remove useless code. 2. format code to be clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-15 10:24:02 +08:00
wangxiyuan	f6af1d2471	[MISC] fix logger (#515 ) logger in vllm-ascend doesn't work. This PR fix the issue. Fix: https://github.com/vllm-project/vllm-ascend/issues/431 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-15 10:18:05 +08:00
Shanshan Shen	14d9a64047	[ModelRunner][V1] Optimize V1 attention mask (#442 ) ### What this PR does / why we need it? Pre-construct a mask matrix to improve the efficiency of attention mask construction during inference. Note that the length of the matrix needs to be carefully balanced: a matrix that is too large will consume excessive VRAM, while a matrix that is too small will require dynamic concatenation during inference, leading to performance degradation. Therefore, an environment variable is added here to dynamically set the size of the pre-constructed mask matrix based on requirements. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-04-02 10:33:53 +08:00
Mengqing Cao	2dbd763584	[CI] Fix mypy CI (#443 ) ### What this PR does / why we need it? Fix CI by updating mypy and pining numpy version _the modification of model_runner_v1 is just to make CI happy_ ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-01 09:25:33 +08:00
wangxiyuan	31f29b9f30	[Core] Make V1 work and enable V1 engine test (#389 ) 1. Make sure the version is string before parse in collect_env 2. Add basic V1 engine test Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-28 19:34:23 +08:00
wangxiyuan	befbee5883	Update README and add collect_env info (#369 ) 1. Doc: Fix error link 2. Doc: make Chinese version the same with english 3. remove useless file `test.py` 4. update `collect_env.py` 5. Fix v1 import error Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-21 15:43:43 +08:00
Shanshan Shen	c06af8b2e0	[V1][Core] Add support for V1 Engine (#295 ) ### What this PR does / why we need it? Add support for V1 Engine. Please note that this is just the initial version, and there may be some places need to be fixed or optimized in the future, feel free to leave some comments to us. ### Does this PR introduce _any_ user-facing change? To use V1 Engine on NPU device, you need to set the env variable shown below: ```bash export VLLM_USE_V1=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn ``` If you are using vllm for offline inferencing, you must add a `__main__` guard like: ```bash if __name__ == '__main__': llm = vllm.LLM(...) ``` Find more details [here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing). ### How was this patch tested? I have tested the online serving with `Qwen2.5-7B-Instruct` using this command: ```bash vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 ``` Query the model with input prompts: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", "max_tokens": 7, "temperature": 0 }' ``` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-03-20 19:34:44 +08:00

25 Commits