xc-llm-ascend

Author	SHA1	Message	Date
wemaster	54c0e63df7	[MTP] follow custom deepseek modeling changes to support graph mode (#636 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? As custom deepseek modeling do some changes to support graph mode in https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to change custom deepseek_mtp modeling. And some modifications for k>1 were not carried over by the https://github.com/vllm-project/vllm-ascend/pull/429, now i add it. In order to better take care of the MTP feature in the vllm-ascend repository, I added cases related to graph mode(torchair), but i skip it since torchair can not correctly clean up memory in vllmrunner. Also i add some case for MTP quantization weights, but test weight is not ready, so i skip it and i will open it when test quant weights is ready. https://github.com/vllm-project/vllm-ascend/pull/648 did not completely fix the sample change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I added the relevant changes. ### Does this PR introduce _any_ user-facing change? now, u can use following method to use mtp in deepseek v3/r1 float or quant weights with eager mode. ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, enforce_eager=True, trust_remote_code=True, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` or use mtp in deepseek v3/r1 float or quant weights with graph mode（torchair） ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, trust_remote_code=True, additional_config={ 'enable_graph_mode': True, }, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` add notes: 1. now, we support k>1, so u can set num_speculative_tokens > 1 if there is sufficient redundant computing power; 2. MTP is not supported in V1, we will support it when vLLM does it in https://github.com/vllm-project/vllm/issues/13500. 3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3 patch https://github.com/vllm-project/vllm-ascend/pull/236 file `vllm_ascend/patch/patch_metrics.py` method `__npu_async_metrics_collector_init__` ### How was this patch tested? local tested passed and test by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-28 21:18:53 +08:00
Mengqing Cao	be9e3e8545	[Bugfix] Fix triton placeholder patch period (#704 ) Fix triton placeholder patch period Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-28 18:52:03 +08:00
Li Wang	58f9d932d3	[Doc] Update faqs (#699 ) ### What this PR does / why we need it? Update faqs to make it more clear Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-28 18:48:23 +08:00
Li Wang	d0a0c81ced	[Doc] Add deepsee-v2-lite w8a8 quantization turorial (#630 ) ### What this PR does / why we need it? Add deepsee-v2-lite w8a8 quantization turorial --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-28 17:14:26 +08:00
wangxiyuan	5de3646522	[MISC] Make vllm version configurable (#651 ) Sometimes, user install a dev/editable version of vllm. In this case, we should make sure vllm-ascend works as well. This PR add a new env `VLLM_VERSION`. It's used for developers who edit vllm. In this case, developers should set thie env to make sure which vllm version is installed and used. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-28 14:19:06 +08:00
dependabot[bot]	8849cf1eda	Bump actions/setup-python from 5.5.0 to 5.6.0 (#697 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.5.0 to 5.6.0. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-04-28 14:06:38 +08:00
Icey	ee7a0e2cd4	Update openEuler dockerfile for COMPILE_CUSTOM_KERNELS=1 (#689 ) ### What this PR does / why we need it? Update openEuler dockerfile for COMPILE_CUSTOM_KERNELS=1 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Icey <1790571317@qq.com>	2025-04-28 11:45:46 +08:00
Pleaplusone	38f34e359f	[Fix] fix deepseek v0 attention eager mode (#671 ) ### What this PR does / why we need it? `reshape_and_cache_siso` seems have some funcitonality issues, use torch op combination replace this custom op by default. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-28 08:53:06 +08:00
Yikun Jiang	413657ae43	[FOLLOWUP][DOC] Fix pip install cmd in installation.md (#680 ) ### What this PR does / why we need it? Fix pip install cmd in installation.md Followup on: https://github.com/vllm-project/vllm-ascend/pull/661 ### Does this PR introduce _any_ user-facing change? No, doc only ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-27 18:37:25 +08:00
Yikun Jiang	2e20797934	[BUILD] Upgrade torch-npu to 2.5.1 (#661 ) ### What this PR does / why we need it? The torch-npu 2.5.1 are published: https://pypi.org/project/torch-npu/2.5.1/ It's time to remove all torch-npu dev version from vllm-ascend code base ### Does this PR introduce _any_ user-facing change? Yes, using torch-npu 2.5.1 ### How was this patch tested? - [ ] CI passed - [ ] Manually test - [ ] Grep all `dev2025` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-27 17:28:29 +08:00
Jade Zheng	fa4a5d980e	[Bugfix] Remove redundant tensor creation and unused code (#656 ) ### What this PR does / why we need it? Eliminated duplicate `block_table` tensor initialization and cleaned up unused code segments. This resolves an issue where the second creation was overwriting the first, potentially leading to unexpected behavior. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-04-27 14:09:16 +08:00
Mengqing Cao	ba3d8aae94	[Model][MiniCPM] support MiniCPM (#645 ) ### What this PR does / why we need it? This pr support minicpm in branch main. see https://github.com/vllm-project/vllm-ascend/pull/164 ### How was this patch tested? test locally with minicpm --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-27 11:27:24 +08:00
Yikun Jiang	742f679c7d	Remove prompt string from engine core data structures (#663 ) ### What this PR does / why we need it? vLLM Ascend side followup on: [Core] Remove prompt string from engine core data structures `df6f3ce883` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-26 23:15:58 +08:00
wangxiyuan	c99c4c8c70	[Doc] Update feature support list (#650 ) 1. remove Chinese doc. The content is out of data and we don't have enough time to maintain it. 2. Update feature support matrix. Refresh the content and add V1 status. --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-26 10:27:29 +08:00
wangxiyuan	3879d9cad9	[CI] Fix sample backward compatibility problem (#648 ) `b411418ff0` this vllm commit change the sample usage. This PR adapt the change for main and make sure it works for 0.8.4 as well. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-25 11:53:26 +08:00
yiz-liu	d785e78563	[V1] Make V1 engine backward compatible (#637 ) ### What this PR does / why we need it? Enforce eager mode in the V1 engine ahead of the upcoming CANN and torch_npu releases. ### Does this PR introduce _any_ user-facing change? After this change, users will no longer need to manually set enforce_eager=True. ### How was this patch tested? Test it with regular offline inference examples. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-24 17:20:11 +08:00
Li Wang	bd70ce828c	[CI] Add qwen2.5-vl test (#643 ) ### What this PR does / why we need it? Part of #499 Add qwen2.5-vl test on single npu, v1 engine is excluded because qwen2.5-vl has some problems with v1 now, at the same time, this test can also make #639 more credible Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-24 17:12:12 +08:00
Li Wang	a9c6b52205	[Bugfix] Fix qwen2.5-vl positon input bug (#639 ) ### What this PR does / why we need it? Fix qwen2.5-vl positon input bug, fix #625 `TypeError: 'NoneType' object is not iterable` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-24 15:21:57 +08:00
Li Wang	866ce7168c	[Benchmark] Download model from modelscope (#634 ) ### What this PR does / why we need it? - Run benchmark scripts will Download model from modelscope Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-24 14:48:24 +08:00
Bug Hunter Yan	05bdcbeae4	support aclgraph (#426 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> support npugraph to default， Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-23 20:56:24 +08:00
zzzzwwjj	5c6d05a59e	support deepseek quant & mix-parallel with graphmode (#585 ) ### What this PR does / why we need it? 1. support deepseek with w8a8 quant; 2. support deepseek with mix-parallel(multi-DP, EP+TP); 3. support deepseek with graphmode. --------- Signed-off-by: wen-jie666 <wenjie39@huawei.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wen-jie666 <wenjie39@huawei.com>	2025-04-23 16:23:25 +08:00
Pleaplusone	e74331a1ed	Add dp initialize patch with hccl backend (#626 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Add dp stateless process group initialization path with hccl backend as vllm-ascend patch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-23 15:47:51 +08:00
RongRongStudio	848e041a54	Using EvalScope evaluation (#611 ) ### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-23 00:50:09 +08:00
Shanshan Shen	4a0ce3660e	[Misc] Remove some parts of metrics patch (#603 ) ### What this PR does / why we need it? Remove some parts of metrics patch, since the `cuda` hard code has been fixed by https://github.com/vllm-project/vllm/pull/14411. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-22 18:45:21 +08:00
Li Wang	cf6ab42ee2	[CI]Add guided decoding test (#422 ) ### What this PR does / why we need it? After extensive testing, we are happy to say that guided_decoding is fully supported by npu, in this pr, we add guided_decoding integrated with our test, mainly does the following things: 1. test v0 supported backends including ` "outlines", "lm-format-enforcer","xgrammar"` 2. test v1 supported backends including ` "guidance", "xgrammar"` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-22 17:50:06 +08:00
wangxiyuan	538a69c145	[Patch] format patch module to make it more clear (#601 ) Format patch module to make it more clear. Add the patch doc description, the new patch must follow this guide. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-22 14:13:00 +08:00
Shuqiao Li	ad845bfe82	fix doc to mention env setting for v0.7.3-dev (#602 ) ### What this PR does / why we need it? fix doc to mention env setting for v0.7.3-dev Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-22 14:11:41 +08:00
Pleaplusone	d12a057df8	Add note for deepseek related docs and remove unnecessary comments (#590 ) ### What this PR does / why we need it? Add notes for deepseek's patch and remove some of the unnecessary comments --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-22 09:59:09 +08:00
Mengqing Cao	c5850d302d	[Doc] Update installation (#596 ) Many users facing a failed installation when using `pip install -e .`, this is mainly introduced by the released `torch-npu` version conflict with `torch>=2.5.1`. This conflict mainly exist in the temp env of pyproject build. This pr updates installation tutorial by using `python setup.py develop` to quick fix this. cc @wangxiyuan --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-22 09:04:20 +08:00
paulyu12	a8d633f629	[Bugfix] fix import error (#600 ) ### What this PR does / why we need it? Fix the import error that https://github.com/vllm-project/vllm-ascend/issues/592 mentioned. Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-04-22 08:57:25 +08:00
wemaster	0ae9ee0f8a	[BUGFIX] main-sd-bugfix && [UT] add mtp UT (#593 ) ### What this PR does / why we need it? The pr will fix some bug about spec decode / MTP The pr add a mtp e2e UT `test_mtp_correctness.py` vllm_ascend/attention/attention.py 1. add support `self.attn_mask_cache` only has 1 element to cover scene in which both spec docode and chunked prefill are enabled. vllm_ascend/distributed/parallel_state.py 1. remove 2 assert because spec decode worker would use init_worker twice vllm_ascend/models/deepseek_mtp.py 1. remove unused params; 2. add support w8a8 in `CustomDeepSeekMTP` vllm_ascend/quantization/quant_config.py 1. use `AscendUnquantizedFusedMoEMethod` instead of `UnquantizedFusedMoEMethod` other 1. replace `from vllm.logger import init_logger` to `from vllm.logger import logger` all of the vllm-ascend project ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-21 19:25:51 +08:00
Shuqiao Li	5442b463fd	add doc for patch_config (#574 ) ### What this PR does / why we need it? add doc for patch_config ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No code changed. Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-21 10:33:38 +08:00
Yikun Jiang	96d6fa7c90	[Docker] Fix openEuler image suffix (#586 ) ### What this PR does / why we need it? There was a bug when we release v0.8.4rc1 (openEuler image tag was wrong set to 0.8.4rc1), according doc of docker-meta-action, it should be append suffix: ``` tags: \| type=pep440,enable=true,priority=900,prefix=,suffix=,pattern=,value= ``` This patch just fix openEuler image suffix to make pep440 tag rule work. This patch also remove the cache step because the cache step bring more than 10mins export, but reduce less time in next trigger. ### Does this PR introduce _any_ user-facing change? Yes, docker image tag set to right ### How was this patch tested? I test with in my fork repo by setting default branch: - release a tag: v0.7.88rc1 (pep440 tag) - The log show `--label org.opencontainers.image.version=v0.7.88rc1-openeuler` is right rule https://github.com/Yikun/vllm-ascend/actions/runs/14560411481/job/40842950165#step:9:205 Related: https://github.com/vllm-project/vllm-ascend/pull/489 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-21 08:55:26 +08:00
Yikun Jiang	12cae04db9	[quantization] Support w8a8 quantization (#580 ) ### What this PR does / why we need it? Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model has [quantize filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27). If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply, otherwise will use VLLMAscendQuantizer directly. - This patch fix installation docs to make installation work - This patch enable norm quantization by patch `RMSNorm.__init__`, `RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model` - Add `AscendW8A8LinearMethod` for W8A8 - Add `AscendW8A8DynamicLinearMethod` and `AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC - Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` ### Does this PR introduce _any_ user-facing change? Yes, support w8a8 quantization. After this patch supported, users can use below commands to run w8a8 models: ``` vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B" ``` ### How was this patch tested? 0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` 1. From @Yikun: I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls refer to https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613 2. From @dingdingchaomian : Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both models were quantized using Ascend's msmodelslim tool: - Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one for w8a8 dynamic. - Deepseek-v2-lite-chat were tested once because its quantization used both static and dynamic w8a8. Models were tested using both off line inference and online serving, and both work well. The inference codes are exactly the same with the examples in https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with model path and tensor parallel number changed. --------- Signed-off-by: dingdingchaomian <wangce21@huawei.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: dingdingchaomian <wangce21@huawei.com> Co-authored-by: Angazenn <zengyanjia@huawei.com> Co-authored-by: liujiaxu <liujiaxu4@huawei.com> Co-authored-by: ApsarasX <apsarax@outlook.com> Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>	2025-04-20 18:14:05 +08:00
Pleaplusone	1a1f9a6d89	port deepseekv2 and mtp to main branch (#429 ) ### What this PR does / why we need it? This PR ports all the deepseek graph mode code and mtp code from v0.7.3 to the main branch --------- Signed-off-by: SidaoY <1024863041@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: mengwei805 <mengwei25@huawei.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: q00832892 <qiaoyang19@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: SidaoY <1024863041@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-04-19 17:38:18 +08:00
Yikun Jiang	086423dc35	[Docker] Bump Dockerfile version to v0.8.4 (#577 ) ### What this PR does / why we need it? Bump Dockerfile version to v0.8.4 ### Does this PR introduce _any_ user-facing change? docker image are using v0.8.4 version vLLM ### How was this patch tested? CI passed Closes: https://github.com/vllm-project/vllm-ascend/pull/571 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-18 19:15:17 +08:00
Shuqiao Li	a127cc83f8	catch ImportError when C code not compiled (#575 ) ### What this PR does / why we need it? Found a problem when ImportError raised but not ModuleNotFoundError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-18 18:11:49 +08:00
Shanshan Shen	985b0548b0	[Doc] Update v0.8.4 release note, add contents for structured output feature (#576 ) ### What this PR does / why we need it? Update v0.8.4 release note: - Add contents for structured output feature. - Remove redundant `(` in spec decoding. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Preview Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-18 17:44:16 +08:00
Shanshan Shen	65c1f4579f	[V1][Structured Output] Add `apply_grammar_bitmask()` method to model runner (#555 ) ### What this PR does / why we need it? Add `apply_grammar_bitmask()` method to model runner. This method is necessary for `xgrammar` structured output. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-18 16:47:55 +08:00
Mengqing Cao	2c903bc7ac	[Doc] Update doc for custom ops build (#570 ) - update doc about custom ops compile --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-18 15:35:10 +08:00
Mengqing Cao	b91f9a5afd	[Doc][Build] Update build doc and faq (#568 ) Update build doc and faq about deepseek w8a8 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-18 14:16:41 +08:00
wangxiyuan	e66ded5679	[Doc] Add release note for 0.8.4rc1 (#557 ) Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-18 13:24:36 +08:00
Shanshan Shen	7eeff60715	[Doc] Update FAQ doc (#561 ) ### What this PR does / why we need it? Update FAQ doc to make `docker pull` more clear Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-18 13:13:13 +08:00
Shuqiao Li	84563fc65d	Add sleep mode feature for Ascend NPU (#513 ) ### What this PR does / why we need it? This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things: - offload model weights - discard kv cache RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process. This PR may solve #375 and #320 . ### Does this PR introduce _any_ user-facing change? No existing user interfaces changed. Users will have two new methods(`sleep()` and `wake_up()`) to use. ### How was this patch tested? This PR is tested with Qwen/Qwen2.5-0.5B-Instruct. At first, we have free NPU memory M1. After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)` executed, we have free NPU memory M2. M2 < M1. Then we call `llm.sleep(level=1)`, we have free NPU memory M3. We have M3 > M2, M3 is very close to M1. Plus, we have the same output tokens before sleep and after wake up, with the config of `SamplingParams(temperature=0, max_tokens=10)` and with the same input tokens of course. This PR is utilizing the CMake procedure of #371 , thanks a lot. Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-18 13:11:39 +08:00
wangxiyuan	42c7fbb10e	[Misc] Fix import error and address nits to make CI happy (#563 ) 1. Add `vllm_version_is` function to check vllm version. 2. `ensure_kv_transfer_initialized` and `get_kv_transfer_group ` have been moved to other place in vllm main branch via `3408e47159` , this patch fix the import error. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-18 12:23:32 +08:00
Pleaplusone	66a0837963	adopt rope in vllm-ascend (#530 ) ### What this PR does / why we need it? Adopt custom kernel rotary embedding in actual model inference, customized rotary_embedding will generate contiguous query and key in the cpp side to reduce the overhead of two contiguous and index_select compared with rotary_embedding in torch_npu. For now, rotary_embedding can only support the scenario of `is_neox = true`, non-neox version rope will be updated soon in the future. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-18 08:56:05 +08:00
whx	23f85e3f74	[BugFix] Fix scheduler problems in last PR. (#558 ) This PR Fixes scheduler problems in last PR: 1. change position of DT test to validate it. 2. fix format of copyright. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-04-18 08:49:48 +08:00
Mengqing Cao	6ee7f5cf71	[SpecDecode] Add spec decode support (#500 ) ### What this PR does / why we need it? Backport: https://github.com/vllm-project/vllm-ascend/pull/252 This support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models. Backport: https://github.com/vllm-project/vllm-ascend/pull/423 spec decode MultiStepWorker support TP1DraftModelRunner fully, support run the draft_model_runner with multi-step prepare on the NPU directly and support draft_model_runner use MLA. 1. before this pr, `MultiStepWorker` would not step into the branch using NPU prepare, but only into the branch using CPU prepare (`line 52` of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has `no effect` on the `correct operation` of speculative decoding and the performance of the two branches is basically the same as of the current version, I support entering this branch in this PR. In general, there are two main changes in `patch_multi_step_worker.py`: first, the `is_cuda_like()` check is removed and the `TP1DraftModelRunner` rewritten in vllm_ascend is used; second, the `supports_gpu_multi_step()` function is made to return true on NPU devices when outer Multi_step_worker could work correct. 3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU, but not MLA. The relevant adaptation is in `vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why the `input_positions` of `model_input.attn_metadata` in vllm-ascend needs to be added in `execute_model`, it is done in `model_runner.py`, so I also made corresponding changes. Otherwise, when atten_backend is MLA, it will prompt that input_positions cannot be found. 4. I commented out two lines in `draft_model_runner.py` in `line118` to support the scenario of K>1. ``` # lora_mapping=model_input.lora_mapping, # lora_requests=model_input.lora_requests, ``` I added comments. In the future, when vllm-ascend supports lora feature, the changes here can be restored. TODO： - [ ] revert the patch when the related issues are addressed in vllm ### How was this patch tested? CI passed with new added test. - e2e test for medusa proposer: tests/singlecard/spec_decode/e2e/test_medusa_correctness.py - e2e test for mlp proposer: tests/singlecard/spec_decode/e2e/test_mlp_correctness.py - e2e test for n-gram proposer: tests/singlecard/spec_decode/e2e/test_ngram_correctness.py Tests for patched files: - tests/singlecard/spec_decode/test_dynamic_spec_decode.py - tests/singlecard/spec_decode/test_multi_step_worker.py - tests/singlecard/spec_decode/test_ngram_worker.py - tests/singlecard/spec_decode/test_spec_decode_worker.py --------- Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>	2025-04-17 20:16:32 +08:00
Mengqing Cao	b71f193cb0	[Model][Doc] Update model support list (#552 ) Update model support list cc @Yikun plz help review, thanks! Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-17 19:32:20 +08:00
whx	20dff4deff	[Scheduler] Add AscendScheduler. (#543 ) This PR adds AscendScheduler to vllm v1 engine. This scheduler currently supports v0-style prefill-first scheduling strategy. In the future more schedule methods will be supported by this scheduler. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-04-17 19:31:50 +08:00

1 2 3 4 5

204 Commits