xc-llm-ascend

Author	SHA1	Message	Date
Li Wang	90aabaeb2e	[Doc] Add benchmark guide (#635 ) ### What this PR does / why we need it? Add benchmark developer guide --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-30 09:17:59 +08:00
wangxiyuan	f8350569e6	[CI] upgrade vllm to 0.8.5 (#715 ) 1. Upgrade vllm to 0.8.5 2. Drop 0.8.4 support 3. Keep doc to 0.8.4rc2 until we release 0.8.5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:15:50 +08:00
wangxiyuan	95e7aa4736	[Platform] format platform to make it more clear (#610 ) Platform should only contain the function that based from vllm. This PR move the unrelated function to the right place to make platform more clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:03:10 +08:00
wangxiyuan	b917361ca5	[MISC] Clean up torch_npu (#688 ) torch_npu 2.5.1 support autoload now. This patch does: 1. remove useless torch_npu import 2. replace `torch_npu.npu` to `torch.npu`. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-29 18:03:38 +08:00
Pleaplusone	0329fad927	[Perf] Deepseekv3 performance optimization for eager mode (#598 ) ### What this PR does / why we need it? Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR https://github.com/vllm-project/vllm-ascend/pull/543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-29 17:12:03 +08:00
ApsarasX	87975fa058	[Bugfix] Fix early return in CustomDeepseekV2MoE.forward during profile_run (#682 ) ### What this PR does / why we need it? Fix #674 to avoild KVCache overallocation and OOM risks. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-04-29 17:06:19 +08:00
Li Wang	7aee9228f0	[CI] Add nightly CI (#668 ) ### What this PR does / why we need it? Add nightly CI for basic function and model usability --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-29 16:35:52 +08:00
Li Wang	d6be63e11d	[CI] Add Qwen3-0.6B-Base test (#717 ) ### What this PR does / why we need it? Add Qwen3-0.6B-Base for integration test Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-29 14:35:19 +08:00
wangxiyuan	0dae55a9a3	[MISC] fix format check error (#654 ) This pr makes format.sh works as expect. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-29 11:14:19 +08:00
wangxiyuan	1fce70a2fb	[Model] Support common fused moe ops for moe model, such as Qwen3Moe (#709 ) vllm-ascend now only support moe for deepseek. We should add common moe support back Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-28 21:57:01 +08:00
Jade Zheng	40bd602485	[Feature] Use reshape_and_cache fused op (#706 ) Replace torch function with reshape_and_cache fused op for better performance. The `reshape_and_cache` function wasn't working because it expected torch.int32 tensor, but a torch.int64 tensor was provided. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-04-28 21:54:42 +08:00
Yikun Jiang	d39855b075	Update installation and tutorial doc (#711 ) ### What this PR does / why we need it? Update installation and tutorial doc ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-28 21:52:17 +08:00
wangxiyuan	5995d23532	[Doc] Add 0.8.4rc2 release note (#705 ) Add 0.8.4rc2 release note Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-28 21:51:35 +08:00
wemaster	54c0e63df7	[MTP] follow custom deepseek modeling changes to support graph mode (#636 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? As custom deepseek modeling do some changes to support graph mode in https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to change custom deepseek_mtp modeling. And some modifications for k>1 were not carried over by the https://github.com/vllm-project/vllm-ascend/pull/429, now i add it. In order to better take care of the MTP feature in the vllm-ascend repository, I added cases related to graph mode(torchair), but i skip it since torchair can not correctly clean up memory in vllmrunner. Also i add some case for MTP quantization weights, but test weight is not ready, so i skip it and i will open it when test quant weights is ready. https://github.com/vllm-project/vllm-ascend/pull/648 did not completely fix the sample change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I added the relevant changes. ### Does this PR introduce _any_ user-facing change? now, u can use following method to use mtp in deepseek v3/r1 float or quant weights with eager mode. ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, enforce_eager=True, trust_remote_code=True, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` or use mtp in deepseek v3/r1 float or quant weights with graph mode（torchair） ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, trust_remote_code=True, additional_config={ 'enable_graph_mode': True, }, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` add notes: 1. now, we support k>1, so u can set num_speculative_tokens > 1 if there is sufficient redundant computing power; 2. MTP is not supported in V1, we will support it when vLLM does it in https://github.com/vllm-project/vllm/issues/13500. 3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3 patch https://github.com/vllm-project/vllm-ascend/pull/236 file `vllm_ascend/patch/patch_metrics.py` method `__npu_async_metrics_collector_init__` ### How was this patch tested? local tested passed and test by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-28 21:18:53 +08:00
Mengqing Cao	be9e3e8545	[Bugfix] Fix triton placeholder patch period (#704 ) Fix triton placeholder patch period Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-28 18:52:03 +08:00
Li Wang	58f9d932d3	[Doc] Update faqs (#699 ) ### What this PR does / why we need it? Update faqs to make it more clear Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-28 18:48:23 +08:00
Li Wang	d0a0c81ced	[Doc] Add deepsee-v2-lite w8a8 quantization turorial (#630 ) ### What this PR does / why we need it? Add deepsee-v2-lite w8a8 quantization turorial --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-28 17:14:26 +08:00
wangxiyuan	5de3646522	[MISC] Make vllm version configurable (#651 ) Sometimes, user install a dev/editable version of vllm. In this case, we should make sure vllm-ascend works as well. This PR add a new env `VLLM_VERSION`. It's used for developers who edit vllm. In this case, developers should set thie env to make sure which vllm version is installed and used. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-28 14:19:06 +08:00
dependabot[bot]	8849cf1eda	Bump actions/setup-python from 5.5.0 to 5.6.0 (#697 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.5.0 to 5.6.0. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-04-28 14:06:38 +08:00
Icey	ee7a0e2cd4	Update openEuler dockerfile for COMPILE_CUSTOM_KERNELS=1 (#689 ) ### What this PR does / why we need it? Update openEuler dockerfile for COMPILE_CUSTOM_KERNELS=1 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Icey <1790571317@qq.com>	2025-04-28 11:45:46 +08:00
Pleaplusone	38f34e359f	[Fix] fix deepseek v0 attention eager mode (#671 ) ### What this PR does / why we need it? `reshape_and_cache_siso` seems have some funcitonality issues, use torch op combination replace this custom op by default. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-28 08:53:06 +08:00
Yikun Jiang	413657ae43	[FOLLOWUP][DOC] Fix pip install cmd in installation.md (#680 ) ### What this PR does / why we need it? Fix pip install cmd in installation.md Followup on: https://github.com/vllm-project/vllm-ascend/pull/661 ### Does this PR introduce _any_ user-facing change? No, doc only ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-27 18:37:25 +08:00
Yikun Jiang	2e20797934	[BUILD] Upgrade torch-npu to 2.5.1 (#661 ) ### What this PR does / why we need it? The torch-npu 2.5.1 are published: https://pypi.org/project/torch-npu/2.5.1/ It's time to remove all torch-npu dev version from vllm-ascend code base ### Does this PR introduce _any_ user-facing change? Yes, using torch-npu 2.5.1 ### How was this patch tested? - [ ] CI passed - [ ] Manually test - [ ] Grep all `dev2025` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-27 17:28:29 +08:00
Jade Zheng	fa4a5d980e	[Bugfix] Remove redundant tensor creation and unused code (#656 ) ### What this PR does / why we need it? Eliminated duplicate `block_table` tensor initialization and cleaned up unused code segments. This resolves an issue where the second creation was overwriting the first, potentially leading to unexpected behavior. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-04-27 14:09:16 +08:00
Mengqing Cao	ba3d8aae94	[Model][MiniCPM] support MiniCPM (#645 ) ### What this PR does / why we need it? This pr support minicpm in branch main. see https://github.com/vllm-project/vllm-ascend/pull/164 ### How was this patch tested? test locally with minicpm --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-27 11:27:24 +08:00
Yikun Jiang	742f679c7d	Remove prompt string from engine core data structures (#663 ) ### What this PR does / why we need it? vLLM Ascend side followup on: [Core] Remove prompt string from engine core data structures `df6f3ce883` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-26 23:15:58 +08:00
wangxiyuan	c99c4c8c70	[Doc] Update feature support list (#650 ) 1. remove Chinese doc. The content is out of data and we don't have enough time to maintain it. 2. Update feature support matrix. Refresh the content and add V1 status. --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-26 10:27:29 +08:00
wangxiyuan	3879d9cad9	[CI] Fix sample backward compatibility problem (#648 ) `b411418ff0` this vllm commit change the sample usage. This PR adapt the change for main and make sure it works for 0.8.4 as well. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-25 11:53:26 +08:00
yiz-liu	d785e78563	[V1] Make V1 engine backward compatible (#637 ) ### What this PR does / why we need it? Enforce eager mode in the V1 engine ahead of the upcoming CANN and torch_npu releases. ### Does this PR introduce _any_ user-facing change? After this change, users will no longer need to manually set enforce_eager=True. ### How was this patch tested? Test it with regular offline inference examples. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-24 17:20:11 +08:00
Li Wang	bd70ce828c	[CI] Add qwen2.5-vl test (#643 ) ### What this PR does / why we need it? Part of #499 Add qwen2.5-vl test on single npu, v1 engine is excluded because qwen2.5-vl has some problems with v1 now, at the same time, this test can also make #639 more credible Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-24 17:12:12 +08:00
Li Wang	a9c6b52205	[Bugfix] Fix qwen2.5-vl positon input bug (#639 ) ### What this PR does / why we need it? Fix qwen2.5-vl positon input bug, fix #625 `TypeError: 'NoneType' object is not iterable` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-24 15:21:57 +08:00
Li Wang	866ce7168c	[Benchmark] Download model from modelscope (#634 ) ### What this PR does / why we need it? - Run benchmark scripts will Download model from modelscope Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-24 14:48:24 +08:00
Bug Hunter Yan	05bdcbeae4	support aclgraph (#426 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> support npugraph to default， Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-23 20:56:24 +08:00
zzzzwwjj	5c6d05a59e	support deepseek quant & mix-parallel with graphmode (#585 ) ### What this PR does / why we need it? 1. support deepseek with w8a8 quant; 2. support deepseek with mix-parallel(multi-DP, EP+TP); 3. support deepseek with graphmode. --------- Signed-off-by: wen-jie666 <wenjie39@huawei.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wen-jie666 <wenjie39@huawei.com>	2025-04-23 16:23:25 +08:00
Pleaplusone	e74331a1ed	Add dp initialize patch with hccl backend (#626 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Add dp stateless process group initialization path with hccl backend as vllm-ascend patch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-23 15:47:51 +08:00
RongRongStudio	848e041a54	Using EvalScope evaluation (#611 ) ### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-23 00:50:09 +08:00
Shanshan Shen	4a0ce3660e	[Misc] Remove some parts of metrics patch (#603 ) ### What this PR does / why we need it? Remove some parts of metrics patch, since the `cuda` hard code has been fixed by https://github.com/vllm-project/vllm/pull/14411. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-22 18:45:21 +08:00
Li Wang	cf6ab42ee2	[CI]Add guided decoding test (#422 ) ### What this PR does / why we need it? After extensive testing, we are happy to say that guided_decoding is fully supported by npu, in this pr, we add guided_decoding integrated with our test, mainly does the following things: 1. test v0 supported backends including ` "outlines", "lm-format-enforcer","xgrammar"` 2. test v1 supported backends including ` "guidance", "xgrammar"` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-22 17:50:06 +08:00
wangxiyuan	538a69c145	[Patch] format patch module to make it more clear (#601 ) Format patch module to make it more clear. Add the patch doc description, the new patch must follow this guide. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-22 14:13:00 +08:00
Shuqiao Li	ad845bfe82	fix doc to mention env setting for v0.7.3-dev (#602 ) ### What this PR does / why we need it? fix doc to mention env setting for v0.7.3-dev Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-22 14:11:41 +08:00
Pleaplusone	d12a057df8	Add note for deepseek related docs and remove unnecessary comments (#590 ) ### What this PR does / why we need it? Add notes for deepseek's patch and remove some of the unnecessary comments --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-22 09:59:09 +08:00
Mengqing Cao	c5850d302d	[Doc] Update installation (#596 ) Many users facing a failed installation when using `pip install -e .`, this is mainly introduced by the released `torch-npu` version conflict with `torch>=2.5.1`. This conflict mainly exist in the temp env of pyproject build. This pr updates installation tutorial by using `python setup.py develop` to quick fix this. cc @wangxiyuan --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-22 09:04:20 +08:00
paulyu12	a8d633f629	[Bugfix] fix import error (#600 ) ### What this PR does / why we need it? Fix the import error that https://github.com/vllm-project/vllm-ascend/issues/592 mentioned. Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-04-22 08:57:25 +08:00
wemaster	0ae9ee0f8a	[BUGFIX] main-sd-bugfix && [UT] add mtp UT (#593 ) ### What this PR does / why we need it? The pr will fix some bug about spec decode / MTP The pr add a mtp e2e UT `test_mtp_correctness.py` vllm_ascend/attention/attention.py 1. add support `self.attn_mask_cache` only has 1 element to cover scene in which both spec docode and chunked prefill are enabled. vllm_ascend/distributed/parallel_state.py 1. remove 2 assert because spec decode worker would use init_worker twice vllm_ascend/models/deepseek_mtp.py 1. remove unused params; 2. add support w8a8 in `CustomDeepSeekMTP` vllm_ascend/quantization/quant_config.py 1. use `AscendUnquantizedFusedMoEMethod` instead of `UnquantizedFusedMoEMethod` other 1. replace `from vllm.logger import init_logger` to `from vllm.logger import logger` all of the vllm-ascend project ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-21 19:25:51 +08:00
Shuqiao Li	5442b463fd	add doc for patch_config (#574 ) ### What this PR does / why we need it? add doc for patch_config ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No code changed. Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-21 10:33:38 +08:00
Yikun Jiang	96d6fa7c90	[Docker] Fix openEuler image suffix (#586 ) ### What this PR does / why we need it? There was a bug when we release v0.8.4rc1 (openEuler image tag was wrong set to 0.8.4rc1), according doc of docker-meta-action, it should be append suffix: ``` tags: \| type=pep440,enable=true,priority=900,prefix=,suffix=,pattern=,value= ``` This patch just fix openEuler image suffix to make pep440 tag rule work. This patch also remove the cache step because the cache step bring more than 10mins export, but reduce less time in next trigger. ### Does this PR introduce _any_ user-facing change? Yes, docker image tag set to right ### How was this patch tested? I test with in my fork repo by setting default branch: - release a tag: v0.7.88rc1 (pep440 tag) - The log show `--label org.opencontainers.image.version=v0.7.88rc1-openeuler` is right rule https://github.com/Yikun/vllm-ascend/actions/runs/14560411481/job/40842950165#step:9:205 Related: https://github.com/vllm-project/vllm-ascend/pull/489 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-21 08:55:26 +08:00
Yikun Jiang	12cae04db9	[quantization] Support w8a8 quantization (#580 ) ### What this PR does / why we need it? Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model has [quantize filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27). If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply, otherwise will use VLLMAscendQuantizer directly. - This patch fix installation docs to make installation work - This patch enable norm quantization by patch `RMSNorm.__init__`, `RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model` - Add `AscendW8A8LinearMethod` for W8A8 - Add `AscendW8A8DynamicLinearMethod` and `AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC - Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` ### Does this PR introduce _any_ user-facing change? Yes, support w8a8 quantization. After this patch supported, users can use below commands to run w8a8 models: ``` vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B" ``` ### How was this patch tested? 0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` 1. From @Yikun: I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls refer to https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613 2. From @dingdingchaomian : Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both models were quantized using Ascend's msmodelslim tool: - Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one for w8a8 dynamic. - Deepseek-v2-lite-chat were tested once because its quantization used both static and dynamic w8a8. Models were tested using both off line inference and online serving, and both work well. The inference codes are exactly the same with the examples in https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with model path and tensor parallel number changed. --------- Signed-off-by: dingdingchaomian <wangce21@huawei.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: dingdingchaomian <wangce21@huawei.com> Co-authored-by: Angazenn <zengyanjia@huawei.com> Co-authored-by: liujiaxu <liujiaxu4@huawei.com> Co-authored-by: ApsarasX <apsarax@outlook.com> Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>	2025-04-20 18:14:05 +08:00
Pleaplusone	1a1f9a6d89	port deepseekv2 and mtp to main branch (#429 ) ### What this PR does / why we need it? This PR ports all the deepseek graph mode code and mtp code from v0.7.3 to the main branch --------- Signed-off-by: SidaoY <1024863041@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: mengwei805 <mengwei25@huawei.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: q00832892 <qiaoyang19@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: SidaoY <1024863041@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-04-19 17:38:18 +08:00
Yikun Jiang	086423dc35	[Docker] Bump Dockerfile version to v0.8.4 (#577 ) ### What this PR does / why we need it? Bump Dockerfile version to v0.8.4 ### Does this PR introduce _any_ user-facing change? docker image are using v0.8.4 version vLLM ### How was this patch tested? CI passed Closes: https://github.com/vllm-project/vllm-ascend/pull/571 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-18 19:15:17 +08:00
Shuqiao Li	a127cc83f8	catch ImportError when C code not compiled (#575 ) ### What this PR does / why we need it? Found a problem when ImportError raised but not ModuleNotFoundError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-18 18:11:49 +08:00

1 2 3 4 5

217 Commits