xc-llm-ascend

Author	SHA1	Message	Date
ApsarasX	324f819b92	[Perf] Optimize fused_experts quantization code to save npu memory (#784 ) ### What this PR does / why we need it? In the w8a8 quantization code of `fused_experts`, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually `del` these variables to end their lifecycle, which fills the code with `del` statements and looks inelegant. Therefore, I plan to names the output of most operators as `hidden_states`, thereby ending the lifecycle of the previous `hidden_states`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-05-09 15:09:37 +08:00
Jade Zheng	2c685e3b61	[Bugfix] Correct method call for _set_cos_sin_cache (#774 ) This change ensures proper functionality for longer sequences by correctly invoking the _set_cos_sin_cache method with self as the first argument. For example, with DeepSeek R1, if this change isn't made, the program will crash when the input sequence exceeds 4096. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-05-09 12:55:57 +08:00
zzzzwwjj	5301649108	[Doc] Add notes for OOM in FAQs (#786 ) ### What this PR does / why we need it? add notes for OOM in faqs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-08 16:28:29 +08:00
chris668899	6c020883a8	[WIP]Add Func: aclgraph_batch_size auto-adjust to different model (#771 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR add new function of : aclgraph_batch_size can dynamic adjust to different model; before this PR, the aclgraph_batch_sizes given from vllm to vllm-ascend always too large, and that may result in ERROR while running on different, with the information: "The resources are insufficient". Now, with this PR, the code can dynamic adjust aclgraph_batch_sizes depend on the model hidden_layer_nums and parallel config, for example: a. for Qwen2.5-7B, the aclgraph_batch_size length is 33 total; b. for Qwen2.5-72B, the aclgraph_batch_size length is 11 total; Signed-off-by: chris668899 <15105191595@126.com>	2025-05-08 16:23:33 +08:00
yiz-liu	2e3520e285	[Bugfix] Fix output tensor shape in vanilla_chunked_prefill and update import paths for model_loader (#773 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Fix output tensor shape in vanilla_chunked_prefill function. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> None. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Run offline inference on DeepSeek models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-08 14:19:26 +08:00
Yikun Jiang	ec27af346a	[Doc] Add 0.8.5rc1 release note (#756 ) ### What this PR does / why we need it? Add 0.8.5rc1 release note and bump vllm version to v0.8.5.post1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-06 23:46:35 +08:00
linfeng-yuan	2cd036ee8e	[Bugfix] fix accuracy problem for quantized deepseek models (#768 ) ### What this PR does / why we need it? The root cause of the bug is that numerical computations involving NaNs cannot eliminate them. We addressed it by using `masked_fill_` to eliminate NaNs while avoiding memory-wasting `torch.where` approach. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch was tested with vllm v0.8.5 and vllm-ascend master. I run deepseek_v3 model with offline inference scripts (examples/dp_offline/run_dp.sh & data_parallel.py). Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-05-06 22:09:56 +08:00
ApsarasX	d6e9417652	[Bugfix] Fix masked_fill_ function typo (#769 ) ### What this PR does / why we need it? Fix function name typo, make `mask_fill_` to `masked_fill_` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-05-06 21:54:52 +08:00
Yikun Jiang	afe1767c17	[Core] Cleanup triton patch which has been fixed in vllm (#764 ) ### What this PR does / why we need it? - Revert "Re-patch TritonPlaceholder on main to make CI happy (#753)" because upstream main CI already merged: https://github.com/vllm-project/vllm/pull/17446 - Keep 0.8.5.post1 compatible ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-06 18:52:15 +08:00
linfeng-yuan	b0dbe5f8e1	[Bug fix] fix a typo in setup.py (#762 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Fix a typo in setup.py. Currently, it does not affect any functionality or interfaces. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-05-06 17:01:26 +08:00
Yikun Jiang	5897dc5bbe	[Build] Bump vLLM version to v0.8.5.post1 (#755 ) ### What this PR does / why we need it? Bump vllm version to v0.8.5.post1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-06 11:44:12 +08:00
sunbaosong	d6bfae8eee	support 32K model len on deepseek r1 W8A8 (#728 ) ### What this PR does / why we need it? Optimize NPU memory usage. https://github.com/vllm-project/vllm-ascend/issues/723 vllm v0.8.4.rc2 and DeepSeek R1 can only support a model length of 16K. When attempting to run with a model length of 32K, an "Out of Memory" (OOM) error will occur. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: sunbaosong <13793883820@163.com>	2025-05-06 10:12:07 +08:00
Yikun Jiang	79538b5d73	Upgrade CANN version to 8.1.rc1 (#747 ) ### What this PR does / why we need it? Make CANN version bump separately from https://github.com/vllm-project/vllm-ascend/pull/708 - Upgrade CANN version to 8.1.rc1 - Add prefix to speed up download `m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10` - Address tail sapce for Dockerfile.openEuler - Add note for `/workspace` and `/vllm-workspace` as followup of https://github.com/vllm-project/vllm-ascend/pull/741 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed Co-authored-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-05-06 05:44:18 +08:00
Yikun Jiang	d7e1110c8e	Re-patch TritonPlaceholder on main to make CI happy (#753 ) ### What this PR does / why we need it? Re-patch TritonPlaceholder on main to make CI happy - Add triton patch back until https://github.com/vllm-project/vllm/pull/17446 resolved - Move patch_main before patch_common to resolve minicpm triton import issue - Add `0.8.5` and `0.8.5.post1` to make patch work on 0.8.5 all versions Related: - https://github.com/vllm-project/vllm-ascend/pull/704 - https://github.com/vllm-project/vllm-ascend/pull/690 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All CI passed include main Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-05 23:22:24 +08:00
Yikun Jiang	d2ead057ae	Re-enable Speculative Decode test for vLLM v0.8.5 (#749 ) ### What this PR does / why we need it? Re-enable Speculative Decode test for vLLM v0.8.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-02 14:44:48 +08:00
whx	8b194ad12e	[Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist (#694 ) ### What this PR does / why we need it? - This PR proposes a P2P version of Disaggregated Prefill based on llm_datadist which manages data transfer. - This solution reconstructs previous offline single-node Disaggregated Prefill solution, and supports multi-node and online serveing now. - Currently this solution supports 1P1D situation of Deepseek hybrid parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered in the solution design, and will be supported soon within v1 engine. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: ganyi <pleaplusone.gy@gmail.com>	2025-05-01 22:31:36 +08:00
linfeng-yuan	84e2ed898b	performance optimization, usability optimization and API compatibility adjustments for deepseek with npu graph mode (#731 ) --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> 1. Improve inference speed and usability for deepsek models with NPU graph mode. 2. Modify some codes to adapt to CANN 8.1.RC1.beta1. 3. Add a switch for NPU graph mode and its cache. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> This PR provides an experimental configuration to enable NPU graph mode for Deepseek models. User can set additional_config={'enable_graph_mode': True} to try this feature. Note that this feature currently only supports for V0 engine. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> This patch was tested with the newest torch_npu 2.5.1 (https://pypi.org/project/torch-npu/#files) and CANN 8.1.RC1.beta1 toolkit&nnal&kernels (https://www.hiascend.com/developer/download/community/result?module=cann) released in 25/30 April. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-05-01 13:51:42 +08:00
Mengqing Cao	399b03830d	[Build][Bugfix] Fix source code path to avoid reference error (#726 ) ### What this PR does / why we need it? Fix source code path to avoid reference error in docker image fix https://github.com/vllm-project/vllm-ascend/issues/725 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-30 17:38:13 +08:00
Pleaplusone	3a628891ab	[Feature] Add quant description file for new quant model generated by modelslim (#719 ) ### What this PR does / why we need it? After discussed with MindStudio about the quantization model format, we decide to support another quant format which may used in new modelslim tool, in which case, `quantization_config` may be removed from the `config.json` file and `quant_model_description.json` will be used for quantization configuration. ### Does this PR introduce _any_ user-facing change? Yes, using the latest quantization format ### How was this patch tested? Test locally Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-30 16:51:56 +08:00
hfadzxy	affca6f348	[Test] Add accuracy test report workflow (#542 ) ### What this PR does / why we need it? 1. Provide accuracy test report for development branch release. 2. Models and datasets for accuracy test： \| Model \| datasets \| \|---------------------------- \| --------------------------- \| \| Qwen2.5-7B-Instruct \| ceval-val, gsm8k, mmlu \| \| Qwen3-8B \| ceval-val, gsm8k, mmlu \| \| Llama-3.1-8B-Instruct \| ceval-val, gsm8k, mmlu \| \| Qwen2.5-VL-7B-Instruct \| mmmu_val \| ### Does this PR introduce _any_ user-facing change? This PR will display the accuracy test report of the release versionin docs/source/developer_guide/accuracy_report。 Qwen2.5-7B-Instruct.md Qwen3-8B.md Llama-3.1-8B-Instruct.md Qwen2.5-VL-7B-Instruct .md Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-30 14:53:58 +08:00
zouyida2052	ba9714ccee	Optimize qwen2_vl and qwen2_5_vl (#701 ) ### What this PR does / why we need it? Optimize qwen2_vl and qwen2_5_vl. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got roughly 22% perf boost. --------- Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com> Co-authored-by: zouyida2052 <zouyida@huawei.com>	2025-04-30 14:22:38 +08:00
Li Wang	90aabaeb2e	[Doc] Add benchmark guide (#635 ) ### What this PR does / why we need it? Add benchmark developer guide --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-30 09:17:59 +08:00
wangxiyuan	f8350569e6	[CI] upgrade vllm to 0.8.5 (#715 ) 1. Upgrade vllm to 0.8.5 2. Drop 0.8.4 support 3. Keep doc to 0.8.4rc2 until we release 0.8.5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:15:50 +08:00
wangxiyuan	95e7aa4736	[Platform] format platform to make it more clear (#610 ) Platform should only contain the function that based from vllm. This PR move the unrelated function to the right place to make platform more clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:03:10 +08:00
wangxiyuan	b917361ca5	[MISC] Clean up torch_npu (#688 ) torch_npu 2.5.1 support autoload now. This patch does: 1. remove useless torch_npu import 2. replace `torch_npu.npu` to `torch.npu`. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-29 18:03:38 +08:00
Pleaplusone	0329fad927	[Perf] Deepseekv3 performance optimization for eager mode (#598 ) ### What this PR does / why we need it? Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR https://github.com/vllm-project/vllm-ascend/pull/543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-29 17:12:03 +08:00
ApsarasX	87975fa058	[Bugfix] Fix early return in CustomDeepseekV2MoE.forward during profile_run (#682 ) ### What this PR does / why we need it? Fix #674 to avoild KVCache overallocation and OOM risks. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-04-29 17:06:19 +08:00
Li Wang	7aee9228f0	[CI] Add nightly CI (#668 ) ### What this PR does / why we need it? Add nightly CI for basic function and model usability --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-29 16:35:52 +08:00
Li Wang	d6be63e11d	[CI] Add Qwen3-0.6B-Base test (#717 ) ### What this PR does / why we need it? Add Qwen3-0.6B-Base for integration test Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-29 14:35:19 +08:00
wangxiyuan	0dae55a9a3	[MISC] fix format check error (#654 ) This pr makes format.sh works as expect. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-29 11:14:19 +08:00
wangxiyuan	1fce70a2fb	[Model] Support common fused moe ops for moe model, such as Qwen3Moe (#709 ) vllm-ascend now only support moe for deepseek. We should add common moe support back Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-28 21:57:01 +08:00
Jade Zheng	40bd602485	[Feature] Use reshape_and_cache fused op (#706 ) Replace torch function with reshape_and_cache fused op for better performance. The `reshape_and_cache` function wasn't working because it expected torch.int32 tensor, but a torch.int64 tensor was provided. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-04-28 21:54:42 +08:00
Yikun Jiang	d39855b075	Update installation and tutorial doc (#711 ) ### What this PR does / why we need it? Update installation and tutorial doc ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-28 21:52:17 +08:00
wangxiyuan	5995d23532	[Doc] Add 0.8.4rc2 release note (#705 ) Add 0.8.4rc2 release note Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-28 21:51:35 +08:00
wemaster	54c0e63df7	[MTP] follow custom deepseek modeling changes to support graph mode (#636 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? As custom deepseek modeling do some changes to support graph mode in https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to change custom deepseek_mtp modeling. And some modifications for k>1 were not carried over by the https://github.com/vllm-project/vllm-ascend/pull/429, now i add it. In order to better take care of the MTP feature in the vllm-ascend repository, I added cases related to graph mode(torchair), but i skip it since torchair can not correctly clean up memory in vllmrunner. Also i add some case for MTP quantization weights, but test weight is not ready, so i skip it and i will open it when test quant weights is ready. https://github.com/vllm-project/vllm-ascend/pull/648 did not completely fix the sample change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I added the relevant changes. ### Does this PR introduce _any_ user-facing change? now, u can use following method to use mtp in deepseek v3/r1 float or quant weights with eager mode. ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, enforce_eager=True, trust_remote_code=True, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` or use mtp in deepseek v3/r1 float or quant weights with graph mode（torchair） ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, trust_remote_code=True, additional_config={ 'enable_graph_mode': True, }, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` add notes: 1. now, we support k>1, so u can set num_speculative_tokens > 1 if there is sufficient redundant computing power; 2. MTP is not supported in V1, we will support it when vLLM does it in https://github.com/vllm-project/vllm/issues/13500. 3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3 patch https://github.com/vllm-project/vllm-ascend/pull/236 file `vllm_ascend/patch/patch_metrics.py` method `__npu_async_metrics_collector_init__` ### How was this patch tested? local tested passed and test by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-28 21:18:53 +08:00
Mengqing Cao	be9e3e8545	[Bugfix] Fix triton placeholder patch period (#704 ) Fix triton placeholder patch period Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-28 18:52:03 +08:00
Li Wang	58f9d932d3	[Doc] Update faqs (#699 ) ### What this PR does / why we need it? Update faqs to make it more clear Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-28 18:48:23 +08:00
Li Wang	d0a0c81ced	[Doc] Add deepsee-v2-lite w8a8 quantization turorial (#630 ) ### What this PR does / why we need it? Add deepsee-v2-lite w8a8 quantization turorial --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-28 17:14:26 +08:00
wangxiyuan	5de3646522	[MISC] Make vllm version configurable (#651 ) Sometimes, user install a dev/editable version of vllm. In this case, we should make sure vllm-ascend works as well. This PR add a new env `VLLM_VERSION`. It's used for developers who edit vllm. In this case, developers should set thie env to make sure which vllm version is installed and used. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-28 14:19:06 +08:00
dependabot[bot]	8849cf1eda	Bump actions/setup-python from 5.5.0 to 5.6.0 (#697 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.5.0 to 5.6.0. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-04-28 14:06:38 +08:00
Icey	ee7a0e2cd4	Update openEuler dockerfile for COMPILE_CUSTOM_KERNELS=1 (#689 ) ### What this PR does / why we need it? Update openEuler dockerfile for COMPILE_CUSTOM_KERNELS=1 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Icey <1790571317@qq.com>	2025-04-28 11:45:46 +08:00
Pleaplusone	38f34e359f	[Fix] fix deepseek v0 attention eager mode (#671 ) ### What this PR does / why we need it? `reshape_and_cache_siso` seems have some funcitonality issues, use torch op combination replace this custom op by default. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-28 08:53:06 +08:00
Yikun Jiang	413657ae43	[FOLLOWUP][DOC] Fix pip install cmd in installation.md (#680 ) ### What this PR does / why we need it? Fix pip install cmd in installation.md Followup on: https://github.com/vllm-project/vllm-ascend/pull/661 ### Does this PR introduce _any_ user-facing change? No, doc only ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-27 18:37:25 +08:00
Yikun Jiang	2e20797934	[BUILD] Upgrade torch-npu to 2.5.1 (#661 ) ### What this PR does / why we need it? The torch-npu 2.5.1 are published: https://pypi.org/project/torch-npu/2.5.1/ It's time to remove all torch-npu dev version from vllm-ascend code base ### Does this PR introduce _any_ user-facing change? Yes, using torch-npu 2.5.1 ### How was this patch tested? - [ ] CI passed - [ ] Manually test - [ ] Grep all `dev2025` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-27 17:28:29 +08:00
Jade Zheng	fa4a5d980e	[Bugfix] Remove redundant tensor creation and unused code (#656 ) ### What this PR does / why we need it? Eliminated duplicate `block_table` tensor initialization and cleaned up unused code segments. This resolves an issue where the second creation was overwriting the first, potentially leading to unexpected behavior. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-04-27 14:09:16 +08:00
Mengqing Cao	ba3d8aae94	[Model][MiniCPM] support MiniCPM (#645 ) ### What this PR does / why we need it? This pr support minicpm in branch main. see https://github.com/vllm-project/vllm-ascend/pull/164 ### How was this patch tested? test locally with minicpm --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-27 11:27:24 +08:00
Yikun Jiang	742f679c7d	Remove prompt string from engine core data structures (#663 ) ### What this PR does / why we need it? vLLM Ascend side followup on: [Core] Remove prompt string from engine core data structures `df6f3ce883` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-26 23:15:58 +08:00
wangxiyuan	c99c4c8c70	[Doc] Update feature support list (#650 ) 1. remove Chinese doc. The content is out of data and we don't have enough time to maintain it. 2. Update feature support matrix. Refresh the content and add V1 status. --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-26 10:27:29 +08:00
wangxiyuan	3879d9cad9	[CI] Fix sample backward compatibility problem (#648 ) `b411418ff0` this vllm commit change the sample usage. This PR adapt the change for main and make sure it works for 0.8.4 as well. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-25 11:53:26 +08:00
yiz-liu	d785e78563	[V1] Make V1 engine backward compatible (#637 ) ### What this PR does / why we need it? Enforce eager mode in the V1 engine ahead of the upcoming CANN and torch_npu releases. ### Does this PR introduce _any_ user-facing change? After this change, users will no longer need to manually set enforce_eager=True. ### How was this patch tested? Test it with regular offline inference examples. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-24 17:20:11 +08:00

1 2 3 4 5 ...

288 Commits