xc-llm-ascend

Author	SHA1	Message	Date
Angazenn	1e67089bc9	[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? 1. This PR introduces native `all_to_all` communication operator to fix `allgather` bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs. 2. The operator `npu_dequant_swiglu_quant` only supports input hidden_states with dtype `torch.int32`. This tensor occupies space of `global_bs * seq_len * topk * hidden_size`, which might be very large as `ep_size` grows. Therefore we need to disable this operator and use original `swiglu` && `quantize`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By performing offline inference: ![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828) --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-05-15 09:19:55 +08:00
wangxiyuan	857f489cbf	[CI] Patch torch.library.infer_schema for torch 2.5 backward compatibility (#837 ) Patch torch.library.infer_schema for torch 2.5 backward compatibility - Introduced a new module `patch_utils` under `vllm_ascend/patch/worker/patch_common/`. - Added a function `ascend_direct_register_custom_op` to handle custom operator registration with backward compatibility for PyTorch < 2.7 (such as torch 2.5.1). - Implemented type conversion logic for annotations to ensure compatibility across different PyTorch versions. - Registered the function `ascend_direct_register_custom_op` to `utils.direct_register_custom_op`. - Updated `__init__.py` to include `patch_utils` as the first import. - Ensured `patch_utils` is available for use in other patch files and skipped isort checks for `patch_utils` import. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-14 09:20:55 +08:00
yiz-liu	5305a2ccf9	[Bugfix] Tweak distributed process group initialization and add dummy… (#816 ) fix batch execution method to enable DP in V1 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-12 17:31:29 +08:00
Yikun Jiang	afe1767c17	[Core] Cleanup triton patch which has been fixed in vllm (#764 ) ### What this PR does / why we need it? - Revert "Re-patch TritonPlaceholder on main to make CI happy (#753)" because upstream main CI already merged: https://github.com/vllm-project/vllm/pull/17446 - Keep 0.8.5.post1 compatible ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-06 18:52:15 +08:00
Yikun Jiang	d7e1110c8e	Re-patch TritonPlaceholder on main to make CI happy (#753 ) ### What this PR does / why we need it? Re-patch TritonPlaceholder on main to make CI happy - Add triton patch back until https://github.com/vllm-project/vllm/pull/17446 resolved - Move patch_main before patch_common to resolve minicpm triton import issue - Add `0.8.5` and `0.8.5.post1` to make patch work on 0.8.5 all versions Related: - https://github.com/vllm-project/vllm-ascend/pull/704 - https://github.com/vllm-project/vllm-ascend/pull/690 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All CI passed include main Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-05 23:22:24 +08:00
wangxiyuan	f8350569e6	[CI] upgrade vllm to 0.8.5 (#715 ) 1. Upgrade vllm to 0.8.5 2. Drop 0.8.4 support 3. Keep doc to 0.8.4rc2 until we release 0.8.5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:15:50 +08:00
wangxiyuan	95e7aa4736	[Platform] format platform to make it more clear (#610 ) Platform should only contain the function that based from vllm. This PR move the unrelated function to the right place to make platform more clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:03:10 +08:00
wangxiyuan	b917361ca5	[MISC] Clean up torch_npu (#688 ) torch_npu 2.5.1 support autoload now. This patch does: 1. remove useless torch_npu import 2. replace `torch_npu.npu` to `torch.npu`. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-29 18:03:38 +08:00
wemaster	54c0e63df7	[MTP] follow custom deepseek modeling changes to support graph mode (#636 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? As custom deepseek modeling do some changes to support graph mode in https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to change custom deepseek_mtp modeling. And some modifications for k>1 were not carried over by the https://github.com/vllm-project/vllm-ascend/pull/429, now i add it. In order to better take care of the MTP feature in the vllm-ascend repository, I added cases related to graph mode(torchair), but i skip it since torchair can not correctly clean up memory in vllmrunner. Also i add some case for MTP quantization weights, but test weight is not ready, so i skip it and i will open it when test quant weights is ready. https://github.com/vllm-project/vllm-ascend/pull/648 did not completely fix the sample change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I added the relevant changes. ### Does this PR introduce _any_ user-facing change? now, u can use following method to use mtp in deepseek v3/r1 float or quant weights with eager mode. ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, enforce_eager=True, trust_remote_code=True, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` or use mtp in deepseek v3/r1 float or quant weights with graph mode（torchair） ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, trust_remote_code=True, additional_config={ 'enable_graph_mode': True, }, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` add notes: 1. now, we support k>1, so u can set num_speculative_tokens > 1 if there is sufficient redundant computing power; 2. MTP is not supported in V1, we will support it when vLLM does it in https://github.com/vllm-project/vllm/issues/13500. 3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3 patch https://github.com/vllm-project/vllm-ascend/pull/236 file `vllm_ascend/patch/patch_metrics.py` method `__npu_async_metrics_collector_init__` ### How was this patch tested? local tested passed and test by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-28 21:18:53 +08:00
Mengqing Cao	be9e3e8545	[Bugfix] Fix triton placeholder patch period (#704 ) Fix triton placeholder patch period Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-28 18:52:03 +08:00
Mengqing Cao	ba3d8aae94	[Model][MiniCPM] support MiniCPM (#645 ) ### What this PR does / why we need it? This pr support minicpm in branch main. see https://github.com/vllm-project/vllm-ascend/pull/164 ### How was this patch tested? test locally with minicpm --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-27 11:27:24 +08:00
Pleaplusone	e74331a1ed	Add dp initialize patch with hccl backend (#626 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Add dp stateless process group initialization path with hccl backend as vllm-ascend patch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-23 15:47:51 +08:00
Shanshan Shen	4a0ce3660e	[Misc] Remove some parts of metrics patch (#603 ) ### What this PR does / why we need it? Remove some parts of metrics patch, since the `cuda` hard code has been fixed by https://github.com/vllm-project/vllm/pull/14411. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-22 18:45:21 +08:00
wangxiyuan	538a69c145	[Patch] format patch module to make it more clear (#601 ) Format patch module to make it more clear. Add the patch doc description, the new patch must follow this guide. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-22 14:13:00 +08:00
Pleaplusone	d12a057df8	Add note for deepseek related docs and remove unnecessary comments (#590 ) ### What this PR does / why we need it? Add notes for deepseek's patch and remove some of the unnecessary comments --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-22 09:59:09 +08:00
wemaster	0ae9ee0f8a	[BUGFIX] main-sd-bugfix && [UT] add mtp UT (#593 ) ### What this PR does / why we need it? The pr will fix some bug about spec decode / MTP The pr add a mtp e2e UT `test_mtp_correctness.py` vllm_ascend/attention/attention.py 1. add support `self.attn_mask_cache` only has 1 element to cover scene in which both spec docode and chunked prefill are enabled. vllm_ascend/distributed/parallel_state.py 1. remove 2 assert because spec decode worker would use init_worker twice vllm_ascend/models/deepseek_mtp.py 1. remove unused params; 2. add support w8a8 in `CustomDeepSeekMTP` vllm_ascend/quantization/quant_config.py 1. use `AscendUnquantizedFusedMoEMethod` instead of `UnquantizedFusedMoEMethod` other 1. replace `from vllm.logger import init_logger` to `from vllm.logger import logger` all of the vllm-ascend project ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-21 19:25:51 +08:00
Shuqiao Li	5442b463fd	add doc for patch_config (#574 ) ### What this PR does / why we need it? add doc for patch_config ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No code changed. Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-21 10:33:38 +08:00
Pleaplusone	1a1f9a6d89	port deepseekv2 and mtp to main branch (#429 ) ### What this PR does / why we need it? This PR ports all the deepseek graph mode code and mtp code from v0.7.3 to the main branch --------- Signed-off-by: SidaoY <1024863041@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: mengwei805 <mengwei25@huawei.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: q00832892 <qiaoyang19@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: SidaoY <1024863041@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-04-19 17:38:18 +08:00
Shuqiao Li	84563fc65d	Add sleep mode feature for Ascend NPU (#513 ) ### What this PR does / why we need it? This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things: - offload model weights - discard kv cache RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process. This PR may solve #375 and #320 . ### Does this PR introduce _any_ user-facing change? No existing user interfaces changed. Users will have two new methods(`sleep()` and `wake_up()`) to use. ### How was this patch tested? This PR is tested with Qwen/Qwen2.5-0.5B-Instruct. At first, we have free NPU memory M1. After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)` executed, we have free NPU memory M2. M2 < M1. Then we call `llm.sleep(level=1)`, we have free NPU memory M3. We have M3 > M2, M3 is very close to M1. Plus, we have the same output tokens before sleep and after wake up, with the config of `SamplingParams(temperature=0, max_tokens=10)` and with the same input tokens of course. This PR is utilizing the CMake procedure of #371 , thanks a lot. Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-18 13:11:39 +08:00
wangxiyuan	42c7fbb10e	[Misc] Fix import error and address nits to make CI happy (#563 ) 1. Add `vllm_version_is` function to check vllm version. 2. `ensure_kv_transfer_initialized` and `get_kv_transfer_group ` have been moved to other place in vllm main branch via `3408e47159` , this patch fix the import error. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-18 12:23:32 +08:00
Mengqing Cao	6ee7f5cf71	[SpecDecode] Add spec decode support (#500 ) ### What this PR does / why we need it? Backport: https://github.com/vllm-project/vllm-ascend/pull/252 This support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models. Backport: https://github.com/vllm-project/vllm-ascend/pull/423 spec decode MultiStepWorker support TP1DraftModelRunner fully, support run the draft_model_runner with multi-step prepare on the NPU directly and support draft_model_runner use MLA. 1. before this pr, `MultiStepWorker` would not step into the branch using NPU prepare, but only into the branch using CPU prepare (`line 52` of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has `no effect` on the `correct operation` of speculative decoding and the performance of the two branches is basically the same as of the current version, I support entering this branch in this PR. In general, there are two main changes in `patch_multi_step_worker.py`: first, the `is_cuda_like()` check is removed and the `TP1DraftModelRunner` rewritten in vllm_ascend is used; second, the `supports_gpu_multi_step()` function is made to return true on NPU devices when outer Multi_step_worker could work correct. 3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU, but not MLA. The relevant adaptation is in `vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why the `input_positions` of `model_input.attn_metadata` in vllm-ascend needs to be added in `execute_model`, it is done in `model_runner.py`, so I also made corresponding changes. Otherwise, when atten_backend is MLA, it will prompt that input_positions cannot be found. 4. I commented out two lines in `draft_model_runner.py` in `line118` to support the scenario of K>1. ``` # lora_mapping=model_input.lora_mapping, # lora_requests=model_input.lora_requests, ``` I added comments. In the future, when vllm-ascend supports lora feature, the changes here can be restored. TODO： - [ ] revert the patch when the related issues are addressed in vllm ### How was this patch tested? CI passed with new added test. - e2e test for medusa proposer: tests/singlecard/spec_decode/e2e/test_medusa_correctness.py - e2e test for mlp proposer: tests/singlecard/spec_decode/e2e/test_mlp_correctness.py - e2e test for n-gram proposer: tests/singlecard/spec_decode/e2e/test_ngram_correctness.py Tests for patched files: - tests/singlecard/spec_decode/test_dynamic_spec_decode.py - tests/singlecard/spec_decode/test_multi_step_worker.py - tests/singlecard/spec_decode/test_ngram_worker.py - tests/singlecard/spec_decode/test_spec_decode_worker.py --------- Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>	2025-04-17 20:16:32 +08:00
wangxiyuan	bbe7ccd366	[MISC] Add patch module (#526 ) This PR added patch module for vllm 1. platform patch: the patch will be registered when load the platform 2. worker patch: the patch will be registered when worker is started. The detail is: 1. patch_common: patch for main and 0.8.4 version 4. patch_main: patch for main verison 5. patch_0_8_4: patch for 0.8.4 version	2025-04-16 09:28:58 +08:00
Mengqing Cao	4544e99d88	[dist] revert communicator patch (#66 ) ### What this PR does / why we need it? Revert communicator patch as https://github.com/vllm-project/vllm/pull/13208 has been merged. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? test locally by https://github.com/vllm-project/vllm-ascend/pull/30#issuecomment-2650251266 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-17 11:42:33 +08:00
Mengqing Cao	b88443b6c6	[dist] fix communicator patch (#58 ) ### What this PR does / why we need it? fix communicator patch so parallel could work. see #52 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-14 10:45:49 +08:00
wangxiyuan	f762ee89cc	[Communicator] Add monkey patch (#30 ) Some PR for plugin support is not merged by vllm yet. This PR add monkey patch to vllm-ascend to make vllm-ascend work with vllm directly. This patch code should be removed once the related function is supported by vllm originally. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-02-11 19:15:35 +08:00

25 Commits