xc-llm-ascend

Author	SHA1	Message	Date
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
leo-pony	807f0895b2	Bump torch version to 2.7.1 (#1562 ) ### What this PR does / why we need it? Bump torch version to 2.7.1, and cleanup infer schema patch https://github.com/vllm-project/vllm-ascend/commit/857f489 (https://github.com/vllm-project/vllm-ascend/pull/837), this patch depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974 ### Does this PR introduce any user-facing change? No #### How was this patch tested? CI passed torch-npu 2.7.1rc1 install guide: https://gitee.com/ascend/pytorch/tree/v2.7.1/ install depending: ``` pip3 install pyyaml pip3 install setuptools ``` install torch-npu: Closes: https://github.com/vllm-project/vllm-ascend/issues/1866 Closes: https://github.com/vllm-project/vllm-ascend/issues/1390 - vLLM version: v0.10.0 - vLLM main: `9af654cc38` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-05 08:43:24 +08:00
22dimensions	9e65da990e	[Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132 ) ### What this PR does / why we need it? cherry-pick #1501 from 0.9.1-dev to main Currently, Ray is not compatible with ACL Graph, so we need to fall back to eager mode when using the Ray backend. co-authored: Yizhou Liu <liu_yizhou@outlook.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:06:09 +08:00
wangxiyuan	af56ae3ed1	[1/4][Refactor] Refactor torchair worker (#1885 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker (this PR) - create torchair module and move torchair related code in worker to the new module 3. Refactor utils 4. Refactor model_runner 5. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 11:50:46 +08:00
Mengqing Cao	8cfd257992	[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681 ) ### What this PR does / why we need it? Remove ETP/EP maintained in branch main. We drop this as there is no relevant scenarios to use ETP now, and we may subsequently advocate implementing expert tensor parallelism in vLLM to support scenarios where the expert is needed to be sliced This is a part of #1422 backport. Fixes https://github.com/vllm-project/vllm-ascend/issues/1396 https://github.com/vllm-project/vllm-ascend/issues/1154 ### Does this PR introduce _any_ user-facing change? We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in vllm instead. ### How was this patch tested? CI passed with new added and existing test. - vLLM version: v0.9.2 - vLLM main: `fe8a2c544a` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-21 09:08:04 +08:00
Mengqing Cao	574fe407eb	[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841 ) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-18 23:07:14 +08:00
Shanshan Shen	f96100fad5	[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805 ) ### What this PR does / why we need it? Remove V0 related codes of test, example, platform. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `235bfd5dfe` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-15 19:58:55 +08:00
wangxiyuan	7bdada58eb	[Misc] Remove VLLM_USE_V1 usage in code (#1764 ) We plan to remove V0 code from this version. The first step is to delete v0 usage. Related: https://github.com/vllm-project/vllm-ascend/issues/1620 - vLLM version: v0.9.2 - vLLM main: `61e20828da` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 11:52:16 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
Zhu Yi Lin	b308a7a258	support pangumoe w8a8c8 and docs (#1477 ) ### What this PR does / why we need it? support pangu moe w8a8c8 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. Signed-off-by: zhuyilin <809721801@qq.com>	2025-06-28 18:51:07 +08:00
Yikun Jiang	097e7149f7	[Platform] Add initial experimental support for Altlas 300I series (#1333 ) ### What this PR does / why we need it? Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation: - https://github.com/vllm-project/vllm-ascend/pull/914 - https://github.com/vllm-project/vllm-ascend/pull/1318 - https://github.com/vllm-project/vllm-ascend/pull/1327 ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas 300I DUO series ### How was this patch tested? CI passed with: - E2E image build for 310P - CI test on A2 with e2e test and longterm test - Unit test missing because need a real 310P image to have the test, will add in a separate PR later. - Manually e2e test: - Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B: https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322 - Pangu MGoE 72B The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended. #### ENV information CANN, NNAL version: 8.1.RC1 > [!IMPORTANT] > PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P #### Code example ##### Build vllm-ascend from source code ```shell # download source code as vllm-ascend cd vllm-ascend export SOC_VERSION=Ascend310P3 pip install -v -e . cd .. ``` ##### Run offline inference ```python from vllm import LLM, SamplingParams prompts = ["水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。", "水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。"] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, max_num_seqs=4, dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P disable_custom_all_reduce=True, trust_remote_code=True, tensor_parallel_size=2, compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]}, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` --------- Signed-off-by: Vincent Yuan <farawayboat@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: Vincent Yuan <farawayboat@gmail.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com>	2025-06-21 09:00:16 +08:00
Shanshan Shen	2cd8ecdc4f	[Bugfix][Spec Decode] Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode (#1258 ) ### What this PR does / why we need it? Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode. Find more details at mengwei805's comment in https://github.com/vllm-project/vllm-ascend/pull/1123. ### Does this PR introduce _any_ user-facing change? The user will not be aware of `VLLM_ASCEND_ACL_OP_INIT_MODE` (`ACL_OP_INIT_MODE`). ### How was this patch tested? Test scripts: ```python from vllm import LLM, SamplingParams prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="Qwen/Qwen2.5-1.5B-Instruct", tensor_parallel_size=1, speculative_config={ "method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 4, }, ) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Results: ``` Adding requests: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 76.70it/s] Processed prompts: 100%\|███████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 1.33it/s, est. speed input: 6.64 toks/s, output: 21.26 toks/s] Prompt: 'The future of AI is', Generated text: ' bright\n\n04/15/2020\n\nBy: James' ``` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-18 17:50:20 +08:00
zhuo97	f5404dc650	Fix the device error when using ray as vllm-acend backend (#884 ) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com>	2025-06-16 21:03:16 +08:00
wangxiyuan	69b817ed65	[CI] Add unit test framework (#1201 ) This PR added the unit test framework to enable ut for vLLM Ascend. Unit test runs on CPU machines. It'll be ran once lint check is passed the same as e2e test. For unit test, this PR created a new folder called `ut` under `tests` module. All the test file in `ut` should keep the same with the code in `vllm-ascend`. The file name should be start with `test_` prefix. For example, in this PR. the `test_ascend_config.py` is added for `ascend_config.py` test. A new fille `worker/test_worker_v1.py` is also added as the placeholder. This file should be the unit test for `vllm-ascend/worker/worker_v1.py`. Additional, a new `fake_weight` folder is added, it contains the config.json from `facebook/opt-125m`, so that the test will not always visit huggingface. TODO: We should add all the unit test file one by one in the future. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-16 18:32:28 +08:00
yiz-liu	6003afa6d2	[BugFix] Fix data parallel (#940 ) ### What this PR does / why we need it? With this PR, we can migrate to the native `data_parallel.py` in vllm examples and remove the version in vllm-ascend. At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable difficulties; therefore, we must employ a temporary workaround and manually specify the device. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-09 14:08:18 +08:00
zzzzwwjj	f1543d5e0d	[bugfix] fix deeepseek accuracy (#1118 ) ### What this PR does / why we need it? fix deeepseek accuracy in mix-parallel case. Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-07 21:11:36 +08:00
Li Wang	a2552e10e4	[Worker][V1] Support sleep mode for v1 (#1084 ) ### What this PR does / why we need it? Support sleep mode for v1 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-06 21:54:02 +08:00
Shanshan Shen	5d0e9fd19a	[Misc] Add `ACL_OP_INIT_MODE` env var and set default to `1` (#597 ) ### What this PR does / why we need it? Fix the bug in torch 2.5.1 that raising segment fault when enable `pin_memory` while creating a tensor using `torch.tensor`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-06 20:22:51 +08:00
wangxiyuan	dab19d5dca	[BugFix] Fix ascend config check (#1092 ) Fix the ascend config check logic: 1. refactor check_ascend_config to make it clear: 1. torchair graph should not work with enforce_eager=True 2. aclgraph should not work with torchair graph 3. add refresh config for rlhf case 4. fix a typo in model runner 5. change expert_tensor_parallel_size default to 0 to keep the same as before Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 18:54:37 +08:00
wangxiyuan	e1ab6d318e	[Misc] Refactor additional_config (#1029 ) More and more config options are added to additional_config. This PR provide a new AscendConfig to manage these config options by an easier way to make code cleaner and readable. This PR also added the `additional_config` doc for users. Added the test_ascend_config.py to make sure the new AscendConfig works as expect. TODO: Add e2e test with torchair and deepseek once the CI resource is available. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-05 16:28:01 +08:00
NeverRaR	da9acfca60	feat: support data parallel for deepseek (#1012 ) ### What this PR does / why we need it? feat: support data parallel for deepseek ### Does this PR introduce _any_ user-facing change? Yes, support dp for deepseek ### How was this patch tested? ``` export VLLM_ENABLE_MC2=0 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh nohup python -m vllm.entrypoints.openai.api_server --model=/path/to/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=8 \ -dp=2 \ --max-num-seqs 24 \ --max-model-len 4096 \ --max-num-batched-tokens 4096 \ --block-size 128 \ -O 0 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_batch_sizes":[24],"expert_tensor_parallel_size":16,"ascend_scheduler_config":{},"enable_graph_mode":true}' \ --gpu-memory-utilization 0.95 &> run.log & disown ``` Signed-off-by: boying <897013703@qq.com>	2025-06-04 18:31:41 +08:00
Shanshan Shen	068c3a0167	[Bugfix] Add verification for `quant_action.choices` to avoid `TypeError` (#1046 ) ### What this PR does / why we need it? When I run vllm-ascend, I get this error msg: ```bash Traceback (most recent call last): File "/home/sss/software/miniconda3/envs/vllm-v1/bin/vllm", line 8, in <module> sys.exit(main()) File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/cli/main.py", line 50, in main cmd.subparser_init(subparsers).set_defaults( File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/cli/serve.py", line 101, in subparser_init serve_parser = make_arg_parser(serve_parser) File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/openai/cli_args.py", line 254, in make_arg_parser parser = AsyncEngineArgs.add_cli_args(parser) File "/home/sss/github/vllm-project/vllm/vllm/engine/arg_utils.py", line 1582, in add_cli_args current_platform.pre_register_and_update(parser) File "/home/sss/github/vllm-project/vllm-ascend/vllm_ascend/platform.py", line 80, in pre_register_and_update if ASCEND_QUATIZATION_METHOD not in quant_action.choices: TypeError: argument of type 'NoneType' is not iterable [ERROR] 2025-06-03-02:53:42 (PID:6005, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception ``` This is because the `choices` attribute in `quant_action` can be `None` and we don't check it. ```bash # quant_action _StoreAction(option_strings=['--quantization', '-q'], dest='quantization', nargs=None, const=None, default=None, type=<class 'str'>, choices=None, required=False, help='Method used to quantize the weights. If `None`, we first check the\n`quantization_config` attribute in the model config file. If that is\n`None`, we assume the model weights are not quantized and use `dtype` to\ndetermine the data type of the weights.', metavar=None) ``` Thus, I have added check for the `choices` to handle the scenario of `choices=None`. ### Does this PR introduce _any_ user-facing change? yes, vllm server with ascend quantization works now. ### How was this patch tested? by `vllm server --quantization ascend` command. Related: https://github.com/vllm-project/vllm/issues/19004 Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-03 11:44:45 +08:00
yiz-liu	5a1689fc64	[Fix] Fix update_aclgraph_sizes when running MoE models (#913 ) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-30 15:17:11 +08:00
Mengqing Cao	a93bed4535	[aclgraph] implentment NPUPiecewiseBackend to enable aclgraph (#836 ) ### What this PR does / why we need it? 1. Implentment `NPUPiecewiseBackend` to enable aclgraph 2. Eable aclgraph by default in V1, but raise error when running deepseek and raise warning when running models except for qwen ### How was this patch tested? CI pass with the new ut --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-29 11:58:26 +08:00
rjg-lyh	b434f37b46	[V1] Revert the default value of enable_chunked_prefill in additional… (#935 ) ### What this PR does / why we need it? Revert the default value of enable_chunked_prefill to 'False' in additional_scheduler_config. In engine v1, enable_chunked_prefill is forcibly set to True in VllmConfig, which causes it to be perceived as True in check_and_update_config(). As a result, when the v0 scheduler is enabled, the chunked prefill feature remains active, leading to the failure of the v0 scheduler and causing it to fall back to the native v1 scheduling logic. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-23 10:06:50 +08:00
yangpuPKU	46df67a5e9	[bugfix] Improve log level and info for custom ops build (#937 ) ### What this PR does / why we need it? Fix the bug of #703, where vllm wrong raised the ERROR : Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'. The format for reporting import vllm_ascend_C failure is unified by warning ("Failed to import vllm_ascend_C:%s", e). ### Does this PR introduce _any_ user-facing change? No --------- Signed-off-by: yangpuPKU <604425840@qq.com>	2025-05-23 10:05:57 +08:00
rjg-lyh	b4d6672d01	[BugFix] Fix chunked prefill bugs in engine v1 (#844 ) ### What this PR does / why we need it? Fix the bugs when run deepseek model in engine v1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-22 10:33:50 +08:00
22dimensions	00e0243561	enable online serving quantization (#877 ) For online serving, "ascend" quantization method is not a choice natively, so we need to add "ascend" quantization method to quantization methods list and the user can enable quantization using "vllm serve --quantization ascend" command. --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-05-17 17:36:04 +08:00
yiz-liu	701b0fd95e	[Enhancement] Add padding for ACL Graph (#803 ) ### What this PR does / why we need it? Add padding for ACL Graph and refactor graph batch size adjustments to utils.py --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-12 20:26:22 +08:00
NeverRaR	efabd722eb	feat: support torchair graph mode in v1 engine (#789 ) ### What this PR does / why we need it? support torchair graph mode with v1 engine --------- Signed-off-by: boying <897013703@qq.com>	2025-05-12 19:14:07 +08:00
rjg-lyh	fa99f89e93	[Core] Support the features of prefix cache and chunked prefill in v0/v1 (#782 ) ### What this PR does / why we need it? Support the features of prefix cache and chunked prefill in v0/v1. --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-09 16:39:28 +08:00
linfeng-yuan	84e2ed898b	performance optimization, usability optimization and API compatibility adjustments for deepseek with npu graph mode (#731 ) --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> 1. Improve inference speed and usability for deepsek models with NPU graph mode. 2. Modify some codes to adapt to CANN 8.1.RC1.beta1. 3. Add a switch for NPU graph mode and its cache. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> This PR provides an experimental configuration to enable NPU graph mode for Deepseek models. User can set additional_config={'enable_graph_mode': True} to try this feature. Note that this feature currently only supports for V0 engine. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> This patch was tested with the newest torch_npu 2.5.1 (https://pypi.org/project/torch-npu/#files) and CANN 8.1.RC1.beta1 toolkit&nnal&kernels (https://www.hiascend.com/developer/download/community/result?module=cann) released in 25/30 April. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-05-01 13:51:42 +08:00
wangxiyuan	95e7aa4736	[Platform] format platform to make it more clear (#610 ) Platform should only contain the function that based from vllm. This PR move the unrelated function to the right place to make platform more clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:03:10 +08:00
wangxiyuan	b917361ca5	[MISC] Clean up torch_npu (#688 ) torch_npu 2.5.1 support autoload now. This patch does: 1. remove useless torch_npu import 2. replace `torch_npu.npu` to `torch.npu`. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-29 18:03:38 +08:00
Pleaplusone	0329fad927	[Perf] Deepseekv3 performance optimization for eager mode (#598 ) ### What this PR does / why we need it? Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR https://github.com/vllm-project/vllm-ascend/pull/543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-29 17:12:03 +08:00
yiz-liu	d785e78563	[V1] Make V1 engine backward compatible (#637 ) ### What this PR does / why we need it? Enforce eager mode in the V1 engine ahead of the upcoming CANN and torch_npu releases. ### Does this PR introduce _any_ user-facing change? After this change, users will no longer need to manually set enforce_eager=True. ### How was this patch tested? Test it with regular offline inference examples. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-24 17:20:11 +08:00
Bug Hunter Yan	05bdcbeae4	support aclgraph (#426 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> support npugraph to default， Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-23 20:56:24 +08:00
zzzzwwjj	5c6d05a59e	support deepseek quant & mix-parallel with graphmode (#585 ) ### What this PR does / why we need it? 1. support deepseek with w8a8 quant; 2. support deepseek with mix-parallel(multi-DP, EP+TP); 3. support deepseek with graphmode. --------- Signed-off-by: wen-jie666 <wenjie39@huawei.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wen-jie666 <wenjie39@huawei.com>	2025-04-23 16:23:25 +08:00
Pleaplusone	1a1f9a6d89	port deepseekv2 and mtp to main branch (#429 ) ### What this PR does / why we need it? This PR ports all the deepseek graph mode code and mtp code from v0.7.3 to the main branch --------- Signed-off-by: SidaoY <1024863041@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: mengwei805 <mengwei25@huawei.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: q00832892 <qiaoyang19@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: SidaoY <1024863041@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-04-19 17:38:18 +08:00
Shuqiao Li	84563fc65d	Add sleep mode feature for Ascend NPU (#513 ) ### What this PR does / why we need it? This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things: - offload model weights - discard kv cache RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process. This PR may solve #375 and #320 . ### Does this PR introduce _any_ user-facing change? No existing user interfaces changed. Users will have two new methods(`sleep()` and `wake_up()`) to use. ### How was this patch tested? This PR is tested with Qwen/Qwen2.5-0.5B-Instruct. At first, we have free NPU memory M1. After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)` executed, we have free NPU memory M2. M2 < M1. Then we call `llm.sleep(level=1)`, we have free NPU memory M3. We have M3 > M2, M3 is very close to M1. Plus, we have the same output tokens before sleep and after wake up, with the config of `SamplingParams(temperature=0, max_tokens=10)` and with the same input tokens of course. This PR is utilizing the CMake procedure of #371 , thanks a lot. Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-18 13:11:39 +08:00
Pleaplusone	66a0837963	adopt rope in vllm-ascend (#530 ) ### What this PR does / why we need it? Adopt custom kernel rotary embedding in actual model inference, customized rotary_embedding will generate contiguous query and key in the cpp side to reduce the overhead of two contiguous and index_select compared with rotary_embedding in torch_npu. For now, rotary_embedding can only support the scenario of `is_neox = true`, non-neox version rope will be updated soon in the future. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-18 08:56:05 +08:00
whx	20dff4deff	[Scheduler] Add AscendScheduler. (#543 ) This PR adds AscendScheduler to vllm v1 engine. This scheduler currently supports v0-style prefill-first scheduling strategy. In the future more schedule methods will be supported by this scheduler. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-04-17 19:31:50 +08:00
paulyu12	697908f5cd	[Platform][Worker][ModelRunner] Add LoRA & Multi-LoRA support (#521 ) ### What this PR does / why we need it? According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic Serving feature develop #396](https://github.com/vllm-project/vllm-ascend/issues/396) and this [vLLM Ascend Roadmap Q2 2025 #448](https://github.com/vllm-project/vllm-ascend/issues/448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone https://github.com/vllm-project/vllm.git cd vllm/examples/offline_inference/ && python3 multilora_inference.py --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-04-17 16:48:46 +08:00
hfadzxy	9935d45728	[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 ) ### What this PR does / why we need it? Add model basic accuracy test(Qwen2.5-0.5B-Instruct) Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-17 14:59:56 +08:00
Shanshan Shen	415ed027fa	[V1][Platform] Remove `supports_structured_output()` in platform (#531 ) ### What this PR does / why we need it? Remove `supports_structured_output()` in platform. This method is no need, because upstream has deleted this. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-16 09:30:33 +08:00
eeethenQ	44a8301424	[Feature] Add PD separation feature (#432 ) ### What this PR does / why we need it? Adapt Disaggregated Prefill feature onto Ascend device ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py To run it, do this ``` export PROMPT_DEVICE_ID=0,1 export DECODE_DEVICE_ID=2,3 python examples/offline_disaggregated_prefill_npu.py ``` --------- Signed-off-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ZihuiQian <qianzihui@huawei.com>	2025-04-15 15:11:35 +08:00
wangxiyuan	f6af1d2471	[MISC] fix logger (#515 ) logger in vllm-ascend doesn't work. This PR fix the issue. Fix: https://github.com/vllm-project/vllm-ascend/issues/431 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-15 10:18:05 +08:00
Shanshan Shen	1d88dacf9f	[V1][Platform] Add `supports_structured_output()` method to Platform (#475 ) ### What this PR does / why we need it? Add `supports_structured_output()` method to Platform, find more details at https://github.com/vllm-project/vllm/pull/16148. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-07 19:11:51 +08:00
Li Wang	3f9752f8ee	[Bugfix]Lazy import vllm config (#462 ) ### What this PR does / why we need it? Lazy import vllm config to avoid circular imports --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-03 16:03:08 +08:00
Pleaplusone	ce8259975e	[core] Support custom ascendc kernels in vllm-ascend (#233 ) This PR add custom ascendc kernel rotary_embedding support in vllm-ascend, related CMakeLists and setuptools is also added in this PR. Related: https://github.com/vllm-project/vllm-ascend/issues/156 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-03 14:52:34 +08:00

1 2

64 Commits