xc-llm-ascend/vllm_ascend/ops/activation.py

#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#

import torch
from vllm.model_executor.layers.activation import QuickGELU, SiluAndMul


class AscendQuickGELU(QuickGELU):

    def forward_oot(self, x: torch.tensor) -> torch.Tensor:
        import torch_npu

        out = torch_npu.npu_fast_gelu(x)
        return out


class AscendSiluAndMul(SiluAndMul):

    def forward_oot(self, x: torch.Tensor) -> torch.Tensor:
        import torch_npu

        from vllm_ascend.utils import AscendDeviceType, get_ascend_device_type

        torch.ops.vllm.maybe_prefetch_mlp_down_proj(x)
        if get_ascend_device_type() == AscendDeviceType._310P:
            out = torch_npu.npu_swiglu(x.to(torch.float32)).to(torch.float16)
        else:
            out = torch_npu.npu_swiglu(x)
        torch.ops.vllm.maybe_wait_prefetch_done(out)
        return out
[Core] Cherry pick from 0.7.1 to keep the main code newest (#127) Cherry pick from 0.7.1 to keep the main code newest Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-02-21 17:07:37 +08:00			`#`
			`# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.`
			`#`
			`# Licensed under the Apache License, Version 2.0 (the "License");`
			`# you may not use this file except in compliance with the License.`
			`# You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460) ### What this PR does / why we need it? Add model basic accuracy test(Qwen2.5-0.5B-Instruct) Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-04-17 14:59:56 +08:00			`# This file is a part of the vllm-ascend project.`
[Core] Cherry pick from 0.7.1 to keep the main code newest (#127) Cherry pick from 0.7.1 to keep the main code newest Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-02-21 17:07:37 +08:00			`#`

			`import torch`
Optimize qwen2_vl and qwen2_5_vl (#701) ### What this PR does / why we need it? Optimize qwen2_vl and qwen2_5_vl. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got roughly 22% perf boost. --------- Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> 2025-04-30 14:22:38 +08:00			`from vllm.model_executor.layers.activation import QuickGELU, SiluAndMul`
[Core] Cherry pick from 0.7.1 to keep the main code newest (#127) Cherry pick from 0.7.1 to keep the main code newest Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-02-21 17:07:37 +08:00
[Platform] Add initial experimental support for Altlas 300I series (#1333) ### What this PR does / why we need it? Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation: - https://github.com/vllm-project/vllm-ascend/pull/914 - https://github.com/vllm-project/vllm-ascend/pull/1318 - https://github.com/vllm-project/vllm-ascend/pull/1327 ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas 300I DUO series ### How was this patch tested? CI passed with: - E2E image build for 310P - CI test on A2 with e2e test and longterm test - Unit test missing because need a real 310P image to have the test, will add in a separate PR later. - Manually e2e test: - Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B: https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322 - Pangu MGoE 72B The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended. #### ENV information CANN, NNAL version: 8.1.RC1 > [!IMPORTANT] > PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P #### Code example ##### Build vllm-ascend from source code ```shell # download source code as vllm-ascend cd vllm-ascend export SOC_VERSION=Ascend310P3 pip install -v -e . cd .. ``` ##### Run offline inference ```python from vllm import LLM, SamplingParams prompts = ["水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。", "水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。"] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, max_num_seqs=4, dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P disable_custom_all_reduce=True, trust_remote_code=True, tensor_parallel_size=2, compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]}, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` --------- Signed-off-by: Vincent Yuan <farawayboat@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: Vincent Yuan <farawayboat@gmail.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com> 2025-06-21 09:00:16 +08:00
[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/8dfb45ca3379b3a789ec529af4bf725daa07f10d --------- Signed-off-by: MengqingCao <cmq0113@163.com> 2025-07-18 23:07:14 +08:00			`class AscendQuickGELU(QuickGELU):`
[Core] Cherry pick from 0.7.1 to keep the main code newest (#127) Cherry pick from 0.7.1 to keep the main code newest Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-02-21 17:07:37 +08:00
[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/8dfb45ca3379b3a789ec529af4bf725daa07f10d --------- Signed-off-by: MengqingCao <cmq0113@163.com> 2025-07-18 23:07:14 +08:00			`def forward_oot(self, x: torch.tensor) -> torch.Tensor:`
			`import torch_npu`
[Core] Cherry pick from 0.7.1 to keep the main code newest (#127) Cherry pick from 0.7.1 to keep the main code newest Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-02-21 17:07:37 +08:00
[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/8dfb45ca3379b3a789ec529af4bf725daa07f10d --------- Signed-off-by: MengqingCao <cmq0113@163.com> 2025-07-18 23:07:14 +08:00			`out = torch_npu.npu_fast_gelu(x)`
			`return out`
[Core] Cherry pick from 0.7.1 to keep the main code newest (#127) Cherry pick from 0.7.1 to keep the main code newest Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-02-21 17:07:37 +08:00

[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/8dfb45ca3379b3a789ec529af4bf725daa07f10d --------- Signed-off-by: MengqingCao <cmq0113@163.com> 2025-07-18 23:07:14 +08:00			`class AscendSiluAndMul(SiluAndMul):`
Optimize qwen2_vl and qwen2_5_vl (#701) ### What this PR does / why we need it? Optimize qwen2_vl and qwen2_5_vl. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got roughly 22% perf boost. --------- Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> 2025-04-30 14:22:38 +08:00
[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/8dfb45ca3379b3a789ec529af4bf725daa07f10d --------- Signed-off-by: MengqingCao <cmq0113@163.com> 2025-07-18 23:07:14 +08:00			`def forward_oot(self, x: torch.Tensor) -> torch.Tensor:`
			`import torch_npu`
Optimize qwen2_vl and qwen2_5_vl (#701) ### What this PR does / why we need it? Optimize qwen2_vl and qwen2_5_vl. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got roughly 22% perf boost. --------- Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> 2025-04-30 14:22:38 +08:00
[refact] unified soc_version code (#4359) ### What this PR does / why we need it? Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. ### Does this PR introduce _any_ user-facing change? When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2918c1b49c88c29783c86f78d2c4221cb9622379 Signed-off-by: zzzzwwjj <1183291235@qq.com> 2025-11-26 14:28:55 +08:00			`from vllm_ascend.utils import AscendDeviceType, get_ascend_device_type`
Optimize qwen2_vl and qwen2_5_vl (#701) ### What this PR does / why we need it? Optimize qwen2_vl and qwen2_5_vl. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got roughly 22% perf boost. --------- Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> 2025-04-30 14:22:38 +08:00
[main] mlp weight prefetch in Qwen Dense Models (#2816) ### What this PR does / why we need it? This PR prefetchs the weight of mlp layers in Qwen Dense Models to optimize the performance in Decode phase mainly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: https://github.com/vllm-project/vllm/commit/a1213fae5f8fe624c4b954768d23a86e384abdb1 Signed-off-by: rjg-lyh <1318825571@qq.com> Co-authored-by: Shuming19 <313093131@qq.com> 2025-09-11 21:20:09 +08:00			`torch.ops.vllm.maybe_prefetch_mlp_down_proj(x)`
[refact] unified soc_version code (#4359) ### What this PR does / why we need it? Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. ### Does this PR introduce _any_ user-facing change? When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2918c1b49c88c29783c86f78d2c4221cb9622379 Signed-off-by: zzzzwwjj <1183291235@qq.com> 2025-11-26 14:28:55 +08:00			`if get_ascend_device_type() == AscendDeviceType._310P:`
[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/8dfb45ca3379b3a789ec529af4bf725daa07f10d --------- Signed-off-by: MengqingCao <cmq0113@163.com> 2025-07-18 23:07:14 +08:00			`out = torch_npu.npu_swiglu(x.to(torch.float32)).to(torch.float16)`
			`else:`
			`out = torch_npu.npu_swiglu(x)`
[main] mlp weight prefetch in Qwen Dense Models (#2816) ### What this PR does / why we need it? This PR prefetchs the weight of mlp layers in Qwen Dense Models to optimize the performance in Decode phase mainly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: https://github.com/vllm-project/vllm/commit/a1213fae5f8fe624c4b954768d23a86e384abdb1 Signed-off-by: rjg-lyh <1318825571@qq.com> Co-authored-by: Shuming19 <313093131@qq.com> 2025-09-11 21:20:09 +08:00			`torch.ops.vllm.maybe_wait_prefetch_done(out)`
[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/8dfb45ca3379b3a789ec529af4bf725daa07f10d --------- Signed-off-by: MengqingCao <cmq0113@163.com> 2025-07-18 23:07:14 +08:00			`return out`