xc-llm-ascend/vllm_ascend/_310p/ops/layernorm.py

import torch
import torch_npu
from vllm.model_executor.layers.layernorm import RMSNormGated

from vllm_ascend.ops.layernorm import AscendGemmaRMSNorm, AscendRMSNorm


class AscendRMSNorm310(AscendRMSNorm):
    def forward_oot(
        self,
        x: torch.Tensor,
        residual: torch.Tensor | None = None,
    ) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:
        if residual is not None:
            x, _, residual = torch_npu.npu_add_rms_norm(x, residual, self.weight, self.variance_epsilon)
            if self.bias is not None:
                x.add_(self.bias)
            return x, residual

        x, _ = torch_npu.npu_rms_norm(x, self.weight, self.variance_epsilon)
        if self.bias is not None:
            x.add_(self.bias)
        return x


class AscendGemmaRMSNorm310(AscendGemmaRMSNorm):
    def forward_oot(
        self,
        x: torch.Tensor,
        residual: torch.Tensor | None = None,
    ) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:
        if residual is not None:
            orig_dtype = residual.dtype
            x = x + residual.to(x.dtype)
            residual = x.to(orig_dtype)
            x, _ = torch_npu.npu_rms_norm(x, 1.0 + self.weight, self.variance_epsilon)
            return x, residual

        x, _ = torch_npu.npu_rms_norm(x, 1.0 + self.weight, self.variance_epsilon)
        return x


class AscendRMSNormGated310(RMSNormGated):
    def forward_oot(
        self,
        x: torch.Tensor,
        z: torch.Tensor | None = None,
    ) -> torch.Tensor:
        # 310P should not depend on the Triton-gated layernorm path.
        # Reuse the upstream native implementation directly.
        return super().forward_native(x, z)
[310P]: refactoring for 310p kvcache and some ops class (#6117) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-01-24 20:34:29 +08:00			`import torch`
			`import torch_npu`
[310p]: add rmsnorm gated fallback and unit test (#7424) ### What this PR does / why we need it? RFC #7394 310P cannot use the fused `rmsnormgated` operator and must fall back to the native implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ut - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-03-24 09:00:11 +08:00			`from vllm.model_executor.layers.layernorm import RMSNormGated`
[310P]: refactoring for 310p kvcache and some ops class (#6117) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-01-24 20:34:29 +08:00
			`from vllm_ascend.ops.layernorm import AscendGemmaRMSNorm, AscendRMSNorm`


			`class AscendRMSNorm310(AscendRMSNorm):`
			`def forward_oot(`
			`self,`
			`x: torch.Tensor,`
			`residual: torch.Tensor \| None = None,`
			`) -> torch.Tensor \| tuple[torch.Tensor, torch.Tensor]:`
			`if residual is not None:`
[Feat.][310P] addrmsnorm for 300I DUO (#6704) ### What this PR does / why we need it? This PR integrates the `npu_add_rms_norm` fused kernel for RMSNorm operations with residual connections on 310P devices. This change optimizes the computation by replacing a two-step process (manual residual addition followed by RMSNorm) with a single, more efficient fused operation. This is needed to improve the performance of models utilizing RMSNorm with residual connections on the 310P architecture. Fixes # ### Does this PR introduce _any_ user-facing change? No, this PR introduces an internal optimization and does not change any user-facing APIs or behaviors. ### How was this patch tested? This patch was tested with updated unit tests (`test_RMSNorm_forward_310p`) that mock the `npu_add_rms_norm` operation to verify the correctness of the fused kernel integration. --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-02-13 15:40:49 +08:00			`x, _, residual = torch_npu.npu_add_rms_norm(x, residual, self.weight, self.variance_epsilon)`
			`if self.bias is not None:`
			`x.add_(self.bias)`
[310P]: refactoring for 310p kvcache and some ops class (#6117) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-01-24 20:34:29 +08:00			`return x, residual`

[Refact.]: Refactor some leftover implementations of 300I DUO in the main branch. (#6425) ### What this PR does / why we need it? - Replace the RoPE operator implementation. - Refactor some leftover implementations of 300I DUO in the main branch. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-02-02 16:12:04 +08:00			`x, _ = torch_npu.npu_rms_norm(x, self.weight, self.variance_epsilon)`
[310P]: refactoring for 310p kvcache and some ops class (#6117) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-01-24 20:34:29 +08:00			`if self.bias is not None:`
			`x.add_(self.bias)`
			`return x`


			`class AscendGemmaRMSNorm310(AscendGemmaRMSNorm):`
			`def forward_oot(`
			`self,`
			`x: torch.Tensor,`
			`residual: torch.Tensor \| None = None,`
			`) -> torch.Tensor \| tuple[torch.Tensor, torch.Tensor]:`
			`if residual is not None:`
			`orig_dtype = residual.dtype`
			`x = x + residual.to(x.dtype)`
			`residual = x.to(orig_dtype)`
			`x, _ = torch_npu.npu_rms_norm(x, 1.0 + self.weight, self.variance_epsilon)`
			`return x, residual`

			`x, _ = torch_npu.npu_rms_norm(x, 1.0 + self.weight, self.variance_epsilon)`
			`return x`
[310p]: add rmsnorm gated fallback and unit test (#7424) ### What this PR does / why we need it? RFC #7394 310P cannot use the fused `rmsnormgated` operator and must fall back to the native implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ut - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-03-24 09:00:11 +08:00

			`class AscendRMSNormGated310(RMSNormGated):`
			`def forward_oot(`
			`self,`
			`x: torch.Tensor,`
			`z: torch.Tensor \| None = None,`
			`) -> torch.Tensor:`
			`# 310P should not depend on the Triton-gated layernorm path.`
			`# Reuse the upstream native implementation directly.`
			`return super().forward_native(x, z)`