xc-llm-ascend/vllm_ascend/flash_common3_context.py

from dataclasses import dataclass

import torch
from vllm.model_executor.layers.linear import LinearBase


@dataclass
class FlashCommon3Context:
    gate: LinearBase | None = None
    topk_weights: torch.Tensor | None = None
    topk_ids: torch.Tensor | None = None
    row_idx: torch.Tensor | None = None
    shared_experts: torch.nn.Module | None = None
    shared_out: torch.Tensor | None = None


_flash_common3_context: FlashCommon3Context | None = None


def get_flash_common3_context() -> FlashCommon3Context | None:
    return _flash_common3_context


def set_flash_common3_context(
    topk_weights: torch.Tensor | None = None,
    topk_ids: torch.Tensor | None = None,
    shared_experts: torch.nn.Module | None = None,
    shared_out: torch.Tensor | None = None,
):
    global _flash_common3_context
    if _flash_common3_context is None:
        _flash_common3_context = FlashCommon3Context()

    if topk_weights is not None:
        _flash_common3_context.topk_weights = topk_weights
    if topk_ids is not None:
        _flash_common3_context.topk_ids = topk_ids
    if shared_experts is not None:
        _flash_common3_context.shared_experts = shared_experts
    if shared_out is not None:
        _flash_common3_context.shared_out = shared_out
[Perf]enable prefill flashcommon3 (#4065) ### What this PR does / why we need it? moe multistream overlap to improve the performance. ### How was this patch tested? --additional-config '{"multistream_overlap_gate": true}' - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> 2025-12-14 09:34:13 +08:00			`from dataclasses import dataclass`

			`import torch`
			`from vllm.model_executor.layers.linear import LinearBase`


			`@dataclass`
			`class FlashCommon3Context:`
[Lint]Style: Convert `vllm-ascend/compilation` to ruff format (#5912) ### What this PR does / why we need it? Convert `vllm-ascend/compilation` to ruff format. ### Does this PR introduce _any_ user-facing change? During this migration, we encountered some errors in our CI and testing environments, such as: ``` vllm_ascend/utils.py:653: in <module> def register_ascend_customop(vllm_config: VllmConfig \| None = None): ^^^^^^^^^^^^^^^^^ E TypeError: unsupported operand type(s) for \|: 'NoneType' and 'NoneType' ``` 1. Root Cause Analysis: The project uses a common pattern to break circular dependencies: ```python if TYPE_CHECKING: from vllm.config import VllmConfig else: VllmConfig = None # Placeholder assigned at runtime ``` When Python parses the function definition `def register_ascend_customop(vllm_config: VllmConfig \| None)`, it attempts to evaluate the expression `VllmConfig \| None`. Since `VllmConfig` is assigned `None` at runtime, the expression effectively becomes `None \| None`. In Python, `None` is an instance of `NoneType`. While the `\|` operator is implemented for Type objects (classes), it is not supported for `NoneType` instances, leading to the `TypeError` shown above. 2. Solution: To maintain the modern `\|` syntax required by our new linting standards while preserving our dependency management strategy, I have introduced: ```python from __future__ import annotations ``` at the top of the affected files. This enables Postponed Evaluation of Annotations (PEP 563). 3. Impact and Benefits: - By enabling `annotations`, Python no longer executes the `VllmConfig \| None` operation during module load. Instead, it stores the annotation as a string literal, completely avoiding the `None \| None` calculation. - We can keep the `VllmConfig = None` placeholders. This ensures that other modules can still import these symbols without triggering an `ImportError`, maintaining a stable dependency graph. - IDEs and static type checkers (MyPy/Pyright) continue to resolve the types correctly. This allows us to use modern syntax without sacrificing type safety or runtime stability. - The only side effect is that `__annotations__` will now return strings instead of type objects. Since this module does not use runtime type enforcement or reflection, this change has zero negative impact on existing functionality. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9 --------- Signed-off-by: MrZ20 <2609716663@qq.com> 2026-01-16 20:57:46 +08:00			`gate: LinearBase \| None = None`
			`topk_weights: torch.Tensor \| None = None`
			`topk_ids: torch.Tensor \| None = None`
			`row_idx: torch.Tensor \| None = None`
			`shared_experts: torch.nn.Module \| None = None`
			`shared_out: torch.Tensor \| None = None`
[Perf]enable prefill flashcommon3 (#4065) ### What this PR does / why we need it? moe multistream overlap to improve the performance. ### How was this patch tested? --additional-config '{"multistream_overlap_gate": true}' - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> 2025-12-14 09:34:13 +08:00

[Lint]Style: Convert `vllm-ascend/compilation` to ruff format (#5912) ### What this PR does / why we need it? Convert `vllm-ascend/compilation` to ruff format. ### Does this PR introduce _any_ user-facing change? During this migration, we encountered some errors in our CI and testing environments, such as: ``` vllm_ascend/utils.py:653: in <module> def register_ascend_customop(vllm_config: VllmConfig \| None = None): ^^^^^^^^^^^^^^^^^ E TypeError: unsupported operand type(s) for \|: 'NoneType' and 'NoneType' ``` 1. Root Cause Analysis: The project uses a common pattern to break circular dependencies: ```python if TYPE_CHECKING: from vllm.config import VllmConfig else: VllmConfig = None # Placeholder assigned at runtime ``` When Python parses the function definition `def register_ascend_customop(vllm_config: VllmConfig \| None)`, it attempts to evaluate the expression `VllmConfig \| None`. Since `VllmConfig` is assigned `None` at runtime, the expression effectively becomes `None \| None`. In Python, `None` is an instance of `NoneType`. While the `\|` operator is implemented for Type objects (classes), it is not supported for `NoneType` instances, leading to the `TypeError` shown above. 2. Solution: To maintain the modern `\|` syntax required by our new linting standards while preserving our dependency management strategy, I have introduced: ```python from __future__ import annotations ``` at the top of the affected files. This enables Postponed Evaluation of Annotations (PEP 563). 3. Impact and Benefits: - By enabling `annotations`, Python no longer executes the `VllmConfig \| None` operation during module load. Instead, it stores the annotation as a string literal, completely avoiding the `None \| None` calculation. - We can keep the `VllmConfig = None` placeholders. This ensures that other modules can still import these symbols without triggering an `ImportError`, maintaining a stable dependency graph. - IDEs and static type checkers (MyPy/Pyright) continue to resolve the types correctly. This allows us to use modern syntax without sacrificing type safety or runtime stability. - The only side effect is that `__annotations__` will now return strings instead of type objects. Since this module does not use runtime type enforcement or reflection, this change has zero negative impact on existing functionality. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9 --------- Signed-off-by: MrZ20 <2609716663@qq.com> 2026-01-16 20:57:46 +08:00			`_flash_common3_context: FlashCommon3Context \| None = None`
[Perf]enable prefill flashcommon3 (#4065) ### What this PR does / why we need it? moe multistream overlap to improve the performance. ### How was this patch tested? --additional-config '{"multistream_overlap_gate": true}' - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> 2025-12-14 09:34:13 +08:00

[Lint]Style: Convert `vllm-ascend/compilation` to ruff format (#5912) ### What this PR does / why we need it? Convert `vllm-ascend/compilation` to ruff format. ### Does this PR introduce _any_ user-facing change? During this migration, we encountered some errors in our CI and testing environments, such as: ``` vllm_ascend/utils.py:653: in <module> def register_ascend_customop(vllm_config: VllmConfig \| None = None): ^^^^^^^^^^^^^^^^^ E TypeError: unsupported operand type(s) for \|: 'NoneType' and 'NoneType' ``` 1. Root Cause Analysis: The project uses a common pattern to break circular dependencies: ```python if TYPE_CHECKING: from vllm.config import VllmConfig else: VllmConfig = None # Placeholder assigned at runtime ``` When Python parses the function definition `def register_ascend_customop(vllm_config: VllmConfig \| None)`, it attempts to evaluate the expression `VllmConfig \| None`. Since `VllmConfig` is assigned `None` at runtime, the expression effectively becomes `None \| None`. In Python, `None` is an instance of `NoneType`. While the `\|` operator is implemented for Type objects (classes), it is not supported for `NoneType` instances, leading to the `TypeError` shown above. 2. Solution: To maintain the modern `\|` syntax required by our new linting standards while preserving our dependency management strategy, I have introduced: ```python from __future__ import annotations ``` at the top of the affected files. This enables Postponed Evaluation of Annotations (PEP 563). 3. Impact and Benefits: - By enabling `annotations`, Python no longer executes the `VllmConfig \| None` operation during module load. Instead, it stores the annotation as a string literal, completely avoiding the `None \| None` calculation. - We can keep the `VllmConfig = None` placeholders. This ensures that other modules can still import these symbols without triggering an `ImportError`, maintaining a stable dependency graph. - IDEs and static type checkers (MyPy/Pyright) continue to resolve the types correctly. This allows us to use modern syntax without sacrificing type safety or runtime stability. - The only side effect is that `__annotations__` will now return strings instead of type objects. Since this module does not use runtime type enforcement or reflection, this change has zero negative impact on existing functionality. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9 --------- Signed-off-by: MrZ20 <2609716663@qq.com> 2026-01-16 20:57:46 +08:00			`def get_flash_common3_context() -> FlashCommon3Context \| None:`
[Perf]enable prefill flashcommon3 (#4065) ### What this PR does / why we need it? moe multistream overlap to improve the performance. ### How was this patch tested? --additional-config '{"multistream_overlap_gate": true}' - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> 2025-12-14 09:34:13 +08:00			`return _flash_common3_context`


			`def set_flash_common3_context(`
[Lint]Style: Convert `vllm-ascend/compilation` to ruff format (#5912) ### What this PR does / why we need it? Convert `vllm-ascend/compilation` to ruff format. ### Does this PR introduce _any_ user-facing change? During this migration, we encountered some errors in our CI and testing environments, such as: ``` vllm_ascend/utils.py:653: in <module> def register_ascend_customop(vllm_config: VllmConfig \| None = None): ^^^^^^^^^^^^^^^^^ E TypeError: unsupported operand type(s) for \|: 'NoneType' and 'NoneType' ``` 1. Root Cause Analysis: The project uses a common pattern to break circular dependencies: ```python if TYPE_CHECKING: from vllm.config import VllmConfig else: VllmConfig = None # Placeholder assigned at runtime ``` When Python parses the function definition `def register_ascend_customop(vllm_config: VllmConfig \| None)`, it attempts to evaluate the expression `VllmConfig \| None`. Since `VllmConfig` is assigned `None` at runtime, the expression effectively becomes `None \| None`. In Python, `None` is an instance of `NoneType`. While the `\|` operator is implemented for Type objects (classes), it is not supported for `NoneType` instances, leading to the `TypeError` shown above. 2. Solution: To maintain the modern `\|` syntax required by our new linting standards while preserving our dependency management strategy, I have introduced: ```python from __future__ import annotations ``` at the top of the affected files. This enables Postponed Evaluation of Annotations (PEP 563). 3. Impact and Benefits: - By enabling `annotations`, Python no longer executes the `VllmConfig \| None` operation during module load. Instead, it stores the annotation as a string literal, completely avoiding the `None \| None` calculation. - We can keep the `VllmConfig = None` placeholders. This ensures that other modules can still import these symbols without triggering an `ImportError`, maintaining a stable dependency graph. - IDEs and static type checkers (MyPy/Pyright) continue to resolve the types correctly. This allows us to use modern syntax without sacrificing type safety or runtime stability. - The only side effect is that `__annotations__` will now return strings instead of type objects. Since this module does not use runtime type enforcement or reflection, this change has zero negative impact on existing functionality. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9 --------- Signed-off-by: MrZ20 <2609716663@qq.com> 2026-01-16 20:57:46 +08:00			`topk_weights: torch.Tensor \| None = None,`
			`topk_ids: torch.Tensor \| None = None,`
			`shared_experts: torch.nn.Module \| None = None,`
			`shared_out: torch.Tensor \| None = None,`
[Perf]enable prefill flashcommon3 (#4065) ### What this PR does / why we need it? moe multistream overlap to improve the performance. ### How was this patch tested? --additional-config '{"multistream_overlap_gate": true}' - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> 2025-12-14 09:34:13 +08:00			`):`
			`global _flash_common3_context`
			`if _flash_common3_context is None:`
			`_flash_common3_context = FlashCommon3Context()`

			`if topk_weights is not None:`
			`_flash_common3_context.topk_weights = topk_weights`
			`if topk_ids is not None:`
			`_flash_common3_context.topk_ids = topk_ids`
			`if shared_experts is not None:`
			`_flash_common3_context.shared_experts = shared_experts`
			`if shared_out is not None:`
			`_flash_common3_context.shared_out = shared_out`