xc-llm-ascend/vllm_ascend/worker/v2/utils.py

from contextlib import contextmanager

import torch
import vllm
from vllm.logger import logger

from vllm_ascend.worker.v2.block_table import AscendBlockTables
from vllm_ascend.worker.v2.model_states import init_asecnd_model_state


@contextmanager
def torch_cuda_wrapper():
    try:
        torch.cuda.Event = torch.npu.Event
        torch.cuda.Stream = torch.npu.Stream
        torch.cuda.stream = torch.npu.stream
        torch.cuda.default_stream = torch.npu.default_stream
        torch.cuda.current_stream = torch.npu.current_stream
        torch.cuda.graph_pool_handle = torch.npu.graph_pool_handle
        torch.cuda.CUDAGraph = torch.npu.NPUGraph
        torch.cuda.graph = torch.npu.graph
        torch.cuda.synchronize = torch.npu.synchronize
        torch.cuda.set_stream = torch.npu.set_stream
        torch.cuda.current_device = torch.npu.current_device
        torch.cuda.mem_get_info = torch.npu.mem_get_info
        logger.info_once("Wrapping torch.cuda with torch.npu.")
        yield
    finally:
        pass


@contextmanager
def block_table_wrapper():
    try:
        # vllm-ascend need to initialize slot mapping as torch.int32 dtype,
        # but vllm default is torch.int64 dtype.
        vllm.v1.worker.gpu.model_runner.BlockTables = AscendBlockTables
        logger.info_once("Wrapping BlockTables with AscendBlockTables.")
        yield
    finally:
        pass


@contextmanager
def model_states_wrapper():
    try:
        # prepare_attn in AscendModelState is different from vllm,
        # we need to override init_model_state.
        vllm.v1.worker.gpu.model_runner.init_model_state = init_asecnd_model_state
        logger.info_once("Wrapping init_model_state with init_asecnd_model_state.")
        yield
    finally:
        pass
implement model runner v2 basic framework (#5051) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-18 15:51:54 +08:00			`from contextlib import contextmanager`

			`import torch`
[Feature] support aclgraph for model runner v2 (#7110) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2026-03-13 09:11:46 +08:00			`import vllm`
			`from vllm.logger import logger`

			`from vllm_ascend.worker.v2.block_table import AscendBlockTables`
			`from vllm_ascend.worker.v2.model_states import init_asecnd_model_state`
implement model runner v2 basic framework (#5051) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-18 15:51:54 +08:00

			`@contextmanager`
			`def torch_cuda_wrapper():`
			`try:`
			`torch.cuda.Event = torch.npu.Event`
			`torch.cuda.Stream = torch.npu.Stream`
[Feature] support eager mode in model runner v2 (#5210) ### What this PR does / why we need it? #5051 only implement a basic framework for model runner v2, but there are still some bugs for e2e functionality, this PR aim to enable basic functionality. model runner v2 plans: https://github.com/vllm-project/vllm-ascend/issues/5208 - vLLM version: release/v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-29 15:28:34 +08:00			`torch.cuda.stream = torch.npu.stream`
implement model runner v2 basic framework (#5051) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-18 15:51:54 +08:00			`torch.cuda.default_stream = torch.npu.default_stream`
			`torch.cuda.current_stream = torch.npu.current_stream`
			`torch.cuda.graph_pool_handle = torch.npu.graph_pool_handle`
[Feature] support eager mode in model runner v2 (#5210) ### What this PR does / why we need it? #5051 only implement a basic framework for model runner v2, but there are still some bugs for e2e functionality, this PR aim to enable basic functionality. model runner v2 plans: https://github.com/vllm-project/vllm-ascend/issues/5208 - vLLM version: release/v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-29 15:28:34 +08:00			`torch.cuda.CUDAGraph = torch.npu.NPUGraph`
implement model runner v2 basic framework (#5051) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-18 15:51:54 +08:00			`torch.cuda.graph = torch.npu.graph`
[Feature] support eager mode in model runner v2 (#5210) ### What this PR does / why we need it? #5051 only implement a basic framework for model runner v2, but there are still some bugs for e2e functionality, this PR aim to enable basic functionality. model runner v2 plans: https://github.com/vllm-project/vllm-ascend/issues/5208 - vLLM version: release/v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-29 15:28:34 +08:00			`torch.cuda.synchronize = torch.npu.synchronize`
[Feature] support aclgraph for model runner v2 (#7110) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2026-03-13 09:11:46 +08:00			`torch.cuda.set_stream = torch.npu.set_stream`
			`torch.cuda.current_device = torch.npu.current_device`
			`torch.cuda.mem_get_info = torch.npu.mem_get_info`
			`logger.info_once("Wrapping torch.cuda with torch.npu.")`
			`yield`
			`finally:`
			`pass`


			`@contextmanager`
			`def block_table_wrapper():`
			`try:`
			`# vllm-ascend need to initialize slot mapping as torch.int32 dtype,`
			`# but vllm default is torch.int64 dtype.`
			`vllm.v1.worker.gpu.model_runner.BlockTables = AscendBlockTables`
			`logger.info_once("Wrapping BlockTables with AscendBlockTables.")`
			`yield`
			`finally:`
			`pass`


			`@contextmanager`
			`def model_states_wrapper():`
			`try:`
			`# prepare_attn in AscendModelState is different from vllm,`
			`# we need to override init_model_state.`
			`vllm.v1.worker.gpu.model_runner.init_model_state = init_asecnd_model_state`
			`logger.info_once("Wrapping init_model_state with init_asecnd_model_state.")`
implement model runner v2 basic framework (#5051) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-18 15:51:54 +08:00			`yield`
			`finally:`
[Feature] support eager mode in model runner v2 (#5210) ### What this PR does / why we need it? #5051 only implement a basic framework for model runner v2, but there are still some bugs for e2e functionality, this PR aim to enable basic functionality. model runner v2 plans: https://github.com/vllm-project/vllm-ascend/issues/5208 - vLLM version: release/v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2025-12-29 15:28:34 +08:00			`pass`