adopt rope in vllm-ascend (#530)

### What this PR does / why we need it? Adopt custom kernel rotary embedding in actual model inference, customized rotary_embedding will generate contiguous query and key in the cpp side to reduce the overhead of two contiguous and index_select compared with rotary_embedding in torch_npu. For now, rotary_embedding can only support the scenario of `is_neox = true`, non-neox version rope will be updated soon in the future. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-18 08:56:05 +08:00
parent 23f85e3f74
commit 66a0837963
5 changed files with 37 additions and 49 deletions
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -23,7 +23,9 @@ import torch
 import torch_npu  # noqa: F401
 import vllm.envs as envs
 from vllm.logger import logger
+from vllm.platforms import Platform, PlatformEnum

+CUSTOM_OP_ENABLED = False
 try:
    # register custom ops into torch_library here
    import vllm_ascend.vllm_ascend_C  # type: ignore  # noqa: F401
@@ -35,8 +37,8 @@ except ImportError as e:
        logging.warning(
            "Warning: Failed to register custom ops, all custom ops will be disabled"
        )
-
-from vllm.platforms import Platform, PlatformEnum
+    else:
+        CUSTOM_OP_ENABLED = True

 if TYPE_CHECKING:
    from vllm.config import ModelConfig, VllmConfig