adopt rope in vllm-ascend (#530)

### What this PR does / why we need it? Adopt custom kernel rotary embedding in actual model inference, customized rotary_embedding will generate contiguous query and key in the cpp side to reduce the overhead of two contiguous and index_select compared with rotary_embedding in torch_npu. For now, rotary_embedding can only support the scenario of `is_neox = true`, non-neox version rope will be updated soon in the future. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-18 08:56:05 +08:00
parent 23f85e3f74
commit 66a0837963
5 changed files with 37 additions and 49 deletions
--- a/csrc/kernels/pos_encoding_kernels.cpp
+++ b/csrc/kernels/pos_encoding_kernels.cpp
@@ -28,9 +28,9 @@
 using vllm_ascend::AccType;
 using vllm_ascend::local_mem_copy;
 template <typename scalar_t, bool isNeox> class RotaryEmbedding {
-    // NOTE(ganyi): we use 32K as load stride for pipe, need to find another way to
+    // NOTE(ganyi): we use 512B as load stride for pipe, need to find another way to
    // retrive this size from runtime for more Soc support
-    static int constexpr loadSize = 1024 * 4;
+    static int constexpr loadSize = 512;
    using dst_t = scalar_t;
    using acc_t = typename AccType<scalar_t>::type;
    // only half tensor have cast instruct to int8, hardcode acc_dst_t as half