[Perf] move quant before allgather in Allgather EP (#3420)

### What this PR does / why we need it? move quant before allgather in Allgather EP, rely on https://github.com/vllm-project/vllm-ascend/pull/3334 Deepseek R1 W8A8 performance on A2 with `HCCL_ALGO="level0:NA;level1:pipeline"`: | Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR | |----------|----------|----------| | 4k | 375.21 | 364.99 | | 16k | 1465.23 | 1421.75 | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: 83f478bb19 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-11-04 16:49:58 +08:00
parent 44b58b8665
commit bedf223771
10 changed files with 160 additions and 66 deletions
--- a/vllm_ascend/ops/register_custom_ops.py
+++ b/vllm_ascend/ops/register_custom_ops.py
@@ -36,7 +36,7 @@ def _maybe_all_gather_and_maybe_unpad_impl(
            x = tensor_model_parallel_all_gather(x, 0)
            pad_size = forward_context.pad_size
            if pad_size > 0:
-                x = x[:-pad_size, :]
+                x = x[:-pad_size]
        else:
            x = get_ep_group().all_gather(x, 0)
            # unpad
@@ -50,8 +50,7 @@ def _maybe_all_gather_and_maybe_unpad_impl(
            offset = 0
            for idx in range(dp_size):
                num_tokens_dp = num_tokens_across_dp_cpu[idx]
-                result[offset:offset +
-                       num_tokens_dp, :] = x[idx, :num_tokens_dp, :]
+                result[offset:offset + num_tokens_dp] = x[idx, :num_tokens_dp]
                offset += num_tokens_dp
            x = result