[Perf] move quant before allgather in Allgather EP (#3420)
### What this PR does / why we need it?
move quant before allgather in Allgather EP, rely on
https://github.com/vllm-project/vllm-ascend/pull/3334
Deepseek R1 W8A8 performance on A2 with
`HCCL_ALGO="level0:NA;level1:pipeline"`:
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|----------|----------|----------|
| 4k | 375.21 | 364.99 |
| 16k | 1465.23 | 1421.75 |
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
This commit is contained in:
@@ -36,7 +36,7 @@ def _maybe_all_gather_and_maybe_unpad_impl(
|
||||
x = tensor_model_parallel_all_gather(x, 0)
|
||||
pad_size = forward_context.pad_size
|
||||
if pad_size > 0:
|
||||
x = x[:-pad_size, :]
|
||||
x = x[:-pad_size]
|
||||
else:
|
||||
x = get_ep_group().all_gather(x, 0)
|
||||
# unpad
|
||||
@@ -50,8 +50,7 @@ def _maybe_all_gather_and_maybe_unpad_impl(
|
||||
offset = 0
|
||||
for idx in range(dp_size):
|
||||
num_tokens_dp = num_tokens_across_dp_cpu[idx]
|
||||
result[offset:offset +
|
||||
num_tokens_dp, :] = x[idx, :num_tokens_dp, :]
|
||||
result[offset:offset + num_tokens_dp] = x[idx, :num_tokens_dp]
|
||||
offset += num_tokens_dp
|
||||
x = result
|
||||
|
||||
|
||||
Reference in New Issue
Block a user