update to vllm 12-19 (#5223)
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.
2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (https://github.com/vllm-project/vllm/pull/30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.
4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
https://github.com/vllm-project/vllm-ascend/issues/5297
### How was this patch tested?
Co-authored-by: zxwang <1476209578@qq.com>
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
This commit is contained in:
@@ -58,6 +58,30 @@ class AscendMMEncoderAttention(MMEncoderAttention):
|
||||
multimodal_config=multimodal_config,
|
||||
)
|
||||
|
||||
def reshape_qkv_to_3d(
|
||||
self,
|
||||
query: torch.Tensor,
|
||||
key: torch.Tensor,
|
||||
value: torch.Tensor,
|
||||
bsz: int,
|
||||
q_len: int,
|
||||
kv_len: int,
|
||||
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
|
||||
"""
|
||||
Reshape query, key, value to 3D tensors:
|
||||
(batch_size * seq_len, num_heads, head_size)
|
||||
"""
|
||||
query = query.view(bsz * q_len, self.num_heads, self.head_size)
|
||||
key = key.view(bsz * kv_len, self.num_kv_heads, self.head_size)
|
||||
value = value.view(bsz * kv_len, self.num_kv_heads, self.head_size)
|
||||
self.num_queries_per_kv = self.num_heads // self.num_kv_heads
|
||||
if (num_repeat := self.num_queries_per_kv) > 1:
|
||||
# Handle MQA and GQA
|
||||
key = torch.repeat_interleave(key, num_repeat, dim=1)
|
||||
value = torch.repeat_interleave(value, num_repeat, dim=1)
|
||||
|
||||
return query, key, value
|
||||
|
||||
def forward_oot(
|
||||
self,
|
||||
query: torch.Tensor,
|
||||
@@ -86,6 +110,13 @@ class AscendMMEncoderAttention(MMEncoderAttention):
|
||||
v = F.pad(v, (0, pad_len), mode="constant", value=0)
|
||||
|
||||
context_layer = torch.empty_like(q)
|
||||
|
||||
if cu_seqlens is None:
|
||||
cu_seqlens = torch.arange(0, (bsz + 1) * q_len,
|
||||
step=q_len,
|
||||
dtype=torch.int32,
|
||||
device=query.device)
|
||||
|
||||
cu_seqlens = torch.diff(cu_seqlens).to("cpu")
|
||||
|
||||
# operator requires pta version >= 2.5.1
|
||||
|
||||
Reference in New Issue
Block a user