update to vllm 12-19 (#5223)
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
Fix vllm break:
1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4%
TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558)
Fix Solution: Add the now-necessary `all2all_backend` parameter. The
impact of this parameter on the original `set_splitting_ops_for_v1`
implementation is only that graph mode is disabled in `vllm` if
`deepep_high_throughput` is enabled; it has no effect on the
`vllm-ascend` logic.
2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention
interface ] (https://github.com/vllm-project/vllm/pull/30684)
Fix Solution: The reason why the GPU does not need to convert qkv to 3D
is that the GPU's flash_attention operator is compatible with 3D and 4D
(b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator
only supports 3D (s b ( h d)). Therefore, we need to introduce the
reshape_qkv_to_3d operation.
4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue
in upgrade vllm code:
https://github.com/vllm-project/vllm-ascend/issues/5297
### How was this patch tested?
Co-authored-by: zxwang <1476209578@qq.com>
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
This commit is contained in:
2
.github/workflows/bot_pr_create.yaml
vendored
2
.github/workflows/bot_pr_create.yaml
vendored
@@ -34,7 +34,7 @@ jobs:
|
||||
steps:
|
||||
- name: Get vLLM version
|
||||
run: |
|
||||
VLLM_COMMIT=ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
|
||||
VLLM_COMMIT=5fbfa8d9ef15948599631baeb91e8220b2ee9bcc
|
||||
echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> $GITHUB_ENV
|
||||
|
||||
- name: Checkout repository
|
||||
|
||||
2
.github/workflows/pr_test_full.yaml
vendored
2
.github/workflows/pr_test_full.yaml
vendored
@@ -74,7 +74,7 @@ jobs:
|
||||
name: e2e-full
|
||||
strategy:
|
||||
matrix:
|
||||
vllm_version: [v0.13.0]
|
||||
vllm_version: [5fbfa8d9ef15948599631baeb91e8220b2ee9bcc, v0.13.0]
|
||||
needs: [changes]
|
||||
if: ${{ needs.changes.outputs.e2e_tracker == 'true' }}
|
||||
uses: ./.github/workflows/_e2e_test.yaml
|
||||
|
||||
6
.github/workflows/pr_test_light.yaml
vendored
6
.github/workflows/pr_test_light.yaml
vendored
@@ -42,7 +42,7 @@ jobs:
|
||||
lint:
|
||||
uses: ./.github/workflows/_pre_commit.yml
|
||||
with:
|
||||
vllm: v0.13.0
|
||||
vllm: 5fbfa8d9ef15948599631baeb91e8220b2ee9bcc
|
||||
changes:
|
||||
runs-on: linux-aarch64-a2-0
|
||||
outputs:
|
||||
@@ -90,7 +90,7 @@ jobs:
|
||||
SOC_VERSION: ascend910b1
|
||||
strategy:
|
||||
matrix:
|
||||
vllm_version: [v0.13.0]
|
||||
vllm_version: [5fbfa8d9ef15948599631baeb91e8220b2ee9bcc, v0.13.0]
|
||||
|
||||
steps:
|
||||
- name: Free up disk space
|
||||
@@ -160,7 +160,7 @@ jobs:
|
||||
name: e2e-light
|
||||
strategy:
|
||||
matrix:
|
||||
vllm_version: [v0.13.0]
|
||||
vllm_version: [5fbfa8d9ef15948599631baeb91e8220b2ee9bcc, v0.13.0]
|
||||
# Note (yikun): If CI resource are limited we can split job into two chain jobs
|
||||
needs: [lint, changes]
|
||||
# only trigger e2e test after lint passed and the change is e2e related with pull request.
|
||||
|
||||
@@ -50,7 +50,7 @@ If you're using v0.7.3, don't forget to install [mindie-turbo](https://pypi.org/
|
||||
For main branch of vLLM Ascend, we usually make it compatible with the latest vLLM release and a newer commit hash of vLLM. Please note that this table is usually updated. Please check it regularly.
|
||||
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu |
|
||||
|-------------|--------------|------------------|-------------|--------------------|
|
||||
| main | v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |
|
||||
| main | 5fbfa8d9ef15948599631baeb91e8220b2ee9bcc, v0.13.0 tag | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |
|
||||
|
||||
## Release cadence
|
||||
|
||||
|
||||
@@ -781,11 +781,6 @@ PROMPT_CONFIGS = {
|
||||
"fps": 1,
|
||||
},
|
||||
},
|
||||
"hunyuan-vl": {
|
||||
"model": "Tencent-Hunyuan/HunyuanOCR",
|
||||
"prompt_fn": hunyuan_prompt,
|
||||
"mm_processor_kwargs": {},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -58,6 +58,30 @@ class AscendMMEncoderAttention(MMEncoderAttention):
|
||||
multimodal_config=multimodal_config,
|
||||
)
|
||||
|
||||
def reshape_qkv_to_3d(
|
||||
self,
|
||||
query: torch.Tensor,
|
||||
key: torch.Tensor,
|
||||
value: torch.Tensor,
|
||||
bsz: int,
|
||||
q_len: int,
|
||||
kv_len: int,
|
||||
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
|
||||
"""
|
||||
Reshape query, key, value to 3D tensors:
|
||||
(batch_size * seq_len, num_heads, head_size)
|
||||
"""
|
||||
query = query.view(bsz * q_len, self.num_heads, self.head_size)
|
||||
key = key.view(bsz * kv_len, self.num_kv_heads, self.head_size)
|
||||
value = value.view(bsz * kv_len, self.num_kv_heads, self.head_size)
|
||||
self.num_queries_per_kv = self.num_heads // self.num_kv_heads
|
||||
if (num_repeat := self.num_queries_per_kv) > 1:
|
||||
# Handle MQA and GQA
|
||||
key = torch.repeat_interleave(key, num_repeat, dim=1)
|
||||
value = torch.repeat_interleave(value, num_repeat, dim=1)
|
||||
|
||||
return query, key, value
|
||||
|
||||
def forward_oot(
|
||||
self,
|
||||
query: torch.Tensor,
|
||||
@@ -86,6 +110,13 @@ class AscendMMEncoderAttention(MMEncoderAttention):
|
||||
v = F.pad(v, (0, pad_len), mode="constant", value=0)
|
||||
|
||||
context_layer = torch.empty_like(q)
|
||||
|
||||
if cu_seqlens is None:
|
||||
cu_seqlens = torch.arange(0, (bsz + 1) * q_len,
|
||||
step=q_len,
|
||||
dtype=torch.int32,
|
||||
device=query.device)
|
||||
|
||||
cu_seqlens = torch.diff(cu_seqlens).to("cpu")
|
||||
|
||||
# operator requires pta version >= 2.5.1
|
||||
|
||||
@@ -232,7 +232,11 @@ class NPUPlatform(Platform):
|
||||
"using only ACL Graph mode")
|
||||
assert compilation_config.mode == CompilationMode.VLLM_COMPILE, \
|
||||
"When enabling VLLM_COMPILE aclgraph, please make sure compilation_config.mode == CompilationMode.VLLM_COMPILE and compilation_config.cudagraph_mode == CUDAGraphMode.VLLM_COMPILE"
|
||||
compilation_config.set_splitting_ops_for_v1()
|
||||
compilation_config.set_splitting_ops_for_v1(
|
||||
all2all_backend=vllm_config.parallel_config.all2all_backend,
|
||||
data_parallel_size=vllm_config.parallel_config.
|
||||
data_parallel_size,
|
||||
)
|
||||
compilation_config.use_inductor = False
|
||||
compilation_config.splitting_ops.extend(["vllm::mla_forward"])
|
||||
update_aclgraph_sizes(vllm_config)
|
||||
|
||||
Reference in New Issue
Block a user