[long_seq_Feat] support chunk prefill (#4158)

### What this PR does / why we need it? 1、qwen GQA attention_v1 optim 2、DeepSeek MLA refactor, all gather q -> all gather kv 3、modelrunner refactor for chunk prefill, we remove some code not use - vLLM version: v0.11.0 - vLLM main: 2918c1b49c --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Delphine-Nic <tanwenqin@huawei.com> Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
2025-11-14 08:43:37 +08:00
parent 7294f89e43
commit 5ec96fd46c
6 changed files with 419 additions and 941 deletions
--- a/tests/ut/attention/test_mla_v1.py
+++ b/tests/ut/attention/test_mla_v1.py
@@ -484,9 +484,6 @@ class TestAscendMLAImpl(TestBase):
        chunk_ctx.chunk_seq_lens = [torch.tensor([8])]
        chunk_ctx.chunk_seq_lens_npu = [torch.tensor([8])]
        chunk_ctx.starts = [torch.tensor([0])]
-        chunk_ctx.max_chunk_num = 1
-        chunk_ctx.mask_for_non_zero_chunk = [True]
-        chunk_ctx.local_chunked_kv_lens = [[[[8]]]]

        prefill_meta = MagicMock()
        prefill_meta.chunked_context = chunk_ctx