[long_seq_Feat] support chunk prefill (#4158)
### What this PR does / why we need it?
1、qwen GQA attention_v1 optim
2、DeepSeek MLA refactor, all gather q -> all gather kv
3、modelrunner refactor for chunk prefill, we remove some code not use
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
This commit is contained in:
@@ -484,9 +484,6 @@ class TestAscendMLAImpl(TestBase):
|
||||
chunk_ctx.chunk_seq_lens = [torch.tensor([8])]
|
||||
chunk_ctx.chunk_seq_lens_npu = [torch.tensor([8])]
|
||||
chunk_ctx.starts = [torch.tensor([0])]
|
||||
chunk_ctx.max_chunk_num = 1
|
||||
chunk_ctx.mask_for_non_zero_chunk = [True]
|
||||
chunk_ctx.local_chunked_kv_lens = [[[[8]]]]
|
||||
|
||||
prefill_meta = MagicMock()
|
||||
prefill_meta.chunked_context = chunk_ctx
|
||||
|
||||
Reference in New Issue
Block a user