[Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465)

### What this PR does / why we need it? In order to support fused kernels, multi-stream, communication optimization etc, it's better to aggregate all opreations in Attention layer togather. This PR tries to refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl. Note that new mla_v1 doesn't take torchair into consideration. So this PR can only be merged after torchair related mla_v1 is isolated into a new file. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ### Features Test <img width="506" height="141" alt="image" src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466" /> ### Performance After Refact <img width="648" height="486" alt="image" src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb" /> ### Performance Before Refact <img width="618" height="494" alt="image" src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1" /> - vLLM version: v0.10.1.1 - vLLM main: e03940762b --------- Signed-off-by: lwq <liwenquan5@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: lwq <liwenquan5@huawei.com> Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-08-28 10:35:57 +08:00
parent 320edde2df
commit c8d1df3a3f
5 changed files with 410 additions and 345 deletions
--- a/docs/source/user_guide/configuration/additional_config.md
+++ b/docs/source/user_guide/configuration/additional_config.md
@@ -31,6 +31,7 @@ The following table lists the additional configuration options available in vLLM
 | `refresh`                     | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case.     |
 | `expert_map_path`             | str  | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
 | `chunked_prefill_for_mla`     | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |
+| `enable_prefetch`     | bool | `False` | Whether to enable weight prefetch. |
 | `kv_cache_dtype`     | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
 | `enable_shared_expert_dp`     | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |