[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072)

By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: 83f478bb19 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>
2025-12-31 14:24:04 +08:00
parent a539ae753a
commit 38570cfeb6
8 changed files with 163 additions and 95 deletions
--- a/docs/source/user_guide/configuration/additional_config.md
+++ b/docs/source/user_guide/configuration/additional_config.md
@@ -48,6 +48,7 @@ The following table lists additional configuration options available in vLLM Asc
 | `num_wait_worker_iterations`        | int  | `30`    | The forward iterations when the EPLB worker will finish CPU tasks. In our test default value 30 can cover most cases. |
 | `expert_map_record_path`            | str  | `None`  | Save the expert load calculation results to a new expert table in the specified directory.                |
 | `init_redundancy_expert`            | int  | `0`     | Specify redundant experts during initialization.                                                          |
+| `enable_kv_nz`                      | bool | `False` | Whether to enable kvcache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek).                                      |

 The details of each configuration option are as follows:

@@ -105,7 +106,8 @@ An example of additional configuration is as follows:
        "embedding_tensor_parallel_size": 8,
        "mlp_tensor_parallel_size": 8,
    },
+    "enable_kv_nz": False,
    "multistream_overlap_shared_expert": True,
-    "refresh": False,
+    "refresh": False
 }
 ```