[Feature]cpu offload connector (#1659)
This PR implements cpu offload connector to enable NPU kv cache offload
to host DRAM.
- vLLM version: v0.10.2
- vLLM main:
5aeb925452
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
Co-authored-by: AlvisGong <gwly0401@163.com>
This commit is contained in:
@@ -156,8 +156,9 @@ def mla_forward(
|
||||
else:
|
||||
attn_metadata = forward_context.attn_metadata
|
||||
kv_cache = self.mla_attn.kv_cache[forward_context.virtual_engine]
|
||||
self.mla_attn.impl.forward(hidden_states, kv_cache, attn_metadata,
|
||||
need_gather_q_kv, output)
|
||||
self.mla_attn.impl.forward(self.mla_attn.layer_name, hidden_states,
|
||||
kv_cache, attn_metadata, need_gather_q_kv,
|
||||
output)
|
||||
return
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user