[Bugfix] fix pcp qwen full graph FIA bug (#6037)
### What this PR does / why we need it?
In the pcp full graph Qwen model scenario, the inconsistency between the
Q shape and actual q len of the FIA operator is fixed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
This commit is contained in:
@@ -440,11 +440,8 @@ def update_attn_dcp_pcp_params(update_stream, forward_context, runtime_shape):
|
||||
pad_tensor = np.zeros(pad_length, dtype=actual_seq_lengths_kv.dtype)
|
||||
actual_seq_lengths_kv = np.concatenate([actual_seq_lengths_kv, pad_tensor])
|
||||
|
||||
actual_seq_lengths_q = attn_metadata.actual_seq_lengths_q[: attn_metadata.num_decode_tokens]
|
||||
if runtime_shape - len(actual_seq_lengths_q):
|
||||
actual_seq_lengths_q = actual_seq_lengths_q + [actual_seq_lengths_q[-1]] * (
|
||||
runtime_shape - len(actual_seq_lengths_q)
|
||||
)
|
||||
actual_seq_lengths_q = attn_metadata.actual_seq_lengths_q
|
||||
|
||||
if dcp_size > 1:
|
||||
num_heads = num_heads * dcp_size
|
||||
|
||||
|
||||
Reference in New Issue
Block a user