[main][bugfix] Fixed the problem that eagle3 will crash in FULL_DECODE_ONLY (#7290)
### What this PR does / why we need it?
Two problems have been solved in this pr.
These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens`
should be padded to some value in `cudagraph_capture_sizes`.
1. We found the length of `seq_lens_list` in drafter's `attn_metadata`
is 1 shorter than expected. It will raise a kernel exception to make
vllm crash.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996], it is not padded.
3. Though the length of `seq_lens_list` in target's `attn_metadata` is
the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at
the end of the list.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end
of the list.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
This commit is contained in:
@@ -559,6 +559,8 @@ class SpecDecodeBaseProposer(EagleProposer):
|
||||
common_attn_metadata.num_reqs = num_reqs_padded
|
||||
common_attn_metadata.query_start_loc = self.runner.query_start_loc.gpu[: num_reqs_padded + 1]
|
||||
common_attn_metadata.query_start_loc_cpu = self.runner.query_start_loc.cpu[: num_reqs_padded + 1]
|
||||
common_attn_metadata.seq_lens = self.runner.seq_lens.gpu[:num_reqs_padded]
|
||||
common_attn_metadata.seq_lens_cpu = self.runner.seq_lens.cpu[:num_reqs_padded]
|
||||
else:
|
||||
num_input_tokens = num_tokens
|
||||
|
||||
|
||||
@@ -758,11 +758,11 @@ class NPUModelRunner(GPUModelRunner):
|
||||
self.gdn_query_start_loc.copy_to_gpu()
|
||||
|
||||
self.seq_lens.np[:num_reqs] = self.input_batch.num_computed_tokens_cpu[:num_reqs] + num_scheduled_tokens
|
||||
self.seq_lens.cpu[num_reqs:].fill_(0)
|
||||
self.seq_lens.copy_to_gpu()
|
||||
|
||||
# Fill unused with -1. Needed for reshape_and_cache in attention_cp
|
||||
self.query_start_loc.gpu[num_reqs + 1 :].fill_(-1)
|
||||
self.seq_lens.gpu[num_reqs:].fill_(0)
|
||||
|
||||
# Copy the tensors to the NPU.
|
||||
self._prepare_input_ids(scheduler_output, total_num_scheduled_tokens, cu_num_tokens)
|
||||
|
||||
Reference in New Issue
Block a user