correct bug to fix the value of max_num_tokens (#3933)
### What this PR does / why we need it?
correct bug to fix the value of max_num_tokens
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
This commit is contained in:
@@ -117,7 +117,7 @@ class NPUTorchairModelRunner(NPUModelRunner):
|
||||
# NOTE: To be clear, we need to make sure that during graph capture, the number of
|
||||
# tokens is less than or equal to mc2_tokens_capacity. According to _set_cudagraph_sizes,
|
||||
# the max number of tokens in graph is min(max_num_seqs * uniform_decode_query_len, 512).
|
||||
max_num_tokens = self.parallel_config.tensor_parallel_size
|
||||
max_num_tokens = self.max_num_reqs * self.uniform_decode_query_len
|
||||
tp_size = self.parallel_config.tensor_parallel_size
|
||||
# Use integer arithmetic for ceiling division.
|
||||
max_graph_batch_size = self.calculate_new_torchair_graph_batch_size(
|
||||
|
||||
Reference in New Issue
Block a user