[Bugfix] Fix incorrect layer count for MTP models in update_aclgraph_sizes (#7064)
## Summary
- Fix incorrect layer count calculation for MTP (Multi-Token Prediction)
models in `update_aclgraph_sizes()` function
- For MTP models, the draft model's layer count is stored in
`num_nextn_predict_layers` or `mtp_num_hidden_layers` (for Qwen3.5), not
in the standard `num_hidden_layers` field
- Directly accessing `draft.hf_config.num_hidden_layers` returns the
main model's layer count instead of the MTP draft model's layer count
## Bug Description
In `vllm_ascend/utils.py`, the `update_aclgraph_sizes()` function
calculates `resources_per_graph` for speculative decoding scenarios.
When calculating the resources needed for the draft model, the original
code directly accessed:
```python
resources_per_graph += draft.hf_config.num_hidden_layers + 1
```
This works correctly for standard draft models, but **fails for MTP
models** (like DeepSeek-V3's MTP or Qwen3.5's MTP) because:
1. MTP models store their layer count in model-specific fields:
- `num_nextn_predict_layers` (DeepSeek-V3 MTP)
- `mtp_num_hidden_layers` (Qwen3.5 MTP)
2. The `num_hidden_layers` field in these models contains the **main
model's** layer count, not the MTP layer count
3. This leads to **grossly overestimating** the `resources_per_graph`,
which in turn causes the calculated `max_batch_sizes` to be
unnecessarily small
## Fix
Use `draft.get_total_num_hidden_layers()` instead of directly accessing
`draft.hf_config.num_hidden_layers`. This method correctly handles
different model types through the `model_arch_config_convertor`
infrastructure, returning the appropriate layer count for:
- Standard draft models → `num_hidden_layers`
- DeepSeek-V3 MTP → `num_nextn_predict_layers`
- Qwen3.5 MTP → `mtp_num_hidden_layers`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -485,7 +485,10 @@ def update_aclgraph_sizes(vllm_config: VllmConfig) -> None:
|
||||
resources_per_graph = num_hidden_layers + 1
|
||||
# For suffix decoding, use the suffix path when no draft_model_config is provided.
|
||||
if (spec := vllm_config.speculative_config) and (draft := spec.draft_model_config):
|
||||
resources_per_graph += draft.hf_config.num_hidden_layers + 1
|
||||
# Use get_total_num_hidden_layers() to correctly handle MTP models,
|
||||
# which store layer count in num_nextn_predict_layers or
|
||||
# mtp_num_hidden_layers (for Qwen3.5) instead of num_hidden_layers.
|
||||
resources_per_graph += draft.get_total_num_hidden_layers() + 1
|
||||
|
||||
# TODO: Find out whether we need to take into account the pp_size
|
||||
num_comm_groups = sum(
|
||||
|
||||
Reference in New Issue
Block a user