[Bugfix] Fix incorrect layer count for MTP models in update_aclgraph_sizes (#7064)

## Summary
- Fix incorrect layer count calculation for MTP (Multi-Token Prediction)
models in `update_aclgraph_sizes()` function
- For MTP models, the draft model's layer count is stored in
`num_nextn_predict_layers` or `mtp_num_hidden_layers` (for Qwen3.5), not
in the standard `num_hidden_layers` field
- Directly accessing `draft.hf_config.num_hidden_layers` returns the
main model's layer count instead of the MTP draft model's layer count

## Bug Description
In `vllm_ascend/utils.py`, the `update_aclgraph_sizes()` function
calculates `resources_per_graph` for speculative decoding scenarios.
When calculating the resources needed for the draft model, the original
code directly accessed:

```python
resources_per_graph += draft.hf_config.num_hidden_layers + 1
```

This works correctly for standard draft models, but **fails for MTP
models** (like DeepSeek-V3's MTP or Qwen3.5's MTP) because:
1. MTP models store their layer count in model-specific fields:
   - `num_nextn_predict_layers` (DeepSeek-V3 MTP)
   - `mtp_num_hidden_layers` (Qwen3.5 MTP)
2. The `num_hidden_layers` field in these models contains the **main
model's** layer count, not the MTP layer count
3. This leads to **grossly overestimating** the `resources_per_graph`,
which in turn causes the calculated `max_batch_sizes` to be
unnecessarily small

## Fix
Use `draft.get_total_num_hidden_layers()` instead of directly accessing
`draft.hf_config.num_hidden_layers`. This method correctly handles
different model types through the `model_arch_config_convertor`
infrastructure, returning the appropriate layer count for:
- Standard draft models → `num_hidden_layers`
- DeepSeek-V3 MTP → `num_nextn_predict_layers`
- Qwen3.5 MTP → `mtp_num_hidden_layers`

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
wanghuanjun2113
2026-03-09 16:14:51 +08:00
committed by GitHub
parent 4b4961ba5f
commit dec04ec8d8

View File

@@ -485,7 +485,10 @@ def update_aclgraph_sizes(vllm_config: VllmConfig) -> None:
resources_per_graph = num_hidden_layers + 1
# For suffix decoding, use the suffix path when no draft_model_config is provided.
if (spec := vllm_config.speculative_config) and (draft := spec.draft_model_config):
resources_per_graph += draft.hf_config.num_hidden_layers + 1
# Use get_total_num_hidden_layers() to correctly handle MTP models,
# which store layer count in num_nextn_predict_layers or
# mtp_num_hidden_layers (for Qwen3.5) instead of num_hidden_layers.
resources_per_graph += draft.get_total_num_hidden_layers() + 1
# TODO: Find out whether we need to take into account the pp_size
num_comm_groups = sum(