Files
xc-llm-ascend/vllm_ascend
Shaoxu Cheng e5343d6eb3 [310P][Bugfix]: fix ngram graph replay accuracy error (#7134)
### What this PR does / why we need it?
On the 310P device, when running ACLGraph together with the n-gram
speculative decoding algorithm, both graph capture and graph replay
require `uniform_decode_query_len` and do not depend on
`attention_state`. This leads to a rather interesting and unexpected
issue on 310P: during decode-only, execution does **not** enter the
graph, while in the split-fuse state (that is, the chunked prefill
state), it instead enters graph execution directly.

The issue can be resolved by forcibly setting `uniform_decode_query_len`
to `1`, so that 310P captures only the decode-only graph, and replay is
then controlled through `attention_state`.

### Does this PR introduce _any_ user-facing change?
NO

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-12 17:08:08 +08:00
..