xc-llm-ascend/vllm_ascend at e5343d6eb32f863f72dbb0e503c3fbaefca7c924 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Shaoxu Cheng e5343d6eb3 [310P][Bugfix]: fix ngram graph replay accuracy error (#7134 )

### What this PR does / why we need it?
On the 310P device, when running ACLGraph together with the n-gram
speculative decoding algorithm, both graph capture and graph replay
require `uniform_decode_query_len` and do not depend on
`attention_state`. This leads to a rather interesting and unexpected
issue on 310P: during decode-only, execution does **not** enter the
graph, while in the split-fuse state (that is, the chunked prefill
state), it instead enters graph execution directly.

The issue can be resolved by forcibly setting `uniform_decode_query_len`
to `1`, so that 310P captures only the decode-only graph, and replay is
then controlled through `attention_state`.

### Does this PR introduce _any_ user-facing change?
NO

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>

2026-03-12 17:08:08 +08:00

..

[310P][Bugfix]: fix ngram graph replay accuracy error (#7134 )

2026-03-12 17:08:08 +08:00

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[eagle][cp] fix eagle_cp enable bug2 (#7079 )

2026-03-10 16:32:49 +08:00

[bugfix] fix pass bug: pass really rope dim for npu_rotary_embedding (#6880 )

2026-03-06 19:35:17 +08:00

[BugFix]Fix recomputed scheduler bug (#7137 )

2026-03-11 00:32:19 +08:00

[misc] move mxfp_compat into device to decouple from quantization init chain (#6918 )

2026-03-02 18:17:01 +08:00

device_allocator

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

improve the ttft when use mooncake (#6125 )

2026-03-12 16:13:48 +08:00

Support per-step heat collection and enhance FlashLB for multi-stage load balancing (#6477 )

2026-03-12 15:49:09 +08:00

[Main2Main] Upgrade vLLM to 0226 (#6813 )

2026-02-27 16:05:21 +08:00

[Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650 )

2026-03-11 15:43:15 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #6 ) (#6001 )

2026-01-24 22:08:33 +08:00

Support per-step heat collection and enhance FlashLB for multi-stage load balancing (#6477 )

2026-03-12 15:49:09 +08:00

[main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148 )

2026-03-12 14:51:12 +08:00

Refactor quantization layer name mapping to leverage vLLM built-in mappers (#7050 )

2026-03-12 15:48:14 +08:00

[Feature] Add docs of batch invariance and make some extra operators patch (#6910 )

2026-03-05 09:12:40 +08:00

[main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148 )

2026-03-12 14:51:12 +08:00

[Bugfix] Fix the issue where no exception is thrown when graph capture fails. (#5644 )

2026-03-12 16:14:45 +08:00

[Feat]Xlite Qwen3 MoE Support Data Parallel (#6715 )

2026-03-09 17:53:35 +08:00

__init__.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

ascend_config.py

refactor: add a check before layer_sharding logging (#7186 )

2026-03-12 11:56:04 +08:00

ascend_forward_context.py

[EPLB][bugfix] Bugfix for fused mc2 (#6794 )

2026-03-09 11:26:57 +08:00

batch_invariant.py

[Feature] Add docs of batch invariance and make some extra operators patch (#6910 )

2026-03-05 09:12:40 +08:00

cpu_binding.py

[CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (#6945 )

2026-03-03 17:20:52 +08:00

envs.py

[MISC] Clean up useless env USE_OPTIMIZED_MODEL (#6618 )

2026-02-09 15:38:58 +08:00

flash_common3_context.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

meta_registration.py

[Ops][Refactor] Remove custom rotary_embedding operator (#6523 )

2026-02-07 09:24:05 +08:00

platform.py

Revert "[Feature][Quant] Auto-detect quantization format from model f… (#6873 )

2026-03-10 11:27:32 +08:00

profiling_config.py

[Core][Misc] Clean up ProfileExecuteDuration (#6461 )

2026-02-01 20:06:01 +08:00

utils.py

[Build] Add support for Ascend950 chip (#7151 )

2026-03-12 10:25:51 +08:00