xc-llm-ascend/vllm_ascend at f2956ce94427f6a3af44f8028b7b0fb3f7c087de - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

wangbj127 f2956ce944 [v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded (#8133 )

Cherry-picked from https://github.com/vllm-project/vllm-ascend/pull/7858

### What this PR does / why we need it?
This PR fixes a `RuntimeError` (dimension mismatch) that occurs when
Sequence Parallelism (SP) is enabled and the padding added for SP causes
`num_tokens_padded` to differ from `num_tokens_unpadded`. In such cases,
`_pad_query_start_loc_for_fia` adds a dummy request, increasing
`num_reqs_padded`. This mismatch between the actual number of requests
and the padded number of requests leads to errors in downstream token
count computations (e.g., `compute_num_computed_tokens`).

The fix modifies the restrictive condition `num_tokens_padded ==
num_tokens_unpadded` when reverting the dummy request padding if SP is
enabled, as SP padding is handled by stripping it after communication
and should not be treated as an additional request in the attention
metadata.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
vLLM version: v0.18.0
vLLM-Ascend version: releases/v0.18.0

Signed-off-by: Wangbj127 <wangbj1207@126.com>

2026-04-17 22:50:22 +08:00

..

[BugFix][310p][Cherry-pick] Handle null quantization config in ShardedStateLoader310&[Feature][310P] Support W8A8 dynamic linear method (#8296 )

2026-04-16 16:53:39 +08:00

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[BugFix][0.18.0] Fix quant_bias missing in w8a8_static when flashcomm1 is enabled for GLM-5 (#8304 )

2026-04-17 22:46:36 +08:00

[Feat][SP] Suport SP for VL MoE models (#7044 )

2026-03-24 17:16:00 +08:00

[v0.18.0][Misc] Recompute scheduler upgrade to vLLM 0.18.0 (#7720 )

2026-03-27 18:24:53 +08:00

[A5][bugfix] Fix fused MoE A5 MXFP8 scale normalization, load-balance routing and gating_topk ops (#7573 )

2026-03-25 17:20:28 +08:00

device_allocator

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

[BugFix][v0.18.0] require piecewise cudagraph for layerwise AscendSto… (#8282 )

2026-04-16 10:40:14 +08:00

[V0.18.0][EPLB][BugFix] Fix moe_load precision in allgather (#7890 )

2026-04-02 09:20:31 +08:00

upgrade to 0.18.0 (#7502 )

2026-03-21 16:05:38 +08:00

[Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156 )

2026-03-15 17:55:42 +08:00

[ModelLoader][Feature] Add rfork support for fast model loading (#7392 )

2026-03-25 16:40:30 +08:00

[BugFix][0.18.0] Fix quant_bias missing in w8a8_static when flashcomm1 is enabled for GLM-5 (#8304 )

2026-04-17 22:46:36 +08:00

[BugFix][v0.18.0] fix remote KV waiting promotion in balance scheduler (#8280 )

2026-04-17 10:06:36 +08:00

[BugFix][0.18.0] Fix quant_bias missing in w8a8_static when flashcomm1 is enabled for GLM-5 (#8304 )

2026-04-17 22:46:36 +08:00

[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties (#7794 )

2026-03-31 19:01:51 +08:00

[v0.18.0][BugFix] Fix Qwen3.5 MoE flash comm v1 shared expert shape error of mtp layer on A2 (#8004 )

2026-04-13 17:36:09 +08:00

[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded (#8133 )

2026-04-17 22:50:22 +08:00

Main2main upgrade to vllm 0317 afternoon (#7409 )

2026-03-18 23:24:27 +08:00

__init__.py

[ModelLoader][Feature] Add rfork support for fast model loading (#7392 )

2026-03-25 16:40:30 +08:00

ascend_config.py

[feat] support dispatch_v2/combine_v2 hierarchy communication (#7698 )

2026-03-27 09:20:16 +08:00

ascend_forward_context.py

[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp (#7297 )

2026-03-23 17:05:02 +08:00

batch_invariant.py

[CI] Add pre-commit check for patch logger (#7446 )

2026-03-19 16:53:20 +08:00

cpu_binding.py

[BugFix] Enforce C locale for CPU binding subprocess parsing (#8261 )

2026-04-16 16:17:10 +08:00

envs.py

[Misc] Drop Prefetch MLP Env (#7357 )

2026-03-19 14:27:27 +08:00

flash_common3_context.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

meta_registration.py

[Ops][Refactor] Remove custom rotary_embedding operator (#6523 )

2026-02-07 09:24:05 +08:00

platform.py

[BugFix] Add async communication check for capturing mode (#8149 )

2026-04-12 21:52:54 +08:00

profiling_config.py

[Core][Misc] Clean up ProfileExecuteDuration (#6461 )

2026-02-01 20:06:01 +08:00

utils.py

[BugFix][DSv32] Fix DSA-CP PD role gating for deepseek v3.2 (v0.18.0) (#8291 )

2026-04-17 10:05:40 +08:00