1120 Commits

Author SHA1 Message Date
LoganJane
270c5cb8cd [CI] Add nightly CI test cases for the Kimi-K2.5 (#7416)
### What this PR does / why we need it?
Add nightly CI test cases for the Kimi-K2.5.

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: LoganJane <loganJane73@hotmail.com>
Signed-off-by: LoganJane <42287016+LoganJane@users.noreply.github.com>
2026-03-19 11:02:29 +08:00
pz1116
3effc4bc70 [Doc][KV Pool]Revision KV Pool User Guide (#7434)
### What this PR does / why we need it?
Revise the KV Pool user guide:
1. Revise Mooncake environment variables and kvconnector extra configs.
2. Delete `use_ascend_direct` in kv connector extra config as it is
deprecated
3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config
4. Unifies default `max-model-len` and `max-num-batch-tokens` in
examples given.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
2026-03-19 10:13:13 +08:00
Nengjun Ma
8b79d4de52 Main2main upgrade to vllm 0317 afternoon (#7409)
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)

4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
2026-03-18 23:24:27 +08:00
jiangmengyu18
305820f1a9 [Bugfix] fix bug about model type of qwen3_vl_8b_instruct_w8a8 (#7383)
### What this PR does / why we need it?
Adapt to the model type of Qwen3-VL-8B-Instruct-W8A8

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
2026-03-18 20:30:03 +08:00
SparrowMu
fb8e22ec00 [DOC] MiniMax-M2.5 model intro (#7296)
### What this PR does / why we need it?
1. Add nightly test on MiniMax-M2.5 with deployment method on A3
2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
2026-03-18 20:14:36 +08:00
liuhy1213-cell
58725b8b24 [doc] add Prefill-Decode Disaggregation doc for GLM5.md (#7300)
### What this PR does / why we need it?
add Prefill-Decode Disaggregation doc for GLM5.md
w8a8  65k-1.5k 
Concurrency: 80 
prefixcache: 90%
tps: 2054

- vLLM version: v0.17.0

- vLLM main:
4034c3d32e
---------
Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com>
Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>
2026-03-18 17:00:31 +08:00
zhangyiming
1c954ff264 [main2main] upgrade vllm to 0308 (#7213)
### What this PR does / why we need it?
Update main2main to vllm 0308.
breaks:

* https://github.com/vllm-project/vllm/pull/30681
* https://github.com/vllm-project/vllm/pull/35552 remove
self.cudagraph_batch_sizes
* https://github.com/vllm-project/vllm/pull/35158 clear_metadata ->
defer_finalize
* https://github.com/vllm-project/vllm/pull/36006 remove
CacheConfig.cpu_offload_gb
* https://github.com/vllm-project/vllm/pull/35472
* https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder
* https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens
* https://github.com/vllm-project/vllm/pull/28053 

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-03-18 09:24:43 +08:00
lilinsiman
8f278fc101 [eagle3][pcp] fix bug for eagle3 and cp enable (#7309)
### What this PR does / why we need it?
This PR fixes the bug for eagle3 and cp enable introduced by the
parallel speculative inference PR.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
tests and ut

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-03-17 16:14:45 +08:00
pichangping
3f39ac9c8d [Feature]Supports DSv3.1 PD separation and C8 quantization (#7222)
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>

### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints: 
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-03-16 22:49:05 +08:00
wangx700
22d0e1d3d7 [model_runner_v2]optimize the performance of the _topk_log_softmax_kernel (#7221)
### What this PR does / why we need it?
Optimize the performance of the triton operator _topk_log_softmax_kernel
in model_runner_v2 to 1.04xH100,which is 7% of its original value.(issue
https://github.com/vllm-project/vllm-ascend/issues/5208)

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangx700 <wangxin700@huawei.com>
2026-03-16 16:49:10 +08:00
rjg-lyh
4d443b9228 [bugfix] restore pr-7029 and fix patch error (#7294)
### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.

The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.

This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.

### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-16 15:39:42 +08:00
zhaomingyu13
9320365dab [Test][Feature] Add e2e test for QuaRot model with eagle3 (#7128)
### What this PR does / why we need it?
Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot
model and the float model, and then compares their acceptance rates. The
QuaRot model adapting eagle3 PR(#6914, #7038)

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2026-03-16 15:35:55 +08:00
LICO67373
71c21f76f5 [Refactor] Replace npu_ring_mla with FIA in MLA prefill (#5704)
### What this PR does / why we need it?

**Refactor: Replace npu_ring_mla with FIA in MLA prefill**

This PR refactors the MLA (Multi-Layer Attention) prefill implementation
by replacing `npu_ring_mla` with `npu_fused_infer_attention_score` (FIA)
operator, unifying the attention backend with the standard attention
implementation.

**Key changes:**

1. **Core prefill refactoring (`mla_v1.py`)**
- Replace `npu_ring_mla` with `npu_fused_infer_attention_score` in
`_forward_prefill` and `_compute_prefill_context`
   - Use TND layout with `softmax_lse_flag=True` for prefill attention
- Use `npu_attention_update` to merge multiple chunk outputs with LSE
(Log-Sum-Exp)
- Change `attn_mask` from `get_final_mla_mask()` to
`get_splitfuse_attn_mask()` for FIA compatibility

2. **Data type handling**
- Add automatic float16 → bfloat16 conversion (FIA with TND layout only
supports bfloat16)
   - Convert output back to original dtype after FIA computation

3. **Metadata optimization**
   - Pre-calculate `actual_seq_lengths_q` in `AscendMLAPrefillMetadata`
- Pre-calculate `chunk_actual_seq_lengths_kv_list` in
`ChunkedContextMetadata`
- Move `torch.cumsum` operations from forward pass to metadata building
phase

4. **CP compatibility (`mla_cp.py`)**
- Add `_ring_mla_mask_builder` to get `npu_ring_mla`-compatible masks
for Context Parallel scenarios
- Add `chunk_actual_seq_lengths_kv_list` field to
`CPChunkedContextMetadata`

**Why we need it:**
- **Backend unification**: Aligns MLA prefill with standard attention
implementation (`attention_v1.py`)
- **Better chunked context support**: FIA + `npu_attention_update`
provides native LSE-based output merging
- **Future compatibility**: Prepares for eventual `npu_ring_mla` removal
across the codebase

### Does this PR introduce _any_ user-facing change?

**No.** This is a pure refactoring with no functional changes - same
behavior, unified backend.

---
- Related issue: #5463 (item 7)
- vLLM version: v0.14.1

Signed-off-by: lico67373 <918688502@qq.com>
2026-03-16 10:33:09 +08:00
pppeng
7e85f2ff97 [CI] Add test_qwen3_5.py (#7133)
### What this PR does / why we need it?
Add test_qwen3_5.py for base scenarios tp4 on Qwen3.5-27B and
Qwen3.5-35B-A3B.

- vLLM version: main
- vLLM main:
4034c3d32e
---------
Signed-off-by: pppeng <zepengliu912@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-15 22:19:02 +08:00
Mengqing Cao
0c299f79b9 Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)" (#7288)
### What this PR does / why we need it?
This reverts commit 7ed9e9de69, which
introduces an issue that the patch doesn't work with recompute scheduler
enabled.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-15 20:19:09 +08:00
yupeng
29f195a91c [Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156)
### What this PR does / why we need it?
Fix the error that reports while initializing qwen3-reranker-0.6b model
with `--enable-lora`.
And add a testcase to verify the fix.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-15 17:55:42 +08:00
Angazenn
ce5544bfc1 [Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align (#7103)
### What this PR does / why we need it?
To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly
follows the design in
[#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits
changes to functions which are overridden in vLLM-Ascend.

Note:
1. `--mamba-cache-mode align` && PD disaggregation is still not
supported yet in vLLM v0.17.0(see
https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
2. The current implementation of hybrid kv cache might result in a very
large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B
with `-tp 2`, the block_size is adjusted to 2048, which means that any
prefix shorter than 2048 will never be cached. Although this behavior is
consistent with vLLM, it still needs improvements in the future.
3. `--mamba-cache-mode align` requires to copy mamba states during
forward steps. vLLM uses a triton kernel to implement it. However, the
original version run into some bugs on Ascend hardwares. Thus we patch a
new triton kernel to avoid this bug.

### Does this PR introduce _any_ user-facing change?
To use mamba prefix cache, set `--enable-prefix-caching` and
`--mamba-cache-mode align`. Note that the mamba state copy function(see
[do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132))
does not provide a torch native version, thus it might have trouble if
users can't use triton.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-03-15 09:44:09 +08:00
Cao Yi
5ec610e832 [Feature][Quant] Reapply auto-detect quantization format and support remote model ID (#7111)
### What this PR does / why we need it?
Reapply the auto-detect quantization format feature (originally in
#6645, reverted in #6873) and extend it to support remote model
identifiers (e.g., `org/model-name`).

Changes:
- Reapply auto-detection of quantization method from model files
(`quant_model_description.json` for ModelSlim, `config.json` for
compressed-tensors)
- Add `get_model_file()` utility to handle file retrieval from both
local paths and remote repos (HuggingFace Hub / ModelScope)
- Update `detect_quantization_method()` to accept remote repo IDs with
optional `revision` parameter
- Update `maybe_update_config()` to work with remote model identifiers
- Add platform-level `auto_detect_quantization` support
- Add unit tests and e2e tests for both local and remote model ID
scenarios

Closes #6836

### Does this PR introduce _any_ user-facing change?

Yes. When `--quantization` is not explicitly specified, vllm-ascend will
now automatically detect the quantization format from the model files
for both local directories and remote model IDs.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-03-13 22:53:25 +08:00
Mengqing Cao
986cd45397 [Version] Drop 0.16.0 support (#7153)
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-13 16:14:15 +08:00
rjg-lyh
7ed9e9de69 [Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)
### What this PR does / why we need it?
This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops
in pd-mix stage mainly.

Because the code for the current PD-disaggregated scenario is still
under refactoring and cleanup, this PR prioritizes ensuring the C8
functionality in the pd-mix scenario.

The next steps are planned in two parts:
① Once the optimized scatter operator is updated, we will replace the
original operator to improve the performance of storing k_scale.
② Once the code logic for the PD-disaggregated scenario becomes stable,
we will carry out more comprehensive validation and make appropriate
adaptations.
③ Because enabling C8 currently introduces several new operators whose
performance still needs improvement, performance may regress in some
scenarios. Therefore, only after all the operators are fully ready can
we ensure that this feature does not cause any performance degradation.
At that point, we will enable this feature by default and remove the
switch in `additional_config`.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-13 14:47:42 +08:00
kx
df1ee8070d [feat][spec decode]Unified draft parallel (#6766)
### What this PR does / why we need it?
Implement a unified parallelized speculative decoding in VLLM
Ascend,which can simultaneously support parallel speculative inference
schemes such as Pard, P-Eagle, etc. refer to
https://github.com/vllm-project/vllm-ascend/pull/6565 and
https://github.com/vllm-project/vllm-ascend/pull/4078

### How was this patch tested?

run with parallel drafting script:
export target=/model/Llama-3.1-8B-Instruct
export draft=/model/PARD-Llama-3.2-1B
export CUDA_VISIBLE_DEVICES=6
export ASCEND_RT_VISIBLE_DEVICES=6
vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811 \
--speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method":
"draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}'

base script:
export target=/model/Llama-3.1-8B-Instruct
export draft=/model/PARD-Llama-3.2-1B
export CUDA_VISIBLE_DEVICES=6
export ASCEND_RT_VISIBLE_DEVICES=6
vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811

benchmark script:
MAX_CONCURRENCY=1
NUM_PROMPTS=80
vllm bench serve --port 8811 \
    --temperature 0 \
    --model /model/Llama-3.1-8B-Instruct \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts ${NUM_PROMPTS} \
    --max-concurrency ${MAX_CONCURRENCY} \
    --seed 1234

test results :
base(without spec decode): TTFT 79.46ms TPOT 26.99ms
output_tokens_throughput 36.75 tok/s
this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms
output_tokens_throughput 72.98 tok/s
per-position acceptance(from position 0 to 7):
79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%.

----------------------------------------------------------------------
run on qwen3 model script :
export target=/model/Qwen3-1.7B
export draft=/model/PARD-Qwen3-0.6B
export CUDA_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=1

vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811 \
--speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method":
"draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}'

cc  @NickJudyHvv
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: kx <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
2026-03-13 14:07:35 +08:00
Ronald
c980e68d40 [Feature] support aclgraph for model runner v2 (#7110)
### What this PR does / why we need it?
This PR aims to support aclgraph for model runner v2, please see RFC
#5208. The PR contains these modifications:
- adapt to newest commit of vllm main branch.
- supply a unified interface of extra forward context for both model
runner v1 and model runner v2.
- implement graph mode for main model. 

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-13 09:11:46 +08:00
Li Wang
7fe0469e27 [CI][Misc] Use offline mode for model downloads (#7179)
### What this PR does / why we need it?
1. For all parts of the current test module involving the millisecond
download model, add the `local_file_only` parameter to specify offline
mode; this ensures that CI will not fail due to network instability.
2. Install modelscope from a fixed commit until it next release
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
 check if the env or arg `local_files_only` works
1) set the env:
```shell
export HF_HUB_OFFLINE=1
```
2) run the script
```python
from transformers import PretrainedConfig
import huggingface_hub
from modelscope.utils.hf_util import patch_hub

patch_hub()

model="Qwen/Qwen3-0.6B"
kwargs = {}


config_dict, _ = PretrainedConfig.get_config_dict(
    model,
    trust_remote_code=True,
    local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
    **kwargs,
)

print(config_dict)
```
it works well:
```shell
2026-03-06 06:40:12,546 - modelscope - WARNING - We can not confirm the cached file is for revision: master
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
{'architectures': ['Qwen3ForCausalLM'], 'attention_bias': False, 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'head_dim': 128, 'hidden_act': 'silu', 'hidden_size': 1024, 'initializer_range': 0.02, 'intermediate_size': 3072, 'max_position_embeddings': 40960, 'max_window_layers': 28, 'model_type': 'qwen3', 'num_attention_heads': 16, 'num_hidden_layers': 28, 'num_key_value_heads': 8, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000, 'sliding_window': None, 'tie_word_embeddings': True, 'torch_dtype': 'bfloat16', 'transformers_version': '4.51.0', 'use_cache': True, 'use_sliding_window': False, 'vocab_size': 151936, '_commit_hash': None}
```
3) test the model repo does not cached locally when the env
`HF_HUB_OFFLINE`==True
```python
from transformers import PretrainedConfig
import huggingface_hub
from modelscope.utils.hf_util import patch_hub

patch_hub()


model="FireRedTeam/FireRed-OCR"
kwargs = {}


config_dict, _ = PretrainedConfig.get_config_dict(
    model,
    trust_remote_code=True,
    local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
    **kwargs,
)

print(config_dict)
```
and the result is as expected:
```shell
  File "/workspace/demo.py", line 12, in <module>
    config_dict, _ = PretrainedConfig.get_config_dict(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 189, in patch_get_config_dict
    model_dir = get_model_dir(pretrained_model_name_or_path,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 164, in get_model_dir
    model_dir = snapshot_download(
                ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 137, in snapshot_download
    return _snapshot_download(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 283, in _snapshot_download
    raise ValueError(
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable look-ups and downloads online, set 'local_files_only' to False
```
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-13 08:52:24 +08:00
drslark
de93790d08 [main][bugfix] Fixed the problem of drafter crashed in FULL mode (#7158)
### What this PR does / why we need it?

The merged graph of draft in `FULL` mode is broken now.

This pr solves it.

Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so,
it is removed.

It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and
https://github.com/vllm-project/vllm-ascend/pull/7148.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Test code is shown as below:

```python
prompts = [
    "1.Who are you?",
    "2. Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200)
llm = LLM(
    model="/home/some-model/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    max_num_seqs=32,
    # enforce_eager=True,
    disable_log_stats=False,
    distributed_executor_backend="mp",
    gpu_memory_utilization=0.7,
    async_scheduling=True,

    speculative_config={
        "enforce_eager": True,
        "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B",
        "disable_padded_drafter_batch": False,
        "method": "eagle3",
        "num_speculative_tokens": 3,
    },
    
    compilation_config={
        "cudagraph_mode": "FULL",
        "cudagraph_num_of_warmups": 1,
    },

    max_model_len=4096, 
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)
```

The result before:

```text
   File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia
     graph_params.events[num_tokens].append(event)
     ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
 KeyError: 132
```

The result after:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 242
num_draft_tokens: 726
num_accepted_tokens: 156
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.07
```

We also test `FULL_DECODE_ONLY` mode.

The result is:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 244
num_draft_tokens: 732
num_accepted_tokens: 155
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.06
```

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-12 18:38:50 +08:00
Li Wang
0a171b5cdd [Test][BugFix] Fix dispatch_gmm_combine_decode test stability (#7097)
### What this PR does / why we need it?
This patch fix the nightly failure
1. Each case uses a copy of the global kwargs instead of a reference to
prevent parameter pollution between use cases.
2. Add weight initialization in the scenario of `eplb` + `w8a8_dynamic`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```python
pytest -sv tests/e2e/nightly/single_node/ops/multicard_ops_a3/test_dispatch_gmm_combine_decode.py
```

```shell
===================================================================== 3 passed, 4 warnings in 194.86s (0:03:14) ======================================================================
```
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-12 17:22:44 +08:00
XiaoxinWang
37d1bd8c50 fixed fia pad logic in graph mode. (#7144)
### What this PR does / why we need it?
related to vllm PR #34043 this pr delete func
‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual
number of requests, due to fia operator requires that
query_start_loc[-1] equals the total number of computed tokens, so this
func delete cause the ifa error.
In full graph mode, set num_reqs_paded = num_reqs to fix the error
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2026-03-12 14:50:54 +08:00
meihanc
da01a74009 Revert "[CI] fix skiped e2e test when upgrade vllm version (#6654)" (#7166)
This reverts commit f6db47f103.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-11 23:03:15 +08:00
shiyuan680
3b6b3c4214 [MODELRUNNERV2]fix penality ops (#7013)
### What this PR does / why we need it?
fix penality ops for new version, and achieved a 10% performance
improvement

### How was this patch tested?
pytest
‎tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: shiyuan680 <917935075@qq.com>
2026-03-11 17:13:34 +08:00
yupeng
830f39dd70 [Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650)
### What this PR does / why we need it?
Fix the issue #6143 .

### Does this PR introduce _any_ user-facing change?
Allow to start the server with "--enable-lora && --fully-sharded-loras
&& --tensor_parallel_size 2".

### How was this patch tested?
pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-11 15:43:15 +08:00
zhangxinyuehfad
67d40f23fd [CI]Upgrade niglty multi-node-tests max-parallel to 2 (#7035)
### What this PR does / why we need it?

1. Increase nightly multi-node test max-parallel from 1 to 2, and fix
resource conflicts that arise when tests run concurrently.
2. Fix parse-trigger job: Add an if condition so it only runs on
schedule, workflow_dispatch, or PRs labeled nightly-test
3. Adjust nightly schedule: Shift trigger time from 24:00 to 23:45
(UTC+8)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-10 16:25:51 +08:00
pu-zhe
5df450bca4 [Feat] [310p] Support w8a8sc quantization method (#7075)
### What this PR does / why we need it?
New Quantization Method: Introduced support for the W8A8SC static linear
quantization scheme specifically for 310P hardware, enabling more
efficient model compression.
Refactored the save_sharded_state_310.py to avoid multi-process issue.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
W8A8SC quant E2E test.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-03-10 16:13:20 +08:00
Li Wang
33234aa0c5 Revert "[Feature][Quant] Auto-detect quantization format from model f… (#6873)
This reverts commit 3953dcf784. to keep
the basic functions available

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-10 11:27:32 +08:00
yupeng
40f7d93f1a [bugfix][LoRA] Fix the lora accuracy issue introduced by the upstream vLLM changed. (#6958)
### What this PR does / why we need it?
Fix the LoRA e2e test accuracy issue that introduced by the upstream PR
https://github.com/vllm-project/vllm/pull/32005

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama32_lora.py

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Signed-off-by: yupeng <507435917@qq.com>
2026-03-10 10:43:18 +08:00
meihanc
f6db47f103 [CI] fix skiped e2e test when upgrade vllm version (#6654)
### What this PR does / why we need it?
fix skiped test_aclgraph_capture_replay.py when upgrade vllm version

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
13397841ab

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-10 09:55:35 +08:00
SILONG ZENG
43df2cb2fc [Lint]Style: Convert test/ to ruff format(Batch #1) (#6738)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `tests/e2e/310p/multicard/test_vl_model_multicard.py` |
| `tests/e2e/310p/singlecard/test_vl_model_singlecard.py` |
| `tests/e2e/310p/test_utils.py` |
| `tests/e2e/conftest.py` |
| `tests/e2e/model_utils.py` |
| `tests/e2e/models/conftest.py` |
| `tests/e2e/models/test_lm_eval_correctness.py` |
| `tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py` |
| `tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py` |
| `tests/e2e/multicard/2-cards/test_data_parallel.py` |
| `tests/e2e/multicard/2-cards/test_disaggregated_encoder.py` |
| `tests/e2e/multicard/2-cards/test_expert_parallel.py` |
| `tests/e2e/multicard/2-cards/test_external_launcher.py` |
| `tests/e2e/multicard/2-cards/test_full_graph_mode.py` |
| `tests/e2e/multicard/2-cards/test_ilama_lora_tp2.py` |
| `tests/e2e/multicard/2-cards/test_offline_inference_distributed.py` |
| `tests/e2e/multicard/2-cards/test_offline_weight_load.py` |
| `tests/e2e/multicard/2-cards/test_pipeline_parallel.py` |
| `tests/e2e/multicard/2-cards/test_prefix_caching.py` |
| `tests/e2e/multicard/2-cards/test_quantization.py` |
| `tests/e2e/multicard/2-cards/test_qwen3_moe.py` |
| `tests/e2e/multicard/2-cards/test_qwen3_moe_routing_replay.py` |
| `tests/e2e/multicard/2-cards/test_qwen3_performance.py` |
| `tests/e2e/multicard/2-cards/test_shared_expert_dp.py` |
| `tests/e2e/multicard/2-cards/test_single_request_aclgraph.py` |
| `tests/e2e/multicard/2-cards/test_sp_pass.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-10 09:52:50 +08:00
ZT-AIA
ee5347e824 [qwen3 next ]add ascend c casual_conv1d_fn (#6661)
### What this PR does / why we need it?
add ascend c casual_conv1d_fn

- vLLM version: v0.15.0
- vLLM main:
13397841ab
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-09 23:29:49 +08:00
Hexiang Wang
48b624e4cc [BugFix] Fix implementation bug of triton rope_siso (#7082)
### What this PR does / why we need it?
Previously implemention of triton rope_siso missing the storage of
second half of rope results, which will result in:

1. accuracy problem in neox-style scenario
2. ub overflow in non neox-style scenario

This PR fixes it and supplement nightly test case for it.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-03-09 23:08:43 +08:00
Qiu
13adcbe44b feat(attention_cp): support chunked prefill for Qwen3Next with PCP&DCP (#6900)
### What this PR does / why we need it?
Support chunked prefill for Qwen3Next with PCP&DCP

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-03-09 17:55:09 +08:00
LeeWenquan
65eae6de7b Add Ascend Ops recurrent_gated_delta_rule (#6725)
### What this PR does / why we need it?
Change recurrent_gated_delta_rule ops from triton to ascend C version
for better performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-03-09 14:14:14 +08:00
JIACHENG XU
23bf5d4d48 [EPLB][bugfix] Bugfix for fused mc2 (#6794)
### What this PR does / why we need it?
This pull request addresses a bug related to the fused mc2 functionality
within the EPLB (Expert Parallelism Load Balancing) system, specifically
impacting quantization and MoE communication.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: Spicy-Stick <873805887@qq.com>
Signed-off-by: root <root@localhost.localdomain>
2026-03-09 11:26:57 +08:00
ZhaoJiangJiang
a51d6366b9 [Bugfix] Qwen3Next support FlashComm1 (#6830)
### What this PR does / why we need it?
Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence
Parallel (SP)
and resolve precision problems in shared_out when both FlashComm1 is
enabled.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
2026-03-06 17:14:08 +08:00
Zetong Li
a2696006d1 [Refactor][EAGLE] 8/N delete mtp_proposer (re-pull) (#7033)
### What this PR does / why we need it?
**NOTE: This PR is re-pull of #7016 since ci mistakenly marked
unfinished pr as having passed.**

This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and
glm5, now it should be ok to remove mtp_proposer. The bug is actually
about unnecessary slicing of `slot_mapping`.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-06 17:11:22 +08:00
Fager10086
c5dfa8d645 [OPS]add split_qkv_rmsnorm_mrope ops (#6730)
### What this PR does / why we need it?
This PR adds split_qkv_rmsnorm_mrope kernel with interleaved for qwen3.5
and qwen3-vl to improve performance.

### Does this PR introduce _any_ user-facing change?
Does not.

### How to use?
```python
real_q, real_k, real_v, real_gate = torch.ops.vllm.triton_split_qkv_rmsnorm_mrope(
            qkv=qkv,
            q_weight=q_weight,
            k_weight=k_weight,
            cos_sin=cos_sin,
            num_q_heads=num_q_heads,
            num_kv_heads=num_kv_heads,
            head_size=head_size,
            eps=eps,
            mrope_section=mrope_section,
            is_interleaved=is_interleaved,
            rope_dim=rope_dim,
            has_gate=has_gate,
    )
```
### How was this patch tested?
- vLLM version: v0.16.0
- Accuracy test script:
```shell
pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_rmsnorm_mrope.py
```

---------

Signed-off-by: Fager <865071616@qq.com>
Signed-off-by: Fager10086 <77871921+Fager10086@users.noreply.github.com>
Signed-off-by: fager <865071616@qq.com>
2026-03-06 16:18:37 +08:00
xiaocongtou6
bc0fd7ca72 [Feat]Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. (#6940)
### What this PR does / why we need it?
Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for
DeepSeek v3.2.

### How was this patch tested?
Test output:

{"object":"text_completion","model":"deepeek_v3","choices":[{"index":0,"text":"
the head of state and head of government of the United States,
indirectly elected to a four-year term by the American people through
the Electoral College. The officeholder leads the executive branch of
the federal government and is the commander-in-chief of the United
States","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":1,"text":"
Paris. This is the largest city in France and its main political,
cultural and commercial center. The modern location of the city is the
north of the central part of the country, on the banks of the Seine
River Seine River Seine in
3\n\n","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":2,"text":"
now\n\n# AI future is now\n\nThe world is changing at a rapid pace, and
artificial intelligence (AI) is at the forefront of this transformation.
From self-driving cars to virtual assistants, AI is already making a
significant impact on our daily
lives","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":3,"text":"
a 3rd year student at the University of Lincoln studying Media
Production. This blog is about my work throughout my final year on the
course.\n\n## Tuesday 3 May 2016\n### Final Major Project -
Evaluation\n\nFor my final project
I","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":27,"total_tokens":227,"completion_tokens":200,"prompt_tokens_details":null},"kv_transfer_params":null}

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: xiaocongtou6 <2066962956@qq.com>
Signed-off-by: xiaocongtou6 <105542647+xiaocongtou6@users.noreply.github.com>
2026-03-06 16:10:24 +08:00
wanghengkang
c49ce18ea5 [Test] Add e2e test cases for the Qwen-VL model adaptation to Ascend 310p (#6977)
### What this PR does / why we need it?
Add e2e test cases for the Qwen-VL model adaptation to Ascend 310p

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: gcw_61wqY8cy <wanghengkang1@huawei.com>
2026-03-06 14:25:10 +08:00
wangxiyuan
16c3b0b822 Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030)
Reverts vllm-project/vllm-ascend#7016
It breaks E2E test
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
2026-03-06 11:24:05 +08:00
frank
18b52afe2b [Ops][Misc] Optimize split_qkv_rmsnorm_rope op (#6827)
### What this PR does / why we need it?

This PR optimizes the `split_qkv_rmsnorm_rope` operator by introducing a
new Triton kernel, `split_qkv_rmsnorm_rope_prefill_kernel`, for the
prefill stage (i.e., large batch sizes). The implementation now
dynamically selects between the existing decode kernel and the new
prefill kernel based on the batch size, which improves performance for
large batch scenarios.

Additionally, the RoPE implementation is updated to support partial
rotation dimensions (`rope_dim`), making the operator more flexible.

### Does this PR introduce _any_ user-facing change?

No. This is a performance optimization and is not expected to introduce
any user-facing changes.

### How was this patch tested?

CI should pass with existing tests. The new prefill path is triggered
when the batch size is larger than the number of available vector cores.
The partial RoPE feature can be tested by passing the `rope_dim`
argument.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: guzhiyong <guzhiyong5@h-partners.com>
Signed-off-by: frank <2547457096@qq.com>
Co-authored-by: guzhiyong <guzhiyong5@h-partners.com>
2026-03-06 09:30:31 +08:00
Zetong Li
a60e179c7f [Refactor][EAGLE] 8/N delete mtp_proposer (#7016)
### What this PR does / why we need it?
This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and
glm5, now it should be ok to remove mtp_proposer. The bug is actually
about unnecessary slicing of `slot_mapping`.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-06 09:10:57 +08:00
SILONG ZENG
bd571cf6d6 [Main2Main] Upgrade vLLM to 0303 (#6944)
### What this PR does / why we need it?
break:
- https://github.com/vllm-project/vllm/pull/34102 
Disable_full param replaced with valid_modes/invalid_modes API
- https://github.com/vllm-project/vllm/pull/35503
Now must return float compilation_time
- https://github.com/vllm-project/vllm/pull/35564
New sequence_lengths param added
- https://github.com/vllm-project/vllm/pull/33807
A check was performed (if runner_backend != "auto")
- https://github.com/vllm-project/vllm/pull/34861
`BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to
check process group state
- https://github.com/vllm-project/vllm/pull/35274

**Important change:**
- https://github.com/vllm-project/vllm/pull/28672

`matcher_utils` directly accesses `torch.ops._C.*` during the import
phase. In the Ascend environment, some unregistered ops trigger
`AttributeError`, causing e2e initialization failure.

https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323

https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29

This PR adds temporary compatibility placeholders (rms_norm,
fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant,
silu_and_mul) to
`vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to
ensure no crashes during the import phase. Upstream repairs will be
considered later.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
2026-03-06 09:08:52 +08:00
Cao Yi
50441e4650 [BugFix][MTP] Fix prefill misclassified as decode when prompt tokens == num_spec_tokens + 1 (#6835)
## Problem
When MTP is enabled, prefill requests with `prompt_tokens ==
num_spec_tokens + 1` are incorrectly classified as decode requests,
causing accuracy issues.

## Root Cause
The `uniform_decode` condition only checked:
- `max_num_scheduled_tokens == uniform_decode_query_len`
- `num_tokens == max_num_scheduled_tokens * num_reqs`

This is insufficient because a prefill request with specific prompt
length satisfies these conditions as well.

## Fix
Add `is_all_decode` check to ensure all requests have
`num_computed_tokens > 0` before classifying as uniform decode, since
decode requests must have computed at least one token.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-03-05 17:33:10 +08:00