Commit Graph

2615 Commits

Author SHA1 Message Date
zhaomingyu13
9320365dab [Test][Feature] Add e2e test for QuaRot model with eagle3 (#7128)
### What this PR does / why we need it?
Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot
model and the float model, and then compares their acceptance rates. The
QuaRot model adapting eagle3 PR(#6914, #7038)

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2026-03-16 15:35:55 +08:00
LICO67373
71c21f76f5 [Refactor] Replace npu_ring_mla with FIA in MLA prefill (#5704)
### What this PR does / why we need it?

**Refactor: Replace npu_ring_mla with FIA in MLA prefill**

This PR refactors the MLA (Multi-Layer Attention) prefill implementation
by replacing `npu_ring_mla` with `npu_fused_infer_attention_score` (FIA)
operator, unifying the attention backend with the standard attention
implementation.

**Key changes:**

1. **Core prefill refactoring (`mla_v1.py`)**
- Replace `npu_ring_mla` with `npu_fused_infer_attention_score` in
`_forward_prefill` and `_compute_prefill_context`
   - Use TND layout with `softmax_lse_flag=True` for prefill attention
- Use `npu_attention_update` to merge multiple chunk outputs with LSE
(Log-Sum-Exp)
- Change `attn_mask` from `get_final_mla_mask()` to
`get_splitfuse_attn_mask()` for FIA compatibility

2. **Data type handling**
- Add automatic float16 → bfloat16 conversion (FIA with TND layout only
supports bfloat16)
   - Convert output back to original dtype after FIA computation

3. **Metadata optimization**
   - Pre-calculate `actual_seq_lengths_q` in `AscendMLAPrefillMetadata`
- Pre-calculate `chunk_actual_seq_lengths_kv_list` in
`ChunkedContextMetadata`
- Move `torch.cumsum` operations from forward pass to metadata building
phase

4. **CP compatibility (`mla_cp.py`)**
- Add `_ring_mla_mask_builder` to get `npu_ring_mla`-compatible masks
for Context Parallel scenarios
- Add `chunk_actual_seq_lengths_kv_list` field to
`CPChunkedContextMetadata`

**Why we need it:**
- **Backend unification**: Aligns MLA prefill with standard attention
implementation (`attention_v1.py`)
- **Better chunked context support**: FIA + `npu_attention_update`
provides native LSE-based output merging
- **Future compatibility**: Prepares for eventual `npu_ring_mla` removal
across the codebase

### Does this PR introduce _any_ user-facing change?

**No.** This is a pure refactoring with no functional changes - same
behavior, unified backend.

---
- Related issue: #5463 (item 7)
- vLLM version: v0.14.1

Signed-off-by: lico67373 <918688502@qq.com>
2026-03-16 10:33:09 +08:00
Mengqing Cao
e20f0b1a0d [ReleaseNote] Add release note for v0.17.0rc1 (#7240)
### What this PR does / why we need it?
This pull request adds the release notes for `v0.17.0rc1`. It also
updates version numbers across various documentation files, including
`README.md`, `README.zh.md`,
`docs/source/community/versioning_policy.md`, and `docs/source/conf.py`
to reflect the new release.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
2026-03-15 22:47:47 +08:00
pppeng
7e85f2ff97 [CI] Add test_qwen3_5.py (#7133)
### What this PR does / why we need it?
Add test_qwen3_5.py for base scenarios tp4 on Qwen3.5-27B and
Qwen3.5-35B-A3B.

- vLLM version: main
- vLLM main:
4034c3d32e
---------
Signed-off-by: pppeng <zepengliu912@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-15 22:19:02 +08:00
Mengqing Cao
0c299f79b9 Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)" (#7288)
### What this PR does / why we need it?
This reverts commit 7ed9e9de69, which
introduces an issue that the patch doesn't work with recompute scheduler
enabled.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-15 20:19:09 +08:00
yupeng
29f195a91c [Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156)
### What this PR does / why we need it?
Fix the error that reports while initializing qwen3-reranker-0.6b model
with `--enable-lora`.
And add a testcase to verify the fix.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-15 17:55:42 +08:00
Qiu
7daccf4b64 Perf(PP): support PP with async send/recv. (#7143)
### What this PR does / why we need it?
Follow up the PR https://github.com/vllm-project/vllm/pull/33368, this
PR provides async send/recv support for PP in vllm-ascend.

---
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-03-15 09:45:09 +08:00
Angazenn
ce5544bfc1 [Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align (#7103)
### What this PR does / why we need it?
To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly
follows the design in
[#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits
changes to functions which are overridden in vLLM-Ascend.

Note:
1. `--mamba-cache-mode align` && PD disaggregation is still not
supported yet in vLLM v0.17.0(see
https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
2. The current implementation of hybrid kv cache might result in a very
large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B
with `-tp 2`, the block_size is adjusted to 2048, which means that any
prefix shorter than 2048 will never be cached. Although this behavior is
consistent with vLLM, it still needs improvements in the future.
3. `--mamba-cache-mode align` requires to copy mamba states during
forward steps. vLLM uses a triton kernel to implement it. However, the
original version run into some bugs on Ascend hardwares. Thus we patch a
new triton kernel to avoid this bug.

### Does this PR introduce _any_ user-facing change?
To use mamba prefix cache, set `--enable-prefix-caching` and
`--mamba-cache-mode align`. Note that the mamba state copy function(see
[do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132))
does not provide a torch native version, thus it might have trouble if
users can't use triton.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-03-15 09:44:09 +08:00
bazingazhou233-hub
c69291eefc [Doc] Add USE_MODELSCOPE_HUB=0 to lm-eval guide (#7279)
## Summary
- Add `USE_MODELSCOPE_HUB=0` to both Online and Offline lm-eval sections
- Add explanatory notes about Docker containers launching with
`VLLM_USE_MODELSCOPE=True`

The Docker containers set `VLLM_USE_MODELSCOPE=True`, which causes
lm-eval to download datasets from ModelScope instead of HuggingFace,
resulting in "Repo not exists" errors. Setting `USE_MODELSCOPE_HUB=0`
disables this behavior.

Fixes #607
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
2026-03-14 22:41:02 +08:00
bazingazhou233-hub
9e6c547d98 [Doc] Replace deprecated full_cuda_graph with cudagraph_mode in Qwen2.5-Omni (#7286)
## Summary
- Replace `full_cuda_graph: 1` with `cudagraph_mode: FULL_DECODE_ONLY`
in both single-NPU and multi-NPU examples
- `full_cuda_graph` is deprecated and falls back to `NONE` on NPU

Fixes #4696
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>
2026-03-14 22:38:36 +08:00
NJX
bb506a1c99 [Doc][Installation] Clarify SOC_VERSION for CPU-only source builds (#7278)
### What this PR does / why we need it?
- Clarify that `SOC_VERSION` must be set when building from source in a
CPU-only environment where `npu-smi` is unavailable.
- Add concrete `SOC_VERSION` examples (A2/A3/300I/A5) and point users to
`Dockerfile*` defaults.
- Improve the `setup.py` error message so users get actionable guidance
when `SOC_VERSION` is missing.

Fixes #6816.

### Does this PR introduce _any_ user-facing change?
- Yes. Documentation is updated and the build-time error message is more
informative.

### How was this patch tested?
- (Local) Syntax check: `python -m compileall setup.py`.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: NJX-njx <3771829673@qq.com>
2026-03-14 22:38:25 +08:00
DreamerLeader
199df03524 [BugFix]Fix CI errors “ascend_transport.so: cannot open shared object file: No such file or directory” (#7242)
### What this PR does / why we need it?
Conditional Import for Mooncake: The import of
mooncake.engine.TransferEngine was moved into a try-except block within
the GlobalTE class's constructor. This ensures that mooncake is only
imported when needed and provides a clear error message with
installation instructions if it's missing.
### Does this PR introduce _any_ user-facing change?
The error message "ascend_transport.so: cannot open shared object file:
No such file or directory" in the CI is fixed to ensure the normal
running of the CI.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
2026-03-14 21:23:05 +08:00
Mengqing Cao
e7aa2c285c [SpecDecode] Fix Draft model proposer (#7230)
### What this PR does / why we need it?
This pr fix the Unified draft parallel feature. 
1. In Draft model proposer, there are exceed 1 attention layers in
target model, thus removing the assertion on layer number.
2. we should get block size through `draft_attn_groups` instead of
`attn_metadata_builder` after 0.17.0.
3. `attn_update_stack_num_spec_norm` shouldn't be done when unified
draft parallel is enabled

### How was this patch tested?
Test pass with
`tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_parallel_drafting_acceptance`,
which is already included in CI

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-14 18:26:37 +08:00
Hexiang Wang
0ad52517a1 Revert "Refactor quantization layer name mapping to leverage vLLM built-in mappers" (#7237)
Reverts vllm-project/vllm-ascend#7050, which breaks kimi-k2.5 and
qwen-omin.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
2026-03-14 00:05:54 +08:00
Cao Yi
5ec610e832 [Feature][Quant] Reapply auto-detect quantization format and support remote model ID (#7111)
### What this PR does / why we need it?
Reapply the auto-detect quantization format feature (originally in
#6645, reverted in #6873) and extend it to support remote model
identifiers (e.g., `org/model-name`).

Changes:
- Reapply auto-detection of quantization method from model files
(`quant_model_description.json` for ModelSlim, `config.json` for
compressed-tensors)
- Add `get_model_file()` utility to handle file retrieval from both
local paths and remote repos (HuggingFace Hub / ModelScope)
- Update `detect_quantization_method()` to accept remote repo IDs with
optional `revision` parameter
- Update `maybe_update_config()` to work with remote model identifiers
- Add platform-level `auto_detect_quantization` support
- Add unit tests and e2e tests for both local and remote model ID
scenarios

Closes #6836

### Does this PR introduce _any_ user-facing change?

Yes. When `--quantization` is not explicitly specified, vllm-ascend will
now automatically detect the quantization format from the model files
for both local directories and remote model IDs.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-03-13 22:53:25 +08:00
Junyuan
6852a2e267 [feat] add LMCacheAscendConnector (#6882)
### What this PR does / why we need it?

LMCache-Ascend is LMCache's solution on the Ascend platform and one of
the KVCache pooling solutions for Ascend. We hope to integrate
LMCache-Ascend into the vLLM-Ascend community as one of the official
KVCache pooling solutions for vLLM-Ascend.

We added a new LMCacheAscendConnector in vLLM-Ascend and registered it.

### Does this PR introduce _any_ user-facing change?

Users can specify the kvconnector using `--kv-transfer-config`, allowing
them to freely choose which kvconnector to use, without any user-facing
change.

### How was this patch tested?

Test by specifying `--kv-transfer-config
'{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'`

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: chloroethylene <jjysama@gmail.com>
2026-03-13 17:41:35 +08:00
Mengqing Cao
986cd45397 [Version] Drop 0.16.0 support (#7153)
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-13 16:14:15 +08:00
rjg-lyh
7ed9e9de69 [Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)
### What this PR does / why we need it?
This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops
in pd-mix stage mainly.

Because the code for the current PD-disaggregated scenario is still
under refactoring and cleanup, this PR prioritizes ensuring the C8
functionality in the pd-mix scenario.

The next steps are planned in two parts:
① Once the optimized scatter operator is updated, we will replace the
original operator to improve the performance of storing k_scale.
② Once the code logic for the PD-disaggregated scenario becomes stable,
we will carry out more comprehensive validation and make appropriate
adaptations.
③ Because enabling C8 currently introduces several new operators whose
performance still needs improvement, performance may regress in some
scenarios. Therefore, only after all the operators are fully ready can
we ensure that this feature does not cause any performance degradation.
At that point, we will enable this feature by default and remove the
switch in `additional_config`.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-13 14:47:42 +08:00
kx
df1ee8070d [feat][spec decode]Unified draft parallel (#6766)
### What this PR does / why we need it?
Implement a unified parallelized speculative decoding in VLLM
Ascend,which can simultaneously support parallel speculative inference
schemes such as Pard, P-Eagle, etc. refer to
https://github.com/vllm-project/vllm-ascend/pull/6565 and
https://github.com/vllm-project/vllm-ascend/pull/4078

### How was this patch tested?

run with parallel drafting script:
export target=/model/Llama-3.1-8B-Instruct
export draft=/model/PARD-Llama-3.2-1B
export CUDA_VISIBLE_DEVICES=6
export ASCEND_RT_VISIBLE_DEVICES=6
vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811 \
--speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method":
"draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}'

base script:
export target=/model/Llama-3.1-8B-Instruct
export draft=/model/PARD-Llama-3.2-1B
export CUDA_VISIBLE_DEVICES=6
export ASCEND_RT_VISIBLE_DEVICES=6
vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811

benchmark script:
MAX_CONCURRENCY=1
NUM_PROMPTS=80
vllm bench serve --port 8811 \
    --temperature 0 \
    --model /model/Llama-3.1-8B-Instruct \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts ${NUM_PROMPTS} \
    --max-concurrency ${MAX_CONCURRENCY} \
    --seed 1234

test results :
base(without spec decode): TTFT 79.46ms TPOT 26.99ms
output_tokens_throughput 36.75 tok/s
this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms
output_tokens_throughput 72.98 tok/s
per-position acceptance(from position 0 to 7):
79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%.

----------------------------------------------------------------------
run on qwen3 model script :
export target=/model/Qwen3-1.7B
export draft=/model/PARD-Qwen3-0.6B
export CUDA_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=1

vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811 \
--speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method":
"draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}'

cc  @NickJudyHvv
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: kx <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
2026-03-13 14:07:35 +08:00
pppeng
6ee7ffb98a Add Qwen3_5 to model list (#7130)
### What this PR does / why we need it?
The pr aims to add new models like Qwen3.5-35B-A3B/Qwen3.5-27B to model
list for testing.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>
2026-03-13 11:42:28 +08:00
Qiu
c377e73933 Perf(PP): support PP with async scheduling. (#7136)
### What this PR does / why we need it?
Follow up the PR https://github.com/vllm-project/vllm/pull/32618, this
PR provides async scheduling support for PP in vllm-ascend.
---
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-03-13 10:27:23 +08:00
Ronald
c980e68d40 [Feature] support aclgraph for model runner v2 (#7110)
### What this PR does / why we need it?
This PR aims to support aclgraph for model runner v2, please see RFC
#5208. The PR contains these modifications:
- adapt to newest commit of vllm main branch.
- supply a unified interface of extra forward context for both model
runner v1 and model runner v2.
- implement graph mode for main model. 

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-13 09:11:46 +08:00
Li Wang
1f71da80eb [CI] Fix server start failure when long weight loading (#7098)
### What this PR does / why we need it?
When loading large models (e.g., 163 shards), weight loading can exceed
the default 600s timeout. Engine startup timeout with the error:
```shell
TimeoutError: Timed out waiting for engines to send initial message on input socket.
```
We should increase the `VLLM_ENGINE_READY_TIMEOUT_S ` to avoid it
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-13 08:52:56 +08:00
Li Wang
7fe0469e27 [CI][Misc] Use offline mode for model downloads (#7179)
### What this PR does / why we need it?
1. For all parts of the current test module involving the millisecond
download model, add the `local_file_only` parameter to specify offline
mode; this ensures that CI will not fail due to network instability.
2. Install modelscope from a fixed commit until it next release
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
 check if the env or arg `local_files_only` works
1) set the env:
```shell
export HF_HUB_OFFLINE=1
```
2) run the script
```python
from transformers import PretrainedConfig
import huggingface_hub
from modelscope.utils.hf_util import patch_hub

patch_hub()

model="Qwen/Qwen3-0.6B"
kwargs = {}


config_dict, _ = PretrainedConfig.get_config_dict(
    model,
    trust_remote_code=True,
    local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
    **kwargs,
)

print(config_dict)
```
it works well:
```shell
2026-03-06 06:40:12,546 - modelscope - WARNING - We can not confirm the cached file is for revision: master
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
{'architectures': ['Qwen3ForCausalLM'], 'attention_bias': False, 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'head_dim': 128, 'hidden_act': 'silu', 'hidden_size': 1024, 'initializer_range': 0.02, 'intermediate_size': 3072, 'max_position_embeddings': 40960, 'max_window_layers': 28, 'model_type': 'qwen3', 'num_attention_heads': 16, 'num_hidden_layers': 28, 'num_key_value_heads': 8, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000, 'sliding_window': None, 'tie_word_embeddings': True, 'torch_dtype': 'bfloat16', 'transformers_version': '4.51.0', 'use_cache': True, 'use_sliding_window': False, 'vocab_size': 151936, '_commit_hash': None}
```
3) test the model repo does not cached locally when the env
`HF_HUB_OFFLINE`==True
```python
from transformers import PretrainedConfig
import huggingface_hub
from modelscope.utils.hf_util import patch_hub

patch_hub()


model="FireRedTeam/FireRed-OCR"
kwargs = {}


config_dict, _ = PretrainedConfig.get_config_dict(
    model,
    trust_remote_code=True,
    local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
    **kwargs,
)

print(config_dict)
```
and the result is as expected:
```shell
  File "/workspace/demo.py", line 12, in <module>
    config_dict, _ = PretrainedConfig.get_config_dict(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 189, in patch_get_config_dict
    model_dir = get_model_dir(pretrained_model_name_or_path,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 164, in get_model_dir
    model_dir = snapshot_download(
                ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 137, in snapshot_download
    return _snapshot_download(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 283, in _snapshot_download
    raise ValueError(
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable look-ups and downloads online, set 'local_files_only' to False
```
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-13 08:52:24 +08:00
zxr2333
fe4cad24e9 [BugFix]fix qwen3.5 reshape_kvcache bug (#7209)
### What this PR does / why we need it?

This PR fixes a bug in `reshape_kvcache_tensors` when reshaping the
Mamba cache for models like Qwen3.5. The previous implementation did not
correctly handle cases where the KV cache tensors have different data
types. This change ensures that slicing is performed based on byte
offsets before reshaping the tensors, which correctly handles
heterogeneous dtypes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

By CI.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-12 23:51:40 +08:00
drizzlezyk
5fe7942bbd [CI] add action for issue labeler on issue open/edit (#7208)
### What this PR does / why we need it?
New Workflow File bot_issue_manage.yaml

Automatically runs when issues are opened or edited
Uses the official GitHub Issue Labeler action to categorize issues
Label Configuration issue-labeler.yml

Defines regex patterns for model-specific labels (310p, GLM5, Qwen 3.5,
DeepSeek, Kimi K2, Kimi K2.5)
Enables automatic issue classification based on title/content matching

### Does this PR introduce _any_ user-facing change?
No. This PR only introduces internal GitHub Actions workflow and
configuration changes. There are no API, interface, or behavior changes
visible to end users. It purely improves the issue management process on
GitHub.

### How was this patch tested?
- GitHub Actions workflow syntax is valid and follows the official
GitHub documentation
- The issue labeler action (github/issue-labeler@v3.4) is a
well-maintained official GitHub action
- Configuration file follows the expected YAML format for the
issue-labeler action
- Regex patterns for model names have been verified for correct syntax

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: drizzlezyk <drizzlezyk@163.com>
2026-03-12 20:16:17 +08:00
wangbj127
0c659e91ed [MTP][Bugfix] Fix GLM5-W8A8 precision issues caused by rotary quant MTP weights (#7139)
### What this PR does / why we need it?
When GLM5 target model uses rotary quant, the final hidden states passes
to MTP need to do an extra rotary.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>
2026-03-12 20:01:24 +08:00
drslark
de93790d08 [main][bugfix] Fixed the problem of drafter crashed in FULL mode (#7158)
### What this PR does / why we need it?

The merged graph of draft in `FULL` mode is broken now.

This pr solves it.

Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so,
it is removed.

It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and
https://github.com/vllm-project/vllm-ascend/pull/7148.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Test code is shown as below:

```python
prompts = [
    "1.Who are you?",
    "2. Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200)
llm = LLM(
    model="/home/some-model/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    max_num_seqs=32,
    # enforce_eager=True,
    disable_log_stats=False,
    distributed_executor_backend="mp",
    gpu_memory_utilization=0.7,
    async_scheduling=True,

    speculative_config={
        "enforce_eager": True,
        "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B",
        "disable_padded_drafter_batch": False,
        "method": "eagle3",
        "num_speculative_tokens": 3,
    },
    
    compilation_config={
        "cudagraph_mode": "FULL",
        "cudagraph_num_of_warmups": 1,
    },

    max_model_len=4096, 
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)
```

The result before:

```text
   File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia
     graph_params.events[num_tokens].append(event)
     ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
 KeyError: 132
```

The result after:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 242
num_draft_tokens: 726
num_accepted_tokens: 156
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.07
```

We also test `FULL_DECODE_ONLY` mode.

The result is:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 244
num_draft_tokens: 732
num_accepted_tokens: 155
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.06
```

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-12 18:38:50 +08:00
Li Wang
88c56e3bf2 [Misc] Fix main lint to make CI happy (#7204)
### What this PR does / why we need it?
Fix lint failed due to the merging of a previous PR.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-12 18:27:48 +08:00
Li Wang
0a171b5cdd [Test][BugFix] Fix dispatch_gmm_combine_decode test stability (#7097)
### What this PR does / why we need it?
This patch fix the nightly failure
1. Each case uses a copy of the global kwargs instead of a reference to
prevent parameter pollution between use cases.
2. Add weight initialization in the scenario of `eplb` + `w8a8_dynamic`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```python
pytest -sv tests/e2e/nightly/single_node/ops/multicard_ops_a3/test_dispatch_gmm_combine_decode.py
```

```shell
===================================================================== 3 passed, 4 warnings in 194.86s (0:03:14) ======================================================================
```
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-12 17:22:44 +08:00
Li Wang
d866e6b238 [Bugfix] Fixed permission issues with the automatic PR submission workflow (#7142)
### What this PR does / why we need it?
Auto submit a pull request via
https://github.com/vllm-ascend-ci/vllm-ascend, the workflow looks like:
1. get a new config.yaml via run e2e tests
2. push the changed `config.yaml` to a new branch of
https://github.com/vllm-ascend-ci/vllm-ascend
3. submit a pull request to vllm-ascend via gh cli
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-12 17:18:59 +08:00
Shaoxu Cheng
e5343d6eb3 [310P][Bugfix]: fix ngram graph replay accuracy error (#7134)
### What this PR does / why we need it?
On the 310P device, when running ACLGraph together with the n-gram
speculative decoding algorithm, both graph capture and graph replay
require `uniform_decode_query_len` and do not depend on
`attention_state`. This leads to a rather interesting and unexpected
issue on 310P: during decode-only, execution does **not** enter the
graph, while in the split-fuse state (that is, the chunked prefill
state), it instead enters graph execution directly.

The issue can be resolved by forcibly setting `uniform_decode_query_len`
to `1`, so that 310P captures only the decode-only graph, and replay is
then controlled through `attention_state`.

### Does this PR introduce _any_ user-facing change?
NO

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-12 17:08:08 +08:00
Ronald
bfd049aa2c [Lint] fix typos error in epd_load_balance_proxy_layerwise_server_example.py (#7199)
### What this PR does / why we need it?
his PR fixes a typo in two function names in the
`epd_load_balance_proxy_layerwise_server_example.py` example script. The
function names `aquire_aborted_pd_requests` and
`aquire_aborted_prefiller_requests` were misspelled and have been
corrected to `acquire_aborted_pd_requests` and
`acquire_aborted_prefiller_requests` respectively. This improves code
readability and correctness.

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-12 17:04:38 +08:00
tfhddd
21fea86b08 feat: [CI] Introduce uv to accelerate pip install (#7127)
### What this PR does / why we need it?
Integrates uv: Significantly accelerates pip install execution and
resolves concurrency issues caused by traditional pip caching
mechanisms.

Why pip install uc-manager is explicitly added:
This project depends on uc-manager. However, installing it via uv pip
install uc-manager currently fails due to a known issue. An issue has
already been filed with the upstream uv repository to address this.
Consequently, we explicitly invoke pip install uc-manager as a temporary
workaround to ensure the build succeeds.
https://github.com/ModelEngine-Group/unified-cache-management/issues/736

Why use UV_SYSTEM_PYTHON: 1:
No virtual environment has been created yet; this configuration has the
same effect as directly using `pip install`.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: tfhddd <2272751277@qq.com>
2026-03-12 16:47:23 +08:00
shaopeng-666
592661e787 [Doc] EPD doc and load-balance proxy example (#6221)
Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2026-03-12 16:17:17 +08:00
无脸男
09d26754cd [Bugfix] Fix the issue where no exception is thrown when graph capture fails. (#5644)
### What this PR does / why we need it?

Fix the issue where no exception is thrown when graph capture fails.


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: WithHades <244036962@qq.com>
2026-03-12 16:14:45 +08:00
xleoken
77b43492ae improve the ttft when use mooncake (#6125)
### What this PR does / why we need it?
improve performance of mooncake by change the log level from info to
debug
### ENV
2P + 4D, EP

1. benchmark script
```
evalscope perf \
  --parallel 512 \
  --number 1024 \
  --model deepseek \
  --url http://localhost:9000/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 2 \
  --min-tokens 2 \
  --prefix-length 0 \
  --min-prompt-length 512 \
  --max-prompt-length 512 \
  --tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814  \
  --extra-args '{"ignore_eos": true}' \
  --rate 2
```

2. before patch
```
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |  209.484  |
+-----------------------------------+-----------+
| Number of concurrency             |  512      |
+-----------------------------------+-----------+
| Request rate (req/s)              |    6      |
+-----------------------------------+-----------+
| Total requests                    | 1024      |
+-----------------------------------+-----------+
| Succeed requests                  | 1022      |
+-----------------------------------+-----------+
| Failed requests                   |    2      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |    9.7573 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    | 2507.62   |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    4.8786 |
+-----------------------------------+-----------+
| Average latency (s)               |    7.0561 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    5.7444 |
+-----------------------------------+-----------+
| Average time per output token (s) |    1.3117 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    1.3117 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request |    2      |
+-----------------------------------+-----------+
2026-01-22 14:56:32 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.6062  | 0.5113  |  0.5113  |    1.234    |     512      |       2       |     0.0888     |    22.8338    |
|     25%     |  0.7248  | 0.5639  |  0.5639  |   1.4114    |     512      |       2       |      0.2       |    51.3919    |
|     50%     |  0.9092  | 0.7748  |  0.7748  |   1.6767    |     512      |       2       |     1.1935     |   306.7171    |
|     66%     |  1.0745  | 1.0345  |  1.0345  |   3.1308    |     512      |       2       |     1.3395     |   344.2495    |
|     75%     |  7.0812  | 1.5389  |  1.5389  |   10.0016   |     512      |       2       |     1.417      |   364.1808    |
|     80%     | 10.6944  | 1.8552  |  1.8552  |   13.3717   |     512      |       2       |     1.4778     |   379.7911    |
|     90%     | 19.2342  | 2.4325  |  2.4326  |   22.5105   |     512      |       2       |     1.6208     |   416.5381    |
|     95%     | 24.4399  | 2.8289  |  2.8289  |   26.0329   |     512      |       2       |     1.7548     |   450.9942    |
|     98%     | 45.0941  | 3.4098  |  3.4098  |   45.6287   |     512      |       2       |     1.8193     |   467.5476    |
|     99%     | 46.2786  | 3.8492  |  3.8492  |   46.9282   |     512      |       2       |     1.8576     |   477.4157    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
```

3. after patch
```
Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |  191.613  |
+-----------------------------------+-----------+
| Number of concurrency             |  512      |
+-----------------------------------+-----------+
| Request rate (req/s)              |    6      |
+-----------------------------------+-----------+
| Total requests                    | 1024      |
+-----------------------------------+-----------+
| Succeed requests                  | 1024      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |   10.6882 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    | 2746.87   |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    5.3441 |
+-----------------------------------+-----------+
| Average latency (s)               |    2.0407 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    0.7989 |
+-----------------------------------+-----------+
| Average time per output token (s) |    1.2419 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    1.2419 |
+-----------------------------------+-----------+
| Average input tokens per request  |  512      |
+-----------------------------------+-----------+
| Average output tokens per request |    2      |
+-----------------------------------+-----------+
2026-01-22 15:10:31 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.5727  | 0.5051  |  0.5051  |   1.1761    |     512      |       2       |     1.0368     |   266.4696    |
|     25%     |  0.6497  | 0.5324  |  0.5324  |   1.3159    |     512      |       2       |     1.1763     |   302.3184    |
|     50%     |  0.7767  | 0.6908  |  0.6908  |   1.4793    |     512      |       2       |     1.3521     |   347.4944    |
|     66%     |  0.8711  | 0.7912  |  0.7912  |   1.5916    |     512      |       2       |     1.4518     |   373.1092    |
|     75%     |  0.9125  | 0.8797  |  0.8797  |   1.7008    |     512      |       2       |     1.521      |   390.9018    |
|     80%     |  0.9381  | 0.9442  |  0.9442  |   1.7657    |     512      |       2       |     1.5749     |   404.7606    |
|     90%     |  0.994   | 1.0818  |  1.0818  |   1.9289    |     512      |       2       |     1.7006     |   437.0518    |
|     95%     |  1.0369  | 1.2454  |  1.2454  |   2.2154    |     512      |       2       |     1.7937     |   460.9731    |
|     98%     |  1.1237  | 18.8814 | 18.8814  |   19.4607   |     512      |       2       |     1.8755     |   482.0097    |
|     99%     |  1.6752  | 24.4406 | 24.4406  |   25.4734   |     512      |       2       |     1.907      |   490.0993    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
```

---------

Signed-off-by: xleoken <xleoken@163.com>
2026-03-12 16:13:48 +08:00
Hexiang Wang
f244f3c4a9 [BugFix] Fix problem of extra processes on rank0 device (#7107)
### What this PR does / why we need it?
Currently when tp>1, we have extra processes on tp rank0 device which
consumes extra HBM memory. This is caused by `import
torch_npu._inductor` before set_device which introduces extra
initialization of device.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
All ci passed.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-03-12 15:59:03 +08:00
herizhen
e5024d0264 [doc] Add Ascend PyTorch Profiler section (#7117)
### What this PR does / why we need it?
add Ascend PyTorch Profiler section

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Documentation Format Checks
Technical Content Validation
Build Verification
Version Compatibility
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: herizhen <1270637059@qq.com>
2026-03-12 15:51:00 +08:00
Mercykid-bash
132f3c5d0a Support per-step heat collection and enhance FlashLB for multi-stage load balancing (#6477)
# Feature: FlashLB algorithm

## Purpose

This Pull Request enhances the EPLB (Expert Parallelism Load Balancing)
system by introducing a novel load balancing algorithm: FlashLB.
1. The default algorithm adopts two separate sub-procedures to optimize
expert replication and placement independently:

a. **Expert Replica Allotment Sub-procedure** : Determines the number of
replicas for all experts. At each step, it greedily adds one more
replica to the expert with the highest per-replica load, aiming to
minimize load skew at the expert replica granularity (Min Max Replica,
MMR).

b. **Expert Replica Placement Sub-procedure** : Distributes all replicas
across devices. First, it sorts the generated replicas in descending
order of hotness, then iteratively places the currently hottest replica
onto the device with the lowest cumulative load and available slots.
However, this simplistic combination of two separate procedures lacks
synergy and often leads to sub-optimal load balancing. For example, in
the simple scenario illustrated below: Given 8 logical experts with
hotness values [600, 560, 120, 120, 20, 10, 10, 10], and 2 replicas
allocated per device across 8 devices, the default EPLB algorithm
results in a maximum per-device hotness of 232 (peak-average load ratio
1.28), while our proposed FlashLB algorithm reduces this value to 205
(peak-average load ratio 1.13).

<figure><img
src="https://github.com/user-attachments/assets/b9b10fab-651e-4524-9942-adbca8d044a4"
width="90%"</figure>

2. The default algorithm simply aggregates hotness measurements across
the entire profiling window. While this provides a coarse approximation
of the hotness distribution, it fails to capture the time-phased
variations and temporal correlations in expert hotness (both within and
between experts) across iterations—phenomena that have been observed in
real-world scenarios. Such single-point hotness estimation degrades the
solution quality of the load balancing algorithm.

3. The default algorithm regularly recalculates updated expert placement
results for all layers without discrimination. Considering that
excessive expert updates can impact Service Level Objectives (SLOs),
such full-scale redeployment leads to excessively high adjustment
overhead, which negatively affects end-to-end performance.

## FlashLB Algorithm Principle

### 1. Joint Optimization of Replica Allotment and Placement

FlashLB achieves joint optimization of replica allotment and placement
through a novel tree search approach, combined with carefully designed e
Fl fficient pruning and lightweight look-ahead estimation. We partition
all experts into several subsets, and for each subset, hierarchically
determine the optimal replica count and placement. Leveraging efficient
pruning and lightweight look-ahead estimation, the process consistently
aims to optimize the globally expected inter-device load balance degree
(considering both deployed and unexplored experts) while ensuring
sufficient computational efficiency. Additionally, precompilation
techniques are employed for acceleration, delivering load balancing that
is both high-quality and practically efficient.
### 2. Multi-Episode Enhancement

Instead of performing full-duration averaging like the default
algorithm, FlashLB partitions each profiling interval (e.g., 1024
iterations) into multiple consecutive smaller episodes (e.g., 16
iterations). This preserves hotness fluctuation and correlation
information. It then constructs a multi-objective optimization problem
to co-optimize these episodes simultaneously, enabling adaptability to
interleaved hotness patterns and improving statistical robustness.

### 3. Layer-wise Cherry-Picking Redeployment

To reduce the overhead of frequent expert redeployment, FlashLB
introduces a cherry-picking redeployment scheme. During each algorithmic
decision cycle, it real-time tracks load balance degree of all layers
and triggers expert placement updates only for those layers whose
peak-average ratio exceeds a predefined threshold. This avoids
unnecessary redeployment for stable layers, significantly reducing
adjustment overhead and thereby improving end-to-end performance gains.

## Co-author:

Co-authored-by: Skywalker-EP 173723846@qq.com

This PR mainly introduces two key optimizations for load balancing
scheduling:
1. **Add per-step heat collection function**:
Support real-time collection of per-step heat information during model
inference. This enables more fine-grained load balancing decisions by
taking per-step heat as the optimization target, improving scheduling
accuracy for dynamic and fluctuating workloads.

2. **Update FlashLB algorithm**:
Upgrade the FlashLB scheduling logic to better adapt to multi-stage heat
distribution scenarios. The improved algorithm can comprehensively
perceive and utilize multi-stage heat characteristics, achieving more
stable and efficient load balancing under complex expert deployment and
dynamic traffic patterns.

---------

Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
Signed-off-by: xuzewei28 <xuzewei2@h-partners.com>
Co-authored-by: xuzewei28 <xuzewei2@h-partners.com>
2026-03-12 15:49:09 +08:00
Feng-xiaosuo
abe72d7cb9 Refactor quantization layer name mapping to leverage vLLM built-in mappers (#7050)
…the quantization layer name

### What this PR does / why we need it?
This PR modifies the loading logic for layer name prefixes in quantized
models. The goal is to reduce or eliminate the need for point-to-point
(hardcoded) modifications by leveraging the built-in mapper mechanism
already provided in vLLM's model code. For models that do not yet have a
corresponding mapper, the original point-to-point modification approach
has been retained to ensure backward compatibility.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The changes were validated using an offline deployment script to launch
and verify multiple multimodal models. Testing confirmed that the
updated loading logic correctly handles layer name prefixes across
different model architectures, with no regression in model
initialization or inference behavior.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Matrix_K <zhangke144@huawei.com>
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
Co-authored-by: Matrix_K <zhangke144@huawei.com>
2026-03-12 15:48:14 +08:00
drslark
fb0d6dd175 [main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148)
### What this PR does / why we need it?

Fixed the error of speculative decoding in FULL mode when `num_spec + 1`
not in `cudagraph_capture_sizes`.

Now, we can run speculative decoding in FULL mode, but with drafter as
eager.

It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 .

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Test code is shown as below:

```python
prompts = [
    "1.Who are you?",
    "2. Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200)
llm = LLM(
    model="/home/some-model/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    max_num_seqs=32,
    # enforce_eager=True,
    disable_log_stats=False,
    distributed_executor_backend="mp",
    gpu_memory_utilization=0.7,
    async_scheduling=True,

    speculative_config={
        "enforce_eager": True,
        "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B",
        "disable_padded_drafter_batch": False,
        "method": "eagle3",
        "num_speculative_tokens": 2,
    },
    
    compilation_config={
        "cudagraph_mode": "FULL",
        "cudagraph_num_of_warmups": 1,
    },

    max_model_len=4096, 
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)
```

The result before:

```text
   File "/vllm-workspace/vllm/vllm/v1/cudagraph_dispatcher.py", line 140, in _create_padded_batch_descriptor
     assert num_tokens_padded % uniform_decode_query_len == 0
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 AssertionError
```

The result after:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 249
num_draft_tokens: 498
num_accepted_tokens: 149
mean acceptance length: 1.60
--------------------------------------------------
acceptance at token 0: 0.43
acceptance at token 1: 0.17
```

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-12 14:51:12 +08:00
XiaoxinWang
37d1bd8c50 fixed fia pad logic in graph mode. (#7144)
### What this PR does / why we need it?
related to vllm PR #34043 this pr delete func
‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual
number of requests, due to fia operator requires that
query_start_loc[-1] equals the total number of computed tokens, so this
func delete cause the ifa error.
In full graph mode, set num_reqs_paded = num_reqs to fix the error
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2026-03-12 14:50:54 +08:00
MengLong Chen
bbffe58b63 [Doc] fix DSV3.1 PD configs (#7187)
### What this PR does / why we need it?
Modify the `kv_port` and `engine_id` config of DeepSeek-V3.1/R1 in the
2P1D scenario

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2026-03-12 14:24:49 +08:00
Qiu
aa0143e55d refactor: add a check before layer_sharding logging (#7186)
### What this PR does / why we need it?
We should only display this log message when layer_sharding is enabled.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-03-12 11:56:04 +08:00
linfeng-yuan
5f3826b093 [Build] Add support for Ascend950 chip (#7151)
### What this PR does / why we need it?
This PR adds support for the Ascend950 chip. This includes:
- Updating build scripts (`CMakeLists.txt` and `setup.py`) to recognize
the Ascend950 chip and set appropriate compilation flags.
- Disabling a set of custom operators that are not yet supported on the
Ascend950 hardware target.
- Performing a codebase-wide refactoring of `pipe_barrier()` calls to
the namespaced `AscendC::PipeBarrier<>()` for improved code consistency
and adherence to the latest API standards.

Ascend950DT e2e passed (Qwen3-32B-MXFP8) and CI passed
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-12 10:25:51 +08:00
meihanc
da01a74009 Revert "[CI] fix skiped e2e test when upgrade vllm version (#6654)" (#7166)
This reverts commit f6db47f103.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-11 23:03:15 +08:00
shiyuan680
3b6b3c4214 [MODELRUNNERV2]fix penality ops (#7013)
### What this PR does / why we need it?
fix penality ops for new version, and achieved a 10% performance
improvement

### How was this patch tested?
pytest
‎tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: shiyuan680 <917935075@qq.com>
2026-03-11 17:13:34 +08:00
yupeng
830f39dd70 [Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650)
### What this PR does / why we need it?
Fix the issue #6143 .

### Does this PR introduce _any_ user-facing change?
Allow to start the server with "--enable-lora && --fully-sharded-loras
&& --tensor_parallel_size 2".

### How was this patch tested?
pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-11 15:43:15 +08:00
pz1116
a7f91fce71 [KV Pool]get_num_new_matched_tokens return 0 if token length < block_size (#7146)
### What this PR does / why we need it?
Currently, we call lookup_client for looking up token hit in KV Pool,
however, when token length < block size, the key will be empty and there
is no point to lookup in KV Pool backend since there will never be a
hit.
Hence, add early return in `get_num_new_matched_tokens` when `token_len`
< `block_size`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: fems14 <1804143737@qq.com>
2026-03-11 15:05:34 +08:00