444 Commits

Author SHA1 Message Date
Yang Yuxi
e776d5c0f1 [Bugfix]v0.18.0 support FlashComm1 & DCP for Qwen (#7726)
### What this PR does / why we need it?
This PR backports the changes from #7673 ([Bugfix] support FlashComm1 &
DCP for Qwen) to the releases/v0.18.0 branch.

--------
Signed-off-by: Yang Yuxi <907276627@qq.com>
2026-03-29 15:59:19 +08:00
Wang Kunpeng
5df2ddd8db [v0.18.0][Bugfix]Fix Error "AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'" (#7748)
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736

**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`

**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.

**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-03-28 17:03:56 +08:00
Wangbei25
dd55736ee4 fix uncompatible between fc1 and non-sp-padding (#7643)
cherry pick https://github.com/vllm-project/vllm-ascend/pull/7614
### What this PR does / why we need it?
fix uncompatible between fc1 and non-sp-padding
After PR
[non-sp-padding](https://github.com/vllm-project/vllm-ascend/pull/7297),
kimi2.5 open flashcomm1 will raise an error : The expanded size of the
tensor do not match the existing size at non-singleton dimension 0.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM-Ascend main: 9976e685b7

Signed-off-by: Wangbei25 <wangbei41@huawie.com>
Co-authored-by: Wangbei25 <wangbei41@huawie.com>
2026-03-25 23:23:37 +08:00
realliujiaxu
5d12446573 [Feat][SP] Suport SP for VL MoE models (#7044)
### What this PR does / why we need it?

2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.


### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.


### How was this patch tested?
- Model: Qwen3-VL-30B-A3B, 
- TP4 DP2
- 100 reqs
- max concurrency 1

| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k         | 429.40               | 323.3                  |
| 16k        | 1297.01              | 911.74                |

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-24 17:16:00 +08:00
zxr2333
67aad1fce8 [BugFix][P/D] fix padding error on FullGraph mode && fix layerwise connector mamba accuracy (#7506)
### What this PR does / why we need it?
1. When the FullGraph mode is used, the branches in the Triton operator
are compiled and fixed during the graph capture process, causing the
branch condition in the `fused_recurrent_gated_delta_rule` operator,
which checks whether `ssm_state_indices >= 0` before writing to the SSM
cache, to become invalid. Now, the write operation is performed
regardless of the value. This results in the operator performing address
offset calculations and writing to the SSM cache based on the -1 offset
after -1 is used for padding in vLLM GDN backend. Since the conv cache
and SSM cache in vLLM Ascend implementation are actually a single
continuous tensor divided into two parts, this leads to data overwriting
and the generation of NaN values.
This PR addresses two cases where padding -1 is required in the GDN
metadata builder. The same logic is used to replace the padding with 0
to avoid the problem of memory overwriting, because block 0 is a
reserved block.
2. Fix layerwise connector bug for mamba cache sending on heterogeneous
TP.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-24 15:15:55 +08:00
Levi
9976e685b7 [Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp (#7297)
### What this PR does / why we need it?
Fix multi dp padding logic for eager mode, bacause its will cause rank0
load imbalance in kimi-k2.5-w4a8 with the all the padding tokens router
to rank0. And the fix can also apply to other model in multi dp.
- before
hbm usage:
<img width="2229" height="733" alt="image"
src="https://github.com/user-attachments/assets/50479b6d-cfd0-4206-8e80-974024652997"
/>

preformance:
```shell
Concurrency  NumPrompts   QPS          TTFT_Avg     TTFT_P50     TPOT_Avg     TPOT_P50     TPOT_P90    
============ ============ ============ ============ ============ ============ ============ ============
1            15           0.0179       1667.7803    1673.3437    35.2973      35.2775      35.3784     
32           480          0.4725       2764.8027    1905.2137    40.8030      40.6978      41.0179     
64           960          0.7820       4123.7096    3485.6153    48.0461      48.1598      48.2971     
100          1500         1.0852       6216.7988    5714.0082    52.9323      53.0613      54.6304     
108          1620         1.1040       6277.4892    5798.7425    56.3862      56.9224      57.2901     
116          1740         1.1680       6563.3293    6039.5659    56.9894      57.4027      57.5786     
128          1920         1.2555       7822.5551    7604.1662    57.7660      58.1768      58.2717     
192          2880         1.4314       9212.1953    9131.3461    58.9905      59.1683      59.2791     
256          3840         1.4480       9028.0812    8913.7937    59.0092      59.2385      59.3516  
```


- after
hbm usage:
<img width="2246" height="1005" alt="image"
src="https://github.com/user-attachments/assets/d0936481-5a58-4bc5-a6f1-b92735d47885"
/>


preformance:
```shell
Concurrency  NumPrompts   QPS          TTFT_Avg     TTFT_P50     TPOT_Avg     TPOT_P50     TPOT_P90    
============ ============ ============ ============ ============ ============ ============ ============
1            15           0.0181       601.4171     600.9774     35.6270      35.6254      35.6480     
32           480          0.4455       720.8782     724.2889     45.4250      45.4755      45.6318     
64           960          0.8445       729.6209     728.2149     47.0464      47.0896      47.1985     
100          1500         1.2601       723.4834     724.6673     48.3108      48.3844      48.5355     
108          1620         1.3409       727.1509     720.6772     48.8962      48.9409      49.0489     
116          1740         1.4080       679.9799     677.6119     49.1253      49.1983      49.3087     
128          1920         1.4155       680.6284     674.9436     49.2193      49.2450      49.3763     
192          2880         1.4422       684.6577     676.7833     49.2059      49.2264      49.3229     
256          3840         1.4558       685.2462     678.1709     49.2191      49.2351      49.3419    
```
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: fny-coder <985619145@qq.com>
2026-03-23 17:05:02 +08:00
XiaoxinWang
cbf46fad3c fixed graph mode bug. (#7460)
### What this PR does / why we need it?
In fulldecodeonly mode, num_req_padded was set to an incorrect value,
causing accuracy degradation in Qwen3-Next. Therefore, we added a check
for compilation_config.cudagraph_mode to the conditional logic, ensuring
that padding is applied only in FULL mode.


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2026-03-22 10:09:37 +08:00
Zetong Li
84a74f0cb1 [Bugfix] Fix padding logic in eagle proposer for kimi25 (#7348)
### What this PR does / why we need it?
This PR aims to fix padding logic in eagle proposer for kimi25. Main
changes involve:
1. modify the way to obtain draft model attention builder and backend
2. add block table padding & related tensor slicing in common metadata
when `draft_step>1` for solving fia verifying error
3. replace block table in `update_graph_params` for solving fia
verifying error

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-21 16:57:22 +08:00
meihanc
bff4fbfca5 upgrade to 0.18.0 (#7502)
### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-03-21 16:05:38 +08:00
HongtaoYang
80a4265717 [Feat] Support separate attention backend for target and draft model. (#7342)
### What this PR does / why we need it?
This PR enables separate attention backend configuration for target and
draft models in speculative decoding, decoupling the previously bound
attention backend settings between the two models.

It solves the compatibility issue where some draft models do not support
the attention backend used by the target model, and allows users to
select the optimal attention backend for each model individually to
maximize inference performance. The change is fully backward compatible.
---------
Signed-off-by: SidaoY <1024863041@qq.com>
2026-03-21 10:48:01 +08:00
LI SHENGYONG
4e6dbe0956 [EPLB][Bugfix] Set parallel_config.enable_eplb to true to load redundant experts (#7470)
### What this PR does / why we need it?
pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because
it filters out redundant experts.
pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use
parallel_config.enable_eplb to determine whether to skip the weight
loading filter.
But in vllm-ascend, parallel_config.enable_eplb is always false. When we
use eplb, we temporarily set it to true.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?

![Snipaste_2026-03-19_16-13-01](https://github.com/user-attachments/assets/b3a4911e-36b3-4c31-951c-7c091f416d00)
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-20 15:22:55 +08:00
Nengjun Ma
8b79d4de52 Main2main upgrade to vllm 0317 afternoon (#7409)
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)

4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
2026-03-18 23:24:27 +08:00
zhangyiming
1c954ff264 [main2main] upgrade vllm to 0308 (#7213)
### What this PR does / why we need it?
Update main2main to vllm 0308.
breaks:

* https://github.com/vllm-project/vllm/pull/30681
* https://github.com/vllm-project/vllm/pull/35552 remove
self.cudagraph_batch_sizes
* https://github.com/vllm-project/vllm/pull/35158 clear_metadata ->
defer_finalize
* https://github.com/vllm-project/vllm/pull/36006 remove
CacheConfig.cpu_offload_gb
* https://github.com/vllm-project/vllm/pull/35472
* https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder
* https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens
* https://github.com/vllm-project/vllm/pull/28053 

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-03-18 09:24:43 +08:00
zxr2333
5645ca8392 [BugFix]A2 MOE method&& layerwise MTP bugfix && Mamba gdn_metadata bugfix (#7364)
### What this PR does / why we need it?
Some bug fixes, mainly including:
1. For A2, the number of experts each single card cannot be greater than
16 when using MC2. The PR fixed the error in the A2 moe communication
method selection, which would cause the selection of an incorrect
communication method when the number of model experts exceeds 256. For
example, when using an A2 16-cards model to load the PD-disaggregation D
node with Qwen3.5 series models, the incorrect MC2 method would be
chosen.
2. Fixed the issue where the layerwise connector sends the kv-cache of
the MTP layer multiple times when `num_spec_tokens` > 1. Now, the
kv-cache is sent only when the MTP layer is forward for the first time.
3. Fix the accuracy issue of qwen3.5 when using MTP for PD
disaggregation. The cause is that `num_decode_draft_tokens` does not
consider that `spec_tokens` are not existed during the first inference
when PD disaggregation (`spec_tokens` are generated during the first
inference). However, `spec_tokens_padding` is added by
`recomputed_scheduler`. As a result, `gdn_metadata` incorrectly
considers that the prefill with a length of 2 is performed.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-17 23:03:45 +08:00
pichangping
3f39ac9c8d [Feature]Supports DSv3.1 PD separation and C8 quantization (#7222)
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>

### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints: 
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-03-16 22:49:05 +08:00
drslark
a6f6e919e6 [main][bugfix] Fixed the problem that eagle3 will crash in FULL_DECODE_ONLY (#7290)
### What this PR does / why we need it?
Two problems have been solved in this pr.
These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens`
should be padded to some value in `cudagraph_capture_sizes`.

1. We found the length of `seq_lens_list` in drafter's `attn_metadata`
is 1 shorter than expected. It will raise a kernel exception to make
vllm crash.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996], it is not padded.

3. Though the length of `seq_lens_list` in target's `attn_metadata` is
the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at
the end of the list.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end
of the list.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-16 20:41:36 +08:00
rjg-lyh
4d443b9228 [bugfix] restore pr-7029 and fix patch error (#7294)
### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.

The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.

This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.

### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-16 15:39:42 +08:00
Mengqing Cao
0c299f79b9 Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)" (#7288)
### What this PR does / why we need it?
This reverts commit 7ed9e9de69, which
introduces an issue that the patch doesn't work with recompute scheduler
enabled.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-15 20:19:09 +08:00
Angazenn
ce5544bfc1 [Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align (#7103)
### What this PR does / why we need it?
To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly
follows the design in
[#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits
changes to functions which are overridden in vLLM-Ascend.

Note:
1. `--mamba-cache-mode align` && PD disaggregation is still not
supported yet in vLLM v0.17.0(see
https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
2. The current implementation of hybrid kv cache might result in a very
large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B
with `-tp 2`, the block_size is adjusted to 2048, which means that any
prefix shorter than 2048 will never be cached. Although this behavior is
consistent with vLLM, it still needs improvements in the future.
3. `--mamba-cache-mode align` requires to copy mamba states during
forward steps. vLLM uses a triton kernel to implement it. However, the
original version run into some bugs on Ascend hardwares. Thus we patch a
new triton kernel to avoid this bug.

### Does this PR introduce _any_ user-facing change?
To use mamba prefix cache, set `--enable-prefix-caching` and
`--mamba-cache-mode align`. Note that the mamba state copy function(see
[do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132))
does not provide a torch native version, thus it might have trouble if
users can't use triton.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-03-15 09:44:09 +08:00
Mengqing Cao
e7aa2c285c [SpecDecode] Fix Draft model proposer (#7230)
### What this PR does / why we need it?
This pr fix the Unified draft parallel feature. 
1. In Draft model proposer, there are exceed 1 attention layers in
target model, thus removing the assertion on layer number.
2. we should get block size through `draft_attn_groups` instead of
`attn_metadata_builder` after 0.17.0.
3. `attn_update_stack_num_spec_norm` shouldn't be done when unified
draft parallel is enabled

### How was this patch tested?
Test pass with
`tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_parallel_drafting_acceptance`,
which is already included in CI

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-14 18:26:37 +08:00
Mengqing Cao
986cd45397 [Version] Drop 0.16.0 support (#7153)
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-13 16:14:15 +08:00
rjg-lyh
7ed9e9de69 [Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)
### What this PR does / why we need it?
This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops
in pd-mix stage mainly.

Because the code for the current PD-disaggregated scenario is still
under refactoring and cleanup, this PR prioritizes ensuring the C8
functionality in the pd-mix scenario.

The next steps are planned in two parts:
① Once the optimized scatter operator is updated, we will replace the
original operator to improve the performance of storing k_scale.
② Once the code logic for the PD-disaggregated scenario becomes stable,
we will carry out more comprehensive validation and make appropriate
adaptations.
③ Because enabling C8 currently introduces several new operators whose
performance still needs improvement, performance may regress in some
scenarios. Therefore, only after all the operators are fully ready can
we ensure that this feature does not cause any performance degradation.
At that point, we will enable this feature by default and remove the
switch in `additional_config`.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-13 14:47:42 +08:00
kx
df1ee8070d [feat][spec decode]Unified draft parallel (#6766)
### What this PR does / why we need it?
Implement a unified parallelized speculative decoding in VLLM
Ascend,which can simultaneously support parallel speculative inference
schemes such as Pard, P-Eagle, etc. refer to
https://github.com/vllm-project/vllm-ascend/pull/6565 and
https://github.com/vllm-project/vllm-ascend/pull/4078

### How was this patch tested?

run with parallel drafting script:
export target=/model/Llama-3.1-8B-Instruct
export draft=/model/PARD-Llama-3.2-1B
export CUDA_VISIBLE_DEVICES=6
export ASCEND_RT_VISIBLE_DEVICES=6
vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811 \
--speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method":
"draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}'

base script:
export target=/model/Llama-3.1-8B-Instruct
export draft=/model/PARD-Llama-3.2-1B
export CUDA_VISIBLE_DEVICES=6
export ASCEND_RT_VISIBLE_DEVICES=6
vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811

benchmark script:
MAX_CONCURRENCY=1
NUM_PROMPTS=80
vllm bench serve --port 8811 \
    --temperature 0 \
    --model /model/Llama-3.1-8B-Instruct \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts ${NUM_PROMPTS} \
    --max-concurrency ${MAX_CONCURRENCY} \
    --seed 1234

test results :
base(without spec decode): TTFT 79.46ms TPOT 26.99ms
output_tokens_throughput 36.75 tok/s
this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms
output_tokens_throughput 72.98 tok/s
per-position acceptance(from position 0 to 7):
79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%.

----------------------------------------------------------------------
run on qwen3 model script :
export target=/model/Qwen3-1.7B
export draft=/model/PARD-Qwen3-0.6B
export CUDA_VISIBLE_DEVICES=1
export ASCEND_RT_VISIBLE_DEVICES=1

vllm serve $target \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --no-enable-prefix-caching \
  --port 8811 \
--speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method":
"draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}'

cc  @NickJudyHvv
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: kx <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Co-authored-by: 01267596 <xiongkai123@cmbchina.com>
2026-03-13 14:07:35 +08:00
Qiu
c377e73933 Perf(PP): support PP with async scheduling. (#7136)
### What this PR does / why we need it?
Follow up the PR https://github.com/vllm-project/vllm/pull/32618, this
PR provides async scheduling support for PP in vllm-ascend.
---
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-03-13 10:27:23 +08:00
zxr2333
fe4cad24e9 [BugFix]fix qwen3.5 reshape_kvcache bug (#7209)
### What this PR does / why we need it?

This PR fixes a bug in `reshape_kvcache_tensors` when reshaping the
Mamba cache for models like Qwen3.5. The previous implementation did not
correctly handle cases where the KV cache tensors have different data
types. This change ensures that slicing is performed based on byte
offsets before reshaping the tensors, which correctly handles
heterogeneous dtypes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

By CI.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-12 23:51:40 +08:00
drslark
de93790d08 [main][bugfix] Fixed the problem of drafter crashed in FULL mode (#7158)
### What this PR does / why we need it?

The merged graph of draft in `FULL` mode is broken now.

This pr solves it.

Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so,
it is removed.

It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and
https://github.com/vllm-project/vllm-ascend/pull/7148.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Test code is shown as below:

```python
prompts = [
    "1.Who are you?",
    "2. Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200)
llm = LLM(
    model="/home/some-model/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    max_num_seqs=32,
    # enforce_eager=True,
    disable_log_stats=False,
    distributed_executor_backend="mp",
    gpu_memory_utilization=0.7,
    async_scheduling=True,

    speculative_config={
        "enforce_eager": True,
        "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B",
        "disable_padded_drafter_batch": False,
        "method": "eagle3",
        "num_speculative_tokens": 3,
    },
    
    compilation_config={
        "cudagraph_mode": "FULL",
        "cudagraph_num_of_warmups": 1,
    },

    max_model_len=4096, 
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)
```

The result before:

```text
   File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia
     graph_params.events[num_tokens].append(event)
     ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
 KeyError: 132
```

The result after:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 242
num_draft_tokens: 726
num_accepted_tokens: 156
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.07
```

We also test `FULL_DECODE_ONLY` mode.

The result is:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 244
num_draft_tokens: 732
num_accepted_tokens: 155
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.06
```

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-12 18:38:50 +08:00
无脸男
09d26754cd [Bugfix] Fix the issue where no exception is thrown when graph capture fails. (#5644)
### What this PR does / why we need it?

Fix the issue where no exception is thrown when graph capture fails.


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: WithHades <244036962@qq.com>
2026-03-12 16:14:45 +08:00
XiaoxinWang
37d1bd8c50 fixed fia pad logic in graph mode. (#7144)
### What this PR does / why we need it?
related to vllm PR #34043 this pr delete func
‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual
number of requests, due to fia operator requires that
query_start_loc[-1] equals the total number of computed tokens, so this
func delete cause the ifa error.
In full graph mode, set num_reqs_paded = num_reqs to fix the error
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2026-03-12 14:50:54 +08:00
lilinsiman
a5ea699e29 [eagle][cp] fix eagle_cp enable bug2 (#7079)
### What this PR does / why we need it?
Fix acceptance and high-concurrency bug in eagle3 and cp enabled

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
tests and ut

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-03-10 16:32:49 +08:00
yupeng
40f7d93f1a [bugfix][LoRA] Fix the lora accuracy issue introduced by the upstream vLLM changed. (#6958)
### What this PR does / why we need it?
Fix the LoRA e2e test accuracy issue that introduced by the upstream PR
https://github.com/vllm-project/vllm/pull/32005

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama32_lora.py

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Signed-off-by: yupeng <507435917@qq.com>
2026-03-10 10:43:18 +08:00
王远
82fdd40d49 [Feat]Xlite Qwen3 MoE Support Data Parallel (#6715)
### What this PR does / why we need it?
This patch adds support for the Qwen3-MoE data parallel in Xlite. For
more details about Xlite, please refer to the following
link:[https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md).

online server config:
```shell
port=$1
log=$2
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export VLLM_ASCEND_ENABLE_NZ=0
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
ip=127.0.0.1
python -m vllm.entrypoints.openai.api_server \
        --model /mnt/nvme1n1/wy/models/Qwen3-30B-A3B  \
        --tensor-parallel-size 2 \
        --enable-expert-parallel \
        --data-parallel-size 4 \
        --gpu-memory-utilization 0.9 \
        --max-num-batched-tokens 32768 \
        --data-parallel-size-local 4 \
        --max-num-seqs=200 \
        --block-size 128 \
        --max-model-len 6656 \
        --trust-remote-code \
        --disable-log-requests \
        --served-model-name qwen \
        --no-enable-prefix-caching \
	--additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}, "enable_cpu_binding": true}' \
	--compilation-config '{"cudagraph_capture_sizes":[1, 16, 32, 48, 64, 100, 150, 200], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
	--async-scheduling \
	--host ${ip} \
	--port ${port} > ${log} 2>&1 &
``` 
test_config:
```shell
vllm bench serve \
    --max-concurrency ${maxconcurrency} \
    --num-prompts ${num_prompts} \
    --host ${HOST} \
    --port ${PORT} \
    --model ${MODEL_NAME} \
    --dataset-name random \
    --backend openai-chat \
    --random-input-len 512 \
    --random-output-len 512  \
    --random-range-ratio 0.2 \
    --temperature 0.6 \
    --metric-percentiles "50,90,99" \
    --tokenizer ${TOKENIZER_PATH} \
    --endpoint /v1/chat/completions \
    --ignore-eos
``` 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.16.0
- vLLM main:
c86cdcbcd2

Signed-off-by: uuzWY <Ethan.wangyuan@huawei.com>
Co-authored-by: uuzWY <Ethan.wangyuan@huawei.com>
2026-03-09 17:53:35 +08:00
Cao Yi
cb4c7de856 [Perf] Optimize MTP execution by reordering state update operation (#6844)
## Summary
- Move `_update_states_after_model_execute` call from after main model
sampling to after draft model execution
- This reordering reduces pipeline bubbles between main model and draft
model execution
- No accuracy impact - the state update operation is independent of
draft token proposal

## Performance Impact
Reduces idle time between main model and draft model execution stages,
improving overall MTP (Multi-Token Prediction) performance.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
2026-03-09 15:55:27 +08:00
zxr2333
d39d80830c [KVCache]Qwen3.5 supports contiguous tensor hybrid-attn kv-cache (#6887)
### What this PR does / why we need it?
Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid
model, such as Qwen3Next and Qwen3.5.
Due to the restrictions of Ascend operators, all KV tensors, conv
tensors, and SSM tensors must be contiguous. Therefore, this PR uses the
following solution to generate the KV cache:
tensor1: [(kv_padding), conv                      , ...]
tensor2: [k                   , ssm                       , ...]
tensor3: [v                   , (mamba_padding), ...]
Under this scheme, although some waste may occur, the tensors of all
caches are guaranteed to be contiguous.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-09 15:28:40 +08:00
Cao Yi
aef9d4249d [Perf] Avoid CPU sync in mrope_positions copy by using full tensor copy (#7014)
### What this PR does / why we need it?

The index-select operation `mrope_positions.gpu[:,
:total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU
synchronization, which blocks subsequent operator dispatch and causes
bubbles visible in Profiling.

This PR changes to full tensor copy
(`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync
point. The trade-off is a negligible increase in memory usage since
`mrope_positions.cpu` is a small tensor.

**Result:** ~2-3% TPOT improvement with the profiling bubbles
eliminated.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Verified via Profiling that the CPU sync bubble is eliminated and TPOT
is reduced by 2-3%.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
2026-03-09 14:46:37 +08:00
wangxiaoteng888
a3f4f6b10b [P/D][Bugfix] Layerwise stacking MTP error. (#7036)
### What this PR does / why we need it?
The community has added a cleaning mechanism for the metadata after the
main model finishes running. The MTP layer should not clean the
metadata, and a new condition has been added to avoid cleaning it.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-03-09 10:55:43 +08:00
drslark
6a7115fa0d [main][feature] Support quarot for eagle3 without embedding (#7038)
### What this PR does / why we need it?
If some `eagle3` model without embed_tokens works with `quarot` target
model, the acceptence rate will drop.
We solve it in this PR.
The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225.

- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-09 10:43:06 +08:00
lilinsiman
01d3515dcf [eagle][cp][bugfix] Fix the bug in eagle and cp enabled (#6981)
### What this PR does / why we need it?
When eagle and cp are enabled at the same time, there is an error in
pcp_allgather due to hidden_states. This PR fixes this issue.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-03-06 20:49:49 +08:00
ZhaoJiangJiang
a51d6366b9 [Bugfix] Qwen3Next support FlashComm1 (#6830)
### What this PR does / why we need it?
Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence
Parallel (SP)
and resolve precision problems in shared_out when both FlashComm1 is
enabled.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
2026-03-06 17:14:08 +08:00
Zetong Li
a2696006d1 [Refactor][EAGLE] 8/N delete mtp_proposer (re-pull) (#7033)
### What this PR does / why we need it?
**NOTE: This PR is re-pull of #7016 since ci mistakenly marked
unfinished pr as having passed.**

This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and
glm5, now it should be ok to remove mtp_proposer. The bug is actually
about unnecessary slicing of `slot_mapping`.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-06 17:11:22 +08:00
wangxiyuan
16c3b0b822 Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030)
Reverts vllm-project/vllm-ascend#7016
It breaks E2E test
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
2026-03-06 11:24:05 +08:00
Zetong Li
a60e179c7f [Refactor][EAGLE] 8/N delete mtp_proposer (#7016)
### What this PR does / why we need it?
This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and
glm5, now it should be ok to remove mtp_proposer. The bug is actually
about unnecessary slicing of `slot_mapping`.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2026-03-06 09:10:57 +08:00
SILONG ZENG
bd571cf6d6 [Main2Main] Upgrade vLLM to 0303 (#6944)
### What this PR does / why we need it?
break:
- https://github.com/vllm-project/vllm/pull/34102 
Disable_full param replaced with valid_modes/invalid_modes API
- https://github.com/vllm-project/vllm/pull/35503
Now must return float compilation_time
- https://github.com/vllm-project/vllm/pull/35564
New sequence_lengths param added
- https://github.com/vllm-project/vllm/pull/33807
A check was performed (if runner_backend != "auto")
- https://github.com/vllm-project/vllm/pull/34861
`BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to
check process group state
- https://github.com/vllm-project/vllm/pull/35274

**Important change:**
- https://github.com/vllm-project/vllm/pull/28672

`matcher_utils` directly accesses `torch.ops._C.*` during the import
phase. In the Ascend environment, some unregistered ops trigger
`AttributeError`, causing e2e initialization failure.

https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323

https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29

This PR adds temporary compatibility placeholders (rms_norm,
fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant,
silu_and_mul) to
`vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to
ensure no crashes during the import phase. Upstream repairs will be
considered later.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
2026-03-06 09:08:52 +08:00
Cao Yi
50441e4650 [BugFix][MTP] Fix prefill misclassified as decode when prompt tokens == num_spec_tokens + 1 (#6835)
## Problem
When MTP is enabled, prefill requests with `prompt_tokens ==
num_spec_tokens + 1` are incorrectly classified as decode requests,
causing accuracy issues.

## Root Cause
The `uniform_decode` condition only checked:
- `max_num_scheduled_tokens == uniform_decode_query_len`
- `num_tokens == max_num_scheduled_tokens * num_reqs`

This is insufficient because a prefill request with specific prompt
length satisfies these conditions as well.

## Fix
Add `is_all_decode` check to ensure all requests have
`num_computed_tokens > 0` before classifying as uniform decode, since
decode requests must have computed at least one token.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-03-05 17:33:10 +08:00
LI SHENGYONG
5a3744c542 [EPLB] The profiling can collect the time required for adjusting the eplb. (#7001)
### What this PR does / why we need it?
To analyze the overhead of the dynamic eplb adjustment framework in
detail, we added the time consumption of the adjustment to the print
information in profiling mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


![Snipaste_2026-03-05_11-42-28](https://github.com/user-attachments/assets/41c2b82a-5dfa-4e39-8b50-f4649deed30c)

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-03-05 16:10:57 +08:00
wangxiyuan
13777bf3f0 [Spec Decode]clean up spec decode interface (#6947)
This pull request refactors the speculative decoding proposer interface
to align with upstream vLLM, removing the local `Proposer` interface and
renaming methods to `propose`.

This is the first step. In the future we should remove the class
register and just add few Ascend specified method once the arch in vLLM
is ready.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-05 14:30:10 +08:00
zhaomingyu13
52d9086f64 [Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914)
### What this PR does / why we need it?
When using the target model after rotational quantization, the
acceptance rate decreases because the fc weight of the draft model has
not undergone rotational quantization(issue: #6445). We fixed this issue
by performing rotation quantization on the fc weight of the draft model
in the same way as the main model when loading draft model.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2026-03-04 11:29:49 +08:00
weiguihua2
5b05b3a090 [feat]ds3.2 pcp support mtp and chunkprefill (#6917)
### What this PR does / why we need it?
ds3.2 pcp supports the combination of MTP and chunkprefill features.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-03-03 19:03:50 +08:00
realliujiaxu
5e24b26a54 [Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883)
### What this PR does / why we need it?

PR #5632 introduced a bug by replacing some branches gated by enable_sp
with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is
enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and
VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with
the previous logic and leads to accuracy issues. This PR restores the
original enable_sp-based branching to recover expected behavior and
accuracy.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

#### 1. start server
``` bash
vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/  \
    --port 8001 \
    --served-model-name auto \
    --max-model-len 1024 \
    --enforce-eager \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --additional-config '{"enable_shared_expert_dp": true}'
```

#### 2. curl
```bash
curl -s http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "auto",
  "messages": [
    {"role": "user", "content": "Hello. I have a question. Who are you?"}
  ],
  "max_tokens": 10,
  "temperature": 0.0,
  "ignore_eos_token": true
}'
```

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-01 20:22:50 +08:00
Bai Yongbin
9d09488b4a [Feat] support basic pcp&dcp for qwen3next (#6091)
### What this PR does / why we need it?
This PR implements Context Parallelism (CP) support for the Qwen3-Next
model, including PCP (Parallel Context Parallelism) and DCP
(Dynamic/Data Context Parallelism).

- vLLM version: v0.15.0
- vLLM main:
f176443446

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: Bai Yongbin <845473182@qq.com>
Co-authored-by: SunnyLee219 <3294305115@qq.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-02-28 21:44:08 +08:00
starmountain1997
5ffae03156 [bugfix] fix capture shape in sp_eagle_fullgraph (#6846)
### What this PR does / why we need it?

This was meant to be merged in #6536, but I accidentally restored a
commit. You can find the relevant discussion
[here](https://github.com/vllm-project/vllm-ascend/pull/6536#issuecomment-3882883471).

Since `self.pass_config.enable_sp` is forcibly set to `False` in the
[source
code](f176443446/vllm/config/compilation.py (L1066)),
this section will no longer verify whether the generated cudagraph
shapes are multiples of both `uniform_decode_query_len`
(`num_speculative_tokens + 1`) and `tensor_parallel_size`.

This PR enables the `num_speculative_tokens + 1` and
`tensor_parallel_size` check upfront. Therefore, it won't silently round
up the `cudagraph_size` and throw a cryptic error for the user.

A typical example of this cryptic error looks like:
```
ValueError: could not broadcast input array from shape (196,) into shape (14,)
```

### Does this PR introduce _any_ user-facing change?

no.

### How was this patch tested?

Have passed all test.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: lilinsiman <lilinsiman@gmail.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-28 17:30:02 +08:00