### What this PR does / why we need it?
This PR backports the changes from #7673 ([Bugfix] support FlashComm1 &
DCP for Qwen) to the releases/v0.18.0 branch.
--------
Signed-off-by: Yang Yuxi <907276627@qq.com>
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736
**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`
**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.
**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.
### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.
### How was this patch tested?
- Model: Qwen3-VL-30B-A3B,
- TP4 DP2
- 100 reqs
- max concurrency 1
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k | 429.40 | 323.3 |
| 16k | 1297.01 | 911.74 |
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
1. When the FullGraph mode is used, the branches in the Triton operator
are compiled and fixed during the graph capture process, causing the
branch condition in the `fused_recurrent_gated_delta_rule` operator,
which checks whether `ssm_state_indices >= 0` before writing to the SSM
cache, to become invalid. Now, the write operation is performed
regardless of the value. This results in the operator performing address
offset calculations and writing to the SSM cache based on the -1 offset
after -1 is used for padding in vLLM GDN backend. Since the conv cache
and SSM cache in vLLM Ascend implementation are actually a single
continuous tensor divided into two parts, this leads to data overwriting
and the generation of NaN values.
This PR addresses two cases where padding -1 is required in the GDN
metadata builder. The same logic is used to replace the padding with 0
to avoid the problem of memory overwriting, because block 0 is a
reserved block.
2. Fix layerwise connector bug for mamba cache sending on heterogeneous
TP.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
In fulldecodeonly mode, num_req_padded was set to an incorrect value,
causing accuracy degradation in Qwen3-Next. Therefore, we added a check
for compilation_config.cudagraph_mode to the conditional logic, ensuring
that padding is applied only in FULL mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
This PR aims to fix padding logic in eagle proposer for kimi25. Main
changes involve:
1. modify the way to obtain draft model attention builder and backend
2. add block table padding & related tensor slicing in common metadata
when `draft_step>1` for solving fia verifying error
3. replace block table in `update_graph_params` for solving fia
verifying error
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR enables separate attention backend configuration for target and
draft models in speculative decoding, decoupling the previously bound
attention backend settings between the two models.
It solves the compatibility issue where some draft models do not support
the attention backend used by the target model, and allows users to
select the optimal attention backend for each model individually to
maximize inference performance. The change is fully backward compatible.
---------
Signed-off-by: SidaoY <1024863041@qq.com>
### What this PR does / why we need it?
pr: https://github.com/vllm-project/vllm/pull/37136 break eplb because
it filters out redundant experts.
pr: https://github.com/vllm-project/vllm/pull/37322 fix it due to use
parallel_config.enable_eplb to determine whether to skip the weight
loading filter.
But in vllm-ascend, parallel_config.enable_eplb is always false. When we
use eplb, we temporarily set it to true.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?

| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)
2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)
3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)
4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.
5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it?
Some bug fixes, mainly including:
1. For A2, the number of experts each single card cannot be greater than
16 when using MC2. The PR fixed the error in the A2 moe communication
method selection, which would cause the selection of an incorrect
communication method when the number of model experts exceeds 256. For
example, when using an A2 16-cards model to load the PD-disaggregation D
node with Qwen3.5 series models, the incorrect MC2 method would be
chosen.
2. Fixed the issue where the layerwise connector sends the kv-cache of
the MTP layer multiple times when `num_spec_tokens` > 1. Now, the
kv-cache is sent only when the MTP layer is forward for the first time.
3. Fix the accuracy issue of qwen3.5 when using MTP for PD
disaggregation. The cause is that `num_decode_draft_tokens` does not
consider that `spec_tokens` are not existed during the first inference
when PD disaggregation (`spec_tokens` are generated during the first
inference). However, `spec_tokens_padding` is added by
`recomputed_scheduler`. As a result, `gdn_metadata` incorrectly
considers that the prefill with a length of 2 is performed.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>
### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints:
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Two problems have been solved in this pr.
These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens`
should be padded to some value in `cudagraph_capture_sizes`.
1. We found the length of `seq_lens_list` in drafter's `attn_metadata`
is 1 shorter than expected. It will raise a kernel exception to make
vllm crash.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996], it is not padded.
3. Though the length of `seq_lens_list` in target's `attn_metadata` is
the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at
the end of the list.
e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20],
`actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But
`seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end
of the list.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.
The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.
This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.
### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This reverts commit 7ed9e9de69, which
introduces an issue that the patch doesn't work with recompute scheduler
enabled.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly
follows the design in
[#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits
changes to functions which are overridden in vLLM-Ascend.
Note:
1. `--mamba-cache-mode align` && PD disaggregation is still not
supported yet in vLLM v0.17.0(see
https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
2. The current implementation of hybrid kv cache might result in a very
large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B
with `-tp 2`, the block_size is adjusted to 2048, which means that any
prefix shorter than 2048 will never be cached. Although this behavior is
consistent with vLLM, it still needs improvements in the future.
3. `--mamba-cache-mode align` requires to copy mamba states during
forward steps. vLLM uses a triton kernel to implement it. However, the
original version run into some bugs on Ascend hardwares. Thus we patch a
new triton kernel to avoid this bug.
### Does this PR introduce _any_ user-facing change?
To use mamba prefix cache, set `--enable-prefix-caching` and
`--mamba-cache-mode align`. Note that the mamba state copy function(see
[do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132))
does not provide a torch native version, thus it might have trouble if
users can't use triton.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
This pr fix the Unified draft parallel feature.
1. In Draft model proposer, there are exceed 1 attention layers in
target model, thus removing the assertion on layer number.
2. we should get block size through `draft_attn_groups` instead of
`attn_metadata_builder` after 0.17.0.
3. `attn_update_stack_num_spec_norm` shouldn't be done when unified
draft parallel is enabled
### How was this patch tested?
Test pass with
`tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_parallel_drafting_acceptance`,
which is already included in CI
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops
in pd-mix stage mainly.
Because the code for the current PD-disaggregated scenario is still
under refactoring and cleanup, this PR prioritizes ensuring the C8
functionality in the pd-mix scenario.
The next steps are planned in two parts:
① Once the optimized scatter operator is updated, we will replace the
original operator to improve the performance of storing k_scale.
② Once the code logic for the PD-disaggregated scenario becomes stable,
we will carry out more comprehensive validation and make appropriate
adaptations.
③ Because enabling C8 currently introduces several new operators whose
performance still needs improvement, performance may regress in some
scenarios. Therefore, only after all the operators are fully ready can
we ensure that this feature does not cause any performance degradation.
At that point, we will enable this feature by default and remove the
switch in `additional_config`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This PR fixes a bug in `reshape_kvcache_tensors` when reshaping the
Mamba cache for models like Qwen3.5. The previous implementation did not
correctly handle cases where the KV cache tensors have different data
types. This change ensures that slicing is performed based on byte
offsets before reshaping the tensors, which correctly handles
heterogeneous dtypes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
The merged graph of draft in `FULL` mode is broken now.
This pr solves it.
Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so,
it is removed.
It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and
https://github.com/vllm-project/vllm-ascend/pull/7148.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Test code is shown as below:
```python
prompts = [
"1.Who are you?",
"2. Who are you?",
]
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200)
llm = LLM(
model="/home/some-model/Meta-Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
max_num_seqs=32,
# enforce_eager=True,
disable_log_stats=False,
distributed_executor_backend="mp",
gpu_memory_utilization=0.7,
async_scheduling=True,
speculative_config={
"enforce_eager": True,
"model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B",
"disable_padded_drafter_batch": False,
"method": "eagle3",
"num_speculative_tokens": 3,
},
compilation_config={
"cudagraph_mode": "FULL",
"cudagraph_num_of_warmups": 1,
},
max_model_len=4096,
enable_prefix_caching=False,
)
outputs = llm.generate(prompts, sampling_params)
```
The result before:
```text
File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia
graph_params.events[num_tokens].append(event)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 132
```
The result after:
```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 242
num_draft_tokens: 726
num_accepted_tokens: 156
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.07
```
We also test `FULL_DECODE_ONLY` mode.
The result is:
```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 244
num_draft_tokens: 732
num_accepted_tokens: 155
mean acceptance length: 1.64
--------------------------------------------------
acceptance at token 0: 0.42
acceptance at token 1: 0.16
acceptance at token 2: 0.06
```
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Fix the issue where no exception is thrown when graph capture fails.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
related to vllm PR #34043 this pr delete func
‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual
number of requests, due to fia operator requires that
query_start_loc[-1] equals the total number of computed tokens, so this
func delete cause the ifa error.
In full graph mode, set num_reqs_paded = num_reqs to fix the error
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Fix acceptance and high-concurrency bug in eagle3 and cp enabled
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
tests and ut
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Fix the LoRA e2e test accuracy issue that introduced by the upstream PR
https://github.com/vllm-project/vllm/pull/32005
### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama32_lora.py
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Signed-off-by: yupeng <507435917@qq.com>
## Summary
- Move `_update_states_after_model_execute` call from after main model
sampling to after draft model execution
- This reordering reduces pipeline bubbles between main model and draft
model execution
- No accuracy impact - the state update operation is independent of
draft token proposal
## Performance Impact
Reduces idle time between main model and draft model execution stages,
improving overall MTP (Multi-Token Prediction) performance.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
### What this PR does / why we need it?
Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid
model, such as Qwen3Next and Qwen3.5.
Due to the restrictions of Ascend operators, all KV tensors, conv
tensors, and SSM tensors must be contiguous. Therefore, this PR uses the
following solution to generate the KV cache:
tensor1: [(kv_padding), conv , ...]
tensor2: [k , ssm , ...]
tensor3: [v , (mamba_padding), ...]
Under this scheme, although some waste may occur, the tensors of all
caches are guaranteed to be contiguous.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
The index-select operation `mrope_positions.gpu[:,
:total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU
synchronization, which blocks subsequent operator dispatch and causes
bubbles visible in Profiling.
This PR changes to full tensor copy
(`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync
point. The trade-off is a negligible increase in memory usage since
`mrope_positions.cpu` is a small tensor.
**Result:** ~2-3% TPOT improvement with the profiling bubbles
eliminated.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Verified via Profiling that the CPU sync bubble is eliminated and TPOT
is reduced by 2-3%.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
### What this PR does / why we need it?
The community has added a cleaning mechanism for the metadata after the
main model finishes running. The MTP layer should not clean the
metadata, and a new condition has been added to avoid cleaning it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
If some `eagle3` model without embed_tokens works with `quarot` target
model, the acceptence rate will drop.
We solve it in this PR.
The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225.
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
When eagle and cp are enabled at the same time, there is an error in
pcp_allgather due to hidden_states. This PR fixes this issue.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence
Parallel (SP)
and resolve precision problems in shared_out when both FlashComm1 is
enabled.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
### What this PR does / why we need it?
**NOTE: This PR is re-pull of #7016 since ci mistakenly marked
unfinished pr as having passed.**
This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and
glm5, now it should be ok to remove mtp_proposer. The bug is actually
about unnecessary slicing of `slot_mapping`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and
glm5, now it should be ok to remove mtp_proposer. The bug is actually
about unnecessary slicing of `slot_mapping`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: Zetong Li <slippersss@126.com>
## Problem
When MTP is enabled, prefill requests with `prompt_tokens ==
num_spec_tokens + 1` are incorrectly classified as decode requests,
causing accuracy issues.
## Root Cause
The `uniform_decode` condition only checked:
- `max_num_scheduled_tokens == uniform_decode_query_len`
- `num_tokens == max_num_scheduled_tokens * num_reqs`
This is insufficient because a prefill request with specific prompt
length satisfies these conditions as well.
## Fix
Add `is_all_decode` check to ensure all requests have
`num_computed_tokens > 0` before classifying as uniform decode, since
decode requests must have computed at least one token.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
To analyze the overhead of the dynamic eplb adjustment framework in
detail, we added the time consumption of the adjustment to the print
information in profiling mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
This pull request refactors the speculative decoding proposer interface
to align with upstream vLLM, removing the local `Proposer` interface and
renaming methods to `propose`.
This is the first step. In the future we should remove the class
register and just add few Ascend specified method once the arch in vLLM
is ready.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
When using the target model after rotational quantization, the
acceptance rate decreases because the fc weight of the draft model has
not undergone rotational quantization(issue: #6445). We fixed this issue
by performing rotation quantization on the fc weight of the draft model
in the same way as the main model when loading draft model.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
### What this PR does / why we need it?
ds3.2 pcp supports the combination of MTP and chunkprefill features.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
PR #5632 introduced a bug by replacing some branches gated by enable_sp
with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is
enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and
VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with
the previous logic and leads to accuracy issues. This PR restores the
original enable_sp-based branching to recover expected behavior and
accuracy.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
#### 1. start server
``` bash
vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \
--port 8001 \
--served-model-name auto \
--max-model-len 1024 \
--enforce-eager \
--tensor-parallel-size 2 \
--data-parallel-size 2 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--additional-config '{"enable_shared_expert_dp": true}'
```
#### 2. curl
```bash
curl -s http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Hello. I have a question. Who are you?"}
],
"max_tokens": 10,
"temperature": 0.0,
"ignore_eos_token": true
}'
```
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
This was meant to be merged in #6536, but I accidentally restored a
commit. You can find the relevant discussion
[here](https://github.com/vllm-project/vllm-ascend/pull/6536#issuecomment-3882883471).
Since `self.pass_config.enable_sp` is forcibly set to `False` in the
[source
code](f176443446/vllm/config/compilation.py (L1066)),
this section will no longer verify whether the generated cudagraph
shapes are multiples of both `uniform_decode_query_len`
(`num_speculative_tokens + 1`) and `tensor_parallel_size`.
This PR enables the `num_speculative_tokens + 1` and
`tensor_parallel_size` check upfront. Therefore, it won't silently round
up the `cudagraph_size` and throw a cryptic error for the user.
A typical example of this cryptic error looks like:
```
ValueError: could not broadcast input array from shape (196,) into shape (14,)
```
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
Have passed all test.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: lilinsiman <lilinsiman@gmail.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: guozr <guozr1997@hotmail.com>