Commit Graph

188 Commits

Author SHA1 Message Date
meihanc
bff4fbfca5 upgrade to 0.18.0 (#7502)
### What this PR does / why we need it?
1. upgrade to 0.18.0
2. ensure kernel_block_sizes is int for Eagle drafter
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-03-21 16:05:38 +08:00
SILONG ZENG
eb92e7d50e [Bugfix] Restore balance scheduling patch for v0.17.0 (#7479)
### What this PR does / why we need it?
Restore previously introduced patches:
- https://github.com/vllm-project/vllm-ascend/pull/5212

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-19 20:12:57 +08:00
ichaoren
9d1452c74d [OPS]add split_qkv_tp_rmsnorm_rope ops (#7376)
### What this PR does / why we need it?
This PR introduces a new fused Triton kernel,
`split_qkv_tp_rmsnorm_rope` for Minimax-m2.5.

The implementation includes two Triton kernels:
1. `_split_qkv_and_compute_local_qk_var_kernel`: Splits the QKV input
and computes the local variance for RMSNorm.
2. `_apply_global_rmsnorm_kernel`: Applies global RMSNorm (considering
TP all-reduce for variance) and Neox-style RoPE.

### Does this PR introduce _any_ user-facing change?
Does not.

### How was this patch tested?
```python
pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_tp_rmsnorm_rope.py
```
### Test Data
A3 TP16
基线  

| data       | TTFT(ms) | TPOT(ms) | TPS    |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1  | 267.55   | 25.5     | 38.85  |
| 4k/1k@bs4  | 542.4    | 26.51    | 148.06 |

测试线

| data       | TTFT(ms) | TPOT(ms) | TPS    |
|------------|---------:|---------:|-------:|
| 4k/1k@bs1  | 234.64   | 20.96    | 47.24  |
| 4k/1k@bs4  | 508.36   | 22.16    | 176.69 |


- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: xutianyi <xutianyi5@huawei.com>
Co-authored-by: xutianyi <xutianyi5@huawei.com>
2026-03-19 17:19:18 +08:00
Li Wang
83a4065b4b [CI] Add pre-commit check for patch logger (#7446)
### What this PR does / why we need it?
See https://github.com/vllm-project/vllm-ascend/pull/7402, pre-commit
hook will forbid init_logger(__name__) in vllm_ascend patch modules

- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-19 16:53:20 +08:00
pu-zhe
e8f7b2e3f1 [Refactor] [310p] Support Mamba Cache and support attn_head_size larger than 128 (#7372)
### What this PR does / why we need it?
1. Mamba Cache Support on 310P: Implemented logic to correctly
initialize and allocate KV cache for Mamba models on the 310P platform,
including handling of state tensors and page size alignment.
2. Increased Attention Head Size Support: Modified the attention backend
to support attn_head_size larger than 128 by dynamically selecting
appropriate kernel block sizes based on hardware limitations (e.g.,
block_size * head_size <= 16384).
3. Refactored KV Cache Allocation: Consolidated and improved the KV
cache allocation mechanism, moving from separate size calculation and
allocation steps to a unified _allocate_kv_cache_tensors method that
handles both Attention and Mamba specific cache structures.
4. Dynamic Mamba Config Patching: Introduced conditional loading of
Mamba configuration patches, specifically using patch_mamba_config_310
for the 310P platform to ensure platform-specific optimizations and
validations.
5. Reserve reasonable memory to allocate KV cache to avoid OOM issue
with default gpu_memory_utilization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Qwen3.5 E2E test
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-03-19 09:16:22 +08:00
Nengjun Ma
8b79d4de52 Main2main upgrade to vllm 0317 afternoon (#7409)
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](https://github.com/vllm-project/vllm/pull/37031)

4.fix [Support multiple KV groups in OffloadingSpec
](https://github.com/vllm-project/vllm/pull/36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
8a680463fa

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
2026-03-18 23:24:27 +08:00
Angazenn
ec34bf0062 [Misc]fix logger which does not take effects in patches (#7402)
### What this PR does / why we need it?
This PR fixes the logger initialization in patches so that the log info
can be displayed as expected.

### Does this PR introduce _any_ user-facing change?
No.

- vLLM version: v0.17.0
- vLLM main:
4497431df6

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-03-18 17:13:12 +08:00
zhangyiming
1c954ff264 [main2main] upgrade vllm to 0308 (#7213)
### What this PR does / why we need it?
Update main2main to vllm 0308.
breaks:

* https://github.com/vllm-project/vllm/pull/30681
* https://github.com/vllm-project/vllm/pull/35552 remove
self.cudagraph_batch_sizes
* https://github.com/vllm-project/vllm/pull/35158 clear_metadata ->
defer_finalize
* https://github.com/vllm-project/vllm/pull/36006 remove
CacheConfig.cpu_offload_gb
* https://github.com/vllm-project/vllm/pull/35472
* https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder
* https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens
* https://github.com/vllm-project/vllm/pull/28053 

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-03-18 09:24:43 +08:00
pichangping
3f39ac9c8d [Feature]Supports DSv3.1 PD separation and C8 quantization (#7222)
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>

### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints: 
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-03-16 22:49:05 +08:00
rjg-lyh
4d443b9228 [bugfix] restore pr-7029 and fix patch error (#7294)
### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.

The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.

This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.

### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-16 15:39:42 +08:00
Mengqing Cao
0c299f79b9 Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)" (#7288)
### What this PR does / why we need it?
This reverts commit 7ed9e9de69, which
introduces an issue that the patch doesn't work with recompute scheduler
enabled.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-15 20:19:09 +08:00
Angazenn
ce5544bfc1 [Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align (#7103)
### What this PR does / why we need it?
To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly
follows the design in
[#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits
changes to functions which are overridden in vLLM-Ascend.

Note:
1. `--mamba-cache-mode align` && PD disaggregation is still not
supported yet in vLLM v0.17.0(see
https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
2. The current implementation of hybrid kv cache might result in a very
large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B
with `-tp 2`, the block_size is adjusted to 2048, which means that any
prefix shorter than 2048 will never be cached. Although this behavior is
consistent with vLLM, it still needs improvements in the future.
3. `--mamba-cache-mode align` requires to copy mamba states during
forward steps. vLLM uses a triton kernel to implement it. However, the
original version run into some bugs on Ascend hardwares. Thus we patch a
new triton kernel to avoid this bug.

### Does this PR introduce _any_ user-facing change?
To use mamba prefix cache, set `--enable-prefix-caching` and
`--mamba-cache-mode align`. Note that the mamba state copy function(see
[do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132))
does not provide a torch native version, thus it might have trouble if
users can't use triton.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-03-15 09:44:09 +08:00
Mengqing Cao
986cd45397 [Version] Drop 0.16.0 support (#7153)
### What this PR does / why we need it?
Drop 0.16.0 support in main
- Fix eagle proposer break introduced by
https://github.com/vllm-project/vllm/pull/34552. Mainly change to use
the draft attention group to initialize the attention metadata builder.
- Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes`
error, which is a bug in vLLM v0.17.0, and fixed by a later pr
https://github.com/vllm-project/vllm/pull/30515

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-13 16:14:15 +08:00
rjg-lyh
7ed9e9de69 [Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029)
### What this PR does / why we need it?
This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops
in pd-mix stage mainly.

Because the code for the current PD-disaggregated scenario is still
under refactoring and cleanup, this PR prioritizes ensuring the C8
functionality in the pd-mix scenario.

The next steps are planned in two parts:
① Once the optimized scatter operator is updated, we will replace the
original operator to improve the performance of storing k_scale.
② Once the code logic for the PD-disaggregated scenario becomes stable,
we will carry out more comprehensive validation and make appropriate
adaptations.
③ Because enabling C8 currently introduces several new operators whose
performance still needs improvement, performance may regress in some
scenarios. Therefore, only after all the operators are fully ready can
we ensure that this feature does not cause any performance degradation.
At that point, we will enable this feature by default and remove the
switch in `additional_config`.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
2026-03-13 14:47:42 +08:00
Ronald
c980e68d40 [Feature] support aclgraph for model runner v2 (#7110)
### What this PR does / why we need it?
This PR aims to support aclgraph for model runner v2, please see RFC
#5208. The PR contains these modifications:
- adapt to newest commit of vllm main branch.
- supply a unified interface of extra forward context for both model
runner v1 and model runner v2.
- implement graph mode for main model. 

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-13 09:11:46 +08:00
wangbj127
0c659e91ed [MTP][Bugfix] Fix GLM5-W8A8 precision issues caused by rotary quant MTP weights (#7139)
### What this PR does / why we need it?
When GLM5 target model uses rotary quant, the final hidden states passes
to MTP need to do an extra rotary.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>
2026-03-12 20:01:24 +08:00
drslark
fb0d6dd175 [main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148)
### What this PR does / why we need it?

Fixed the error of speculative decoding in FULL mode when `num_spec + 1`
not in `cudagraph_capture_sizes`.

Now, we can run speculative decoding in FULL mode, but with drafter as
eager.

It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 .

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Test code is shown as below:

```python
prompts = [
    "1.Who are you?",
    "2. Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200)
llm = LLM(
    model="/home/some-model/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    max_num_seqs=32,
    # enforce_eager=True,
    disable_log_stats=False,
    distributed_executor_backend="mp",
    gpu_memory_utilization=0.7,
    async_scheduling=True,

    speculative_config={
        "enforce_eager": True,
        "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B",
        "disable_padded_drafter_batch": False,
        "method": "eagle3",
        "num_speculative_tokens": 2,
    },
    
    compilation_config={
        "cudagraph_mode": "FULL",
        "cudagraph_num_of_warmups": 1,
    },

    max_model_len=4096, 
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)
```

The result before:

```text
   File "/vllm-workspace/vllm/vllm/v1/cudagraph_dispatcher.py", line 140, in _create_padded_batch_descriptor
     assert num_tokens_padded % uniform_decode_query_len == 0
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 AssertionError
```

The result after:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 249
num_draft_tokens: 498
num_accepted_tokens: 149
mean acceptance length: 1.60
--------------------------------------------------
acceptance at token 0: 0.43
acceptance at token 1: 0.17
```

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-12 14:51:12 +08:00
SparrowMu
54668e73c5 [Model] Support Minimax-m2.5 on NPU (#7105)
### What this PR does / why we need it?

Initial version to support minimax-m2.5 on vllm-ascend. 
This commit coverting original fp8 weight to a quantilized bf16 to
support Minimax-m2.5 on NPU.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

### Test Report
Self tested precision summary, where the official precision score of
AIME2025 is 86.3
<img width="426" height="84" alt="image"
src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a"
/>

---------

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
2026-03-11 00:12:02 +08:00
zxr2333
239683c7a6 [P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (#7022)
### What this PR does / why we need it?
Mooncake Layerwise Connector supports hybrid attention manager with
multiple kvcache groups.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By CI.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-10 23:59:20 +08:00
pppeng
0f289fa2a8 Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109)
### What this PR does / why we need it?

The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not
support `ssm_state` inputs in float32 format,
we temporarily retain the _forward_core implementation with triton for
Qwen3_5

---------

Signed-off-by: pppeng <zepengliu912@qq.com>
Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>
2026-03-10 23:28:58 +08:00
ZT-AIA
ee5347e824 [qwen3 next ]add ascend c casual_conv1d_fn (#6661)
### What this PR does / why we need it?
add ascend c casual_conv1d_fn

- vLLM version: v0.15.0
- vLLM main:
13397841ab
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-09 23:29:49 +08:00
tanhaoan333
57c554a23f [bugfix]Fix parameter ordering bug in _merge_multimodal_embeddings (#7068)
### What this PR does / why we need it?

This PR fixes a bug in the `_merge_multimodal_embeddings` function where
the parameter order was incorrect. The `multimodal_embeddings` and
`is_multimodal` parameters were swapped, which would lead to runtime
errors when the function is called with positional arguments.

This change corrects the function signature to align with its expected
usage, ensuring that multimodal embeddings are correctly merged.

### Does this PR introduce _any_ user-facing change?

No. This is a bug fix for an internal utility function and has no
user-facing impact.

### How was this patch tested?

The correctness of this fix is validated by existing tests for
multimodal functionality. With the incorrect function signature, these
tests would fail due to argument type mismatches. CI passing confirms
the fix is effective.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
2026-03-09 16:05:52 +08:00
zxr2333
d39d80830c [KVCache]Qwen3.5 supports contiguous tensor hybrid-attn kv-cache (#6887)
### What this PR does / why we need it?
Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid
model, such as Qwen3Next and Qwen3.5.
Due to the restrictions of Ascend operators, all KV tensors, conv
tensors, and SSM tensors must be contiguous. Therefore, this PR uses the
following solution to generate the KV cache:
tensor1: [(kv_padding), conv                      , ...]
tensor2: [k                   , ssm                       , ...]
tensor3: [v                   , (mamba_padding), ...]
Under this scheme, although some waste may occur, the tensors of all
caches are guaranteed to be contiguous.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-09 15:28:40 +08:00
LeeWenquan
65eae6de7b Add Ascend Ops recurrent_gated_delta_rule (#6725)
### What this PR does / why we need it?
Change recurrent_gated_delta_rule ops from triton to ascend C version
for better performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-03-09 14:14:14 +08:00
drslark
6a7115fa0d [main][feature] Support quarot for eagle3 without embedding (#7038)
### What this PR does / why we need it?
If some `eagle3` model without embed_tokens works with `quarot` target
model, the acceptence rate will drop.
We solve it in this PR.
The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225.

- vLLM main:
4034c3d32e

Signed-off-by: drslark <slarksblood@qq.com>
2026-03-09 10:43:06 +08:00
ZhaoJiangJiang
a51d6366b9 [Bugfix] Qwen3Next support FlashComm1 (#6830)
### What this PR does / why we need it?
Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence
Parallel (SP)
and resolve precision problems in shared_out when both FlashComm1 is
enabled.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
2026-03-06 17:14:08 +08:00
SILONG ZENG
bd571cf6d6 [Main2Main] Upgrade vLLM to 0303 (#6944)
### What this PR does / why we need it?
break:
- https://github.com/vllm-project/vllm/pull/34102 
Disable_full param replaced with valid_modes/invalid_modes API
- https://github.com/vllm-project/vllm/pull/35503
Now must return float compilation_time
- https://github.com/vllm-project/vllm/pull/35564
New sequence_lengths param added
- https://github.com/vllm-project/vllm/pull/33807
A check was performed (if runner_backend != "auto")
- https://github.com/vllm-project/vllm/pull/34861
`BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to
check process group state
- https://github.com/vllm-project/vllm/pull/35274

**Important change:**
- https://github.com/vllm-project/vllm/pull/28672

`matcher_utils` directly accesses `torch.ops._C.*` during the import
phase. In the Ascend environment, some unregistered ops trigger
`AttributeError`, causing e2e initialization failure.

https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323

https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29

This PR adds temporary compatibility placeholders (rms_norm,
fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant,
silu_and_mul) to
`vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to
ensure no crashes during the import phase. Upstream repairs will be
considered later.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
2026-03-06 09:08:52 +08:00
songjianquan
43c8da3574 [Feat]fused_qkvzba_split_reshape supports token number greater than 65536 (#6740)
### What this PR does / why we need it?

This pull request optimizes the fused_qkvzba_split_reshape_cat Triton
kernel for Qwen3-Next GatedDeltaNet model and removes the previous
conditional restrictions in the forward pass.
Key changes:
1. Refactored Triton kernel implementation: The
fused_qkvzba_split_reshape_cat_kernel has been optimized with a new
loop-based approach that supports arbitrary num_v_heads / num_k_heads
ratios and batch sizes. The kernel now uses configurable ROWS_PER_ITER
for better memory utilization .
2. The optimized kernel now handles all scenarios directly without
requiring a fallback path using fix_query_key_value_ordering and
torch.cat.

### Does this PR introduce _any_ user-facing change?
No. This is an internal optimization of the Triton kernel implementation
and does not introduce any user-facing changes.

### How was this patch tested?
CI is expected to pass with existing tests.

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: songjianquan <songjianquan1@huawei.com>
Co-authored-by: songjianquan <songjianquan1@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-05 14:41:38 +08:00
zhaomingyu13
52d9086f64 [Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914)
### What this PR does / why we need it?
When using the target model after rotational quantization, the
acceptance rate decreases because the fc weight of the draft model has
not undergone rotational quantization(issue: #6445). We fixed this issue
by performing rotation quantization on the fc weight of the draft model
in the same way as the main model when loading draft model.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2026-03-04 11:29:49 +08:00
Canlin Guo
e4458b2d2b [Main2Main] Upgrade vLLM to 0226 (#6813)
### What this PR does / why we need it?

Breaking:
1. https://github.com/vllm-project/vllm/pull/33452
2. https://github.com/vllm-project/vllm/pull/33451
3. https://github.com/vllm-project/vllm/pull/32567
4. https://github.com/vllm-project/vllm/pull/32344

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: MrZ20 <2609716663@qq.com>
2026-02-27 16:05:21 +08:00
wangxiyuan
532f7a82f2 [Patch][Misc] Cleanup and update patches (#6802)
### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-26 14:45:33 +08:00
Canlin Guo
9f8b84e5fc [Misc] Drop patch_rope.py (#6291)
### What this PR does / why we need it?

Part of #5304.

We have align with vLLM's latest change for `RotaryEmbeddingBase`. Don't
need this patch anymore.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-02-26 14:04:53 +08:00
Li-Yongwen
2870f7c8ad [Feat] Support routing replay (#6696)
### What this PR does / why we need it?

[Feat] Support routing replay
same as https://github.com/vllm-project/vllm-ascend/pull/6666
resubmit  because of DOC failure

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: liyongwen <1310439159@qq.com>
Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-26 10:22:47 +08:00
LoganJane
ed051737e9 [Bugfix] Support Kimi-K2.5 models (#6755)
### What this PR does / why we need it?
This PR supports the Kimi-K2.5 models on the NPU of bf16 and w4a8
weights.
The corresponding PR in the vllm community has been merged:
https://github.com/vllm-project/vllm/pull/34501

### Does this PR introduce _any_ user-facing change?
- No.

### How was this patch tested?
We test the Kimi-K2.5 weights. The weights path:
https://modelscope.cn/models/Eco-Tech/Kimi-K2.5-W4A8
Successfully ran on 910B NPU using vllm-ascend by the w4a8 weights.

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: LoganJane <LoganJane73@hotmail.com>
2026-02-25 14:51:46 +08:00
Ronald
f1ffb5fb19 [Feature] adapt to uva buffer and main2main (#6657)
### What this PR does / why we need it?
vllm model runner v2 use uva buffer to prepare input data, but npu
doesn't support uva yet, this pr implement a uvawrapper class to mimic
gpu's uva backend. what's more, this pr make some modifications to adapt
to the newer main branch.

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM main:
13397841ab

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-02-12 10:36:31 +08:00
iiiklw
a0315f6697 [npugraph_ex]enable npugraph_ex by default (#6664)
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
2026-02-12 08:44:06 +08:00
Canlin Guo
b7aa511daa [Patch] Remove the patch of MiniCPM (#5975)
### What this PR does / why we need it?

Part of #5304.

After https://github.com/vllm-project/vllm/pull/32523 merge, we could
remove the patch of `MiniCPMAttention`.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test it locally.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-02-09 14:07:44 +08:00
SILONG ZENG
19b5d44ea8 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10) (#6173)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
|`vllm_ascend/ops/layer_shard_linear.py`|
|`vllm_ascend/ops/linear.py`|
|`vllm_ascend/ops/linear_op.py`|
|`vllm_ascend/worker/worker.py`|
| ` vllm_ascend/patch/worker/patch_bert.py` |
| ` vllm_ascend/patch/worker/patch_deepseek.py` |
| ` vllm_ascend/patch/worker/patch_distributed.py` |
| ` vllm_ascend/patch/worker/patch_module.py` |
| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` |
| ` vllm_ascend/patch/worker/patch_qwen3_next.py` |
| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` |
| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` |
| ` vllm_ascend/patch/worker/patch_rope.py` |
| ` vllm_ascend/patch/worker/patch_triton.py` |
| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` |
| ` vllm_ascend/patch/worker/patch_v2_egale.py` |
|` vllm_ascend/worker/npu_input_batch.py`|
|` vllm_ascend/worker/v2/aclgraph_utils.py`|
|` vllm_ascend/worker/v2/attn_utils.py`|
|` vllm_ascend/worker/v2/model_runner.py`|
|` vllm_ascend/worker/v2/sample/gumbel.py`|
|` vllm_ascend/worker/v2/sample/penalties.py`|
|` vllm_ascend/worker/v2/sample/sampler.py`|
|` vllm_ascend/worker/v2/spec_decode/__init__.py`|
|` vllm_ascend/worker/v2/spec_decode/eagle.py`|
|` vllm_ascend/worker/v2/states.py`|
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: SILONG ZENG <2609716663@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-06 15:35:06 +08:00
meihanc
922e5c163b [main2main] upgrade vllm main 0202 (#6560)
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
https://github.com/vllm-project/vllm/pull/32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to https://github.com/vllm-project/vllm/pull/33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
https://github.com/vllm-project/vllm/pull/33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
https://github.com/vllm-project/vllm/pull/32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
https://github.com/vllm-project/vllm/pull/32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to https://github.com/vllm-project/vllm/pull/27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
https://github.com/vllm-project/vllm/pull/33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
https://github.com/vllm-project/vllm/pull/32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-02-05 19:31:17 +08:00
LeeWenquan
b1de6cbb31 [Bugfix][CI]Add qwen3Next MTP+Full Decode (#6047)
### What this PR does / why we need it?
Fix a bug in the repo and add a test case for MTP + Full Decode Only +
Qwen3Next.
The _build_dummy_attn_metadata function in NPUModelRunner seems losed a
query_star_loc.copy_to_gpu operation, which will lead to difference
between query_start_loc and query_start_loc_cpu, and they are required
to be same in MTP + Full Decode Only + Qwen3Next case.

Before this pr:
`self.query_start_loc = [0, 0, 0, 0, ... , 0]
self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]`

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-02-03 14:26:21 +08:00
wangxiyuan
eeedf7c503 [Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470)
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
https://github.com/vllm-project/vllm/pull/31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
https://github.com/vllm-project/vllm/pull/32082 by
https://github.com/vllm-project/vllm-ascend/pull/6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
https://github.com/vllm-project/vllm/pull/25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-02 15:57:55 +08:00
wangxiyuan
7a5b345dc4 [Misc] Drop deepseek patch (#6288)
We patched deepseek before since we notice asserterror raised by
transformers. Now due to transformers upgrade, the patch looks useless
now. Let's remove it.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-29 14:45:50 +08:00
meihanc
fea197ad50 [Main2Main] Upgrade vllm commit to 0123 (#6169)
### What this PR does / why we need it?
1.  Upgrade vllm commit to: 0115
(8471b27df97c3eb79f891802fc0e858f8f7ac6a0)
Modify import paths due to the refactors:
https://github.com/vllm-project/vllm/pull/32245
https://github.com/vllm-project/vllm/pull/32060
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913
2. Upgrade vllm commit to: 0119
(9a1f16da1e423ede2c2f52a9850cbfbb39cefe96)
Fix `WorkerProc.__init__() missing 1 required positional argument:
'is_driver_worker'` due to
https://github.com/vllm-project/vllm/pull/28506
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569
3. Upgrade vllm commit to:
0120(148117ea2e689cd43df4be6892671a17cdae5833)
1. Add `skip_compiled` param in `set_forward_context` due to
https://github.com/vllm-project/vllm/pull/30385
2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to
https://github.com/vllm-project/vllm/pull/24322
change `self.max_num_tokens =
vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size`
3. Modify UT import paths due to the
refactors:https://github.com/vllm-project/vllm/pull/32060
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946
4. Upgrade vllm commit to:
0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9)
1. vLLM switched `uses_mrope` from target to draft model config, making
`positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's
direct self.positions access and tests missing
`draft_model_config.uses_mrope`.
https://github.com/vllm-project/vllm/pull/32048
2. Moved bs_to_padded_graph_size from CompilationConfig to
CudagraphDispatcher due to the refactor
https://github.com/vllm-project/vllm/pull/30143
3. Remove unused `maybe_setup_kv_connector` due to
https://github.com/vllm-project/vllm/pull/32077
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834
6. Upgrade vllm commit to:
0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5)
Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig
due to https://github.com/vllm-project/vllm/pull/32414
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054
8. Upgrade vllm commit to:
0123(dc917cceb877dfd13f98c538c4c96158047d98bd)
Setting temperature=0.0 due to the removal of the default temperature
value in https://github.com/vllm-project/vllm/pull/32723
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wjunLu <wjunlu217@gmail.com>
2026-01-27 08:44:36 +08:00
Canlin Guo
2d3b8a51f9 [Patch] Remove the patch of ECExampleConnector (#5976)
### What this PR does / why we need it?
Part of #5304.

https://github.com/vllm-project/vllm/pull/30225 has been merged now. We
don't need this patch anymore.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-01-26 17:10:03 +08:00
SILONG ZENG
4e53c1d900 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #6) (#6001)
### What this PR does / why we need it?
| File Path |
| :--- |
| ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` |
| ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` |
| ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` |
| ` vllm_ascend/eplb/core/eplb_utils.py` |
| ` vllm_ascend/eplb/core/eplb_worker.py` |
| ` vllm_ascend/eplb/core/policy/policy_abstract.py` |
| ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` |
| ` vllm_ascend/eplb/core/policy/policy_factory.py` |
| ` vllm_ascend/eplb/core/policy/policy_flashlb.py` |
| ` vllm_ascend/eplb/core/policy/policy_random.py` |
| ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` |
| ` vllm_ascend/eplb/eplb_updator.py` |
| ` vllm_ascend/eplb/utils.py` |
| ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` |
| ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` |
| ` vllm_ascend/model_loader/netloader/interaction/elastic.py` |
| ` vllm_ascend/model_loader/netloader/load.py` |
| ` vllm_ascend/model_loader/netloader/netloader.py` |
| ` vllm_ascend/model_loader/netloader/utils.py` |
| ` vllm_ascend/patch/platform/__init__.py` |
| ` vllm_ascend/patch/platform/patch_balance_schedule.py` |
| ` vllm_ascend/patch/platform/patch_ec_connector.py` |
| ` vllm_ascend/patch/platform/patch_mamba_config.py` |
| ` vllm_ascend/patch/platform/patch_multiproc_executor.py` |
| ` vllm_ascend/patch/platform/patch_sched_yield.py` |


- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
zhangxinyuehfad
819a4459ce Drop vLLM 0.13.0 support (#6069)
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-23 09:45:08 +08:00
Icey
c929bd1e8d [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (#5034)
This PR add `MatmulAllreduceRmsnorm` operator and introduces a graph
fusion pass for `matmul_allreduce_rmsnorm` operations. The
implementation includes a new configuration flag, a pattern matching
pass using `torch._inductor.pattern_matcher`.

Co-authored-by: Trunrain [270250579@qq.com](mailto:270250579@qq.com)

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: tongrunze <t00574058@china.huawei.com>
2026-01-19 09:28:07 +08:00
Shaoxu Cheng
1ffca8673f [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776)
### What this PR does / why we need it?
Add basic 310p support. Only dense models work with eager mode now.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
2026-01-17 11:49:18 +08:00
wjunLu
c11a05c4e1 [Main2Main] Upgrade vllm commit to 0113 (#5839)
### What this PR does / why we need it?
Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9)

- Modify import paths due to the refactors
https://github.com/vllm-project/vllm/pull/31916
https://github.com/vllm-project/vllm/pull/32054

- Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional
arguments but 3 were given` due to
https://github.com/vllm-project/vllm/pull/24498

- Skip the async-scheduling tests in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never
verified
https://github.com/vllm-project/vllm/pull/31998

- Skip some pooling tests, which are caused by
https://github.com/vllm-project/vllm/pull/32148
where vllm is also failed
https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4

We will reopen those tests when main2main reachs
https://github.com/vllm-project/vllm/pull/32243

- Skip some cases in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are
broken by
https://github.com/vllm-project/vllm/pull/32118

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-01-15 09:48:53 +08:00
lty
295018ec0f [Refactor]Refactor of vllm_ascend/distributed module (#5719)
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604

This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-01-15 08:57:40 +08:00