1665 Commits

Author SHA1 Message Date
Qiu
a88937f5cb [bugfix](cp) replace None with zeros/inf tensor to avoid TypeError (#5837)
### What this PR does / why we need it?
When there is no kv cache in some devices, the `_compute_prefill_context
func` will return `None`, which is unexecpted. This PR replaces None
with full zeros/-inf tensors to avoid TypeError.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
```bash
pytest tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py -k test_models_chunked_prefill_with_empty_kvcache
```

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-14 20:57:48 +08:00
zhaomingyu13
01805fbd7d Revert "[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519)"(#5902)
This reverts commit d886b81971. it breaks pd function

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2026-01-14 20:55:10 +08:00
LICO67373
2a6d95c389 [Cleanup] Remove dead code make_attention_mask function (#5818)
### What this PR does / why we need it?

This PR removes the unused `make_attention_mask` function from
`vllm_ascend/worker/v2/attn_utils.py`.

**Why it's dead code:**
- After PR #4870 (attention mask unification refactor), attention mask
generation has been centralized in the `AttentionMaskBuilder` singleton
class
- The mask is now generated directly by metadata builders when needed
(e.g., `AscendAttentionMetadataBuilder`, `AscendMLAMetadataBuilder`)
- The `make_attention_mask` function is no longer called anywhere in the
codebase
- The function's parameters (including `attn_mask` and `spec_attn_mask`)
were also removed from `build_attn_metadata` in the same refactor

**Changes:**
- Remove `make_attention_mask` function (24 lines) from
`vllm_ascend/worker/v2/attn_utils.py`

### Does this PR introduce _any_ user-facing change?

No. This is a code cleanup that removes dead code. No user-facing
behavior changes.

### How was this patch tested?

- Verified that `make_attention_mask` is not called anywhere in the
codebase (via `grep`)
- CI tests pass to ensure no regressions
- The function has been unused since PR #4870 was merged
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2026-01-14 16:52:51 +08:00
Ronald
e20813f441 [Feature] implement eagle spec decoding for model runner v2 (#5840)
### What this PR does / why we need it?
this pr implement eagle spec decoding for model runner v2, please see
RFC https://github.com/vllm-project/vllm-ascend/issues/5208

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: v0.13.0

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-01-14 09:18:05 +08:00
LHXuuu
0415e694cd [Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718)
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-01-14 09:17:26 +08:00
LI SHENGYONG
ecf2fa482e [EPLB][Bugfix] Get expert map from layers (#5817)
### What this PR does / why we need it?
The initialization method of expert_map used by the eplb module is
different from that used by the fused_moe module. This PR deletes the
expert_map initialization method used by the eplb module to make the
initialization methods consistent.

#### before bugfix
self._expert_map=tensor([64, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61,62, 63], device='npu:1', dtype=torch.int32)

self.shared_dict["expert_maps"][0]=tensor([-1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]], dtype=torch.int32)

### How was this patch tested?

#### qwen3-235B-w8a8 aime
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-14 09:16:51 +08:00
drslark
48ec97821a [Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816)
### What this PR does / why we need it?
Fixed an accuracy problem when using eagle3 with sp.

The problem is described in
https://github.com/vllm-project/vllm-ascend/issues/5825.

It also adds a much more precise way to determine whether drafter should
use `sp` or not.

Also, it changes the `eager` of drafter to be a real `eager` in frontend
to avoid a `fx-graph` problem.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

For simpilicity, we test it as in
https://github.com/vllm-project/vllm-ascend/issues/5825.

And we get the same result of `eagle3` with `sp` disabled.

```text
--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00
```

* vLLM version: v0.13.0
* vLLM main:
2f4e6548ef

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-14 09:00:37 +08:00
liziyu
e1bed43cff [P/D] bugfix for p node force free requset (#5431)
### What this PR does / why we need it?
Fix the bug where the P-node's schedule dead after it force-frees a
request due to timeout and then receives the completed kv cache pulled
by the D-node again. By add list to recode all requests.


- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-14 08:51:31 +08:00
zhangxinyuehfad
f7b904641e [Main2Main] Upgrade vllm commit to 0109 (#5752)
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
https://github.com/vllm-project/vllm/pull/31786
2. fix spec_decode e2e test due to
https://github.com/vllm-project/vllm/pull/29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
https://github.com/vllm-project/vllm/pull/31891
4. fix `self.seq_lens - query_lens` on same device due to
https://github.com/vllm-project/vllm/pull/31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-13 19:14:43 +08:00
liziyu
eed9e366a7 [Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (#5846)
### What this PR does / why we need it?
Fix layerwise connector for decoder tp size > num kv heads. In this case
prefiller should push kv cache to all decoder npu.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-13 17:30:33 +08:00
Shanshan Shen
d350c2ada6 [CustomOp][Perf] Merge Q/K split to simplify AscendApplyRotaryEmb for better performance (#5799)
### What this PR does / why we need it?
- Use upstream util function (`_pre_process()` and `_post_process()`) to
reduce redundant codes. (Find more details at
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L184-L213)
- Merge Q/K split to simplify the logic of calling
`torch_npu.npu_rotary_mul()` for better performance (TPOT has been
reduced by **6.22%**).

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
####  Functional test

Launch the server:

```bash
export VLLM_USE_MODELSCOPE=True
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```

Query the server:

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
                {"type": "text", "text": "What is the text in the illustrate? How does it look?"}
            ]}
        ],
        "max_tokens": 100
    }'
```

Output:

```
{"id":"chatcmpl-b2911ab6989ef098","object":"chat.completion","created":1768202780,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design, with \"TONGYI\" being more prominent and \"Qwen\" being slightly smaller and positioned below it. The font style is modern and clean, with \"TONGYI\" having a slightly bolder appearance compared to \"Qwen.\"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":178,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

####  Benchmark

Run:

```bash
export VLLM_USE_MODELSCOPE=False
export HF_ENDPOINT="https://hf-mirror.com"
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 10 \
--no-stream
```

Before this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  5.96      
Total input tokens:                      7191      
Total generated tokens:                  996       
Request throughput (req/s):              1.68      
Output token throughput (tok/s):         167.05    
Peak output token throughput (tok/s):    261.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          1373.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          964.43    
Median TTFT (ms):                        858.48    
P99 TTFT (ms):                           1691.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.08     
Median TPOT (ms):                        40.86     
P99 TPOT (ms):                           241.30    
---------------Inter-token Latency----------------
Mean ITL (ms):                           40.16     
Median ITL (ms):                         33.61     
P99 ITL (ms):                            250.30    
==================================================
```

After this PR:

```
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Benchmark duration (s):                  5.71      
Total input tokens:                      7191      
Total generated tokens:                  996       
Request throughput (req/s):              1.75      
Output token throughput (tok/s):         174.45    
Peak output token throughput (tok/s):    279.00    
Peak concurrent requests:                10.00     
Total token throughput (tok/s):          1433.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          992.14    
Median TTFT (ms):                        938.30    
P99 TTFT (ms):                           1728.71   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          59.16     
Median TPOT (ms):                        37.65     
P99 TPOT (ms):                           234.89    
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.55     
Median ITL (ms):                         30.73     
P99 ITL (ms):                            170.72    
==================================================
```

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-13 15:47:23 +08:00
lhchg
4b679984de enable ep32 for dispatch_ffn_combine (#5787)
### What this PR does / why we need it?
To support dispatch_ffn_combine ep32 enabled

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Single operator tested

---------

Signed-off-by: lhchg <lhao_cheng@163.com>
2026-01-13 14:35:52 +08:00
weijinqian0
1ccb9acd9a [Refactor] Provide a framework to accommodate operators for different hardware devices (#5735)
come from: https://github.com/vllm-project/vllm-ascend/issues/5463

Reason:

During the iteration process of the hardware version, there may be a
large number of iterations for the operators, which can lead to
short-term compatibility differences. Therefore, an intermediate
adaptation layer is provided to accommodate the short-term differences
in operators.


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian0 <1184188277@qq.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2026-01-13 09:53:26 +08:00
Rozwel-dx
8d571286dd [Refactor] Modify the binding logic to allocate CPU cores for each NPU card (#5555)
[Refactor] Modify the binding logic to allocate CPU cores for each NPU
card

### What this PR does / why we need it?
Modify the binding logic to allocate CPU cores for each NPU card based
on NUMA affinity, while isolating acl_thread/release_thread and other
processes to prevent mutual interference.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

c85cc045f8

Signed-off-by: rowzwel_dx <1392851715@qq.com>
- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: Rozwel-dx <1392851715@qq.com>
2026-01-13 09:21:28 +08:00
zhaomingyu13
d886b81971 [BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519)
### What this PR does / why we need it?
According to the official documentation, the parameter
"draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3
model. However, based on actual debugging, it was found that the number
of tensor parallelisms (tp) of the Eagle model is consistent with that
of the target model. The setting of tp for the draft model did not take
effect as expected.

**Note:** This feature has not been superimposed and tested with `sp`
and `dp`. It will be adapted later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```python
from vllm import LLM, SamplingParams

def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="meta-llama/Llama-3.1-8B-Instruct",
            tensor_parallel_size=4,
            gpu_memory_utilization=0.9,
            enforce_eager=True,
            speculative_config={
                "method": "eagle3",
                "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
                "draft_tensor_parallel_size": 1,
                "num_speculative_tokens": 3,
            },
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Fixes vllm-project/vllm#31345

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: drslark <slarksblood@qq.com>
2026-01-13 09:14:30 +08:00
shiyuan680
7af3b880c1 support triton of mrope (#5664)
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: shiyuan680 <917935075@qq.com>
2026-01-13 09:13:51 +08:00
DreamerLeader
db7cf9b0ca [bugfix] A2 Environment Pooling for Memcache Compatibility (#5601)
### What this PR does / why we need it?
When running memcache in the A2 environment, the logic for registering
memory needs to be added. Additionally, there is a link establishment
conflict between memcache and HCCS during initialization in A2, so the
link should be established in advance.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: fangjianwei <f30058701@china.huawei.com>
Co-authored-by: fangjianwei <f30058701@china.huawei.com>
2026-01-13 09:07:38 +08:00
LICO67373
c8a324ab73 [Refactor] Add comments for Metadata classes in attention module (#5789)
### What this PR does / why we need it?

Add docstrings for Metadata and MetadataBuilder classes in the attention
module to improve code readability.

Related to #5463 (Item 11: Add some comments for CommonMetadata and
others)

**Modified files:**
- `vllm_ascend/attention/context_parallel/common_cp.py`: Added comments
for `AscendPCPMetadata`, `CPChunkedContextMetadata`,
`AscendMetadataForPrefill`, `AscendMetadataForDecode`
- `vllm_ascend/attention/utils.py`: Added comments for
`AscendPrefillContextParallelMetadata`
- `vllm_ascend/attention/mla_v1.py`: Added comments for
`ChunkedContextMetadata`, `AscendMLADecodeMetadata`
- `vllm_ascend/attention/attention_v1.py`: Added comments for
`AscendMetadata`, `AscendAttentionMetadataBuilder`
- `vllm_ascend/attention/context_parallel/attention_cp.py`: Added
comments for `AscendAttentionCPMetadataBuilder`

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Documentation only, no functional changes.

Signed-off-by: lico67373 <918688502@qq.com>
2026-01-13 08:46:50 +08:00
LiuYi-Up
dde547e900 [Bugfix] bugfix for the order of dummy run pad and sync (#5777)
### What this PR does / why we need it?

This PR addresses an issue in piecewise graph mode when Multi-Threading
Parallelism (MTP) is enabled. Specifically, the original dummy run
sequence performs the following steps in order:

1. Sync DP (input length = 1 + k)
2. Dispatch (input length = 1 + k, with padding==graph size)

However, in the model execution phase, the sequence differs, resulting
in:

1. Padding (input length = 1, with padding)
2. Sync DP (input length = 1 + k)
3. Dispatch (input length 1 + k != graph size 1 + k, with padding)

This discrepancy leads to a mismatch between the input sizes used in the
model execution and those expected by the dispatch graph, causing an
inconsistency in graph size.

This PR ensures that the dispatch graph size aligns correctly by
modifying the sequence of operations during model execution to match the
dummy run sequence, resolving the mismatch issue.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: LiuYi-UP <1150854440@qq.com>
2026-01-13 08:44:10 +08:00
Qiu
5f4b13ab3d [bugfix](cp) align max_context_chunk to cp_virtual_block_size (#5767)
### What this PR does / why we need it?
In the chunked prefill scenario, CP needs to align the
`max_context_chunk` to the `cp_virtual_block_size`, but the current
implementation only aligns it to the `block_size`. For
PD-disaggregation, `cp_kv_cache_interleave_size` is typically set equal
to `block_size`, in which case `cp_virtual_block_size=block_size *
dcp_size * pcp_size`. Under specific conditions, this can lead to
misalignment of certain chunks, subsequently triggering assertion check
errors.

### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-12 20:11:46 +08:00
wangyongjun
4453c60262 [bugfix]limit graph replay sync (#5761)
### What this PR does / why we need it?
when graph mode is picewise,replay by synchronize will be effect
performance, sync almost cost 250us

![123](https://github.com/user-attachments/assets/04d2a1f3-1f57-4dbb-85ce-b250f2ee7ff0)

### Does this PR introduce _any_ user-facing change?
only sync when graph mode contain full mode
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: wangyongjun <wangyongjun7@huawei.com>
2026-01-12 16:46:21 +08:00
gh924
6880c1b383 [Feature] Support for cross-attention and whisper model (#5592)
### What this PR does / why we need it?
To solve the problem of the
issue:https://github.com/vllm-project/vllm-ascend/issues/2262

- support for cross-attention when the model is encoder-decoder
- support for whisper model

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: gh924 <guihao2@huawei.com>
Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>
2026-01-11 11:38:45 +08:00
zzhxxx
db12c1e2c8 [Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701)
### What this PR does / why we need it?
> Extracted from PR #5513
Based on the Sharded-CP feature PR:#4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

### All-gather KV Cache for Communication Overlap:
- This PR adjusts the calculation order in the SFA.
- split `index_select` into `indexer_select_pre_process` and
`indexer_select_post_process`.
- Combine `nope`, `rope` and `index-k` into a tensor to perform
asynchronous all-gather.

### benchmark:
input=40k && num_batch_token=20k
- before:
```
Mean TTFT (ms):                          2614.52
Median TTFT (ms):                        3148.03
P50 TTFT (ms):                           3148.03
P90 TTFT (ms):                           3163.48
P99 TTFT (ms):                           3170.20
```

- after:
```
Mean TTFT (ms):                          2529.92
Median TTFT (ms):                        3051.69
P50 TTFT (ms):                           3051.69
P90 TTFT (ms):                           3067.31
P99 TTFT (ms):                           3072.15
```

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2026-01-11 09:47:27 +08:00
lilinsiman
c5744e2350 [main][bugfix] Fix fullgraph padding bug in mtp eagle refactor (#5692)
### What this PR does / why we need it?
The condition for determining padding in the fullgraph overlay with MTP
and PCP has been modified to accommodate corner cases where the shape
capture size is manually specified.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut and tests

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-01-10 23:07:48 +08:00
zxr2333
78b554dda9 [P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722)
### What this PR does / why we need it?
Add new function to mooncake layerwise connector, including:
1. supports sparse attention, for DeepSeek-V3.2
2. Distribute transfer tasks to redundant kv_head cards

This PR is related to [[RFC]: CDCP Scheduling for Disaggregated
Prefilling with KV Cache Layerwise Push
Support](https://github.com/vllm-project/vllm-ascend/issues/4842)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2026-01-10 23:04:16 +08:00
Feng-xiaosuo
c316679e65 adapt to minimax_m2 (#5624)
### What this PR does / why we need it?
This PR fixes Minimax model loading in vLLM Ascend backend by:

Adding model type check for "minimax" and "minimax_m2" to replace "mlp"
prefix with "block_sparse_moe"
Implementing special handling for Minimax expert layer naming
conventions
Adding Minimax configuration to packed_modules_model_mapping for proper
qkv_proj and experts module handling
Without these changes, Minimax models fail to load on Ascend devices due
to incompatible layer naming and module packing.

### Does this PR introduce _any_ user-facing change?
Yes. Users can now successfully load and run Minimax models on Ascend
hardware with vLLM. This enables inference capabilities for this model
family on Ascend devices.

### How was this patch tested?
Local Testing:
Verified model loading for minimax-xxx and minimax_m2-xxx model variants
on Atlas 800I A2 hardware
Tested inference with sample prompts using vLLM's OpenAI-compatible API
server

Benchmark Validation:
Compared throughput and latency metrics against GPU baseline
Verified memory usage stays within expected limits for different batch
sizes
Tested multi-card inference scenarios with tensor parallelism

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

---------

Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
2026-01-10 23:01:35 +08:00
Levi
ecd4232698 [Feat] flashcomm2+oshard Generalized (#4723)
### What this PR does / why we need it?
[FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf)
introduces redundant storage of the o_proj matrix, which imposes
pressure on GPU memory. We propose the FlashComm2+Oshard approach by
integrating the shared linear layer feature (#2931). This approach
distributes weights layer-by-layer to each GPU and accesses the o_proj
of each layer via asynchronous broadcast operations, thereby alleviating
memory pressure while achieving nearly lossless performance compared to
the original FlashComm2. This PR implements a generalized
FlashComm2+Oshard solution.

Using following env to support flashcomm2 with oshard

```shell
export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1
--additional-config '{
  "layer_sharding": ["o_proj"]
}'
```

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2026-01-10 22:57:57 +08:00
wangxiaoteng888
aa987ffe87 [P/D][bugfix]Fix the PCP port mapping error issue (#5706)
### What this PR does / why we need it?
Fix the PCP port mapping error issue.In a multi-node PD separation
scenario, when the PCP feature is enabled, there is an issue with the
ZMQ transmission port. Specifically, the IP and port received by Side D
do not match. The cause of this issue is an error in the port mapping
update strategy logic.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-10 22:43:52 +08:00
fems14
ff4c1a47b3 [bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751)
### What this PR does / why we need it?
1.Fixed memory retention on certain GPUs caused by missing PUT
operations.

2.Fixed performance degradation resulting from architectural
incompatibilities in the underlying refactor.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: fems14 <1804143737@qq.com>
2026-01-09 17:46:23 +08:00
wangyao-i
3b997fdd32 support mxfp8 quantization (qwen dense) (#5723)
### What this PR does / why we need it?
support mxfp8 quantization (qwen liner layer)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef


Signed-off-by: wangyao <iwangyao@outlook.com>
2026-01-09 16:26:31 +08:00
SILONG ZENG
09b3f9d91b [CI]Add Disaggregated PD Nightly Test for Qwen3-235B and Qwen3-VL-235B (#5502)
### What this PR does / why we need it?
This PR adds online **Disaggregated Prefill/Decode** performance and
accuracy tests for the **Qwen3-235B-A22B** and
**Qwen3-VL-235B-A22B-Instruct** models to the Nightly test suite.

These test configurations simulate the deployment of massive MoE and
Vision-Language models in **a dual-node (32 NPU)** environment,
utilizing Mooncake (KVCache Transfer) technology to achieve efficient KV
cache transfer between the Prefill node and the Decode node.

#### Test Configuration
**Qwen3-235B-A22B**
- Model: Qwen/Qwen3-235B-A22B
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Disaggregated Prefill & Decode
- Node 0 (Producer/Prefill): **DP2 + TP8 + EP + FLASHCOMM1 +
FUSED_MC2**.
- Node 1 (Consumer/Decode): **DP4 + TP4 + EP + FLASHCOMM1 + FUSED_MC2 +
FULL_DECODE_ONLY**.
- Benchmarks:
  - Performance: vllm-ascend/GSM8K-in3500-bs2800.
  - Accuracy: vllm-ascend/gsm8k-lite.

**Qwen3-VL-235B-A22B-Instruct**
- Model: Qwen/Qwen3-VL-235B-A22B-Instruct
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Disaggregated Prefill & Decode
  - Node 0 (Producer/Prefill): **DP2 + TP8 + EP**.
  - Node 1 (Consumer/Decode): **DP4 + TP4 + EP + FULL_DECODE_ONLY**.
- Benchmarks:
  - Performance: vllm-ascend/textvqa-perf-1080p.
  - Accuracy: vllm-ascend/textvqa-lite.

### How was this patch tested?
Nightly test action on CI

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-09 16:25:20 +08:00
1092626063
f63c1341d9 [Feature] GLM4.6 support mtp with fullgraph (#5460)
### What this PR does / why we need it?
GLM4.6 support mtp with fullgraph to improve performance

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
`
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE=AIV

vllm serve /weight/glm4.6_w8a8_with_float_mtp \
  --data-parallel-size 1 \
  --tensor-parallel-size 16 \
  --seed 1024 \
  --served-model-name glm \
  --max-model-len 35000 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 1,
"model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32],
"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --async-scheduling \
`

test case:
`
vllm bench serve \
  --backend vllm \
  --dataset-name prefix_repetition \
  --prefix-repetition-prefix-len 22400 \
  --prefix-repetition-suffix-len 9600 \
  --prefix-repetition-output-len 1024 \
  --num-prompts 1 \
  --prefix-repetition-num-prefixes 1 \
  --ignore-eos \
  --model glm \
  --tokenizer /weight/glm4.6_w8a8_with_float_mtp \
  --seed 1000 \
  --host 0.0.0.0 \
  --port 8000 \
  --endpoint /v1/completions \
  --max-concurrency 1 \
  --request-rate 1

`
- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: 1092626063 <1092626063@qq.com>
2026-01-09 16:07:42 +08:00
ice_rain
09682e0751 [Bugfix] Fix matmul allreduce precision issue by using original weight (#4939)
### What this PR does / why we need it?

This PR fixes the precision issue from improper Tensor maintenance in
`vllm_ascend/ops/linear_op.py` under the Verl reinforcement learning
(RL) scenario. issue:
https://github.com/vllm-project/vllm-ascend/issues/5747
Key changes:
1. Remove the custom class member `self.weight_t` in
`vllm_ascend/ops/linear_op.py`;
2. Adjust the input logic of the `npu_mm_all_reduce_base` operator to
directly fetch weight parameters from the model's `nn.Parameters`,
instead of using pre-created Tensors.

> In the vllm model, it is recommended to avoid creating additional
parameter copies (such as self.weight_t) for computation; if already
created, they must be synchronized with the model's original parameters.
This is because parameter synchronization between training and inference
in the Verl reinforcement learning (RL) scenario may cause memory
address changes to nn.Parameters, and unsynchronized extra Tensors will
reference old memory without updating with the parameters—ultimately
leading to precision issues.
### Does this PR introduce _any_ user-facing change?
No.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: icerain-alt <450125138@qq.com>
Co-authored-by: Shangwei-Li <lishangwei@mail.ustc.edu.cn>
2026-01-09 16:05:32 +08:00
zzhxxx
64d29875f9 [Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698)
### What this PR does / why we need it?
Based on the Sharded-CP feature
PR:https://github.com/vllm-project/vllm-ascend/pull/4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

This PR officially integrates Deepseek V3.2's DSA-CP support on the
basis of https://github.com/vllm-project/vllm-ascend/pull/4702,
improving inference efficiency and scalability under mixed
prefill-decode workloads. The main improvements include:
- Replace the implementations of o_proj, q_b_proj, and kv_b_proj with
custom_op for TP=1.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: Kurumi5210 <jaychou1620@gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
2026-01-09 15:58:40 +08:00
ZT-AIA
e11ff8e535 [BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394)
### What this PR does / why we need it?
The customized ascend operator sgmv_expand and sgmv_shrink applies only
to the scenario where rank is 8,16,32,64. When rank >= 128, the operator
is out of range, causing the model to report an error.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Depends on this commit https://github.com/vllm-project/vllm/pull/31408 
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
2026-01-09 15:57:43 +08:00
lhchg
dc99cfdc15 [CustomOp] support TensorList for dispatchFFNCombine (#5665)
### What this PR does / why we need it?
To support tensorList for dispatch_ffn_combine, to adjust eplb

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Single Operator Testing

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lhchg <lhao_cheng@163.com>
Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>
2026-01-09 15:56:29 +08:00
Wang Xiaoran
3ce5a34468 [BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711)
### What this PR does / why we need it?
This PR fixes a bug in Xlite
backend(https://atomgit.com/openeuler/GVirt/issues/1), The direct cause
of the problem is that the XModel::PrepareAttn function obtained an
illegal number of tokens to be inferred, -540. This illegal value is due
to the padding feature of inference in graph mode and the residual state
across steps. This issue is triggered when a prefill request is newly
added in a step and a decode ends simultaneously. It is first fixed
using num_decode_tokens instead of attn_metadata.num_decodes.
1. In graph mode, vllm_ascend has padding characteristics. In the
_prepare_inputs function, if the number of tokens to be inferred is less
than the set threshold (8 in this case), the attn_metadata.num_decode
array will be expanded to 8.
2. Meanwhile, vllm_ascend uses the class variable self.query_start_loc
of NPUModelRunner to record the tokens to be inferred. Due to poor
coordination with the graph mode padding mechanism when crossing steps,
in some cases (such as when a decode request is completed in a certain
step and a new prefill request is added at the same time), negative
values may be calculated for attn_metadata.query_lens.
3. After type conversion, the negative values in query_lens cause an
overflow. Xlite detects that the number of tokens to be inferred for the
decode request is too large and triggers a "decode len too long" alert.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Same with https://atomgit.com/openeuler/GVirt/issues/1
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wwwumr <1127858301@qq.com>
2026-01-09 15:55:30 +08:00
Rui Kang
be941cab71 [BugFix] NetLoader: No backend type associated with device type npu (#5700)
**What this PR does / why we need it?**
This PR fixes a bug in NetLoader
[PR#2888](https://github.com/vllm-project/vllm-ascend/pull/2888). The
bug was caused by
[PR#3612](https://github.com/vllm-project/vllm-ascend/pull/3612)
([1/N][Refactor] Refactor code to adapt with vllm main), which removed
the `stateless_init_device_torch_dist_pg` function from platform.py,
leading to a failure in the call. This PR adds a way to create a
stateless process group that does not depend on external code.

**Does this PR introduce any user-facing change?**
No

**How was this patch tested?**
Same with
[PR#2888](https://github.com/vllm-project/vllm-ascend/pull/2888)
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: destinysky <kangrui10@126.com>
2026-01-09 15:54:54 +08:00
whx
ee2ed573f1 [BugFix][DS 3.2] Fix ds indexer accuracy problem caused by rope. (#4641)
### What this PR does / why we need it?
The rotary algorithm in deepseek indexer should be neox-style instead of
gptj style. PR #4413 fix this accuracy bug with new triton kernel. This
PR fixes original pytorch version.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with existing test.


- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-01-09 14:11:44 +08:00
Chenxi Qian
40eb3e1836 [OP] Enable custom op aclnnMoeInitRoutingCustom (#5332)
### What this PR does / why we need it?
This PR enables custom op `aclnnMoeInitRoutingCustom` introduced in PR
#5251

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2026-01-09 09:35:18 +08:00
zhenwenqi2024
97f6be8108 [feature]dcp&pcp support mlapo (#5672)
### What this PR does / why we need it?
mlapo in deepseek is a huge performance improvement in decode, this pr
support pcp & dcp with mlapo

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2026-01-08 23:49:23 +08:00
Yizhou
f4605c2b3c [Fix] Fixes speculative decode indexing and unpad condition for attention metadata (#5626)
### What this PR does / why we need it?
This addresses the issue brought up by #5356 and #4963, and we believe
the unnecessary conditions are the root cause.

Change the unpad trigger to be driven by actual size mismatches
(num_reqs vs base_num_reqs or scheduled vs input token counts) rather
than specific speculative-method flags. Then remove brittle workarounds
that forced request counts and sliced query start locations.

This prevents incorrect indexing and length mismatches during
speculative decoding and makes metadata unpadding more robust across
scheduling modes.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Tested by existing cases.

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-01-08 19:41:08 +08:00
cookieyyds
8b3a7a9e87 [bugfix] Support dsv3.2 enable both mtp and full_decode_only (#5679)
### What this PR does / why we need it?
#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
2026-01-08 15:47:31 +08:00
drslark
ccbc5e2ba1 [Feat][Bugfix][main] Adapted SP to eagle3 (#5562)
### What this PR does / why we need it?

Adapted sp to eagle3.

There may still be some problems, e.g., accuracy in some scenes,
`sp`+`dp`...

We will fix them later.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

We tested it mainly in a new `e2e`.

```shell
pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance
```

```text
.

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============= 3 passed, 1 skipped, 2 warnings in 142.05s (0:02:22) =============
```

It passed.

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-08 15:33:52 +08:00
Nengjun Ma
48811bc0b8 Optimize the print info format when deprecated code is used in vllm-ascend (#5696)
### What this PR does / why we need it?
Optimize the warning print information format when detects depredated
code is used in vllm-ascend.

### Does this PR introduce _any_ user-facing change?
NA

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-08 09:26:49 +08:00
Aoxuan Chen
8763953f56 [Feature] add the magicmtp speculative decoding acceleration algorithm (#5542)
### What this PR does / why we need it?

1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. Added Triton and PyTorch implementations, and added E2E test cases.

### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.
- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: chenaoxuan <cax1165@163.com>
2026-01-08 09:15:55 +08:00
lidenghui1110
481138e1d2 [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (#4311)
### What this PR does / why we need it?
func `get_kv_cache_spec` in model_runner changed a lot and caused error
in cpuoffloading connector which is copied from model_runner, this PR
adapts to new implemented `get_kv_cache_spec` to fix it.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-01-08 09:15:09 +08:00
zzhxxx
f7db812ed7 [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181)
### What this PR does / why we need it?
- Delete the environment variable
`VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED`
- Introduce layer_sharding as a configurable feature in
additional_config
- Revise the term "shared weight" to "shard weight."
Configuration : The feature is opt-in via the additional_config
argument:
```
--additional-config '{
  "layer_sharding": ["o_proj", "q_b_proj"]
}'
```

This is orthogonal to standard tensor parallelism and weight replication
strategies. It is treated as a separate, explicit feature.It can be used
in any scenario, combined with the
flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature
or the ShardedCP #4702 feature, to achieve significant performance.



- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
2026-01-08 09:05:02 +08:00
zxr2333
20a8cf061b [BugFix][P/D] Fix pre-create link parameter error (#5694)
### What this PR does / why we need it?
Fix pre-create link parameter error, `batch_transfer_sync_write`
requires list.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2026-01-08 08:41:10 +08:00
ZCG12345
3be8e33fe9 [Kernel] Add moe_gating_top_k operator support for Ascend NPU (#5579)
### What this PR does / why we need it?

1.replace moe_gating_top_k from torch_npu with custom op
2.enable the  renorm function of moe_gating_top_k in softmax scenerio

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need test

- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: ZCG12345 <2097562023@qq.com>
2026-01-07 21:42:31 +08:00