Commit Graph

487 Commits

Author SHA1 Message Date
rjg-lyh
fc2bcbe21c [Ops] Fix bug in register_custom_ops without forward_context (#2883)
### What this PR does / why we need it?
This PR fixed the bug in register_custom_ops without forward_context. We
set try-except to consider this situation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: main
- vLLM main:
7920de0a2a

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-09-12 16:58:08 +08:00
realliujiaxu
778cb72556 fix bug when rotary_dim is not 128 (#2847)
### What this PR does / why we need it?
`torch_npu.npu_apply_rotary_pos_emb` only support head_size and
rotary_dim equal 128. Error occurs when running GLM

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: main
- vLLM main:
404c85ca72

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-09-12 09:49:36 +08:00
22dimensions
f5a97e8fa5 [Quantization] register AscendQuantRMSNorm for quantization (#2856)
### What this PR does / why we need it?

modelslim will generate self.bias for rms norm in quantization, since
RMSNorm in vllm has no this parameter, so its nesscesary
to create a AscendQuantRmsNorm.
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

tested by deepseek-v3.1-w8a8

<img width="2496" height="592" alt="image"
src="https://github.com/user-attachments/assets/004c6e76-3d7a-4a1f-b59f-a14304012663"
/>


- vLLM version: main
- vLLM main:
d6249d0699

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-11 23:14:02 +08:00
wyu0-0
eab3635850 [Bugfix] Retrieve num_redundant_experts from eplb_config in torchair qwen3_moe.py (#2857)
### What this PR does / why we need it?
This PR addresses a configuration retrieval issue related to EPLB
(Expert Parallel Load Balancing) settings in qwen3_moe.py.

The key change is adjusting the source of num_redundant_experts to
correctly fetch from the eplb_config sub-structure within
parallel_config, rather than directly from parallel_config. This aligns
with the updated configuration hierarchy for EPLB-related parameters.

This change references `vllm_ascend/models/qwen3_moe.py`

https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/models/qwen3_moe.py#L255-L257

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?

run bash as follows and test pass
```
source /sfs_turbo/humpy/B080/cann_b080/ascend-toolkit/set_env.sh
source /sfs_turbo/humpy/B080/cann_b080/nnal/atb/set_env.sh
#export HCCL_BUFFSIZE=300

# export HCCL_SOCKET_IFNAME="eth0"
# export TP_SOCKET_IFNAME="eth0"
# export GLOO_SOCKET_IFNAME="eth0"
# export HCCL_IF_IP=33.215.118.231

export VLLM_USE_V1=1
export VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ=1
export TASK_QUEUE_ENABLE=1
# export VLLM_VERSION=0.9.1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

rm -rf ./.torchair_cache/
rm -rf ./dynamo_*
rm -rf /root/ascend/log/debug/plog/*

python -m vllm.entrypoints.openai.api_server \
    --model=/sfs_turbo/tzq/model/Qwen/Qwen3-235B-A22B/ \
    --served-model-name auto \
    --port 8006 \
    -tp 1 \
    -dp 16 \
    --enable_expert_parallel \
    --max-num-seqs 48 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.95 \
    --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes_init":false,"graph_batch_sizes":[1, 8, 16, 24, 48]}, "ascend_scheduler_config":{"enabled":false}, "refresh":true}' \
    --kv-transfer-config \
    '{
        "kv_connector": "SharedStorageConnector",
        "kv_buffer_device": "npu",
        "kv_role": "kv_consumer",
        "kv_parallel_size": 2,
        "kv_port": "20002",
        "engine_id": "decode-'${NODE_RANK}'",
        "kv_rank": 1,
        "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 16
             },
             "decode": {
                    "dp_size": 16,
                    "tp_size": 1
             }
        }
    }' \
    2>&1 disown

```

- vLLM version: main
- vLLM main:
0ae43dbf8c

Signed-off-by: wyu0-0 <woshilynn@163.com>
2025-09-11 22:15:19 +08:00
Angazenn
aeffe27b30 [Perf]set moe w2_weight default to be nz (#2842)
### What this PR does / why we need it?

This PR sets the default format of GMM w2_weight in w8a8_dynamic to be
NZ to improve performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: main
- vLLM main:
e40827280b

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-09-11 21:40:54 +08:00
wuweiqiang24
9615dea3a7 Refactor tensor_parallel and comm_utils (#2814)
### What this PR does / why we need it?
1. Move ops/comm_utils to ops/moe/comm_utils
2. Move distributed/tensor_parallel/gather_from_sequence_parallel_region
to ops/moe/comm_utils
3. Delete distributed/tensor_parallel

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
e2e & ut

- vLLM version: main
- vLLM main:
a1213fae5f

---------

Signed-off-by: wuweiqiang24 <1005334931@qq.com>
Signed-off-by: wuweiqiang24 <wuweiqiang11@huawei.com>
2025-09-11 21:26:36 +08:00
rjg-lyh
0005479b9c [main] mlp weight prefetch in Qwen Dense Models (#2816)
### What this PR does / why we need it?
This PR prefetchs the weight of mlp layers in Qwen Dense Models to
optimize the performance in Decode phase mainly.

### Does this PR introduce _any_ user-facing change?
 No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: main
- vLLM main:
a1213fae5f

Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Shuming19 <313093131@qq.com>
2025-09-11 21:20:09 +08:00
无脸男
c3c2221503 [Feat]support dynamic quantization in allgather (#2841)
### What this PR does / why we need it?
[Feat]support dynamic quantization in allgather
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: main
- vLLM main:
5931b7e5d9

Signed-off-by: withHades <244036962@qq.com>
Signed-off-by: WithHades <244036962@qq.com>
2025-09-11 18:47:20 +08:00
6lazijiamo
bd3dedea61 support qwen25 vl w8a8 quantization (#2778)
### What this PR does / why we need it?
support qwen25 vl w8a8 quantization
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
62f66be1f7

---------

Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>
Co-authored-by: lijiaojiao <lijiaojiao990304@163.com>
2025-09-11 16:40:51 +08:00
jiangpeng
2b9269b581 [Perf][V1] Fully overlap model execution (#2783)
This PR is based on top of
[#23569](https://github.com/vllm-project/vllm/pull/23569) and
[#24219](https://github.com/vllm-project/vllm/pull/24219).

### What this PR does / why we need it?
This PR allows the model runner to function asynchronously when using
async scheduling. This allows full overlap of the cpu operations
(including prepare_inputs) and the model forward pass. This diff is
functional and does not support speculative decoding, PP, or guided
decoding.

Expected speedup is 5-10% over the current async scheduling.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
server
```
python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\
	--trust-remote-code --enforce-eager \
	--distributed-executor-backend=mp \
	-tp=4 \
	--port 8006 \
	--max-model-len 32000 \
	--block-size 128 \
	--gpu-memory-utilization 0.99
```
client
```
python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \
  --dataset-name random --random-input-len 2048 --random-output-len 2048 \
  --ignore-eos\
  --num-prompts 48 --max-concurrency 48  --request-rate inf --temperature 0 \
  --metric-percentiles 90  --base-url http://localhost:8006 --save-result \
  --result-dir $PROFILER_DIR
```

benchmark test based on Qwen3-32B TPOT result:
||forward async| scheduler async |sync|
|-|-|-|-|
|avg|41.73|41.86|44.20|
|improve0|0.3%|0|0|
|improve1|5.58%|0|0|

benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result:
||forward async|sync|
|-|-|-|
|avg|23.22|29.16|
|improve|20.3%|0|


- vLLM version: main
- vLLM main:
e93f4cc9e3

Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com>
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com>
Co-authored-by: Ronald1995 <ronaldautomobile@163.com>
2025-09-11 16:35:36 +08:00
zhaozx-cn
923cdaeba3 fix ascend fused moe spelling error (#2863)
### What this PR does / why we need it?
fix ascend fused moe spelling error

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

0ae43dbf8c

- vLLM version: main
- vLLM main:
fcc0a3130a

Signed-off-by: zhaozixin <zhaozixin1@huawei.com>
Co-authored-by: zhaozixin <zhaozixin1@huawei.com>
2025-09-11 14:35:46 +08:00
zhaozx-cn
b9a0a75c78 fix qwen torchair attention PrefillCacheHit (#2787)
### What this PR does / why we need it?
Fix qwen torchair attention PrefillCacheHit 
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
vLLM version: v0.10.1.1
vLLM main:
e599e2c65e

- vLLM version: main
- vLLM main:
0b9a612fa3

Signed-off-by: zhaozixin <zhaozixin1@huawei.com>
Co-authored-by: zhaozixin <zhaozixin1@huawei.com>
2025-09-11 14:26:59 +08:00
anon189Ty
7b2ecc1e9a [Feat] Unquantized linear nz support (#2619)
### What this PR does / why we need it?
Currently, when executing to the Linear layer of the model in
vLLM-Ascend, the weights input format is ND in unquantized case and
skipped ascend case, which is slower than FRACTAL_NZ.
This PR supplements the execution logic for Linear layer. When
VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights
of the Linear layer will be converted to FRACTAL_NZ, in both unquantized
case and skipped ascend case.

- vLLM version: main
- vLLM main:
267c80d31f

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-09-11 11:40:00 +08:00
liziyu
5691104249 LLMdatadist connector adapt the distributed KV aggregation (#2718)
### What this PR does / why we need it?
LLMdatadist connector adapt the distributed KV aggregation for the main
branch. Change the P node from returning "finish sending" only when TP0
responds to returning "finish sending" as soon as each NPU receives it.
The D node will send a finish receive signal to the corresponding tp
rank of the P node.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
gsm8k test
2*A3 1P 1D
P: dp2 tp8 D:dp 4 tp4
P: dp2 tp8 D:dp 2 tp8


- vLLM version: main
- vLLM main:
cc99baf14d

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-11 11:37:41 +08:00
Mengqing Cao
c2fdd4b8bc [CI/UT] Fix UTs on register customop and warm up model (#2862)
### What this PR does / why we need it?
Fix UTs on register customop and warm up model

### How was this patch tested?
CI passed with existing test.

Co-authored-by: Icey <1790571317@qq.com>

- vLLM version: main
- vLLM main:
cc99baf14d

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-11 11:30:16 +08:00
lilinsiman
b7df04de9b debug_aclgraph_sizes_capture (#2827)
### What this PR does / why we need it?
1. Solved the problem that in the Qwen3 Moe model case, opening DP would
use an extra stream, causing ACLgraph sizes capture error
2. After experimentation, it was found that in many cases, some
operators would occupy more streams than expected. Therefore, the buffer
area for streams in ACLgraph was not large enough. After discussion,
extra 120 streams were added as buffer.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: main
- vLLM main:
0ae43dbf8c

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-09-10 22:50:48 +08:00
huangxialu
88d7af62be [main] adjust the position of warm_up_atb (#2823)
### What this PR does / why we need it?
Adjust the position of warm_up_atb.

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
CI passed with existing test.

- vLLM version: main
- vLLM main:
b23fb78623

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-09-10 14:06:38 +08:00
Li Wang
22b425765a [Bugfix] Fix broken CI (#2825)
### What this PR does / why we need it?
1. Initial support disable tp for integrating with
[vllm-commit](https://github.com/vllm-project/vllm/pull/23024)
2. [vllm@commit](https://github.com/vllm-project/vllm/pull/23673) now
use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add
the integration

- vLLM version: main
- vLLM main:
e40827280b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-10 13:29:29 +08:00
Icey
aa4d2a91ed Refactor AscendMultiHeadLatentAttention (#2826)
### What this PR does / why we need it?
Register AscendMultiHeadLatentAttention as CustomOP, following vllm changes

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: main
- vLLM main:
b23fb78623

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-10 11:26:11 +08:00
CaranLic
168ad600b5 [main] add pd transfer for ascend scheduler (#2753)
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.

### How was this patch tested?

```
max_num_seqs=24,
additional_config={
    "ascend_scheduler_config":{
        "enabled": True,
        "enable_pd_transfer": True,
        "decode_max_num_seqs": 24,
        "enable_chunked_prefill": False
    }
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |


- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163

---------

Signed-off-by: CaranLic <740821011@qq.com>
2025-09-10 08:46:39 +08:00
Mengqing Cao
edf1f600ad [CI] Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 (#2840)
### What this PR does / why we need it?
Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1

### Does this PR introduce _any_ user-facing change?
branch main of vllm-ascend will not be compatible with vllm v0.10.1 and
v0.10.1.1

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-10 08:43:10 +08:00
sherie
93e28e6862 add weight transpose check. (#2756)
### What this PR does / why we need it?
In reinforcement learning scenarios, weight updates are required, but
the current inference applies a transpose operation to the weights,
altering their shape. This causes a shape mismatch with the training
weights, triggering an error during weight updates.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-09-09 20:33:43 +08:00
yiz-liu
e13c4ddb42 [Fix] Fix SharedFusedMoE (#2817)
### What this PR does / why we need it?
Really strange that `register_oot` doesn't work with `SharedFusedMoE`,
so we have to add this patch, for now.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
This PR won't have any effect in DeepSeek since we currently still stick
with the old `CustomDeepseekV2`.

- vLLM version: v0.10.1.1
- vLLM main:
0cdd213641

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-09 18:19:56 +08:00
rjg-lyh
7a205dbaa8 [main] Optimize rope in Qwen Models (#2571)
### What this PR does / why we need it?
Optimize rope by caching sin and cos at the first layer in Qwen Models.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.1.1
- vLLM main:
562663a044

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: ZYang6263 <zy626375@gmail.com>
Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: ZYang6263 <51255902183@stu.ecnu.edu.cn>
Co-authored-by: ZYang6263 <zy626375@gmail.com>
2025-09-09 14:28:14 +08:00
rjg-lyh
1bbb20ea13 [main] flashcomm_v1 optim in Qwen Dense Models (#2802)
### What this PR does / why we need it?
Flashcomm_v1 optim in Qwen Dense Models.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
5e537f45b4

Co-authored-by: 1024daniel <xxltju324@gmail.com>
2025-09-08 22:52:24 +08:00
zzzzwwjj
4df8df5b94 [bugfix] fix deepseek rope sincoscache re-generation (#2744)
### What this PR does / why we need it?
The current implementation will result in duplicate generation of
`sin_cos_cache` in rope when `kv_seqlen` > 4k, because the
initialization length of the `sin_cos_cache` is only 4k.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
After this PR merged, sin_cos_cache will not increase in forward func,
so `test_native_rope_deepseek_forward_cache_handling` is not necessary.

- vLLM version: v0.10.1.1
- vLLM main:
60f0843ef8

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-09-08 22:03:34 +08:00
wangxiyuan
7d6d9449a8 [Misc] Move lora patch file into lora module (#2797)
Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-08 21:42:12 +08:00
wangxiyuan
85d989a3b9 [Misc] Remove pangu model file (#2798)
vllm-ascend won't contain model file anymore. Now pangu model file has
been moved to torchair module. The origin one can be removed.

Note: After this PR, pangu only works with torchair mode then.
- vLLM version: v0.10.1.1
- vLLM main:
8c892b1831

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-08 21:30:37 +08:00
weichen
a041d4f328 [main] [refactor] refactor common_fused_moe.py (#2706)
### What this PR does / why we need it?
1. Move prepare/finalize operation from moe_comm_method to
/ops/moe/fused_moe_prepare_and_finalize
2. Adapt to token_dispatcher in moe_comm_method
3. Move
moe_comm_method/experts_selector/token_dispatcher/fused_moe_prepare_and_finalize
to /ops/moe
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut

- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: weichen <calvin_zhu0210@outlook.com>
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-09-08 20:09:50 +08:00
machenglong2025
1a82b16355 Remove unused code in fused_moe.py (#2805)
### What this PR does / why we need it?
line 408 already declared mc2_mask ,  remove duplicated unused code

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
60f0843ef8

Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
2025-09-08 20:05:19 +08:00
22dimensions
d51694a77b [2/N][Refactor][Quantization] clean quantization patch (#2785)
### What this PR does / why we need it?
quantization patch is unused code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
tested by CI

- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-08 17:31:53 +08:00
realliujiaxu
d3c3538ddc [Bugfix]fix bug when graph_size is not divisible by tp_size (#2719)
### What this PR does / why we need it?
fix https://github.com/vllm-project/vllm-ascend/issues/2702
- A2: skip graph_size update that makes it to tp_size because
dispatch/combine op support different batch size across EP ranks
- A3: add `max_num_reqs = max(new_graph_batch_sizes)` to fix graph_size
and max_num_reqs mismatch

### Does this PR introduce _any_ user-facing change?
Nope
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-09-08 14:52:33 +08:00
TaoYu Chen
dd087effcc Refector prepare_inputs in model_runner_v1.py (#2750)
### What this PR does / why we need it?
Refector prepare_inputs in model_runner_v1.py for more easy read.

### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
PASS CI

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
2025-09-08 10:45:23 +08:00
yiz-liu
c735bb0941 [Fix] Ensure metadata sync across DP ranks in eager mode (#2766)
### What this PR does / why we need it?
Removes the condition that skips metadata synchronization when
`enforce_eager` is enabled.

This change is necessary to correctly sync the `with_prefill` and
`enable_dbo` flags across all data parallel ranks, which is not required
in the base implementation. Forcing the sync operation prevents
potential inconsistencies, albeit with a minor performance impact.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Add a E2E online test case?

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-08 09:55:16 +08:00
sherie
2693196ef8 add gatherep select. (#2740)
### What this PR does / why we need it?
add gatherep select.

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-09-08 09:15:50 +08:00
Marco Barletta
6666e5265d Added support for KV connector v1 (#2039)
### What this PR does / why we need it?
- This PR adds the support for the KV connector interface in the V1
architecture, in the same way as vllm. Vllm-ascend currently lacks of
this support, required to support also layerwise management of KV
caches.

- The connector interface allows using external tools and integrate them
with vllm

### Notes:
We are aware of Issue #684 , however that issue does not modify the
attention classes as necessary to perform layerwise management of KV
caches required for connectors like LMCache.

The implementation of this PR ported the necessary code from the vanilla
vllm. The KV connector API is the same as vanilla vllm, supporting the
standard KV connector API.

EDIT: this PR was re-implementing part of the changes merged one hour
before this PR was made on the file model_runner_v1.py. I solved the
conflicts by removing any modification to the model_runner_v1 file,
which now are largely already merged in main. Now this PR is left for
the modifications to the attention_v1 file.

### Does this PR introduce _any_ user-facing change?
The PR does not modify current APIs, but it extends the behavior of
current worker runner and attention classes to save and load KV caches.
In absence of connectors, the behavior should stay untouched.

### How was this patch tested?
- No unit test implemented yet for the worker.

- Tested together with LMCache using
https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/local_backends/offload.py
with the following models:
1 Deepseek-R1-Distill-Qwen-1.5B
2 Qwen3-30B-A3B
3 Deepseek-v2-lite
4 Llama-3.1-8B
LMCache used in both layerwise and non-layerwise mode.

- Performed LMEval on LMCache integrated with vllm-ascend.

Results without LMCache on Qwen3-8B:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8400|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8355|± |0.0102|
 
Results with LMCache Layerwise:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8385|± |0.0101|
| | |strict-match | 5|exact_match|↑ |0.8332|± |0.0103|


- vLLM version: v0.10.1.1
- vLLM main:
50fede6634

---------

Signed-off-by: marcobarlo <barlettamarco8@gmail.com>
Signed-off-by: marcobarlo <65128997+marcobarlo@users.noreply.github.com>
2025-09-08 09:04:22 +08:00
yeyifan
b2f77d3aa8 [fix] prefill unsupport sliding window attention (#2758)
### What this PR does / why we need it?
fix prefill attention bug,not support sliding window.
npu_fused_infer_attention_score head_dim only equal 128, not support
other number.
### Does this PR introduce _any_ user-facing change?
remove prefill phase npu_fused_infer_attention_score
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
e599e2c65e

---------

Signed-off-by: nsdie <yeyifan@huawei.com>
2025-09-07 10:34:38 +08:00
lidenghui1110
5a7181569c [feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167)
### What this PR does / why we need it?
This PR introduces Oproj matrix tensor model parallel to achieve
decreasing of memory consumption. It only support graph mode in pure DP
scenario.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8
GB NPU memory per RANK. We got best performance when
oproj_tensor_parallel_size=4 without TPOT increasing.

performance data:
<img width="1442" height="442" alt="image"
src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| oproj_tensor_parallel_size | Split the o_proj matrix along the row
dimension (head num * head dim) into oproj_tensor_parallel_size pieces.
| No | int | default value is None, once this value is set, the feature
will be enabled, head num * head dim must be divisible by this value. |

example

`--additional_config={"oproj_tensor_parallel_size": 8}`

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
eddaafc1c7

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zzh <zzh_201018@outlook.com>
2025-09-07 10:31:32 +08:00
henryxuxu0716
51a2aec115 Delete redundant codes related to communication (#2717)
### What this PR does / why we need it?
Delete redundant codes related to communication

### Does this PR introduce _any_ user-facing change?
not involve

### How was this patch tested?
not involve

- vLLM version: v0.10.1.1
- vLLM main:
6c7af8110a

---------

Signed-off-by: 刘哲续 <liuzhexu1@huawei.com>
Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>
2025-09-05 09:39:39 +08:00
1092626063
5b3646ab21 [FEATURE][MTP] Support MTP > 1 (#2708)
### What this PR does / why we need it?
[RFC:Support MTP > 1 for
DeepSeek](https://github.com/vllm-project/vllm-ascend/issues/2745)

- [x] dp1 tp16
- [x] dp4 tp4
- [x] dp2 tp 8
- [x] torchair graph

- vLLM version: v0.10.1.1
- vLLM main:
c9f7081f9c

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-09-05 09:11:22 +08:00
yiz-liu
83eb40a51c [Fix][MoE] Refine MoE communication strategy (#2734)
### What this PR does / why we need it?
Refactors the Mixture-of-Experts (MoE) communication method selection
logic. The choice between all-gather, all-to-all, and mc2 is now
determined by expert parallel configuration, SoC version (A2/A3), and
token count for better performance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Added.


- vLLM version: v0.10.1.1
- vLLM main:
eafa8dcde6

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-05 09:04:04 +08:00
liziyu
4c90fa79ca [Misc] Remove useless PD check in deepseek (#2739)
### What this PR does / why we need it?
Remove useless PD check in deepseek


- vLLM version: v0.10.1.1
- vLLM main:
6c7af8110a

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-04 22:22:19 +08:00
sherie
f86596a66c allgather use fusedop. (#2689)
### What this PR does / why we need it?
Use 'npu_moe_init_routing_v2' &'npu_moe_token_unpermute' repalce
'npu_moe_init_routing' &‘npu_moe_compute_expert_tokens’&
'npu_moe_finalize_routing' to optimize performance
### Does this PR introduce _any_ user-facing change?
| branch| tps| TTFT |TPOT |
| --- | --- | --- |--- |
|main  |733.98  | 280.05 |34.30 |
|main+fusedop  | 740.33 | 273.34 |33.99 |
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-09-04 11:56:29 +08:00
无脸男
7d47d8f4f6 [Fix] fix resources limit error when apply speculative decoding and aclgraph (#2472)
### What this PR does / why we need it?
When both speculative decoding and aclgraph are applied, and
cudagraph_capture_sizes uses the default value, it will report that the
stream resources are insufficient.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
9c99e4871f

Signed-off-by: withHades <244036962@qq.com>
2025-09-04 11:50:43 +08:00
无脸男
0c0789be74 [Feat] allow using aclgraph in ray backend (#2589)
### What this PR does / why we need it?

Allow using aclgraph in ray backend, for tp + pp + aclgraph in multi
machine

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
4ba0c587ba

Signed-off-by: withHades <244036962@qq.com>
2025-09-04 11:45:56 +08:00
Ruri
aff5189c87 [main] Fuse GroupedMatmul, Swiglu and DynamicQuant in W8A8_DYNAMIC quantized MoE layers (#2275)
### What this PR does / why we need it?

Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion
operation `GroupedMatmulSwigluQuant`.

1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py`
2. if in supported occasion, use fusion operation
`npu_grouped_matmul_swiglu_quant`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16`

1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output
Token Throughput increased 27.35%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e"
/>

3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output
Token Throughput increased 6.86%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6"
/>


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
2025-09-04 11:37:32 +08:00
22dimensions
37f5a29cd4 [1/N][Refactor][Quantization] remove redundant quantizer class (#2680)
### What this PR does / why we need it?

AscendQuantizer/LLMQuantizer class is used to select quant method based
on quant config and some other arguments,
but it is more simple and clean replacing these classes with map. So i
remove them.

### Does this PR introduce _any_ user-facing change?
No 

### How was this patch tested?

ut and e2e test


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-04 11:35:14 +08:00
Icey
d4370ebc42 [Refactor] Refactor Spec Decode (#2668)
### What this PR does / why we need it?
Refactor spec decode

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-04 11:34:47 +08:00
Angazenn
e7409e95ee [1/N][Draft][Refactor]torchair pangu_moe modeling refactor (#2437)
### What this PR does / why we need it?

1. Similar to #2384 , this PR add a torchair-specific modeling for
pangu.
2. Fixes a bug introduced by routed_scaling_factor in #2675 .
3. remove eager test case for pangu since there has already been a
torchair test case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: zengyanjia <z00883269@china.huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Co-authored-by: zengyanjia <z00883269@china.huawei.com>
2025-09-04 10:39:21 +08:00
whx
a58013440a [BugFix][MLA] Fix attn_mask bug for ring mla (#2704)
This PR fix a bug related to attention mask used in ring mla. Current
ring mla has supported compressed mask, so we can directly use a 512 *
512 attention mask.

- vLLM version: v0.10.1.1
- vLLM main:
b5ee1e3261

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-04 10:22:46 +08:00