### What this PR does / why we need it?
Fix accuracy for DeepSeek-V2-Lite
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
66072b36db
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This pr fixes a few issues on prefill disaggregation:
1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist
needs the addr of tensors to be aligned with 2M
2. Fix prefill disaggregation kvcache shape error, llmdatadist requires
k/v tensors with shape [num_blocks, ...], however the implentment before
this pr is [2, num_blocks, ...], which will break prefill disaggregation
3. Use hybrid kv cache only when running qwen3_next to fix accuracy
issue on prefill disaggregation.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Tested locally by @liziyu179
- vLLM version: v0.10.2
- vLLM main:
4f02b77de4
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Fix VocabParallelEmbedding UT
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: main
- vLLM main:
f592b3174b
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
For sleep mode level 2, we discarded model both weights and kv_cache,
but the problems is: When we discard weights, we also discard some
tensors representing the model state which we called
`model.named_buffers()`, such as: `running_mean / running_var` in
BatchNorm、rope cos-sin cache ... when we update weights, but forgot to
update buffers as well, this will lead to some unknown issue
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
5963b98b46
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Remove chunked prefill for mla branch in mla , and change dtype of
prefill_mask to avoid accuracy problem
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
This pr fixes two problems while `multistream_moe` enabled in torchair
graph mode:
1. check `TorchairAscendW8A8DynamicFusedMoEMethod` instead of incorrect
`AscendW8A8DynamicFusedMoEMethod`
2. mc2_mask should be chunked no matter `replace_allreduce` is True or
False in forward function of `TorchairAscendFusedMoE`
- vLLM version: v0.10.2
- vLLM main:
0fb2551c23
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Add multi-node ray backend tutorial for Qwen235B-A3B
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
f4cd80f944
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
The current linear.py has the following issues:
- There is redundant conditional logic in the `comm_group` and `forward`
selection for classes such as `AscendMergedColumnParallelLinear`.
- Inconsistent comm_group selection logic exists among
`AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and
`AscendQKVParallelLinear`.
To address these two issues, this PR encapsulates `comm_group` and
`forward` into classes and extracts the classes selection logic into
common functions. For future additions of custom communication groups or
forward methods, it will only be necessary to extend
`CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection
logic.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
dd39baf717
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: weijinqian0 <weijinqian@huawei.com>
### What this PR does / why we need it?
[Bugfix]:replace npu_incre_flash_attention with
npu_fused_infer_attention_score in order to be able to tiling update
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
2b85697031
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
### What this PR does / why we need it?
This PR depends on the merge of #2707 and has adapted the aclgraph
functionality to support MTP.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
2b85697031
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
Added a new connector for Mooncake store integration to enable kvcache
reuse in scenarios with system prompts or multi-turn dialogues.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
5963b98b46
---------
Signed-off-by: LCAIZJ <leichao139636@163.com>
Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: lizy124 <1950471827@qq.com>
Co-authored-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
https://github.com/vllm-project/vllm-ascend/pull/2849 moves the
implementation of `shared_expert_dp` to torchair deepseek_modeling.
However, the calling of `set_forward_context` with `enforce_eager` and
`shared_expert_dp` falls back to the implementation of
model_runner_v1.py and set the global attn_metadata as a dictionary. It
leads to a RuntimerError when attn_metadata is got from the forward
context and used in torchair_deepseek_v2.py. This PR fixes this problem
by introducing the transformation of attn_metadata in this file.
Note that current E2E testing lacks the case of deepseek with
`shared_expert_dp`. We need to add an ST with `shared_expert_dp` in
testing workflow.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
e2e vllm serving with `enable_shared_expert_dp: true` passed.
- vLLM version: v0.10.2
- vLLM main:
de3e53a75b
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Add an option of enable frozen parameter
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb
Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:
Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.
SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com
- vLLM version: v0.10.2
- vLLM main:
567939953b
---------
Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it?
1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario
2. fix mla bug in SpecDecoding Scenario, since num_decodes !=
num_decode_tokens
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
5206ab20ba
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
This PR fused addrmsnorm op and w8a8 quant op to get better perf.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.2
- vLLM main:
0faf3cc3e8
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
In memory of #677 , a long overdue milestone. Now DeepSeek V3/R1 should
be OK with ACL Graph.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Working on it.
- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving with torchair graph mode and eager mode.
- vLLM version: v0.10.2
- vLLM main:
759ef49b15
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
1. Replace prepare/finalize operation in fused_moe.py by
moe_comm_method.prepare()/finalize()
2. Replace unified_fused_experts by moe_comm_method.fused_experts() in
fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py
3. Add calling _select_moe_comm_method in spec-decode proposers.
4. Currently, w4a8_dynamic does not support gatherep, use all2allv
instead.
5. Remove redundant code.
### Does this PR introduce _any_ user-facing change?
AllgatherEP switch is disabled in aclgraph/eager mode, just follow the
rules in modelrunner_v1._select_moe_comm_method()
### How was this patch tested?
e2e & ut
- vLLM version: v0.10.2
- vLLM main:
7f6f2c1182
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
### What this PR does / why we need it?
Update max_tokens and prompt in qwen3 online doc
Before:
```
"'max_tokens' or 'max_completion_tokens' is too large: 4096. This model's maximum context length is 4096 tokens and your request has 18 input tokens (4096 > 4096 - 18). None"
```
After:
```
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct",
"messages": [
{"role": "user", "content": "Who are you?"}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32
}'
.{"id":"chatcmpl-8ddbd65c9ddc405397219a6792feb9a0","object":"chat.completion","created":1757985049,"model":"/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to assist you in generating various","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":44,"completion_tokens":32,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Manually test on my local env
- CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
On main, AscendScheduler does not support Multimodels, becuse of lacking
of scheduled_encoder_inputs which is need on multimodels inference
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: main@93e28e6862669e3b5cf47cea9f782a65ec47e155
- vLLM version: v0.10.2rc2
- vLLM main:
15b8fef453
---------
Signed-off-by: fan2956 <zhoufan53@huawei.com>
Co-authored-by: zhoufan2956 <zhoufan2956@163.com>
### What this PR does / why we need it?
Bump vLLM version to v0.10.2
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.2rc3
- vLLM main:
15b8fef453
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This reverts commit 339fceb89c.
### Does this PR introduce _any_ user-facing change?
Yes, use 8.2rc1 image by default
### How was this patch tested?
CI passed
- vLLM version: v0.10.2rc2
- vLLM main:
cfa3234a5b
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR is used to adapt the hostname format for Mooncake when using
adxl. When Mooncake uses adxl, it is necessary to set
```USE_ASCEND_DIRECT``` to True in the file
```/Mooncake/mooncake-common/common.cmake``` during compilation. The
mooncake_connector obtains this config by calling
```vllm_config.kv_transfer_config.get_from_extra_config```, determines
whether Mooncake is using adxl, and selects the corresponding hostname
format.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: main
- vLLM main:
d21a36f5f9
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
Enable push trigger for image job
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Followup on https://github.com/vllm-project/vllm-ascend/pull/2864
- vLLM version: v0.10.2rc2
- vLLM main:
89e08d6d18
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Upgrade CANN version to 8.3.rc1.alpha001
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2rc2
- vLLM main:
89e08d6d18
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
**Background:**
There are two principles about operator registration in PyTorch
- The same namespace can be only registered once by `TORCH_LIBRARY`
- The operator signatures can be only registered once by `def`
Considering that all custom operators defined in the current repo are
only used by Ascend, instead of defining a common operator schema by
vLLM, all accelerators then follow this operator schema and complete the
implementation based on their respective hardware, which is conducive to
functional abstraction.
Therefore, we can rename the operator registration namespace to an
Ascend-specific namespace(**_C_ascend**).
Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742
- vLLM version: main
- vLLM main:
f592b3174b
Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
### What this PR does / why we need it?
Upgrade vLLM version to 0.10.2rc2
### Does this PR introduce _any_ user-facing change?
Yes, image will use 0.10.2rc2 vLLM
### How was this patch tested?
- vLLM version: main
- vLLM main:
f17c075884
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR enforces the forcible disabling of the chunked prefill feature
in Non-MLA models, as the performance of operators supporting this
functionality is currently suboptimal. Unless the user has enabled
chunked prefill in the ascend_scheduler_config, we would allow this
feature.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
Related: https://github.com/vllm-project/vllm-ascend/pull/2659
- vLLM version: main
- vLLM main:
d21a36f5f9
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This PR fixed the bug in register_custom_ops without forward_context. We
set try-except to consider this situation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: main
- vLLM main:
7920de0a2a
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
- Enable label-based image test and use free runner to run lint
- soft revert
26f388ba08
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: main
- vLLM main:
404c85ca72
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
`torch_npu.npu_apply_rotary_pos_emb` only support head_size and
rotary_dim equal 128. Error occurs when running GLM
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: main
- vLLM main:
404c85ca72
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
modelslim will generate self.bias for rms norm in quantization, since
RMSNorm in vllm has no this parameter, so its nesscesary
to create a AscendQuantRmsNorm.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested by deepseek-v3.1-w8a8
<img width="2496" height="592" alt="image"
src="https://github.com/user-attachments/assets/004c6e76-3d7a-4a1f-b59f-a14304012663"
/>
- vLLM version: main
- vLLM main:
d6249d0699
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
This PR sets the default format of GMM w2_weight in w8a8_dynamic to be
NZ to improve performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: main
- vLLM main:
e40827280b
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
1. Move ops/comm_utils to ops/moe/comm_utils
2. Move distributed/tensor_parallel/gather_from_sequence_parallel_region
to ops/moe/comm_utils
3. Delete distributed/tensor_parallel
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut
- vLLM version: main
- vLLM main:
a1213fae5f
---------
Signed-off-by: wuweiqiang24 <1005334931@qq.com>
Signed-off-by: wuweiqiang24 <wuweiqiang11@huawei.com>
### What this PR does / why we need it?
This PR prefetchs the weight of mlp layers in Qwen Dense Models to
optimize the performance in Decode phase mainly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: main
- vLLM main:
a1213fae5f
Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Shuming19 <313093131@qq.com>
### What this PR does / why we need it?
[Feat]support dynamic quantization in allgather
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: main
- vLLM main:
5931b7e5d9
Signed-off-by: withHades <244036962@qq.com>
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
support qwen25 vl w8a8 quantization
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
62f66be1f7
---------
Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>
Co-authored-by: lijiaojiao <lijiaojiao990304@163.com>
This PR is based on top of
[#23569](https://github.com/vllm-project/vllm/pull/23569) and
[#24219](https://github.com/vllm-project/vllm/pull/24219).
### What this PR does / why we need it?
This PR allows the model runner to function asynchronously when using
async scheduling. This allows full overlap of the cpu operations
(including prepare_inputs) and the model forward pass. This diff is
functional and does not support speculative decoding, PP, or guided
decoding.
Expected speedup is 5-10% over the current async scheduling.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
server
```
python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\
--trust-remote-code --enforce-eager \
--distributed-executor-backend=mp \
-tp=4 \
--port 8006 \
--max-model-len 32000 \
--block-size 128 \
--gpu-memory-utilization 0.99
```
client
```
python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \
--dataset-name random --random-input-len 2048 --random-output-len 2048 \
--ignore-eos\
--num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \
--metric-percentiles 90 --base-url http://localhost:8006 --save-result \
--result-dir $PROFILER_DIR
```
benchmark test based on Qwen3-32B TPOT result:
||forward async| scheduler async |sync|
|-|-|-|-|
|avg|41.73|41.86|44.20|
|improve0|0.3%|0|0|
|improve1|5.58%|0|0|
benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result:
||forward async|sync|
|-|-|-|
|avg|23.22|29.16|
|improve|20.3%|0|
- vLLM version: main
- vLLM main:
e93f4cc9e3
Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com>
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com>
Co-authored-by: Ronald1995 <ronaldautomobile@163.com>