### What this PR does / why we need it?
While running quantized deepseek models with unquantized MTP layer, free
NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This
results from the wasted VRAM buffer allocation casued by calling
`dist.all_to_all_single` without correct device process group argument.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
We run vllm online serving with quantized deepseek-r1 and unquantized
MTP layer, and observed that free_memory increased without redundat VRAM
buffer for HCCL communication op (all_to_all_single).
- vLLM version: v0.10.2
- vLLM main:
6d8246aaff
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Followup on https://github.com/vllm-project/vllm-ascend/pull/3064
1. should limit vllm version to the same hash with mypy
2. fix the vllm version bug for e2e light test.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
c60e6137f0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1. Refactor ci to reuse base workflow and enable main 2 hours trigger
job:
- Extract e2e test in to _e2e_test.yaml
- Reuse _e2e_test in light / full job
- Enable main 2 hours trigger job
2. Rename e2e test to ascend test to make sure action display label
3. Re-enable ut coverage which was failed since
5bcb4c1528
and disable on
6d8bc38c7b
### Does this PR introduce _any_ user-facing change?
Only developer behavior changes:
- Every job trigger full test with vllm release and hash
- Run full job per 2 hours with vllm main
- e2e light test (30 mins): `lint` (6mins) ---> ut (10mins) --->
`v0.10.2 + main / 4 jobs` (15mins)
- e2e full test (1.5h): `ready label` ---> `v0.10.2 + main / 4 jobs`,
about 1.5h
- schedule test: 2hours ---> `v0.10.2 + main / 4 jobs`, about 1.5h
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
c60e6137f0
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Bump main to
c60e6137f0
- Updated imports in `vllm.config` to
`vllm.config.model`(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252
- Refactored `vllm_ascend/sample/sampler.py` to use string values for
`logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs
mode handling and improving compatibility with recent vLLM changes
(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
6d8246aaff
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR prepares for deleting this enviroment variable,
`VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True`
to run
- Fixes https://github.com/vllm-project/vllm/issues/21834
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
See CI
- vLLM version: v0.10.2
- vLLM main:
99cc41ad50
---------
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
### What this PR does / why we need it?
1. update expected accuracy for DeepSeek-V2-Lite
2. add batch size
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Accuracy CI passed
- vLLM version: v0.10.2
- vLLM main:
838d7116ba
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy
### Does this PR introduce _any_ user-facing change?
Users can't set dynamic batch_size or use lm_eval test accuracy when
using models(sliding_window)
### How was this patch tested?
accuarcy of LLM-Research/Phi-4-mini-instruct is ok :
```
vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8105|± |0.0108|
| | |strict-match | 5|exact_match|↑ |0.8097|± |0.0108|
```
- vLLM version: v0.10.2
- vLLM main:
3c96e7b8a1
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Some custom models in vllm-ascend define packed_modules_mapping, which
prevent keeping same model class with vllm community. So move these
custom packed_modules_mapping to quant utils.py. After this pr, some
custom models can be removed.
### Does this PR introduce _any_ user-facing change?
tested by CI
### How was this patch tested?
tested by CI
- vLLM version: v0.10.2
- vLLM main:
5089fd749c
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
Increase doctest timeout to 300s and time print, according to time print
in https://github.com/vllm-project/vllm-ascend/pull/3045 , most of time
consumed in `Graph capturing`, so I think it's fine to increase doctest
timeout
This PR also add time log for each task.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Run `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`
- CI passed
- vLLM version: v0.10.2
- vLLM main:
a684c0124c
Closes: https://github.com/vllm-project/vllm-ascend/issues/3045
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Add missing barrier when no implicit synchonize by `repeat_interleave`
is available. Otherwise, the `non_blocking=True` copy of `output_splits`
and `input_splits` from NPU may failed to complete before later
`async_all_to_all` uses them.
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it?
The speculative decode phase of chunkedprefill has taken an incorrect
path, should always use TND layout for speculative decoding.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
6d8246aaff
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
based on the https://github.com/vllm-project/vllm/pull/23770,
fix Async scheduling and PP compatibility with DP, also fixes issue with
finished requests not being processed in async scheduling and PP cases,
and possible worker race conditions.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
544fe76b95
---------
Signed-off-by: jesse <szxfml@gmail.com>
This PR puts the calculation of shared experts into a separate stream,
overlaping with routing experts.
- vLLM version: v0.10.2
- vLLM main:
fbd6523ac0
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Fix accuracy for DeepSeek-V2-Lite
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
66072b36db
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This pr fixes a few issues on prefill disaggregation:
1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist
needs the addr of tensors to be aligned with 2M
2. Fix prefill disaggregation kvcache shape error, llmdatadist requires
k/v tensors with shape [num_blocks, ...], however the implentment before
this pr is [2, num_blocks, ...], which will break prefill disaggregation
3. Use hybrid kv cache only when running qwen3_next to fix accuracy
issue on prefill disaggregation.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Tested locally by @liziyu179
- vLLM version: v0.10.2
- vLLM main:
4f02b77de4
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Fix VocabParallelEmbedding UT
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: main
- vLLM main:
f592b3174b
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
For sleep mode level 2, we discarded model both weights and kv_cache,
but the problems is: When we discard weights, we also discard some
tensors representing the model state which we called
`model.named_buffers()`, such as: `running_mean / running_var` in
BatchNorm、rope cos-sin cache ... when we update weights, but forgot to
update buffers as well, this will lead to some unknown issue
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
5963b98b46
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Remove chunked prefill for mla branch in mla , and change dtype of
prefill_mask to avoid accuracy problem
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
This pr fixes two problems while `multistream_moe` enabled in torchair
graph mode:
1. check `TorchairAscendW8A8DynamicFusedMoEMethod` instead of incorrect
`AscendW8A8DynamicFusedMoEMethod`
2. mc2_mask should be chunked no matter `replace_allreduce` is True or
False in forward function of `TorchairAscendFusedMoE`
- vLLM version: v0.10.2
- vLLM main:
0fb2551c23
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Add multi-node ray backend tutorial for Qwen235B-A3B
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
f4cd80f944
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
The current linear.py has the following issues:
- There is redundant conditional logic in the `comm_group` and `forward`
selection for classes such as `AscendMergedColumnParallelLinear`.
- Inconsistent comm_group selection logic exists among
`AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and
`AscendQKVParallelLinear`.
To address these two issues, this PR encapsulates `comm_group` and
`forward` into classes and extracts the classes selection logic into
common functions. For future additions of custom communication groups or
forward methods, it will only be necessary to extend
`CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection
logic.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
dd39baf717
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: weijinqian0 <weijinqian@huawei.com>
### What this PR does / why we need it?
[Bugfix]:replace npu_incre_flash_attention with
npu_fused_infer_attention_score in order to be able to tiling update
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
2b85697031
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
### What this PR does / why we need it?
This PR depends on the merge of #2707 and has adapted the aclgraph
functionality to support MTP.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
2b85697031
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
Added a new connector for Mooncake store integration to enable kvcache
reuse in scenarios with system prompts or multi-turn dialogues.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
5963b98b46
---------
Signed-off-by: LCAIZJ <leichao139636@163.com>
Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: lizy124 <1950471827@qq.com>
Co-authored-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
https://github.com/vllm-project/vllm-ascend/pull/2849 moves the
implementation of `shared_expert_dp` to torchair deepseek_modeling.
However, the calling of `set_forward_context` with `enforce_eager` and
`shared_expert_dp` falls back to the implementation of
model_runner_v1.py and set the global attn_metadata as a dictionary. It
leads to a RuntimerError when attn_metadata is got from the forward
context and used in torchair_deepseek_v2.py. This PR fixes this problem
by introducing the transformation of attn_metadata in this file.
Note that current E2E testing lacks the case of deepseek with
`shared_expert_dp`. We need to add an ST with `shared_expert_dp` in
testing workflow.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
e2e vllm serving with `enable_shared_expert_dp: true` passed.
- vLLM version: v0.10.2
- vLLM main:
de3e53a75b
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Add an option of enable frozen parameter
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb
Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:
Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.
SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com
- vLLM version: v0.10.2
- vLLM main:
567939953b
---------
Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it?
1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario
2. fix mla bug in SpecDecoding Scenario, since num_decodes !=
num_decode_tokens
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
5206ab20ba
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
This PR fused addrmsnorm op and w8a8 quant op to get better perf.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.2
- vLLM main:
0faf3cc3e8
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
In memory of #677 , a long overdue milestone. Now DeepSeek V3/R1 should
be OK with ACL Graph.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Working on it.
- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving with torchair graph mode and eager mode.
- vLLM version: v0.10.2
- vLLM main:
759ef49b15
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
1. Replace prepare/finalize operation in fused_moe.py by
moe_comm_method.prepare()/finalize()
2. Replace unified_fused_experts by moe_comm_method.fused_experts() in
fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py
3. Add calling _select_moe_comm_method in spec-decode proposers.
4. Currently, w4a8_dynamic does not support gatherep, use all2allv
instead.
5. Remove redundant code.
### Does this PR introduce _any_ user-facing change?
AllgatherEP switch is disabled in aclgraph/eager mode, just follow the
rules in modelrunner_v1._select_moe_comm_method()
### How was this patch tested?
e2e & ut
- vLLM version: v0.10.2
- vLLM main:
7f6f2c1182
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
### What this PR does / why we need it?
Update max_tokens and prompt in qwen3 online doc
Before:
```
"'max_tokens' or 'max_completion_tokens' is too large: 4096. This model's maximum context length is 4096 tokens and your request has 18 input tokens (4096 > 4096 - 18). None"
```
After:
```
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct",
"messages": [
{"role": "user", "content": "Who are you?"}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32
}'
.{"id":"chatcmpl-8ddbd65c9ddc405397219a6792feb9a0","object":"chat.completion","created":1757985049,"model":"/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to assist you in generating various","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":44,"completion_tokens":32,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Manually test on my local env
- CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
On main, AscendScheduler does not support Multimodels, becuse of lacking
of scheduled_encoder_inputs which is need on multimodels inference
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: main@93e28e6862669e3b5cf47cea9f782a65ec47e155
- vLLM version: v0.10.2rc2
- vLLM main:
15b8fef453
---------
Signed-off-by: fan2956 <zhoufan53@huawei.com>
Co-authored-by: zhoufan2956 <zhoufan2956@163.com>
### What this PR does / why we need it?
Bump vLLM version to v0.10.2
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.2rc3
- vLLM main:
15b8fef453
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This reverts commit 339fceb89c.
### Does this PR introduce _any_ user-facing change?
Yes, use 8.2rc1 image by default
### How was this patch tested?
CI passed
- vLLM version: v0.10.2rc2
- vLLM main:
cfa3234a5b
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR is used to adapt the hostname format for Mooncake when using
adxl. When Mooncake uses adxl, it is necessary to set
```USE_ASCEND_DIRECT``` to True in the file
```/Mooncake/mooncake-common/common.cmake``` during compilation. The
mooncake_connector obtains this config by calling
```vllm_config.kv_transfer_config.get_from_extra_config```, determines
whether Mooncake is using adxl, and selects the corresponding hostname
format.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: main
- vLLM main:
d21a36f5f9
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
Enable push trigger for image job
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Followup on https://github.com/vllm-project/vllm-ascend/pull/2864
- vLLM version: v0.10.2rc2
- vLLM main:
89e08d6d18
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>