Commit Graph

1130 Commits

Author SHA1 Message Date
linfeng-yuan
ffdd1a36e2 [bugfix][torchair] fix wasted NPU memory buffer allocation for quantized deepseek with unquantized MTP layer (#3068)
### What this PR does / why we need it?
While running quantized deepseek models with unquantized MTP layer, free
NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This
results from the wasted VRAM buffer allocation casued by calling
`dist.all_to_all_single` without correct device process group argument.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We run vllm online serving with quantized deepseek-r1 and unquantized
MTP layer, and observed that free_memory increased without redundat VRAM
buffer for HCCL communication op (all_to_all_single).

- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-22 14:06:43 +08:00
Icey
14b39d3c70 [1/N][Refactor][Qwen3-Next] remove redundant Qwen3NextSparseMoeBlock and Qwen3NextAttention (#3019)
### What this PR does / why we need it?
remove redundant Qwen3NextSparseMoeBlock and Qwen3NextAttention

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "The future of AI is",
    ]

    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        # model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-30B-A3B",
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64,
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.2
- vLLM main:
9d1c50a5ac

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-22 11:24:08 +08:00
wangxiyuan
88d24cce8b [CI] Enable main based lint check and light ci matrix (#3079)
### What this PR does / why we need it?
Followup on https://github.com/vllm-project/vllm-ascend/pull/3064
1. should limit vllm version to the same hash with mypy
2. fix the vllm version bug for e2e light test.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed


- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-22 10:37:53 +08:00
Yikun Jiang
693f547ccf Refactor ci to reuse base workflow and re-enable ut coverage (#3064)
### What this PR does / why we need it?
1. Refactor ci to reuse base workflow and enable main 2 hours trigger
job:
- Extract e2e test in to _e2e_test.yaml
- Reuse _e2e_test in light / full job
- Enable main 2 hours trigger job

2. Rename e2e test to ascend test to make sure action display label 
3. Re-enable ut coverage which was failed since
5bcb4c1528
and disable on
6d8bc38c7b

### Does this PR introduce _any_ user-facing change?
Only developer behavior changes:
- Every job trigger full test with vllm release and hash
- Run full job per 2 hours with vllm main
- e2e light test (30 mins): `lint` (6mins) ---> ut (10mins) --->
`v0.10.2 + main / 4 jobs` (15mins)
- e2e full test (1.5h): `ready label` ---> `v0.10.2 + main / 4 jobs`,
about 1.5h
- schedule test: 2hours ---> `v0.10.2 + main / 4 jobs`, about 1.5h 

### How was this patch tested?
CI passed


- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-21 13:27:08 +08:00
Yikun Jiang
b8b68b3dfe [CI] Upgrade vLLM to 20250920 (c60e613) and address config break (#3067)
### What this PR does / why we need it?
Bump main to
c60e6137f0

- Updated imports in `vllm.config` to
`vllm.config.model`(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252

- Refactored `vllm_ascend/sample/sampler.py` to use string values for
`logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs
mode handling and improving compatibility with recent vLLM changes
(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-21 09:49:17 +08:00
Li Wang
12bcbd02bb [CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907)
### What this PR does / why we need it?
1. This pr bump vllm commit to
6d8246aaff
2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548
abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable
3. fix metadata_builder changes introduced by
https://github.com/vllm-project/vllm/pull/23693
4. fix `structured_outputs_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22772
5. fix `moe_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22537

Co-authored-by:  MengqingCao <cmq0113@163.com>
Co-authored-by:  Yikun Jiang <yikunkero@gmail.com>


- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-20 17:37:57 +08:00
Lucas Kabela
53ecd89e8f [Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE (#2969)
### What this PR does / why we need it?
This PR prepares for deleting this enviroment variable,
`VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True`
to run

- Fixes https://github.com/vllm-project/vllm/issues/21834

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
See CI 

- vLLM version: v0.10.2
- vLLM main:
99cc41ad50

---------

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2025-09-20 08:22:30 +08:00
zhangxinyuehfad
e26fe1caf1 [TEST] Speed up DS V2 accuracy test and turn up accuracy baseline (#3047)
### What this PR does / why we need it?
1. update expected accuracy for DeepSeek-V2-Lite
2. add batch size 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Accuracy CI passed

- vLLM version: v0.10.2
- vLLM main:
838d7116ba

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-20 00:40:33 +08:00
zhangxinyuehfad
a22b532d38 [Fixbug] Fix shape not match when sliding_window and dynamic batch_size (#2830)
### What this PR does / why we need it?
Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy 

### Does this PR introduce _any_ user-facing change?

Users can't set dynamic batch_size or use lm_eval test accuracy when
using models(sliding_window)

### How was this patch tested?
accuarcy of LLM-Research/Phi-4-mini-instruct is ok :
```
vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8105|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.8097|±  |0.0108|
```


- vLLM version: v0.10.2
- vLLM main:
3c96e7b8a1

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-19 22:35:14 +08:00
zhanghw0354
cf549b976d [Test]Add unit test for compilation/acl_graph.py (#3039)
### What this PR does / why we need it?
According to issue [#1298
](https://github.com/vllm-project/vllm-ascend/issues/1298) ,this pull
request adds unit test code for compilation/acl_graph.py.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.2
- vLLM main:
f2718d2948

---------

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
2025-09-19 21:31:17 +08:00
22dimensions
0942d9aaab [3/N][Refactor][Quantization]remove packed_modules_mapping from models (#3021)
### What this PR does / why we need it?

Some custom models in vllm-ascend define packed_modules_mapping, which
prevent keeping same model class with vllm community. So move these
custom packed_modules_mapping to quant utils.py. After this pr, some
custom models can be removed.

### Does this PR introduce _any_ user-facing change?

tested by CI

### How was this patch tested?

tested by CI

- vLLM version: v0.10.2
- vLLM main:
5089fd749c

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-19 20:50:14 +08:00
Yikun Jiang
4ba56716f9 Increase doctest timeout to 300s and time print (#3041)
### What this PR does / why we need it?
Increase doctest timeout to 300s and time print, according to time print
in https://github.com/vllm-project/vllm-ascend/pull/3045 , most of time
consumed in `Graph capturing`, so I think it's fine to increase doctest
timeout

This PR also add time log for each task.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Run `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`
- CI passed

- vLLM version: v0.10.2
- vLLM main:
a684c0124c

Closes: https://github.com/vllm-project/vllm-ascend/issues/3045

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-19 20:26:00 +08:00
Shanshan Shen
8326f15ecf [CustomOp] Register AscendSharedFusedMoE custom op (#2980)
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```


- vLLM version: v0.10.2
- vLLM main:
486c5599e3

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-09-19 19:05:01 +08:00
sdmyzlp
05a700d370 [Bugfix] Fix async copy bug under single expert scenario (#3005)
Add missing barrier when no implicit synchonize by `repeat_interleave`
is available. Otherwise, the `non_blocking=True` copy of `output_splits`
and `input_splits` from NPU may failed to complete before later
`async_all_to_all` uses them.

### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-09-19 14:05:36 +08:00
xuyexiong
2a87b4cecb [Bugfix] Fix specdecoding in chunkedprefill scenario (#3025)
### What this PR does / why we need it?

The speculative decode phase of chunkedprefill has taken an incorrect
path, should always use TND layout for speculative decoding.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-19 14:05:08 +08:00
Song Zhixin
833cd1b698 [BugFix] Async scheduling and PP compatibility with DP (#2796)
### What this PR does / why we need it?
based on the https://github.com/vllm-project/vllm/pull/23770,
fix Async scheduling and PP compatibility with DP, also fixes issue with
finished requests not being processed in async scheduling and PP cases,
and possible worker race conditions.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
544fe76b95

---------

Signed-off-by: jesse <szxfml@gmail.com>
2025-09-19 11:29:50 +08:00
whx
0a526768f5 [Feature] Support moe multi-stream for aclgraph. (#2946)
This PR puts the calculation of shared experts into a separate stream,
overlaping with routing experts.

- vLLM version: v0.10.2
- vLLM main:
fbd6523ac0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-19 11:06:45 +08:00
zhangxinyuehfad
0c04bf1e36 [Fixbug] Fix accuracy for DeepSeek-V2-Lite (#3016)
### What this PR does / why we need it?

Fix accuracy for DeepSeek-V2-Lite

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed

- vLLM version: v0.10.2
- vLLM main:
66072b36db

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-09-18 23:58:23 +08:00
Mengqing Cao
367edff5af [HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007)
### What this PR does / why we need it?
This pr fixes a few issues on prefill disaggregation:
1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist
needs the addr of tensors to be aligned with 2M
2. Fix prefill disaggregation kvcache shape error, llmdatadist requires
k/v tensors with shape [num_blocks, ...], however the implentment before
this pr is [2, num_blocks, ...], which will break prefill disaggregation
3. Use hybrid kv cache only when running qwen3_next to fix accuracy
issue on prefill disaggregation.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Tested locally by @liziyu179 

- vLLM version: v0.10.2
- vLLM main:
4f02b77de4

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-18 21:43:22 +08:00
Icey
acb46f303f Fix VocabParallelEmbedding UT (#2722)
### What this PR does / why we need it?
Fix VocabParallelEmbedding UT

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: main
- vLLM main:
f592b3174b

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-18 19:54:01 +08:00
Li Wang
01592515b8 [Bugfix] Fix sleep mode level 2 (#1376)
### What this PR does / why we need it?
For sleep mode level 2, we discarded model both weights and kv_cache,
but the problems is: When we discard weights, we also discard some
tensors representing the model state which we called
`model.named_buffers()`, such as: `running_mean / running_var` in
BatchNorm、rope cos-sin cache ... when we update weights, but forgot to
update buffers as well, this will lead to some unknown issue
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
5963b98b46

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-18 19:51:52 +08:00
LeeWenquan
f4e3d22432 Remove chunked_prefill_for_mla and fix ring_mla bug (#2781)
### What this PR does / why we need it?
Remove chunked prefill for mla branch in mla , and change dtype of
prefill_mask to avoid accuracy problem
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-09-18 19:43:26 +08:00
linfeng-yuan
79a910ef47 [bugfix][torchair] fix multistream_moe problems in torchair graph mode (#2681)
This pr fixes two problems while `multistream_moe` enabled in torchair
graph mode:
1. check `TorchairAscendW8A8DynamicFusedMoEMethod` instead of incorrect
`AscendW8A8DynamicFusedMoEMethod`
2. mc2_mask should be chunked no matter `replace_allreduce` is True or
False in forward function of `TorchairAscendFusedMoE`

- vLLM version: v0.10.2
- vLLM main:
0fb2551c23

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-18 17:35:04 +08:00
Li Wang
4267f5d55f [Doc] Add multi-node ray backend tutorial (#2376)
### What this PR does / why we need it?
Add multi-node ray backend tutorial for Qwen235B-A3B

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
f4cd80f944

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-18 15:30:18 +08:00
realliujiaxu
af2a886814 refactor linear (#2867)
### What this PR does / why we need it?
The current linear.py has the following issues:

- There is redundant conditional logic in the `comm_group` and `forward`
selection for classes such as `AscendMergedColumnParallelLinear`.

- Inconsistent comm_group selection logic exists among
`AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and
`AscendQKVParallelLinear`.

To address these two issues, this PR encapsulates `comm_group` and
`forward` into classes and extracts the classes selection logic into
common functions. For future additions of custom communication groups or
forward methods, it will only be necessary to extend
`CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection
logic.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
dd39baf717

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: weijinqian0 <weijinqian@huawei.com>
2025-09-18 14:09:19 +08:00
panchao-hub
a7f8ed38ed [Bugfix]:replace npu_incre_flash_attention with npu_fused_infer_atten… (#2901)
### What this PR does / why we need it?
[Bugfix]:replace npu_incre_flash_attention with
npu_fused_infer_attention_score in order to be able to tiling update

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
2b85697031

Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
2025-09-18 14:06:08 +08:00
xuyexiong
6681dde902 [Feat][Graph] Support MTP for ACL Graph (#2932)
### What this PR does / why we need it?
This PR depends on the merge of #2707 and has adapted the aclgraph
functionality to support MTP.

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
2b85697031

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-18 14:05:33 +08:00
Chao Lei
cef43b524e [Feat] A Connector that supports Mooncake store (#2913)
### What this PR does / why we need it?
Added a new connector for Mooncake store integration to enable kvcache
reuse in scenarios with system prompts or multi-turn dialogues.

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
5963b98b46

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: lizy124 <1950471827@qq.com>
Co-authored-by: zouyida2052 <zouyida2002@gmail.com>
2025-09-18 14:04:45 +08:00
realliujiaxu
723d460894 [Bugfix] fix kv nz accuracy bug (#2988)
when `enable_kv_nz` is true, output of Deepseek R1 is invalid.

- vLLM version: v0.10.2
- vLLM main:
2b85697031

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-09-17 21:10:25 +08:00
linfeng-yuan
8bcc0ccd57 [bugfix] fix shared expert dp with hybrid kvcache (#2964)
### What this PR does / why we need it?
https://github.com/vllm-project/vllm-ascend/pull/2849 moves the
implementation of `shared_expert_dp` to torchair deepseek_modeling.
However, the calling of `set_forward_context` with `enforce_eager` and
`shared_expert_dp` falls back to the implementation of
model_runner_v1.py and set the global attn_metadata as a dictionary. It
leads to a RuntimerError when attn_metadata is got from the forward
context and used in torchair_deepseek_v2.py. This PR fixes this problem
by introducing the transformation of attn_metadata in this file.

Note that current E2E testing lacks the case of deepseek with
`shared_expert_dp`. We need to add an ST with `shared_expert_dp` in
testing workflow.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
e2e vllm serving with `enable_shared_expert_dp: true` passed.

- vLLM version: v0.10.2
- vLLM main:
de3e53a75b

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-17 20:01:47 +08:00
1Fire4
1f6465c399 Add an option of enable frozen parameter (#2869)
### What this PR does / why we need it?
Add an option of enable  frozen parameter

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb

Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
2025-09-17 12:00:44 +08:00
offline893
76844eec78 Dynamic Expert Load Balance with Zero-like-overhead (#2956)
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:

Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.

SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com


- vLLM version: v0.10.2
- vLLM main:
567939953b

---------

Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
xuyexiong
ae758dda05 [Bugfix] Fix mtp torchair in pd Disaggregation scenario (#2951)
### What this PR does / why we need it?
1. In memory of #2509, Fix mtp torchair in pd Disaggregation scenario
2. fix mla bug in SpecDecoding Scenario, since num_decodes !=
num_decode_tokens


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
5206ab20ba

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-17 09:07:58 +08:00
rjg-lyh
6b7117dbb7 [main] addrmsnorm + quant fusion optim in Dense Models (#2772)
### What this PR does / why we need it?
This PR fused addrmsnorm op and w8a8 quant op to get better perf.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.2
- vLLM main:
0faf3cc3e8

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-09-16 22:31:38 +08:00
yiz-liu
88ca8a051c [Feat][Graph] Support DeepSeek with ACL Graph (#2707)
### What this PR does / why we need it?
In memory of #677 , a long overdue milestone. Now DeepSeek V3/R1 should
be OK with ACL Graph.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Working on it.

- vLLM version: v0.10.2
- vLLM main:
68dbde5dbb

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-16 17:50:17 +08:00
dependabot[bot]
3e60aa5483 Bump actions/setup-python from 5.4.0 to 6.0.0 (#2926)
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.4.0 to 6.0.0.

- vLLM version: v0.10.2
- vLLM main:
3f3313981c

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-16 14:15:10 +08:00
linfeng-yuan
1c5900327b [refactor] refactor deepseek-related files (#2849)
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E  vllm serving with torchair graph mode and eager mode.

- vLLM version: v0.10.2
- vLLM main:
759ef49b15

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-16 14:13:07 +08:00
weichen
18ca7861f6 [Main] [Refactor] Enable MoECommMethod in Eager Mode (#2791)
### What this PR does / why we need it?
1. Replace prepare/finalize operation in fused_moe.py by
moe_comm_method.prepare()/finalize()
2. Replace unified_fused_experts by moe_comm_method.fused_experts() in
fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py
3. Add calling _select_moe_comm_method in spec-decode proposers.
4. Currently, w4a8_dynamic does not support gatherep, use all2allv
instead.
5. Remove redundant code.
### Does this PR introduce _any_ user-facing change?
AllgatherEP switch is disabled in aclgraph/eager mode, just follow the
rules in modelrunner_v1._select_moe_comm_method()
### How was this patch tested?
e2e & ut


- vLLM version: v0.10.2
- vLLM main:
7f6f2c1182

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-09-16 11:06:00 +08:00
Yikun Jiang
0aba644633 Update max_tokens and prompt in qwen3 online doc (#2945)
### What this PR does / why we need it?
Update max_tokens and prompt in qwen3 online doc
Before:
```
"'max_tokens' or 'max_completion_tokens' is too large: 4096. This model's maximum context length is 4096 tokens and your request has 18 input tokens (4096 > 4096 - 18). None"
```

After:
```
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct",
  "messages": [
    {"role": "user", "content": "Who are you?"}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 32
}'
.{"id":"chatcmpl-8ddbd65c9ddc405397219a6792feb9a0","object":"chat.completion","created":1757985049,"model":"/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to assist you in generating various","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":44,"completion_tokens":32,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Manually test on my local env
- CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-16 09:27:50 +08:00
wangxiyuan
048bfd5553 [Release] Add release note for v0.10.2rc1 (#2921)
Add release note for v0.10.2rc1

- vLLM version: v0.10.2
- vLLM main:
b834b4cbf1

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-16 01:20:05 +08:00
wangxiyuan
c556038ef0 [New model] Qwen3-next support (#2917)
### What this PR does / why we need it?
Add Qwen3-next support.

### Does this PR introduce _any_ user-facing change?
Yes, users can use Qwen3 next.
Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the
tutorial will be ready in
[here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html)

### How was this patch tested?
Doc CI passed

Related: https://github.com/vllm-project/vllm-ascend/issues/2884

Co-Authored-By: Angazenn <supperccell@163.com>
Co-Authored-By: zzzzwwjj <1183291235@qq.com>
Co-Authored-By: MengqingCao <cmq0113@163.com>
Co-Authored-By: linfeng-yuan <1102311262@qq.com>
Co-Authored-By: hust17yixuan <303660421@qq.com>
Co-Authored-By: SunnyLee219 <3294305115@qq.com>
Co-Authored-By: maoxx241 <maoxx241@umn.edu>


- vLLM version: v0.10.2
- vLLM main:
b834b4cbf1

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: hust17yixuan <303660421@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: hust17yixuan <303660421@qq.com>
2025-09-16 01:17:42 +08:00
Yikun Jiang
b5ccef6115 [Doc] Add doc for Qwen3 Next (#2916)
### What this PR does / why we need it?
Add doc for Qwen3 Next

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Doc CI passed

Related: https://github.com/vllm-project/vllm-ascend/issues/2884


- vLLM version: v0.10.2
- vLLM main:
01413e0cf5

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-16 01:16:06 +08:00
liziyu
aa3c4563ce fix all cards super_pod_id same on A3 & proxy support min_tokens (#2939)
### What this PR does / why we need it?
fix all cards super_pod_id same on A3 & proxy support min_tokens
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
2*A3 gen ranktable
before:
"prefill_device_list": [
        {
            "server_id": "xxx",
            "device_id": "0",
            "device_ip": "xxx",
            "super_pod_id": "0",
            "super_device_id": "106758159",
            "cluster_id": "1"
        },
        {
            "server_id": "xxx",
            "device_id": "1",
            "device_ip": "xxx",
            "super_pod_id": "0",
            "super_device_id": "106758159",
            "cluster_id": "2"
        }...
after:
"prefill_device_list": [
        {
            "server_id": "xxx",
            "device_id": "0",
            "device_ip": "xxx",
            "super_pod_id": "0",
            "super_device_id": "104857600",
            "cluster_id": "1"
        },
        {
            "server_id": "xxx",
            "device_id": "1",
            "device_ip": "xxx",
            "super_pod_id": "0",
            "super_device_id": "104923137",
            "cluster_id": "2"
        }...

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-16 01:09:18 +08:00
wangxiyuan
382c29f3e1 [BugFix] Fix world size bug in model_runner (#2915)
- Fix world size bug in model_runner to make sure ep>16 runs with MC2 
- enable e2e test for vl

Co-Authored-By: whx-sjtu <2952154980@qq.com>
Co-Authored-By: Icey <1790571317@qq.com>
- vLLM version: v0.10.2
- vLLM main:
3e903b6cb4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-14 12:20:25 +08:00
fan2956
c5a502fd2e main add ascend scheduler support multimodal (#2844)
### What this PR does / why we need it?
On main, AscendScheduler does not support Multimodels, becuse of lacking
of scheduled_encoder_inputs which is need on multimodels inference

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
vLLM version: main@93e28e6862669e3b5cf47cea9f782a65ec47e155

- vLLM version: v0.10.2rc2
- vLLM main:
15b8fef453

---------

Signed-off-by: fan2956 <zhoufan53@huawei.com>
Co-authored-by: zhoufan2956 <zhoufan2956@163.com>
2025-09-14 09:38:51 +08:00
Yikun Jiang
0747a6e68c Bump vLLM version to v0.10.2 (#2914)
### What this PR does / why we need it?
Bump vLLM version to v0.10.2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed
- vLLM version: v0.10.2rc3
- vLLM main:
15b8fef453

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-14 06:57:59 +08:00
Yikun Jiang
f97a64ba7f Bump vLLM version to v0.10.2rc3 (#2911)
### What this PR does / why we need it?
Bump vLLM version to v0.10.2rc3
https://github.com/vllm-project/vllm/compare/v0.10.2rc2...v0.10.2rc3

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

- vLLM version: v0.10.2rc2
- vLLM main:
15b8fef453

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-13 19:15:48 +08:00
Yikun Jiang
8ece6956e7 Revert "Upgrade CANN version to 8.3.rc1.alpha001 (#2903)" (#2909)
### What this PR does / why we need it?
This reverts commit 339fceb89c.

### Does this PR introduce _any_ user-facing change?
Yes, use 8.2rc1 image by default

### How was this patch tested?
CI passed

- vLLM version: v0.10.2rc2
- vLLM main:
cfa3234a5b

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-13 16:21:54 +08:00
zxr2333
0a27705917 fix mooncake connector adxl hostname usage (#2824)
### What this PR does / why we need it?
This PR is used to adapt the hostname format for Mooncake when using
adxl. When Mooncake uses adxl, it is necessary to set
```USE_ASCEND_DIRECT``` to True in the file
```/Mooncake/mooncake-common/common.cmake``` during compilation. The
mooncake_connector obtains this config by calling
```vllm_config.kv_transfer_config.get_from_extra_config```, determines
whether Mooncake is using adxl, and selects the corresponding hostname
format.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.


- vLLM version: main
- vLLM main:
d21a36f5f9

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-09-13 14:38:48 +08:00
Yikun Jiang
d2250c80b5 Enable push trigger for image job (#2906)
### What this PR does / why we need it?
Enable push trigger for image job

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Followup on https://github.com/vllm-project/vllm-ascend/pull/2864
- vLLM version: v0.10.2rc2
- vLLM main:
89e08d6d18

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-13 12:31:36 +08:00