Commit Graph

811 Commits

Author SHA1 Message Date
baxingpiaochong
df88a2ecc8 [P/D]mooncake_connector adapted to 0.10.1 (#2664)
### What this PR does / why we need it?
In vllm version 0.10.1, a new KVOutputAggregator was added to the
executor, moving aggregation to the
executor(https://github.com/vllm-project/vllm/pull/19555). This caused
mooncake_connector to break. This change aims to fix this bug and also
adds a policy to forcibly release the KV cache when the prefill node
times out.

This PR is currently linked to a PR in vllm
(https://github.com/vllm-project/vllm/pull/23917). The vllm PR aims to
modify the finish and send count confirmation in heterogeneous TP
situations.

The reason for deleting many UTs is that a lot of communication codes
have been deleted, so the UT as a whole will appear more concise.

- vLLM version: v0.10.1.1
- vLLM main:
fa4311d85f

---------

Signed-off-by: baxingpiaochong <771405853@qq.com>
2025-09-04 08:22:10 +08:00
zhiyuanzhang
07d44ade19 bugfix: fix initialization error for mooncake in k8s (#2541)
### What this PR does / why we need it?
The detail has been clarified in that issue :
https://github.com/vllm-project/vllm-ascend/issues/2557

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
easy to test beacause we just need to echo the variable


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: LCAIZJ <leichao139636@163.com>
2025-09-03 22:25:08 +08:00
wangxiyuan
41b028aa5f [Doc] add v0.9.1 release note (#2646)
Add release note for 0.9.1

- vLLM version: v0.10.1.1
- vLLM main:
8bd5844989

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-03 18:04:27 +08:00
linfeng-yuan
90a75a90a9 [bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532)
### What this PR does / why we need it?
This PR ports #2312 #2506 #2531 to main branch.

Original implementation of torchair caching forces users to make
everything prepared, fix all the configuration and enable
`use_cached_npu_graph`, and it might cause some problems confusing to
understand and tackle for users. It is better to compile the graph twice
instead of reusing the old kvcaches and cached torchair graph. And the
extra duration time is acceptable. Additionally, this pr fixes a
recompilation problem of torchair graph mode caused by
`running_in_graph` variable in `AscendMLATorchairImpl`.

### Does this PR introduce _any_ user-facing change?
If users want to enabling torchair.cache_compile with high compilation
speed, it is recommended to enable both `use_cached_kv_cache_bytes` and
`use_cached_graph` in `torchair_graph_config`. Without
`use_cached_kv_cache_bytes`, we'll compile torchair computation graph
twice to avoid runtime error caused by configuration mismtaches (the
second compilation will be much faster). Additionally, we've made a
change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to
enhance safety and prevent accidental file deletion by adding a suffix
directory.

### How was this patch tested?
CI and e2e vllm serving pass.


- vLLM version: v0.10.1.1
- vLLM main:
70549c1245

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-03 17:56:12 +08:00
liziyu
5889fa1b1c [bugfix] ascend schedule encountered an incorrect req block length in the check_watermark_for_prefill function (#2508)
### What this PR does / why we need it?
bugfix ascend schedule encountered an incorrect req block length in the
check_watermark_for_prefill function
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
426cc8629f

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-03 16:54:39 +08:00
whx
59d23c39eb [DP] External dp server starter (#2685)
This PR re-implements external-dp starter based on vllm's support for
external dp.

- vLLM version: v0.10.1.1
- vLLM main:
f38035c123

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-03 16:30:26 +08:00
wangxiyuan
c03321781a [CI] skip unstable UT (#2716)
See #2687 we notice that test_platform and test_vocab_parallel_embedding
is unstable, let's skip them first.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-03 15:53:50 +08:00
Li Wang
3584306387 [Bugfix] Fix qwen2.5-vl-without-padding (#2623)
### What this PR does / why we need it?
Correct `AscendQwen2_5_VLForConditionalGeneration_Without_Padding`
override methods
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
42dc59dbac

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-03 14:38:55 +08:00
Li Wang
bece793be6 [CI] Disable per-PR triggering for A3 (#2710)
### What this PR does / why we need it?
Disable per-PR triggering for A3 for now, we trigger the dist test in
the label `dist-test` rather than
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
136d853e65

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-03 11:52:34 +08:00
zhanghw0354
eaeb2efb20 [Main][Feat]Set the Profiler parameters through environment variables consistent with vLLM (#2608)
### What this PR does / why we need it?
Currently, when performing profiling in vLLM-Ascend, if you need to
obtain the Python call stack, you have to manually modify the code. The
code location is:
[worker_v1.py#L337](6c973361fc/vllm_ascend/worker/worker_v1.py (L337))
where you set with_stack to true.
Now, in vLLM, you can set whether to obtain the Python call stack
through an environment variable. The relevant PR is:
[#21803](https://github.com/vllm-project/vllm/pull/21803) and the
documentation is:
[profiling](https://docs.vllm.ai/en/latest/contributing/profiling.html?h=vllm_torch_profiler_with_stack#profile-with-pytorch-profiler)
This PR sets the profiler initialization parameters by using the same
environment variable as vLLM, eliminating the need for manual code
modification.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
0235103cbb

---------

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
2025-09-03 10:58:08 +08:00
Shanshan Shen
93754d8061 [Bugfix] Fix long context seq accuracy problem for GLM4.5 (#2601)
### What this PR does / why we need it?

Fix long context seq accuracy problem for `GLM4.5`.

When `max_tokens=1000`, there is cyclic output problem like:

```bash
00 00 00 00 00 00 00 00 00 00 00 00 00 00
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```python
import os

os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=1000, temperature=0.0)
    # Create an LLM.
    llm = LLM(model="/root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5",
              tensor_parallel_size=8,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=1024)

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
    main()
```

- vLLM version: v0.10.1.1
- vLLM main:
0235103cbb

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-09-03 09:18:44 +08:00
Angazenn
b84465c525 [Perf]Enable npu_moe_gating_top_k_softmax on quantized scenarios (#2633)
### What this PR does / why we need it?
This PR enables `npu_moe_gating_top_k_softmax` when running quantized
MoE (such as W8A8). This op in fact makes no distinction between
quantized and non-quantized scenarios. Introducing this op reduces 3~4ms
for TPOT.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
ce30dca5c4

Signed-off-by: Angazenn <supperccell@163.com>
2025-09-03 09:14:17 +08:00
wangxiyuan
24d4dad7b2 [CI] Enable MTP torchair e2e test (#2705)
enable MTP torchair e2e test

- vLLM version: v0.10.1.1
- vLLM main:
ce30dca5c4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-03 08:57:43 +08:00
Icey
af62af3cc5 [Image] Upgrade openEuler to 24.03 (#2631)
### What this PR does / why we need it?
Upgrade openEuler to 24.03

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
4071c76cf3

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-02 20:09:09 +08:00
wangxiyuan
0829b4873f [CI] recover e2e test (#2688)
1. recover the skipped test.
2. remove pangu eager mode test, it's tested by torchair mode already.
3. skip pangu test util the bug is fixed.

- vLLM version: v0.10.1.1
- vLLM main:
56d04089ef

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-02 18:49:17 +08:00
wangxiyuan
f023bd52bf [CI] Make test_platform UT stable (#2696)
Make test_platform stable

- vLLM version: v0.10.1.1
- vLLM main:
56d04089ef

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-02 18:34:04 +08:00
wangxiyuan
c1e607b7b7 [Misc] Clean up uesless code in rotary_embedding (#2663)
Clean up useless code which is only used for torchair in rotary_embedding

- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-02 17:25:33 +08:00
Wang Yixuan
253b01b9a5 [7/N][refactor]fix torchair rope ops (#2683)
### What this PR does / why we need it?
Due to the registration mechanism, torchair ops can not take effect, so
have to patch the Ascend ops to adapt torchair

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: main
vLLM main:
7ea22e42d5


- vLLM version: main
- vLLM main:
7ea22e42d5

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-09-02 17:21:56 +08:00
yupeng
9f1e054fe3 [Bugfix][LoRA][Operator] Fix LoRA custom operators accuracy issue (#2672)
### What this PR does / why we need it?
Fix the LoRA accuracy issue that introduced by custom AscendC operator
"bgmv_shrink, sgmv_shrink, bgmv_expand, sgmv_epand".

The bug details are: 
- In the kernel function, if you want to call GlobalTensor.GetSize
method, you have to pass the second parameter of bufferSize when you
call GlobalTensor.SetGlobalBuffer first.
- Or GlobalTensor.GetSize method will return a random value.
- You can refer to [this
doc](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha002/apiref/ascendcopapi/atlasascendc_api_07_00024.html).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py

- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a

---------

Signed-off-by: paulyu12 <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: paulyu12 <paulyu0307@gmail.com>
2025-09-02 11:46:59 +08:00
xuyexiong
214b32a346 [V1][BUGFIX][0.10.1] FIX mtp on main branch (#2632)
### What this PR does / why we need it?
Fix MTP torchair bug caused by torchair refactor and moe refactor

Depends on PRs:
fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627 
torchair multi DP fix:
https://github.com/vllm-project/vllm-ascend/pull/2626

### Does this PR introduce _any_ user-facing change?
when dp is enabled, to run mtp online server, need to disable server log
due to the current metrics does not support multi dp
`--disable-log-stats`
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
7c8271cd1e

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-02 11:12:41 +08:00
wangxiyuan
fef18b60bc Refactor e2e CI (#2276)
Refactor E2E CI to make it clear and faster
1. remove some uesless e2e test
2. remove some uesless function
3. Make sure all test runs with VLLMRunner to avoid oom error
4. Make sure all ops test end with torch.empty_cache to avoid oom error
5. run the test one by one to avoid resource limit error


- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-02 09:02:22 +08:00
leo-pony
0df059f41a [CI] Fix CI Break: upstream adds routed_scaling_factor in forward_oot interface (#2675)
### What this PR does / why we need it?
Fix CI Break: upstream adds routed_scaling_factor in forward_oot
interface, vllm-ascend needs to adapt

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
E2E and UT

- vLLM version: v0.10.1.1
- vLLM main:
3e330fcb21

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-09-01 19:02:50 +08:00
panchao-hub
ea53f9076e support torchair mode (#2641)
### What this PR does / why we need it?
support torchair mode
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
5438967fbc

Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
2025-09-01 15:49:07 +08:00
LeeWenquan
b72e34013f Add ut for mla (#2637)
### What this PR does / why we need it?
Update UT for MLA case

- vLLM version: v0.10.1.1
- vLLM main:
14b4326b94

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-09-01 14:07:57 +08:00
Wang Yixuan
ad13964c71 [6/N][refactor]delete torchair in rotary ops (#2581)
### What this PR does / why we need it?
After moved torchair related rope ops into torchair_ops, split the
torchair from the origin rope ops to make the code clean.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
81eea3d348

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-09-01 09:10:15 +08:00
Wang Yixuan
c2c97f3079 [5/N][refactor]add torchair rotary ops (#2559)
### What this PR does / why we need it?
Move torchair related rotary ops into torchair dir to make the code
clear. Next step we'll remove all torchair related code outside of
torchair rotary ops.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
81eea3d348

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-09-01 09:09:21 +08:00
weichen
3a5fc5ee01 [Refactor][MoE] remove redundant code after refactoring fused_moe (#2612)
### What this PR does / why we need it?
There are a lot of redundant codes related to moe here, and the
structure is not very clear.
We did the following things:

we have placed the relatively independent code related to apply_mlp into
a separate file;
removed the environment variables of alltoall_buffer and alltoall_seq.
Remove the code related to alltoall_buffer and alltoall_seq, and retain
the sole TokenDispatcher inheritance class.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e&ut

- vLLM version: v0.10.1.1
- vLLM main:
4071c76cf3

---------

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-08-30 22:28:50 +08:00
panchao-hub
20ae71291d [torchair]remove aicpu op (#2640)
### What this PR does / why we need it?
remove aicpu op for torchair mode
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.10.1.1
vLLM main:
05d839c19e
- vLLM version: v0.10.1.1
- vLLM main:
67c14906aa

Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
2025-08-30 15:51:12 +08:00
panchao-hub
7215454de6 bugfix for torchair graph (#2639)
### What this PR does / why we need it?
bugfix for torchair graph
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
67c14906aa

Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
2025-08-30 15:49:48 +08:00
weijinqian0
6f1047d5fd [CI] fix UT error. (#2644)
69f46359dd changed the vl input usage, this PR fix the related UT failure.

- vLLM version: v0.10.1.1
- vLLM main:
d660c98c1b

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-08-30 12:04:01 +08:00
yiz-liu
d3c93fba5c [3/N][Feat][Graph] Support all-to-all and quantized models with ACL Graph (#2614)
### What this PR does / why we need it?
* **Unify execution paths:** Consolidates the quantized and
non-quantized execution paths into a single `fused_experts` function,
removing duplicated logic and making the control flow clearer and easier
to maintain.
* **W8A8 dynamic quantization:** Adds support for W8A8 dynamic
quantization inside the unified MoE kernel. Communication routines are
updated to correctly handle dynamic quantization scales for activations.
* **Weight pre-processing:** Prae-transpose the `w13` and `w2` weight
matrices (as implemented in PR #2025) so that quantized and
non-quantized models follow the same code path for the MoE gating,
up-projection, and down-projection operations.
* **All-to-all communication:** Adds an `all-to-all` collective
communication pattern. For large token counts on modern hardware,
`all-to-all` is more efficient than the previous `all-gather` strategy.
However, `all-to-all` is not really captured and replayed due to
multiple D2H operations which will trigger synchronization, and thus
raise error when capture graphs. We only use `all-to-all` when fallback
to `compiled_graph_for_general_shape`.
* **Dynamic communication selection:** The model runner now selects the
optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at
runtime based on token count and the Ascend SoC version.
* **Limitation:** `all-gather` is not yet supported for quantized
models, which means there is still something left to do on A2.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
No further test cases needed.

- vLLM version: v0.10.1.1
- vLLM main:
d660c98c1b

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-30 11:00:35 +08:00
Mengqing Cao
91c35d765a [Bugfix] Fix mc2 operator error in aclgraph + ep<16 scenario (#2609)
### What this PR does / why we need it?
1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover
CI, will be refactorred in the future
2. disable aclgraph when testing w8a8

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
95089607fa

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-29 21:59:16 +08:00
wangxiaoteng666
ee6d141dd4 [MAIN][BUGFIX] BugFix: Resolve the issue of waiting queue accumulation when requests are canceled. (#2426)
### What this PR does / why we need it?
Resolve the issue of waiting queue accumulation when requests are
canceled.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci


- vLLM version: v0.10.1.1
- vLLM main:
006477e60b

---------

Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com>
2025-08-29 17:19:23 +08:00
weichen
52aff9e229 [main] [bugfix] Fix misjudging quantized/unquantized scenarios (#2627)
### What this PR does / why we need it?
In a mixed-precision scenario, quant_config is not None, but MoE needs
to perform unquantized computation; however, quantized computation is
currently being used. Therefore, we put the with_quant logic into
forward, avoid misjudging in mix-precision scenarios.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut

- vLLM version: v0.10.1.1
- vLLM main:
98ac0cb32d

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-08-29 16:20:22 +08:00
yiz-liu
aadc75c247 [Fix] Resolve data-parallel (DP) assertion errors in TorchAir (#2626)
### What this PR does / why we need it?
It is confirmed that `num_input_tokens` must be assigned the value of
`maybe_padded_num_tokens` under all circumstances.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Waiting for daily test for TorchAir.
- vLLM version: v0.10.1.1
- vLLM main:
006477e60b

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-29 16:06:49 +08:00
lidenghui1110
600b08f754 [Feat]: Add custom lmhead tensor model parallel (#2309)
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
de533ab2a1

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
2025-08-29 11:41:21 +08:00
zhangxinyuehfad
e7ad4a64f4 [CI] Add e2e ci test for A3 (#2573)
### What this PR does / why we need it?
Add e2e ci test for A3

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
11a7fafaa8

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-08-29 09:33:42 +08:00
yiz-liu
dfc7eb39ad [Fix] Fix DP-related padding logic (#2582)
### What this PR does / why we need it?
The determination of attention state, padding, and other forward
metadata has been moved to an earlier stage within the input preparation
process. This change enables us to utilize a single all-reduce
operation, maximizing synchronization efficiency as early as possible.

The logic for synchronizing metadata—such as the number of tokens,
prefill status, and DBO status—across data parallel (DP) ranks has now
been unified and simplified.

For performance improvements, the all-reduce operation has been switched
from the `gloo` backend to the `npu` backend, which results in an
reduction of several milliseconds per step (**approximately 10%
performance gain for TPOT!**).

Additionally, the multi-DP server hang issue has been resolved, ensuring
no more hangs occur when `num_requests < dp_size`. Alas, a relief.

Finally, the miscalculated memory usage issue has been addressed by
removing the unnecessary `DummyCommImpl`, allowing the system to use the
real communication method when determining available memory.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Maybe we should add an test case for multi-DP online server?
@MengqingCao


- vLLM version: v0.10.1.1
- vLLM main:
c5d004aaaf

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-28 19:39:58 +08:00
Yikun Jiang
175f6bc445 Support v0.10.1 (#2584)
### What this PR does / why we need it?
This patch also supports v0.10.1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed
- test 0.10.1: https://github.com/vllm-project/vllm-ascend/pull/2583
- vLLM version: v0.10.1.1
- vLLM main:
321938e9ac

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-08-28 18:47:53 +08:00
Mengqing Cao
6c973361fc [Bugfix] Fix aclgraph not enabled by default (#2590)
### What this PR does / why we need it?
As vllm will set `cudagraph_mode` to `NONE` before
`check_and_update_config` in post init of `VllmConfig`
(5da4f5d857/vllm/config/__init__.py (L3630)),
we always have `cudagraph_mode` isn't `None`, thus we must remove this
check and add it when the related adaption in vllm is done.

part of https://github.com/vllm-project/vllm-ascend/pull/2577, will add
the e2e test on applying reply after the CI refactor is done

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
f48a9af892

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-28 14:08:31 +08:00
yupeng
cf96366a39 [Bugfix][LoRA][Patch] Fix the LoRA inference bug after upstream vLLM codebase changed (#2560)
### What this PR does / why we need it?
The mergence of the upstream
https://github.com/vllm-project/vllm/pull/22592 caused a vllm-ascend
LoRA inference bug. The details are following:

According to
[torch_npu/npu/_stream_check.py](863b9071cb/torch_npu/npu/_stream_check.py (L74)),
NPU device type tensors have attributes is_cuda=True and is_npu=True.
This causes that vLLM's apply_repetition_penalties function will run
into the branch of "if logits.is_cuda and logits.is_contiguous()" and
call the custom op implemented in CUDA, which is not compatible with
NPU.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py

- vLLM version: v0.10.1.1
- vLLM main:
fe8d7b6f03

---------

Signed-off-by: paulyu12 <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: paulyu12 <paulyu0307@gmail.com>
2025-08-28 10:40:51 +08:00
yeyifan
1191a64ae5 [Feat]attention add sliding windows size (#2528)
### What this PR does / why we need it?
Add a sliding window size parameter to attention
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Regarding the `Gemma3` model, set
additional_config={"ascend_scheduler_config": {"enabled":True}}, only
support AscendScheduler
test commond:`python3 -m vllm.entrypoints.openai.api_server --model
gemma3 --additional-config
'{"ascend_scheduler_config":{"enabled":true}}'`


- vLLM version: v0.10.1.1
- vLLM main:
6578e87365

---------

Signed-off-by: nsdie <yeyifan@huawei.com>
2025-08-28 10:37:19 +08:00
LeeWenquan
c8d1df3a3f [Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465)
### What this PR does / why we need it?
In order to support fused kernels, multi-stream, communication
optimization etc, it's better to aggregate all opreations in Attention
layer togather. This PR tries to refactor mla_v1 by moving all MLA
preprocessing ops into mla_v1 attention impl.
Note that new mla_v1 doesn't take torchair into consideration. So this
PR can only be merged after torchair related mla_v1 is isolated into a
new file.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

### Features Test

<img width="506" height="141" alt="image"
src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466"
/>

### Performance After Refact
<img width="648" height="486" alt="image"
src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb"
/>

### Performance Before Refact
<img width="618" height="494" alt="image"
src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1"
/>


- vLLM version: v0.10.1.1
- vLLM main:
e03940762b

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: SunnyLee219 <3294305115@qq.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-08-28 10:35:57 +08:00
weichen
320edde2df [main] [refactor] refactor fused_moe.py to enable token_dispatchers (#2570)
### What this PR does / why we need it?
Enable token_dispatcher to replace fused_experts_with_xxx in eager mode
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut


- vLLM version: v0.10.1.1
- vLLM main:
704432af3c

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: sherie <963372609@qq.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
Co-authored-by: shiyuan680 <72335504+shiyuan680@users.noreply.github.com>
2025-08-28 10:13:35 +08:00
Wang Yixuan
936c102105 [bugfix][refactor]fix torchair_w8a8 (#2569)
### What this PR does / why we need it?
torchair w8a8 and w4a8 Separate from fused_moe due to the refactor and
change for fused_moe

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
69244e67e6

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-08-28 09:10:31 +08:00
Wang Yixuan
a955e5d404 [4/N][refactor]delete torchair from quantization (#2535)
### What this PR does / why we need it?
After moved torchair related quantization section into
torchair_quantization, split the torchair from the origin quantization

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
69244e67e6

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-08-28 09:10:03 +08:00
Icey
c578f817ca [CustomOp] Register VocabParallelEmbedding instead of overwrite forward (#2515)
### What this PR does / why we need it?
Register VocabParallelEmbedding instead of overwrite forward

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
644d57d531

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-08-28 08:57:34 +08:00
Li Wang
516e14ae6a [Doc] Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8 (#2553)
### What this PR does / why we need it?
Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
de02b07db4

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-27 14:16:44 +08:00
rjg-lyh
2bfbf9b9b3 [main][bugfix] Fix bugs and refactor cached mask generation logic (#2442)
### What this PR does / why we need it?
This PR fix bugs and refactor cached mask generation logic. Now just
pre-construct and use the cached mask on cpu instead of device on npu.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
9b5f64238f

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-08-27 12:07:29 +08:00
huangxialu
6881c19458 [main] convert the format of gmm to nz (#2474)
### What this PR does / why we need it?
convert the format of gmm to nz

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
ut: test_fused_ops.py and e2e: test_fused_moe.py

**performance**:
(qwen3 30B, 2k->20k)

base:
Total Token throughput (tok/s):          719.93

gmm nz:
Total Token throughput (tok/s):          728.52


- vLLM version: v0.10.1.1
- vLLM main:
bfc1edc9f5

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-08-27 11:25:02 +08:00