Commit Graph

828 Commits

Author SHA1 Message Date
Yikun Jiang
a58b43b72c Remove git .extraheader and fecth all commtis in /vllm-workspace/vllm-ascend (#2746)
### What this PR does / why we need it?
Remove git .extraheader and fecth all commtis in
/vllm-workspace/vllm-ascend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/2735
- vLLM version: v0.10.1.1
- vLLM main:
51d5e9be7d

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-05 09:45:11 +08:00
henryxuxu0716
51a2aec115 Delete redundant codes related to communication (#2717)
### What this PR does / why we need it?
Delete redundant codes related to communication

### Does this PR introduce _any_ user-facing change?
not involve

### How was this patch tested?
not involve

- vLLM version: v0.10.1.1
- vLLM main:
6c7af8110a

---------

Signed-off-by: 刘哲续 <liuzhexu1@huawei.com>
Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>
2025-09-05 09:39:39 +08:00
1092626063
5b3646ab21 [FEATURE][MTP] Support MTP > 1 (#2708)
### What this PR does / why we need it?
[RFC:Support MTP > 1 for
DeepSeek](https://github.com/vllm-project/vllm-ascend/issues/2745)

- [x] dp1 tp16
- [x] dp4 tp4
- [x] dp2 tp 8
- [x] torchair graph

- vLLM version: v0.10.1.1
- vLLM main:
c9f7081f9c

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-09-05 09:11:22 +08:00
yiz-liu
83eb40a51c [Fix][MoE] Refine MoE communication strategy (#2734)
### What this PR does / why we need it?
Refactors the Mixture-of-Experts (MoE) communication method selection
logic. The choice between all-gather, all-to-all, and mc2 is now
determined by expert parallel configuration, SoC version (A2/A3), and
token count for better performance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Added.


- vLLM version: v0.10.1.1
- vLLM main:
eafa8dcde6

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-05 09:04:04 +08:00
liziyu
4c90fa79ca [Misc] Remove useless PD check in deepseek (#2739)
### What this PR does / why we need it?
Remove useless PD check in deepseek


- vLLM version: v0.10.1.1
- vLLM main:
6c7af8110a

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-04 22:22:19 +08:00
vllm-ascend-ci
3a2a7d88db [Doc] Update accuracy reports for v0.10.1rc1 (#2755)
The accuracy results running on NPU Altlas A2 have changed, updating
reports for: All models (Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct,
Qwen3-8B-Base, DeepSeek-V2-Lite)

  - [Workflow run][1]
  
[1]:
https://github.com/vllm-project/vllm-ascend/actions/runs/17459225764
- vLLM version: v0.10.1.1
- vLLM main:
2b30afa442

Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
2025-09-04 22:17:17 +08:00
sherie
f86596a66c allgather use fusedop. (#2689)
### What this PR does / why we need it?
Use 'npu_moe_init_routing_v2' &'npu_moe_token_unpermute' repalce
'npu_moe_init_routing' &‘npu_moe_compute_expert_tokens’&
'npu_moe_finalize_routing' to optimize performance
### Does this PR introduce _any_ user-facing change?
| branch| tps| TTFT |TPOT |
| --- | --- | --- |--- |
|main  |733.98  | 280.05 |34.30 |
|main+fusedop  | 740.33 | 273.34 |33.99 |
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-09-04 11:56:29 +08:00
无脸男
7d47d8f4f6 [Fix] fix resources limit error when apply speculative decoding and aclgraph (#2472)
### What this PR does / why we need it?
When both speculative decoding and aclgraph are applied, and
cudagraph_capture_sizes uses the default value, it will report that the
stream resources are insufficient.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
9c99e4871f

Signed-off-by: withHades <244036962@qq.com>
2025-09-04 11:50:43 +08:00
无脸男
0c0789be74 [Feat] allow using aclgraph in ray backend (#2589)
### What this PR does / why we need it?

Allow using aclgraph in ray backend, for tp + pp + aclgraph in multi
machine

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
4ba0c587ba

Signed-off-by: withHades <244036962@qq.com>
2025-09-04 11:45:56 +08:00
Ruri
aff5189c87 [main] Fuse GroupedMatmul, Swiglu and DynamicQuant in W8A8_DYNAMIC quantized MoE layers (#2275)
### What this PR does / why we need it?

Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion
operation `GroupedMatmulSwigluQuant`.

1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py`
2. if in supported occasion, use fusion operation
`npu_grouped_matmul_swiglu_quant`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16`

1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output
Token Throughput increased 27.35%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e"
/>

3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output
Token Throughput increased 6.86%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6"
/>


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
2025-09-04 11:37:32 +08:00
22dimensions
37f5a29cd4 [1/N][Refactor][Quantization] remove redundant quantizer class (#2680)
### What this PR does / why we need it?

AscendQuantizer/LLMQuantizer class is used to select quant method based
on quant config and some other arguments,
but it is more simple and clean replacing these classes with map. So i
remove them.

### Does this PR introduce _any_ user-facing change?
No 

### How was this patch tested?

ut and e2e test


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-04 11:35:14 +08:00
Icey
d4370ebc42 [Refactor] Refactor Spec Decode (#2668)
### What this PR does / why we need it?
Refactor spec decode

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-04 11:34:47 +08:00
Mengqing Cao
7e16b4a7cd [ReleaseNote] Add Release Note for v0.10.1rc1 (#2635)
Add Release Note for v0.10.1rc1

- vLLM version: v0.10.1.1
- vLLM main:
b5ee1e3261

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-04 11:26:47 +08:00
Angazenn
e7409e95ee [1/N][Draft][Refactor]torchair pangu_moe modeling refactor (#2437)
### What this PR does / why we need it?

1. Similar to #2384 , this PR add a torchair-specific modeling for
pangu.
2. Fixes a bug introduced by routed_scaling_factor in #2675 .
3. remove eager test case for pangu since there has already been a
torchair test case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: zengyanjia <z00883269@china.huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Co-authored-by: zengyanjia <z00883269@china.huawei.com>
2025-09-04 10:39:21 +08:00
whx
a58013440a [BugFix][MLA] Fix attn_mask bug for ring mla (#2704)
This PR fix a bug related to attention mask used in ring mla. Current
ring mla has supported compressed mask, so we can directly use a 512 *
512 attention mask.

- vLLM version: v0.10.1.1
- vLLM main:
b5ee1e3261

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-04 10:22:46 +08:00
wangxiyuan
e11a1bbfc1 [Doc] Update news (#2736)
Refresh the news. Add meetup and official release info

- vLLM version: v0.10.1.1
- vLLM main:
b5ee1e3261

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-04 10:10:24 +08:00
Mengqing Cao
984bd7c13a [Bugfix][APC] Fix accuracy issue on prefix caching with AscendScheduler (#2714)
### What this PR does / why we need it?
Fix accuracy issue on prefix caching with AscendScheduler

### How was this patch tested?
CI passed with `test_prefix_cache_with_ascend_scheduler`

- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-04 08:22:46 +08:00
baxingpiaochong
df88a2ecc8 [P/D]mooncake_connector adapted to 0.10.1 (#2664)
### What this PR does / why we need it?
In vllm version 0.10.1, a new KVOutputAggregator was added to the
executor, moving aggregation to the
executor(https://github.com/vllm-project/vllm/pull/19555). This caused
mooncake_connector to break. This change aims to fix this bug and also
adds a policy to forcibly release the KV cache when the prefill node
times out.

This PR is currently linked to a PR in vllm
(https://github.com/vllm-project/vllm/pull/23917). The vllm PR aims to
modify the finish and send count confirmation in heterogeneous TP
situations.

The reason for deleting many UTs is that a lot of communication codes
have been deleted, so the UT as a whole will appear more concise.

- vLLM version: v0.10.1.1
- vLLM main:
fa4311d85f

---------

Signed-off-by: baxingpiaochong <771405853@qq.com>
2025-09-04 08:22:10 +08:00
zhiyuanzhang
07d44ade19 bugfix: fix initialization error for mooncake in k8s (#2541)
### What this PR does / why we need it?
The detail has been clarified in that issue :
https://github.com/vllm-project/vllm-ascend/issues/2557

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
easy to test beacause we just need to echo the variable


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

---------

Signed-off-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: LCAIZJ <leichao139636@163.com>
2025-09-03 22:25:08 +08:00
wangxiyuan
41b028aa5f [Doc] add v0.9.1 release note (#2646)
Add release note for 0.9.1

- vLLM version: v0.10.1.1
- vLLM main:
8bd5844989

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-03 18:04:27 +08:00
linfeng-yuan
90a75a90a9 [bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532)
### What this PR does / why we need it?
This PR ports #2312 #2506 #2531 to main branch.

Original implementation of torchair caching forces users to make
everything prepared, fix all the configuration and enable
`use_cached_npu_graph`, and it might cause some problems confusing to
understand and tackle for users. It is better to compile the graph twice
instead of reusing the old kvcaches and cached torchair graph. And the
extra duration time is acceptable. Additionally, this pr fixes a
recompilation problem of torchair graph mode caused by
`running_in_graph` variable in `AscendMLATorchairImpl`.

### Does this PR introduce _any_ user-facing change?
If users want to enabling torchair.cache_compile with high compilation
speed, it is recommended to enable both `use_cached_kv_cache_bytes` and
`use_cached_graph` in `torchair_graph_config`. Without
`use_cached_kv_cache_bytes`, we'll compile torchair computation graph
twice to avoid runtime error caused by configuration mismtaches (the
second compilation will be much faster). Additionally, we've made a
change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to
enhance safety and prevent accidental file deletion by adding a suffix
directory.

### How was this patch tested?
CI and e2e vllm serving pass.


- vLLM version: v0.10.1.1
- vLLM main:
70549c1245

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-09-03 17:56:12 +08:00
liziyu
5889fa1b1c [bugfix] ascend schedule encountered an incorrect req block length in the check_watermark_for_prefill function (#2508)
### What this PR does / why we need it?
bugfix ascend schedule encountered an incorrect req block length in the
check_watermark_for_prefill function
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
426cc8629f

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-09-03 16:54:39 +08:00
whx
59d23c39eb [DP] External dp server starter (#2685)
This PR re-implements external-dp starter based on vllm's support for
external dp.

- vLLM version: v0.10.1.1
- vLLM main:
f38035c123

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-09-03 16:30:26 +08:00
wangxiyuan
c03321781a [CI] skip unstable UT (#2716)
See #2687 we notice that test_platform and test_vocab_parallel_embedding
is unstable, let's skip them first.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-03 15:53:50 +08:00
Li Wang
3584306387 [Bugfix] Fix qwen2.5-vl-without-padding (#2623)
### What this PR does / why we need it?
Correct `AscendQwen2_5_VLForConditionalGeneration_Without_Padding`
override methods
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
42dc59dbac

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-03 14:38:55 +08:00
Li Wang
bece793be6 [CI] Disable per-PR triggering for A3 (#2710)
### What this PR does / why we need it?
Disable per-PR triggering for A3 for now, we trigger the dist test in
the label `dist-test` rather than
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
136d853e65

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-03 11:52:34 +08:00
zhanghw0354
eaeb2efb20 [Main][Feat]Set the Profiler parameters through environment variables consistent with vLLM (#2608)
### What this PR does / why we need it?
Currently, when performing profiling in vLLM-Ascend, if you need to
obtain the Python call stack, you have to manually modify the code. The
code location is:
[worker_v1.py#L337](6c973361fc/vllm_ascend/worker/worker_v1.py (L337))
where you set with_stack to true.
Now, in vLLM, you can set whether to obtain the Python call stack
through an environment variable. The relevant PR is:
[#21803](https://github.com/vllm-project/vllm/pull/21803) and the
documentation is:
[profiling](https://docs.vllm.ai/en/latest/contributing/profiling.html?h=vllm_torch_profiler_with_stack#profile-with-pytorch-profiler)
This PR sets the profiler initialization parameters by using the same
environment variable as vLLM, eliminating the need for manual code
modification.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
0235103cbb

---------

Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
2025-09-03 10:58:08 +08:00
Shanshan Shen
93754d8061 [Bugfix] Fix long context seq accuracy problem for GLM4.5 (#2601)
### What this PR does / why we need it?

Fix long context seq accuracy problem for `GLM4.5`.

When `max_tokens=1000`, there is cyclic output problem like:

```bash
00 00 00 00 00 00 00 00 00 00 00 00 00 00
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```python
import os

os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=1000, temperature=0.0)
    # Create an LLM.
    llm = LLM(model="/root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5",
              tensor_parallel_size=8,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=1024)

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
    main()
```

- vLLM version: v0.10.1.1
- vLLM main:
0235103cbb

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-09-03 09:18:44 +08:00
Angazenn
b84465c525 [Perf]Enable npu_moe_gating_top_k_softmax on quantized scenarios (#2633)
### What this PR does / why we need it?
This PR enables `npu_moe_gating_top_k_softmax` when running quantized
MoE (such as W8A8). This op in fact makes no distinction between
quantized and non-quantized scenarios. Introducing this op reduces 3~4ms
for TPOT.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
ce30dca5c4

Signed-off-by: Angazenn <supperccell@163.com>
2025-09-03 09:14:17 +08:00
wangxiyuan
24d4dad7b2 [CI] Enable MTP torchair e2e test (#2705)
enable MTP torchair e2e test

- vLLM version: v0.10.1.1
- vLLM main:
ce30dca5c4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-03 08:57:43 +08:00
Icey
af62af3cc5 [Image] Upgrade openEuler to 24.03 (#2631)
### What this PR does / why we need it?
Upgrade openEuler to 24.03

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
4071c76cf3

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-02 20:09:09 +08:00
wangxiyuan
0829b4873f [CI] recover e2e test (#2688)
1. recover the skipped test.
2. remove pangu eager mode test, it's tested by torchair mode already.
3. skip pangu test util the bug is fixed.

- vLLM version: v0.10.1.1
- vLLM main:
56d04089ef

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-02 18:49:17 +08:00
wangxiyuan
f023bd52bf [CI] Make test_platform UT stable (#2696)
Make test_platform stable

- vLLM version: v0.10.1.1
- vLLM main:
56d04089ef

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-02 18:34:04 +08:00
wangxiyuan
c1e607b7b7 [Misc] Clean up uesless code in rotary_embedding (#2663)
Clean up useless code which is only used for torchair in rotary_embedding

- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-02 17:25:33 +08:00
Wang Yixuan
253b01b9a5 [7/N][refactor]fix torchair rope ops (#2683)
### What this PR does / why we need it?
Due to the registration mechanism, torchair ops can not take effect, so
have to patch the Ascend ops to adapt torchair

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: main
vLLM main:
7ea22e42d5


- vLLM version: main
- vLLM main:
7ea22e42d5

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-09-02 17:21:56 +08:00
yupeng
9f1e054fe3 [Bugfix][LoRA][Operator] Fix LoRA custom operators accuracy issue (#2672)
### What this PR does / why we need it?
Fix the LoRA accuracy issue that introduced by custom AscendC operator
"bgmv_shrink, sgmv_shrink, bgmv_expand, sgmv_epand".

The bug details are: 
- In the kernel function, if you want to call GlobalTensor.GetSize
method, you have to pass the second parameter of bufferSize when you
call GlobalTensor.SetGlobalBuffer first.
- Or GlobalTensor.GetSize method will return a random value.
- You can refer to [this
doc](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha002/apiref/ascendcopapi/atlasascendc_api_07_00024.html).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py

- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a

---------

Signed-off-by: paulyu12 <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: paulyu12 <paulyu0307@gmail.com>
2025-09-02 11:46:59 +08:00
xuyexiong
214b32a346 [V1][BUGFIX][0.10.1] FIX mtp on main branch (#2632)
### What this PR does / why we need it?
Fix MTP torchair bug caused by torchair refactor and moe refactor

Depends on PRs:
fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627 
torchair multi DP fix:
https://github.com/vllm-project/vllm-ascend/pull/2626

### Does this PR introduce _any_ user-facing change?
when dp is enabled, to run mtp online server, need to disable server log
due to the current metrics does not support multi dp
`--disable-log-stats`
### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
7c8271cd1e

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-09-02 11:12:41 +08:00
wangxiyuan
fef18b60bc Refactor e2e CI (#2276)
Refactor E2E CI to make it clear and faster
1. remove some uesless e2e test
2. remove some uesless function
3. Make sure all test runs with VLLMRunner to avoid oom error
4. Make sure all ops test end with torch.empty_cache to avoid oom error
5. run the test one by one to avoid resource limit error


- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-02 09:02:22 +08:00
leo-pony
0df059f41a [CI] Fix CI Break: upstream adds routed_scaling_factor in forward_oot interface (#2675)
### What this PR does / why we need it?
Fix CI Break: upstream adds routed_scaling_factor in forward_oot
interface, vllm-ascend needs to adapt

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
E2E and UT

- vLLM version: v0.10.1.1
- vLLM main:
3e330fcb21

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-09-01 19:02:50 +08:00
panchao-hub
ea53f9076e support torchair mode (#2641)
### What this PR does / why we need it?
support torchair mode
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
5438967fbc

Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
2025-09-01 15:49:07 +08:00
LeeWenquan
b72e34013f Add ut for mla (#2637)
### What this PR does / why we need it?
Update UT for MLA case

- vLLM version: v0.10.1.1
- vLLM main:
14b4326b94

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-09-01 14:07:57 +08:00
Wang Yixuan
ad13964c71 [6/N][refactor]delete torchair in rotary ops (#2581)
### What this PR does / why we need it?
After moved torchair related rope ops into torchair_ops, split the
torchair from the origin rope ops to make the code clean.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
81eea3d348

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-09-01 09:10:15 +08:00
Wang Yixuan
c2c97f3079 [5/N][refactor]add torchair rotary ops (#2559)
### What this PR does / why we need it?
Move torchair related rotary ops into torchair dir to make the code
clear. Next step we'll remove all torchair related code outside of
torchair rotary ops.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19


- vLLM version: v0.10.1.1
- vLLM main:
81eea3d348

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-09-01 09:09:21 +08:00
weichen
3a5fc5ee01 [Refactor][MoE] remove redundant code after refactoring fused_moe (#2612)
### What this PR does / why we need it?
There are a lot of redundant codes related to moe here, and the
structure is not very clear.
We did the following things:

we have placed the relatively independent code related to apply_mlp into
a separate file;
removed the environment variables of alltoall_buffer and alltoall_seq.
Remove the code related to alltoall_buffer and alltoall_seq, and retain
the sole TokenDispatcher inheritance class.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e&ut

- vLLM version: v0.10.1.1
- vLLM main:
4071c76cf3

---------

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
2025-08-30 22:28:50 +08:00
panchao-hub
20ae71291d [torchair]remove aicpu op (#2640)
### What this PR does / why we need it?
remove aicpu op for torchair mode
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.10.1.1
vLLM main:
05d839c19e
- vLLM version: v0.10.1.1
- vLLM main:
67c14906aa

Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
2025-08-30 15:51:12 +08:00
panchao-hub
7215454de6 bugfix for torchair graph (#2639)
### What this PR does / why we need it?
bugfix for torchair graph
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
67c14906aa

Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
2025-08-30 15:49:48 +08:00
weijinqian0
6f1047d5fd [CI] fix UT error. (#2644)
69f46359dd changed the vl input usage, this PR fix the related UT failure.

- vLLM version: v0.10.1.1
- vLLM main:
d660c98c1b

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-08-30 12:04:01 +08:00
yiz-liu
d3c93fba5c [3/N][Feat][Graph] Support all-to-all and quantized models with ACL Graph (#2614)
### What this PR does / why we need it?
* **Unify execution paths:** Consolidates the quantized and
non-quantized execution paths into a single `fused_experts` function,
removing duplicated logic and making the control flow clearer and easier
to maintain.
* **W8A8 dynamic quantization:** Adds support for W8A8 dynamic
quantization inside the unified MoE kernel. Communication routines are
updated to correctly handle dynamic quantization scales for activations.
* **Weight pre-processing:** Prae-transpose the `w13` and `w2` weight
matrices (as implemented in PR #2025) so that quantized and
non-quantized models follow the same code path for the MoE gating,
up-projection, and down-projection operations.
* **All-to-all communication:** Adds an `all-to-all` collective
communication pattern. For large token counts on modern hardware,
`all-to-all` is more efficient than the previous `all-gather` strategy.
However, `all-to-all` is not really captured and replayed due to
multiple D2H operations which will trigger synchronization, and thus
raise error when capture graphs. We only use `all-to-all` when fallback
to `compiled_graph_for_general_shape`.
* **Dynamic communication selection:** The model runner now selects the
optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at
runtime based on token count and the Ascend SoC version.
* **Limitation:** `all-gather` is not yet supported for quantized
models, which means there is still something left to do on A2.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
No further test cases needed.

- vLLM version: v0.10.1.1
- vLLM main:
d660c98c1b

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-30 11:00:35 +08:00
Mengqing Cao
91c35d765a [Bugfix] Fix mc2 operator error in aclgraph + ep<16 scenario (#2609)
### What this PR does / why we need it?
1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover
CI, will be refactorred in the future
2. disable aclgraph when testing w8a8

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
95089607fa

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-29 21:59:16 +08:00
wangxiaoteng666
ee6d141dd4 [MAIN][BUGFIX] BugFix: Resolve the issue of waiting queue accumulation when requests are canceled. (#2426)
### What this PR does / why we need it?
Resolve the issue of waiting queue accumulation when requests are
canceled.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci


- vLLM version: v0.10.1.1
- vLLM main:
006477e60b

---------

Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com>
2025-08-29 17:19:23 +08:00