Commit Graph

211 Commits

Author SHA1 Message Date
Shanshan Shen
83e0f41408 [3/N][Refactor] Move torchair_attention to torchair dir (#2017)
### What this PR does / why we need it?

1. Move `torchair_attention` to `torchair` dir.
2. Make `AscendAttentionTorchairBackend` extend `AscendAttentionBackend`
to reduce duplicate methods.
3. Make `AscendTorchairMetadata` extend `AscendMetadata` to reduce
duplicate properties.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
0933f9d518

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-19 10:25:22 +08:00
xleoken
2a763b8326 [Bug] Fix bug in test_chunked.py (#1992)
### What this PR does / why we need it?

1. Remove the return statement, it will always skip following logic.

2. Update `deepseek` to `Qwen2.5-Instruct` for OOM in github e2e test
env.

3. Fix the comparison logic

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Local Test.


- vLLM version: v0.10.0
- vLLM main:
0933f9d518

Signed-off-by: xleoken <xleoken@163.com>
2025-08-19 10:23:47 +08:00
linfeng-yuan
3fc31ee1cb [1/N][refactor] torchair deepseek modeling refactor (#2384)
### What this PR does / why we need it?

Move torchair related model arch into torchair moduel to make the code
clear. Next step we'll remove all torchair related code outside of
torchair moduel.

### Does this PR introduce _any_ user-facing change?
No.

- vLLM version: v0.10.0
- vLLM main:
08d5f7113a

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-08-18 15:00:37 +08:00
Chao Lei
03ca2b26ca [P/D] Mooncake Connector for v1 distributed (#1568)
### What this PR does / why we need it?
This PR adopt Mooncake TransferEngine for kv cache register and
pull_blocks style disaggregate prefill implementation.

### Does this PR introduce any user-facing change?
No

### Dependencies
1. Cann Dependencies
Using Mooncake TransferEngine with Ascend Transport requires CANN
version 8.2.RC1 or higher.(see detail
Mooncake[#502](https://github.com/kvcache-ai/Mooncake/pull/502))

2. vllm-ascend
This PR depends on changes introduced by #950 (modifications to
`model_runner_v1`) and #1361 (updates to `schedule`), both of which have
been merged into the `v0.9.1-dev` branch and are expected to land in
`main` shortly.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
1c859a1387

---------

Signed-off-by: leichao.lc <leichao139636@163.com>
Co-authored-by: jianzs <zheng.shoujian@outlook.com>
Co-authored-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: chris668899 <15105191595@126.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
2025-08-18 14:30:07 +08:00
Mengqing Cao
61866b8ac6 [Quickfix] update CachedRequestState as NewRequestData changed (#2367)
### What this PR does / why we need it?
1. update `CachedRequestState` as `NewRequestData` changed in
https://github.com/vllm-project/vllm/pull/22570
2. drop maintenance of vllm v0.10.0 in the branch main

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.10.0
- vLLM main:
92ff41abea

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-15 07:35:27 +08:00
Icey
c721ae6042 [CustomOp] Register RMSNorm instead of overwrite forward_oot (#2284)
### What this PR does / why we need it?
Use function CustomOp.register_oot to achieve the customop registery
```
from vllm.model_executor.custom_op import CustomOp
CustomOp.register_oot(_decorated_op_cls=AscendRMSNorm, name="RMSNorm")
```

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.0
- vLLM main:
afa5b7ca0b

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-08-14 17:18:30 +08:00
shiyuan680
e14f2ef669 refactor select_experts of moe module (#2150)
### What this PR does / why we need it?
this pr refactor select_experts of moe module
i merge implementations of quantitative and non-quantitative method in a
new class
use such as vllm like ExpertsSelector.select_experts
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
test in qwen3-moe and all ut.

- vLLM version: v0.10.0
- vLLM main:
e18859298d

Signed-off-by: yangcheng <yangcheng104@huawei.com>
Co-authored-by: yangcheng (AJ) <y00806874@china.huawei.com>
2025-08-14 11:50:53 +08:00
Shanshan Shen
103654ccd6 [Misc] Remove redundant imported envs, using envs_ascend instead (#2193)
### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.

```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```

- vLLM version: v0.10.0
- vLLM main:
71683ca6f6

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-14 09:33:39 +08:00
yiz-liu
992271b027 [1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic (#2125)
### What this PR does / why we need it?
This PR refactors the MoE (Mixture of Experts) communication logic by
introducing a strategy pattern. It defines an abstract base class,
`MoECommMethod`, which encapsulates different communication strategies
for MoE layers. By decoupling the MoE implementation from any single
communication method, this change makes it simpler to add, replace, or
optimize communication strategies in the future.

Plan / Roadmap

1. Introduce `MoECommMethod`, implement `AllGatherImpl`, and adapt ACL
Graph handling to cover all scenarios (this PR).
2. Implement `MC2CommImpl` and `AllToAllCommImpl` to optimize
performance in specific scenarios.
3. Enable W8A8 / Int8 models to use `unified_fused_experts`.

Other notes

* Data-parallel (DP) communication currently does not work with vLLM's
dispatch/combine mechanisms; an alternative approach is required to
resolve this incompatibility.

- vLLM version: v0.10.0
- vLLM main:
f7ad6a1eb3

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-08-12 21:10:20 +08:00
Wang Kunpeng
dc585f148a [main][prefill optimization] Optimize parallel strategies to reduce communication overhead (#2198)
### What this PR does / why we need it?
1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to
pure DP for shared experts, enabling more efficient execution.
2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by
using ReduceScatter, made possible by pure DP sharding.
3.AllGather Postponed: Delayed to after QKV down projection to reduce
synchronization impact during prefill.

### How was this patch tested?
Adding ut case in `tests/ut/attention/test_mla_v1.py`

#### How to run

use parameter `--additional_config='{"enable_shared_expert_dp": true}'`

##### a.How to run eager mode

eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true}'

##### b.How to run graph mode

eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
9edd1db02b

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
2025-08-12 14:12:12 +08:00
zhenghaojiang
eb43a475f4 [Feat] chunkprefill mla support torchair graph (#1772)
chunkprefill mla only support eager mode now,we want to optimaze it by
support torchair graph, the idea is simple, when all the request is
running in decode, use torchair graph to deal with it, else when
chunkprefill or prefill only, use the eager mode

- vLLM version: v0.10.0
- vLLM main:
ebf7605b0d

Signed-off-by: haojiangzheng <justineric096@gmail.com>
Co-authored-by: haojiangzheng <justineric096@gmail.com>
2025-08-11 19:58:59 +08:00
whx
29aaba5f84 [Perf][MTP] Optimize reject sampler in greedy situation. (#2137)
This PR port optimization in PR #2002 to main and makes it cleaner.

- vLLM version: v0.10.0
- vLLM main:
afa5b7ca0b

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-08-11 17:37:49 +08:00
Pleaplusone
c0f0b70813 [core] Support capture custom ops into aclgraph (#2113)
### What this PR does / why we need it?
Thanks to the PR https://github.com/vllm-project/vllm-ascend/pull/426
make vllm-ascend support the aclgraph inference to reduce the host
overhead. However, the capability of aclgraph strongly relies on the
functionality provided by `torch.compile`, which is the key feature
supported in torch 2.x . Therefore, capture custom op into aclgraph is
only possible when it can be recognize and captured by `torch.compile`.

In this PR, we register the meta implementation of current custom ops to
enable the fx graph capture. And by doing that, insert those custom ops
into aclgraph become a natural thing to the ascend runtime.

### Does this PR introduce _any_ user-facing change?
No user face change.

### How was this patch tested?
Tested in unittest, we will integrate the `rotary_embedding` op into a
small custom model and use `torch.compile` and aclgraph to capture and
replay it to verify its functionality.

- vLLM version: v0.10.0
- vLLM main:
1b99028069

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-08-11 15:59:42 +08:00
wangxiyuan
9260910c8d [CI] Fix broken CI (#2302)
1. disable test_eagle_ccorrectness test, we'll reopen it once oom error
fixed.
2. drop transformers version limit for main, since vLLM rely on
>=4.55.0, see:
65552b476b
3. fix kv_connector_output bug, see:
796bae07c5

- vLLM version: v0.10.0
- vLLM main:
d1af8b7be9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-11 11:22:32 +08:00
yangqinghao-cmss
ee6f79c44a Add ut for test_communicator.py (#2293)
### What this PR does / why we need it?

Add ut for test_communicator.py 

- vLLM version: v0.10.0
- vLLM main:
e5ebeeba53

Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
2025-08-09 08:26:04 +08:00
Icey
0bd5ff5299 Fix accuracy test config and add DeepSeek-V2-Lite test (#2261)
### What this PR does / why we need it?
This PR fix accuracy test related to
https://github.com/vllm-project/vllm-ascend/pull/2073, users can now
perform accuracy tests on multiple models simultaneously and generate
different report files by running:

```bash
cd ~/vllm-ascend
pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \
          --config-list-file ./tests/e2e/models/configs/accuracy.txt
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
<img width="1648" height="511" alt="image"
src="https://github.com/user-attachments/assets/1757e3b8-a6b7-44e5-b701-80940dc756cd"
/>


- vLLM version: v0.10.0
- vLLM main:
766bc8162c

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-08-08 11:09:16 +08:00
Mengqing Cao
ad1083761f [CI][Quickfix] Fix AscendFusedMoE init error (#2268)
### What this PR does / why we need it?
Fix AscendFusedMoE init error. Use `super().__init__()` instead of
`super(FusedMoE, self).__init__()` to ensure the member variables in
base class could be called by the children class

### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new existing test.


- vLLM version: v0.10.0
- vLLM main:
766bc8162c

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-08 10:20:23 +08:00
huangxialu
dceef080b1 [main] remove torch.cat and replace it by List[0] (#2153)
### What this PR does / why we need it?
torch_npu.npu_grouped_matmul:

https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md

According to the document, when `split_item` is 2 or 3,
`torch_npu.npu_grouped_matmul` will return a list which has one element.
Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is
unnecessary.

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
ut and e2e covered: `tests/ut/ops/test_fused_ops.py`,
`tests/e2e/singlecard/ops/test_fused_moe.py`

**performance**:
(qwen3 30B, 2k->20k)

base:
Total Token throughput (tok/s):          667.76 

remove cat:
Total Token throughput (tok/s):          680.82 


- vLLM version: v0.10.0
- vLLM main:
fa00c5d75b

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-08-07 17:20:19 +08:00
Ronald1995
b2598c3271 enable mm allreduce test (#2192)
### What this PR does / why we need it?
This PR is to add e2e test for using npu_mm_all_reduce_base fusion
kernel.
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
5d5d419ca6

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-08-07 17:19:23 +08:00
Yikun Jiang
58c8d4fdcd Remove transformer pins for v0.9.1-dev (#2234)
### What this PR does / why we need it?
Remove transformer pins for v0.9.1-dev, because we already release the
v0.9.1rc2 with right transformer version

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
doctest CI passed

- vLLM version: v0.10.0
- vLLM main:
7e6544c797

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-08-07 14:41:10 +08:00
lbk-sys
c611291661 【main】SP For Qwen3 MoE (#2209)
### What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2,
replacing AllReduce with Reduce-Scatter and AllGather achieves
computational benefits in norm operations while saving one AllGather
communication. This feature is enabled during the P-phase and delivers
notable gains in long-sequence scenarios (e.g., 16k–25k), with
performance improvements reaching 5%–10%.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
``` 
compilation_config={
    "pass_config":{
        "enable_sequence_parallelism": True
    }
},
enable_expert_parallel=True,
```

- vLLM version: v0.10.0
- vLLM main:
9edd1db02b

---------

Signed-off-by: libaokui <libaokui@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
2025-08-07 09:15:49 +08:00
xuyexiong
26fc36b0e0 [V1] MTP supports torchair (#2145)
### What this PR does / why we need it?
Support MTP  with:

- [x]  V0 Scheduler
- [x]  TorchAir
- [x]  Single DP
- [x]  Multi DP
- [x]  Disaggregate PD

Known issues:
- [ ] Not support V1 Scheduler (chunked prefill), will be supported in a
few weeks
- [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now,
need to comment out the line 171-175 in file
`vllm/vllm/v1/metrics/loggers.py`
```
            if (len(self.engine_indexes) > 1
                and vllm_config.speculative_config is not None):
            raise NotImplementedError("Prometheus metrics with Spec Decoding "
                                      "with >1 EngineCore per AsyncLLM is not "
                                      "supported yet.")
```

To start an online server with torchair enabled, here is an example:
```
python -m vllm.entrypoints.openai.api_server \
 --model="/weights/DeepSeek-R1_w8a8/" \
 --trust-remote-code \
 --max-model-len 40000 \
 --tensor-parallel-size 4 \
 --data_parallel_size 4 \
 --max-num-seqs 16 \
 --no-enable-prefix-caching \
 --enable_expert_parallel \
 --served-model-name deepseekr1 \
 --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
 --quantization ascend \
 --host 0.0.0.0 \
 --port 1234 \
 --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \
 --gpu_memory_utilization 0.9 
``` 

offline example with torchair enabled
```
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=16, temperature=0)
# Create an LLM.
llm = LLM(
    model="/home/data/DeepSeek-R1_w8a8/",
    tensor_parallel_size=16,
    max_num_seqs=16,
    gpu_memory_utilization=0.9,
    distributed_executor_backend="mp",
    enable_expert_parallel=True,
    speculative_config={
        "method": "deepseek_mtp",
        "num_speculative_tokens": 1,
    },
    trust_remote_code=True,
    enforce_eager=False,
    max_model_len=2000,
    additional_config = {
       'torchair_graph_config': {
            'enabled': True,
            "graph_batch_sizes": [16],
            'enable_multistream_shared_expert': False,
        },
       "ascend_scheduler_config": {
            "enabled": True
        },
        # 'expert_tensor_parallel_size': 16,
    }
)

# Generate texts from the prompts.
# llm.start_profile()
outputs = llm.generate(prompts, sampling_params)
# llm.stop_profile()
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.0
- vLLM main:
302962e806

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-08-06 19:37:43 +08:00
huangxialu
875a86cbe9 ut: add example and e2e test for sleepmode in external_launcher (#2152)
### What this PR does / why we need it?
This pr add e2e testcase to make sure sleep mode in external_launcher is
ok.

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
not involved


- vLLM version: v0.10.0
- vLLM main:
74333ae2f6

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-08-06 11:11:53 +08:00
Wang Kunpeng
8a59367d0c [main][Feature] Support deepseek w4a8 quantization (#2172)
### What this PR does / why we need it?
Supports Deepseek-R1 w4a8 quantization.
Since R1 w4a8 uses mixed quantization, only the MOE layer uses
w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which
includes the AscendW4A8DynamicFusedMoEMethod class.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and
`tests/ut/quantization/test_quantizer.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC`
to test deepseek w4a8_dynamic quantized model

#### 1.How to get weights using Modelslim
##### Installation steps
Use the branch master, the commit id is:
298e175d69b3b855111a1e09bbe2fcd12fdb4e24
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### The required transformers environment
transformers>=4.48.2

##### Generate w4a8 weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96)
chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path} --mindie_format

##### Adapt to vllm-ascend
Since mindie_format generates mindie format, some adaptation
modifications are needed for vllm-ascend to use it:
`quant_model_description_w8a8_dynamic.json` rename to
`quant_model_description.json`, and add `"group_size": 256`
Modification in `config.json`:`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3`; `quantization_config` is removed;
tips:The group_size and weights match. If the w4a8 weights are not
generated using msmodelslim, you can check the group_size in
quantization_config in config.json.

#### 2.How to run w4a8
##### a.How to run eager mode
export VLLM_USE_V1=1 # v1

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120 --max-num-seqs 128 --enforce-eager

##### b.How to run graph mode
export VLLM_USE_V1=1 # v1
export HCCL_BUFFSIZE=1024

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
eg: python -m vllm.entrypoints.openai.api_server
--model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
c494f96fbc

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-08-06 10:17:44 +08:00
Ruri
e31b31f9c3 [main][Bugfix] Fix unable to load qwen3_moe quantized weights (#2219)
### What this PR does / why we need it?

Fixes unable to load `qwen3_moe` quantized weights issue due to #1994

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Add a `qwen3_moe` W8A8 quantized model in
`tests/e2e/multicard/test_qwen3_moe.py`

- vLLM version: v0.10.0
- vLLM main:
c494f96fbc

---------

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
2025-08-06 09:08:36 +08:00
sherie
126cdfc92b [Test] add rejection sampler ut (#2084)
### What this PR does / why we need it?
add rejection sampler ut.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT passed

- vLLM version: v0.10.0
- vLLM main:
586f286789

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-08-05 19:03:36 +08:00
Slightwind
f3b50c54e8 [main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication (#2195)
This PR significantly optimizes performance for quantized Mixture of
Experts (MoE) layers by changing the order of quantization and
communication operations.

In the previous implementation, the `all2all` operation was performed on
unquantized `hidden_states` (in FP16/BF16) *before* quantization,
resulting in substantial communication overhead. By performing
quantization on each EP rank **first** and then sending the much smaller
quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast `int` inputs
to `float` for the `argsort` operation, forcing it to run on a faster
NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE
quantization scenarios.

- vLLM version: v0.10.0
- vLLM main:
7175817637

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-08-05 18:47:13 +08:00
wangxiyuan
458ab2db12 [BugFix] Fix the bug that qwen3 moe doesn't work with aclgraph (#2183)
What's the PR does:
1. Move AscendSparseMoeBlock to qwen3 model, since it's only used by
qwen3 model.
2. Disable AscendSparseMoeBlock if aclgraph is enabled,
AscendSparseMoeBlock doesn't work with aclgraph currently.

- vLLM version: v0.10.0
- vLLM main:
cdfd6871a5

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-05 17:42:52 +08:00
leo-pony
807f0895b2 Bump torch version to 2.7.1 (#1562)
### What this PR does / why we need it?
Bump torch version to 2.7.1, and cleanup infer schema patch
https://github.com/vllm-project/vllm-ascend/commit/857f489
(https://github.com/vllm-project/vllm-ascend/pull/837), this patch
depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974

### Does this PR introduce any user-facing change?
No

#### How was this patch tested?
CI passed

torch-npu 2.7.1rc1 install guide:
https://gitee.com/ascend/pytorch/tree/v2.7.1/
install depending:
```
pip3 install pyyaml
pip3 install setuptools
```
install torch-npu:

Closes: https://github.com/vllm-project/vllm-ascend/issues/1866
Closes: https://github.com/vllm-project/vllm-ascend/issues/1390


- vLLM version: v0.10.0
- vLLM main:
9af654cc38

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-08-05 08:43:24 +08:00
wangxiyuan
36e450eb0f [Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097)
we recently added disaggregated_prefill and ascend_forward_context
feature by
ba3dfbd59e
and
df0ec55162.
This PR fix some nit introduced by them to make the code clear.
1. drop `current_platform` usage. It'll lead unknown circular import
error in some case
2. update `set_ascend_forward_context` function to make the logic clear.
for example, remove V0 support in this function.
3. Remove useless `self.local_rank_across_dp` in worker
4. Remove `soc_info.py` to use `get_ascend_soc_version` instead.
 

- vLLM version: v0.10.0
- vLLM main:
02f82fe438

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-05 08:39:02 +08:00
Li Wang
ad366bf908 [Bugfix] Follow vLLM Qwen-Moe/VL and KV Connector change to fix broken CI (#2181)
### What this PR does / why we need it?
This pr fix broken CI:
1. Fix the
ee2eb6ecd8
changes, in this commit, they fused the gate and up projections in the
vision MLP, This can improve performance by reducing one matrix
multiplication. so, this pr do the following things:
- Specify that the two linear layers are fused as `mlp.gate_up_proj`
when loading the weights.
    - Use a SiluAndMul activation function.
2. Fix
aefeea0fde,
Update ModelRunnerOutput parameters to adapt to its changes
3. Fix
[vllm-commit](https://github.com/vllm-project/vllm/pull/20815/files#diff-3ffb829a39ab2b3e4706aa28f5e476815f36c3a87b98d6a66514ebedc8f3ffb4R354-R356),
fix qwen moe
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
fed5849d3f

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-04 21:37:50 +08:00
CaveNightingale
957c7f108d [Bugfix][PD] Make multiple Ps and Ds work on a single machine (#2080)
(cherry picked from commit 816375e0c1071d0696dfab1a1ce35674f9f37aa0)

### What this PR does / why we need it?

Suppose that you want to start a prefiller instance with npus `2,3`
only. So you start the instance with `ASCEND_RT_VISIBLE_DEVICES=2,3`.
The current programming will start two workers, whose ranks are `0` and
`1` respectedly. And they will pick the first and second ip addresses of
npus in the ranktable instead of the thirdth and forth ones. But
actually they are using card `2,3` and therefore they can not link with
remote instances when they attempt to transfer the KVCache.

Hence, at most 1 prefiller instance and at most 1 decoder instance can
work on a single machine since they always pick the first npu ip address
in the ranktable currently.

This pull request is proposed to fix the problem. This fix pick ips of
only those devices that are in `ASCEND_RT_VISIBLE_DEVICES` from the
ranktable.

### Does this PR introduce _any_ user-facing change?

If the user use ranktable generated by `gen_ranktable.sh`, they should
not face any change.

### How was this patch tested?
Qwen-0.6B 1P 1D, dp=2, `ASCEND_RT_VISIBLE_DEVICES=2,3` for prefiller and
`ASCEND_RT_VISIBLE_DEVICES=4,5` for decoder.


- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

Signed-off-by: CaveNightingale <cavenightingale@foxmail.com>
2025-08-04 17:22:18 +08:00
Mengqing Cao
af04ee9e7a [MoE][Dist] Fix Qwen MoE accuracy bug in DP scenario (#1856)
### What this PR does / why we need it?
Fix Qwen MoE accuracy bug in DP scenario.

Now the implentment of `FusedMoE` in vLLM use `All2AllManager` to
manager different all2all algorithm branch. And the default branch use
`Multicast` in `dispatch` phase and `all_reduce` in `combine` phase,
which are not implented in vLLM-Ascend. This leading to invoking into a
default implentment in `base_communicator`, with empty `dispatch` and
`combine` operations, thus causing the accuracy issue on it.

This pr is a temporary workaround, refacting all2all in vLLM-Ascend
could be a better way.


- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-04 10:24:18 +08:00
YuanCheng-coder
ddaded1537 Add ut for envs.py (#2131)
What this PR does / why we need it?
test vllm_ascend/envs.py contains environment variables defination

Does this PR introduce any user-facing change?
N/A

How was this patch tested?
CI passed with new added test.

vLLM version: v0.10.0
vLLM main:
9532a6d563

- vLLM version: v0.10.0
- vLLM main:
b4e081cb15

---------

Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
2025-08-02 16:53:44 +08:00
leo-pony
e467fe1b77 Add qwen-vl model and sampling feature UT for 310I series (#2168)
### What this PR does / why we need it?
Add qwen-vl model and sampling feature UT for  310I series

- vLLM version: v0.10.0
- vLLM main:
e0f63e4a35

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-08-02 11:26:12 +08:00
weijinqian0
6e00aed4d5 [main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088)
It comes from 0.9.1dev
[0.9.1][Feature]Moe alltoallv communication optimization for unquantized
RL training sence & alltoallv support dpo (#1547)

- vLLM version: v0.10.0
- vLLM main:
97608dc276

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: TaoYu Chen <ctynb@qq.com>
Co-authored-by: taoxudonghaha <justsheldon@163.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-08-02 09:49:10 +08:00
22dimensions
8cf97d8310 [Misc] Add extra checking to torchair_graph_config. (#1939)
### What this PR does / why we need it?

cherry-pick #1675  to main
This PR adds validation checking to torchair_graph_config for better
reliability.

Co-authored-by: whx-sjtu <2952154980@qq.com>

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-08-01 09:24:11 +08:00
22dimensions
9e65da990e [Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132)
### What this PR does / why we need it?

cherry-pick #1501 from 0.9.1-dev to main

Currently, Ray is not compatible with ACL Graph, so we need to fall back
to eager mode when using the Ray backend.

co-authored: Yizhou Liu <liu_yizhou@outlook.com>

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-08-01 09:06:09 +08:00
Icey
86bdde1ca8 Enable pytest and yaml style accuracy test (#2073)
### What this PR does / why we need it?

This PR enabled pytest and yaml style accuracy test, users now can
enable accuracy test by running:

```bash
cd ~/vllm-ascend
pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
          --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \
          --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md

pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
          --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1970

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-07-31 21:39:13 +08:00
huangxialu
9c9a7cd90b [main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112)
backport of v0.9.1-dev:
https://github.com/vllm-project/vllm-ascend/pull/1902

origin main npu_moe_gating_top_k_softmax:
https://github.com/vllm-project/vllm-ascend/pull/1355

- vLLM version: v0.10.0
- vLLM main:
055bd3978e

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-07-31 21:05:56 +08:00
Ronald1995
e8660d7978 ut:add ut for qwen2_5_vl (#2143)
### What this PR does / why we need it?
add ut for qwen2_5_vl

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:46:17 +08:00
Ronald1995
cb0a303080 ut:add e2e test for external launcher (#2091)
### What this PR does / why we need it?
This pr add e2e testcase to make sure initialize LLM by
external_launcher method is ok.

### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:37:42 +08:00
Mengqing Cao
4c8842da65 [BugFix] Fix a bug of running chunked-prefill with torchair. (#1378) (#1844)
This PR fixes the bug `local variable 'decode_hs_or_q_c' referenced
before assignment` when running chunked-prefill with torchair. We should
calculate `decode_hs_or_q_c` whether or not torchair graphics mode is
enabled.

backport of #1378
fix https://github.com/vllm-project/vllm-ascend/issues/1369


- vLLM version: v0.10.0
- vLLM main:
0e36abf993

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-07-31 20:08:45 +08:00
daniel
db310c6ec9 add ut for device allocator/camem and mutistream/layers (#2037)
What this PR does / why we need it?

test device allocator/camem and mutistream/layers contains resource
allocation and stream ops
Does this PR introduce any user-facing change?

N/A
How was this patch tested?

CI passed with new added test.


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 1024daniel <xxltju324@gmail.com>
2025-07-31 19:17:27 +08:00
CaranLic
7c90ba5fe8 [Test] add ut for decorator.py/deepseek_mtp.py (#2127)
### What this PR does / why we need it?
add ut for decorator.py/deepseek_mtp.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with new tests
- vLLM version: v0.10.0
- vLLM main:
055bd3978e

---------

Signed-off-by: CaranLic <740821011@qq.com>
2025-07-31 15:21:15 +08:00
Joey Gao
6192bc95c0 [Bugfix] fix tensor not same device in qwen2_5_vl_without_padding (#2051)
bugfix cherry-pick from v0.9.1-dev
https://github.com/vllm-project/vllm-ascend/pull/2007
### What this PR does / why we need it?
Minimum reproducing code:
```python
# test.py
from vllm import LLM, SamplingParams
 
prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen2.5-VL-7B-Instruct", max_model_len=26240)
 
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    
```
```bash
export USE_OPTIMIZED_MODEL=0
python test.py
```
exception as follow:
```
[rank0]:   File "/home/xxx/vllm_ascend/models/qwen2_5_vl_without_padding.py", line 84, in forward
[rank0]:     q = torch_npu.npu_rotary_mul(q, cos, sin)
[rank0]:   File "/home/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu! (when checking argument for argument r1 in method wrapper__npu_rotary_mul)
```

In `AscendQwen2_5_VisionAttention_Without_Padding`,
`torch_npu.npu_rotary_mul(q, cos, sin)`, `cos`/`sin` on cpu, but `q` on
npu, so there will be an error.

`qwen2_5_vl_without_padding.py` need this bugfix, because
`AscendQwen2_5_VisionTransformer_Without_Padding.rot_pos_emb` in
wen2_5_vl_without_padding.py is from vllm and `inv_freq` will create on
cpu.

40d86ee412/vllm/model_executor/models/qwen2_5_vl.py (L482)
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float, device='cpu') / dim))
```
`qwen2_5_vl.py` do not need, because
`AscendQwen2_5_VisionRotaryEmbedding` in qwen2_5_vl.py rewrite
`AscendQwen2_5_VisionRotaryEmbedding` and `inv_freq` will create on
device.
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float) / dim))
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: pjgao <gaopengju3@huawei.com>
Co-authored-by: pjgao <gaopengju3@huawei.com>
2025-07-31 15:18:54 +08:00
Ronald1995
3386e09a40 ut:add ut for qwen2_vl.py (#2096)
### What this PR does / why we need it?
add ut for qwen2_vl.py

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
555e7225bc

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-30 22:31:47 +08:00
Ruri
4fcca137a7 [main][Feature] Support Qwen3 W4A8 quantization (#2060)
### What this PR does / why we need it?

Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC`
to test qwen3 w4a8_dynamic quantized model

Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`

1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```

2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
  --model vllm-ascend/Qwen3-8B-W4A8 \
  --port 8000 \
  --quantization ascend \
  --tensor_parallel_size 2 \
  --enforce-eager
```

- vLLM version: v0.10.0
- vLLM main:
4cd7fe6cea

---------

Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
2025-07-30 14:57:14 +08:00
zhangxinyuehfad
6874d666fa [CI]Add e2e test for 310p (#1879)
### What this PR does / why we need it?
Add e2e test for 310p:
trigger conditions:tag, labels(ready-for-test, e2e-310p-test), schedule
image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10
runner: linux-aarch64-310p-1, linux-aarch64-310p-4
model: IntervitensInc/pangu-pro-moe-model, Qwen/Qwen3-0.6B-Base,
Qwen/Qwen2.5-7B-Instruct

- vLLM version: v0.10.0
- vLLM main:
b917da442b

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-30 14:52:16 +08:00
YuanCheng-coder
34dd24adf2 add ut for vocab_parallel_embedding (#2067)
### What this PR does / why we need it?

test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel
embedding forward

CI passed with new added test.

vLLM version: v0.10.0
vLLM main:
2cc571199b


- vLLM version: v0.10.0
- vLLM main:
05cbbe20c5

Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
2025-07-30 14:35:45 +08:00