Commit Graph

73 Commits

Author SHA1 Message Date
huangxialu
dceef080b1 [main] remove torch.cat and replace it by List[0] (#2153)
### What this PR does / why we need it?
torch_npu.npu_grouped_matmul:

https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md

According to the document, when `split_item` is 2 or 3,
`torch_npu.npu_grouped_matmul` will return a list which has one element.
Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is
unnecessary.

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
ut and e2e covered: `tests/ut/ops/test_fused_ops.py`,
`tests/e2e/singlecard/ops/test_fused_moe.py`

**performance**:
(qwen3 30B, 2k->20k)

base:
Total Token throughput (tok/s):          667.76 

remove cat:
Total Token throughput (tok/s):          680.82 


- vLLM version: v0.10.0
- vLLM main:
fa00c5d75b

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-08-07 17:20:19 +08:00
lbk-sys
c611291661 【main】SP For Qwen3 MoE (#2209)
### What this PR does / why we need it?
Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2,
replacing AllReduce with Reduce-Scatter and AllGather achieves
computational benefits in norm operations while saving one AllGather
communication. This feature is enabled during the P-phase and delivers
notable gains in long-sequence scenarios (e.g., 16k–25k), with
performance improvements reaching 5%–10%.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
``` 
compilation_config={
    "pass_config":{
        "enable_sequence_parallelism": True
    }
},
enable_expert_parallel=True,
```

- vLLM version: v0.10.0
- vLLM main:
9edd1db02b

---------

Signed-off-by: libaokui <libaokui@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
2025-08-07 09:15:49 +08:00
xuyexiong
26fc36b0e0 [V1] MTP supports torchair (#2145)
### What this PR does / why we need it?
Support MTP  with:

- [x]  V0 Scheduler
- [x]  TorchAir
- [x]  Single DP
- [x]  Multi DP
- [x]  Disaggregate PD

Known issues:
- [ ] Not support V1 Scheduler (chunked prefill), will be supported in a
few weeks
- [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now,
need to comment out the line 171-175 in file
`vllm/vllm/v1/metrics/loggers.py`
```
            if (len(self.engine_indexes) > 1
                and vllm_config.speculative_config is not None):
            raise NotImplementedError("Prometheus metrics with Spec Decoding "
                                      "with >1 EngineCore per AsyncLLM is not "
                                      "supported yet.")
```

To start an online server with torchair enabled, here is an example:
```
python -m vllm.entrypoints.openai.api_server \
 --model="/weights/DeepSeek-R1_w8a8/" \
 --trust-remote-code \
 --max-model-len 40000 \
 --tensor-parallel-size 4 \
 --data_parallel_size 4 \
 --max-num-seqs 16 \
 --no-enable-prefix-caching \
 --enable_expert_parallel \
 --served-model-name deepseekr1 \
 --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
 --quantization ascend \
 --host 0.0.0.0 \
 --port 1234 \
 --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \
 --gpu_memory_utilization 0.9 
``` 

offline example with torchair enabled
```
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=16, temperature=0)
# Create an LLM.
llm = LLM(
    model="/home/data/DeepSeek-R1_w8a8/",
    tensor_parallel_size=16,
    max_num_seqs=16,
    gpu_memory_utilization=0.9,
    distributed_executor_backend="mp",
    enable_expert_parallel=True,
    speculative_config={
        "method": "deepseek_mtp",
        "num_speculative_tokens": 1,
    },
    trust_remote_code=True,
    enforce_eager=False,
    max_model_len=2000,
    additional_config = {
       'torchair_graph_config': {
            'enabled': True,
            "graph_batch_sizes": [16],
            'enable_multistream_shared_expert': False,
        },
       "ascend_scheduler_config": {
            "enabled": True
        },
        # 'expert_tensor_parallel_size': 16,
    }
)

# Generate texts from the prompts.
# llm.start_profile()
outputs = llm.generate(prompts, sampling_params)
# llm.stop_profile()
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.0
- vLLM main:
302962e806

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-08-06 19:37:43 +08:00
Wang Kunpeng
8a59367d0c [main][Feature] Support deepseek w4a8 quantization (#2172)
### What this PR does / why we need it?
Supports Deepseek-R1 w4a8 quantization.
Since R1 w4a8 uses mixed quantization, only the MOE layer uses
w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which
includes the AscendW4A8DynamicFusedMoEMethod class.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and
`tests/ut/quantization/test_quantizer.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC`
to test deepseek w4a8_dynamic quantized model

#### 1.How to get weights using Modelslim
##### Installation steps
Use the branch master, the commit id is:
298e175d69b3b855111a1e09bbe2fcd12fdb4e24
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### The required transformers environment
transformers>=4.48.2

##### Generate w4a8 weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96)
chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path} --mindie_format

##### Adapt to vllm-ascend
Since mindie_format generates mindie format, some adaptation
modifications are needed for vllm-ascend to use it:
`quant_model_description_w8a8_dynamic.json` rename to
`quant_model_description.json`, and add `"group_size": 256`
Modification in `config.json`:`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3`; `quantization_config` is removed;
tips:The group_size and weights match. If the w4a8 weights are not
generated using msmodelslim, you can check the group_size in
quantization_config in config.json.

#### 2.How to run w4a8
##### a.How to run eager mode
export VLLM_USE_V1=1 # v1

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120 --max-num-seqs 128 --enforce-eager

##### b.How to run graph mode
export VLLM_USE_V1=1 # v1
export HCCL_BUFFSIZE=1024

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
eg: python -m vllm.entrypoints.openai.api_server
--model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
c494f96fbc

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-08-06 10:17:44 +08:00
sherie
126cdfc92b [Test] add rejection sampler ut (#2084)
### What this PR does / why we need it?
add rejection sampler ut.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT passed

- vLLM version: v0.10.0
- vLLM main:
586f286789

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-08-05 19:03:36 +08:00
Slightwind
f3b50c54e8 [main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication (#2195)
This PR significantly optimizes performance for quantized Mixture of
Experts (MoE) layers by changing the order of quantization and
communication operations.

In the previous implementation, the `all2all` operation was performed on
unquantized `hidden_states` (in FP16/BF16) *before* quantization,
resulting in substantial communication overhead. By performing
quantization on each EP rank **first** and then sending the much smaller
quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast `int` inputs
to `float` for the `argsort` operation, forcing it to run on a faster
NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE
quantization scenarios.

- vLLM version: v0.10.0
- vLLM main:
7175817637

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-08-05 18:47:13 +08:00
leo-pony
807f0895b2 Bump torch version to 2.7.1 (#1562)
### What this PR does / why we need it?
Bump torch version to 2.7.1, and cleanup infer schema patch
https://github.com/vllm-project/vllm-ascend/commit/857f489
(https://github.com/vllm-project/vllm-ascend/pull/837), this patch
depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974

### Does this PR introduce any user-facing change?
No

#### How was this patch tested?
CI passed

torch-npu 2.7.1rc1 install guide:
https://gitee.com/ascend/pytorch/tree/v2.7.1/
install depending:
```
pip3 install pyyaml
pip3 install setuptools
```
install torch-npu:

Closes: https://github.com/vllm-project/vllm-ascend/issues/1866
Closes: https://github.com/vllm-project/vllm-ascend/issues/1390


- vLLM version: v0.10.0
- vLLM main:
9af654cc38

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-08-05 08:43:24 +08:00
wangxiyuan
36e450eb0f [Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097)
we recently added disaggregated_prefill and ascend_forward_context
feature by
ba3dfbd59e
and
df0ec55162.
This PR fix some nit introduced by them to make the code clear.
1. drop `current_platform` usage. It'll lead unknown circular import
error in some case
2. update `set_ascend_forward_context` function to make the logic clear.
for example, remove V0 support in this function.
3. Remove useless `self.local_rank_across_dp` in worker
4. Remove `soc_info.py` to use `get_ascend_soc_version` instead.
 

- vLLM version: v0.10.0
- vLLM main:
02f82fe438

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-05 08:39:02 +08:00
Li Wang
ad366bf908 [Bugfix] Follow vLLM Qwen-Moe/VL and KV Connector change to fix broken CI (#2181)
### What this PR does / why we need it?
This pr fix broken CI:
1. Fix the
ee2eb6ecd8
changes, in this commit, they fused the gate and up projections in the
vision MLP, This can improve performance by reducing one matrix
multiplication. so, this pr do the following things:
- Specify that the two linear layers are fused as `mlp.gate_up_proj`
when loading the weights.
    - Use a SiluAndMul activation function.
2. Fix
aefeea0fde,
Update ModelRunnerOutput parameters to adapt to its changes
3. Fix
[vllm-commit](https://github.com/vllm-project/vllm/pull/20815/files#diff-3ffb829a39ab2b3e4706aa28f5e476815f36c3a87b98d6a66514ebedc8f3ffb4R354-R356),
fix qwen moe
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
fed5849d3f

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-08-04 21:37:50 +08:00
CaveNightingale
957c7f108d [Bugfix][PD] Make multiple Ps and Ds work on a single machine (#2080)
(cherry picked from commit 816375e0c1071d0696dfab1a1ce35674f9f37aa0)

### What this PR does / why we need it?

Suppose that you want to start a prefiller instance with npus `2,3`
only. So you start the instance with `ASCEND_RT_VISIBLE_DEVICES=2,3`.
The current programming will start two workers, whose ranks are `0` and
`1` respectedly. And they will pick the first and second ip addresses of
npus in the ranktable instead of the thirdth and forth ones. But
actually they are using card `2,3` and therefore they can not link with
remote instances when they attempt to transfer the KVCache.

Hence, at most 1 prefiller instance and at most 1 decoder instance can
work on a single machine since they always pick the first npu ip address
in the ranktable currently.

This pull request is proposed to fix the problem. This fix pick ips of
only those devices that are in `ASCEND_RT_VISIBLE_DEVICES` from the
ranktable.

### Does this PR introduce _any_ user-facing change?

If the user use ranktable generated by `gen_ranktable.sh`, they should
not face any change.

### How was this patch tested?
Qwen-0.6B 1P 1D, dp=2, `ASCEND_RT_VISIBLE_DEVICES=2,3` for prefiller and
`ASCEND_RT_VISIBLE_DEVICES=4,5` for decoder.


- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

Signed-off-by: CaveNightingale <cavenightingale@foxmail.com>
2025-08-04 17:22:18 +08:00
YuanCheng-coder
ddaded1537 Add ut for envs.py (#2131)
What this PR does / why we need it?
test vllm_ascend/envs.py contains environment variables defination

Does this PR introduce any user-facing change?
N/A

How was this patch tested?
CI passed with new added test.

vLLM version: v0.10.0
vLLM main:
9532a6d563

- vLLM version: v0.10.0
- vLLM main:
b4e081cb15

---------

Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
2025-08-02 16:53:44 +08:00
weijinqian0
6e00aed4d5 [main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088)
It comes from 0.9.1dev
[0.9.1][Feature]Moe alltoallv communication optimization for unquantized
RL training sence & alltoallv support dpo (#1547)

- vLLM version: v0.10.0
- vLLM main:
97608dc276

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: TaoYu Chen <ctynb@qq.com>
Co-authored-by: taoxudonghaha <justsheldon@163.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-08-02 09:49:10 +08:00
22dimensions
8cf97d8310 [Misc] Add extra checking to torchair_graph_config. (#1939)
### What this PR does / why we need it?

cherry-pick #1675  to main
This PR adds validation checking to torchair_graph_config for better
reliability.

Co-authored-by: whx-sjtu <2952154980@qq.com>

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-08-01 09:24:11 +08:00
huangxialu
9c9a7cd90b [main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112)
backport of v0.9.1-dev:
https://github.com/vllm-project/vllm-ascend/pull/1902

origin main npu_moe_gating_top_k_softmax:
https://github.com/vllm-project/vllm-ascend/pull/1355

- vLLM version: v0.10.0
- vLLM main:
055bd3978e

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-07-31 21:05:56 +08:00
Ronald1995
e8660d7978 ut:add ut for qwen2_5_vl (#2143)
### What this PR does / why we need it?
add ut for qwen2_5_vl

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:46:17 +08:00
daniel
db310c6ec9 add ut for device allocator/camem and mutistream/layers (#2037)
What this PR does / why we need it?

test device allocator/camem and mutistream/layers contains resource
allocation and stream ops
Does this PR introduce any user-facing change?

N/A
How was this patch tested?

CI passed with new added test.


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 1024daniel <xxltju324@gmail.com>
2025-07-31 19:17:27 +08:00
CaranLic
7c90ba5fe8 [Test] add ut for decorator.py/deepseek_mtp.py (#2127)
### What this PR does / why we need it?
add ut for decorator.py/deepseek_mtp.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with new tests
- vLLM version: v0.10.0
- vLLM main:
055bd3978e

---------

Signed-off-by: CaranLic <740821011@qq.com>
2025-07-31 15:21:15 +08:00
Joey Gao
6192bc95c0 [Bugfix] fix tensor not same device in qwen2_5_vl_without_padding (#2051)
bugfix cherry-pick from v0.9.1-dev
https://github.com/vllm-project/vllm-ascend/pull/2007
### What this PR does / why we need it?
Minimum reproducing code:
```python
# test.py
from vllm import LLM, SamplingParams
 
prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen2.5-VL-7B-Instruct", max_model_len=26240)
 
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    
```
```bash
export USE_OPTIMIZED_MODEL=0
python test.py
```
exception as follow:
```
[rank0]:   File "/home/xxx/vllm_ascend/models/qwen2_5_vl_without_padding.py", line 84, in forward
[rank0]:     q = torch_npu.npu_rotary_mul(q, cos, sin)
[rank0]:   File "/home/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu! (when checking argument for argument r1 in method wrapper__npu_rotary_mul)
```

In `AscendQwen2_5_VisionAttention_Without_Padding`,
`torch_npu.npu_rotary_mul(q, cos, sin)`, `cos`/`sin` on cpu, but `q` on
npu, so there will be an error.

`qwen2_5_vl_without_padding.py` need this bugfix, because
`AscendQwen2_5_VisionTransformer_Without_Padding.rot_pos_emb` in
wen2_5_vl_without_padding.py is from vllm and `inv_freq` will create on
cpu.

40d86ee412/vllm/model_executor/models/qwen2_5_vl.py (L482)
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float, device='cpu') / dim))
```
`qwen2_5_vl.py` do not need, because
`AscendQwen2_5_VisionRotaryEmbedding` in qwen2_5_vl.py rewrite
`AscendQwen2_5_VisionRotaryEmbedding` and `inv_freq` will create on
device.
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float) / dim))
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: pjgao <gaopengju3@huawei.com>
Co-authored-by: pjgao <gaopengju3@huawei.com>
2025-07-31 15:18:54 +08:00
Ronald1995
3386e09a40 ut:add ut for qwen2_vl.py (#2096)
### What this PR does / why we need it?
add ut for qwen2_vl.py

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
555e7225bc

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-30 22:31:47 +08:00
Ruri
4fcca137a7 [main][Feature] Support Qwen3 W4A8 quantization (#2060)
### What this PR does / why we need it?

Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC`
to test qwen3 w4a8_dynamic quantized model

Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`

1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```

2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
  --model vllm-ascend/Qwen3-8B-W4A8 \
  --port 8000 \
  --quantization ascend \
  --tensor_parallel_size 2 \
  --enforce-eager
```

- vLLM version: v0.10.0
- vLLM main:
4cd7fe6cea

---------

Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
2025-07-30 14:57:14 +08:00
YuanCheng-coder
34dd24adf2 add ut for vocab_parallel_embedding (#2067)
### What this PR does / why we need it?

test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel
embedding forward

CI passed with new added test.

vLLM version: v0.10.0
vLLM main:
2cc571199b


- vLLM version: v0.10.0
- vLLM main:
05cbbe20c5

Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
2025-07-30 14:35:45 +08:00
wangxiyuan
9b67c87b14 [Refactor]Refactor sampler (#2050)
Refactor Sampler implementation from patch way to inherit from vLLM
Sampler interface.

Next step: Make the op `TopKTopPSampler` in vLLM support custom ops
register mechanism

- vLLM version: v0.10.0
- vLLM main:
61a6905ab0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-30 08:47:22 +08:00
whx
98cadc2146 [Perf] Avoid performing index selection of sin/cos cache every layer (#1890)
Optimize number of index selections of sin/cos cache.

- vLLM version: v0.10.0
- vLLM main:
656c24f1b5

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-29 18:06:45 +08:00
wangxiyuan
0190b68f51 [Misc]Remove PD v0 code (#2047)
Cleanup V0 disaggregated prefill code for V0 Engine.

part of https://github.com/vllm-project/vllm-ascend/issues/1620

TODO: enable v1 e2e test.

- vLLM version: v0.10.0
- vLLM main:
2cc571199b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 19:09:22 +08:00
huangxialu
1a25b0a2dd [Test] add ut for qwen3_moe.py (#2055)
### What this PR does / why we need it?
Add ut for qwen3_moe.py

### Does this PR introduce _any_ user-facing change?
No.


- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-07-28 17:37:13 +08:00
LeeWenquan
3ad582c9a9 [Test] Add ut for files in /attention (#1944)
### What this PR does / why we need it?
Add ut for files in folder /attention
### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.10.0
- vLLM main:
139a7f07bd

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
2025-07-28 15:54:40 +08:00
Ronald1995
32a9c5f694 [Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)
### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-28 15:13:37 +08:00
zzzzwwjj
ba3dfbd59e [main][refactor] Refactoring forward_context and model_runner_v1 (#1979)
### What this PR does / why we need it?

A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:

Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;

### Does this PR introduce _any_ user-facing change?

No change at user-facing.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
57c22e57f9

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
zhangxinyuehfad
d1c640841b [Bugfix] Fix num_hidden_layers when Qwen2-Audio 7B (#1803)
### What this PR does / why we need it?
Fix num_hidden_layers when Qwen2-Audio 7B and #1760 :
```
INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
Traceback (most recent call last):
  File "/workspace/test1.py", line 58, in <module>
    main(audio_count)
  File "/workspace/test1.py", line 38, in main
    llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
  File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context)
  File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config
    config = VllmConfig(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__
    current_platform.check_and_update_config(self)
  File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config
    update_aclgraph_sizes(vllm_config)
  File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes
    num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers'
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes: https://github.com/vllm-project/vllm-ascend/issues/1780
https://github.com/vllm-project/vllm-ascend/issues/1760
https://github.com/vllm-project/vllm-ascend/issues/1276
https://github.com/vllm-project/vllm-ascend/issues/359

- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-26 20:13:00 +08:00
Pleaplusone
df0ec55162 Disaggregate prefill for kv cache register style (#950)
### What this PR does / why we need it?
This PR adopt `LLMDataDist` for kv cache register and `pull_blocks`
style disaggregate prefill implementation. The interface implementation
mainly follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.

This PR can be test with the following step:
- Generate the rank table for all machine.
- execute`toy_proxy.py` to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
- Run the prefill server and decode server.
- send the request to the disaggregate prefill proxy

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Signed-off-by: liziyu179 <3475441767@qq.com>
Signed-off-by: underfitc <hucong24@huawei.com>
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Co-authored-by: liziyu179 <3475441767@qq.com>
Co-authored-by: underfitc <hucong24@huawei.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: underfituu <hzhucong@163.com>
2025-07-26 17:15:47 +08:00
Yikun Jiang
17a430f7b8 Upgrade vLLM to v0.10.0 (#1927)
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
7728dd77bb

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 15:43:29 +08:00
Ronald1995
e561a2c6ec ut:add ut for qwen2_5_vl_without_padding.py (#1988)
### What this PR does / why we need it?
this pr is to add ut for qwen2_5_vl_without_padding.py

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
this is only a ut test


- vLLM version: v0.9.2
- vLLM main:
9c8b2c2a8a

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-25 14:12:44 +08:00
SunnyLee151064
ae560f7131 [Test] Add uts for files in /core (#1957)
### What this PR does / why we need it?

Add uts for files in folder /core

### Does this PR introduce _any_ user-facing change?

No

- vLLM version: v0.9.2
- vLLM main:
5a19a6c670

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
2025-07-25 09:48:19 +08:00
SunnyLee151064
ab7d5aca5d [Test] Add ut for files in /multistream (#1947)
### What this PR does / why we need it?
Add some uts for files in folder /multistream

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.9.2
- vLLM main:
b77c7d327f

Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
2025-07-24 10:42:49 +08:00
SunnyLee151064
34571ea5ae [Test] Add ut for files in /distributed (#1951)
### What this PR does / why we need it?
Add some ut for files in folder /distributed

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.9.2
- vLLM main:
107111a859

Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
2025-07-24 10:36:11 +08:00
Zac
2ffe051859 [Test]add ut for deepseek_v2. (#1964)
What this PR does / why we need it?
Add uts for deepseek_v2

Does this PR introduce any user-facing change?
No

How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
f3137cdd81

---------

Signed-off-by: 张帮政 <zhangbangzheng@huawei.com>
2025-07-24 10:27:50 +08:00
wangxiyuan
846555cdb5 [Misc] Clean up uesless code in attention (#1933)
Before do attention module refactor, we can do some code cleanup to make
the next step easier.

What this PR does:

1. remove uesless `common_prefix_len` for attention builder
2. remove uesless `is_only_prefill` and `num_input_tokens` in attention
metadata.
3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead,
`CommonAttentionMetadata` is over designed and uesless
4. update the attention backend input parameters to keep the same as
vLLM.
5. Rename attention name to the same style with `ASCEND` prefix

- vLLM version: v0.9.2
- vLLM main:
107111a859

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-24 10:23:34 +08:00
shiyuan680
ac0bf133f4 add ut of fused_moe.py (#1930)
### What this PR does / why we need it?
add unit test for fused_moe.py

- vLLM version: v0.9.2
- vLLM main:
2dec7c1a5d

Signed-off-by: yangcheng <yangcheng104@huawei.com>
Co-authored-by: yangcheng <yangcheng104@huawei.com>
2025-07-23 16:24:09 +08:00
weichen
ac773aca43 Add UT for Patches (#1766)
### What this PR does / why we need it?
Add UT for patches in vLLM Ascend
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Irrelevant

- vLLM version: v0.9.2
- vLLM main:
107111a859

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-07-23 16:07:20 +08:00
Mengqing Cao
3aa3b46bfe [V1][PP] Support pp with ray backend in V1 (#1800)
### What this PR does / why we need it?
Support pipeline parallel with ray backend in V1Engine.

Fixes #1751

### Does this PR introduce _any_ user-facing change?
Users could specify ray as distributed backend when inferencing with pp

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.9.2
- vLLM main:
32142b3c62

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-23 14:52:52 +08:00
JohnJan
ce4970eee0 [Test] Add unit test for schedule_config.py (#1590)
What this PR does / why we need it?
According to issue
https://github.com/vllm-project/vllm-ascend/issues/1298 , this pull
request adds unit test code for schedule_config.py.

Does this PR introduce any user-facing change?
No

How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2
2025-07-22 11:43:25 +08:00
Yikun Jiang
5f0b42e414 [FOLLOWUP] Use base test to avoid patch everwhere (#1634)
### What this PR does / why we need it?
Use base test to avoid patch everwhere.

Followup here: https://github.com/vllm-project/vllm-ascend/pull/1566

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
ut ci passed

- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-22 09:03:40 +08:00
wangxiyuan
7265dc090d [2/4][Refactor] Refactor torchair utils (#1892)
There is a lot torchair specified logic in common code. It results hard
code maintenance. We will create a new torchair module to launch
torchair related logic there. I plan to add 4 PR.

1. Refactor worker
2. Refactor utils (this PR)
- simple change that move all torchair related util function to torchair
module
3. Refactor model_runner
4. Refactor attention

- vLLM version: v0.9.2
- vLLM main:
8188196a1c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-21 19:43:30 +08:00
wangxiyuan
af56ae3ed1 [1/4][Refactor] Refactor torchair worker (#1885)
There is a lot torchair specified logic in common code. It results hard
code maintenance. We will create a new torchair module to launch
torchair related logic there. I plan to add 4 PR.

1. Refactor worker (this PR)
- create torchair module and move torchair related code in worker to the
new module
3. Refactor utils
4. Refactor model_runner
5. Refactor attention


- vLLM version: v0.9.2
- vLLM main:
8188196a1c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-21 11:50:46 +08:00
Mengqing Cao
8cfd257992 [Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-21 09:08:04 +08:00
wangxiyuan
a8b316ac5b [CI] Make AttentionBackend interface compatible to fix broken CI (#1893)
vLLM commit
752c6ade2e
removed `blocksparse_params` for attention backend. This PR does the
same change to make CI happy.


- vLLM version: v0.9.2
- vLLM main:
9499e26e2a

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-21 08:21:06 +08:00
lianyibo
53d2ea3789 [Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 (#1811)
### What this PR does / why we need it?

maybe fixes
[#1728](https://github.com/vllm-project/vllm-ascend/issues/1728#issuecomment-3065083433)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Test Qwen3-32B tp=4 with: 

```bash
vllm serve --port 1234 Qwen/Qwen3-32B \
    --served-model-name Qwen3-32B \
    --tensor-parallel-size 4 \
    --swap-space 16 \
    --max-model-len 6000 \
    --load-format dummy \
    --disable-log-stats \
    --disable-log-requests \
```

Request batch_size=128 input/output token=1024

**In 0.9.2rc1**

```text
=====================================================
Total TPS with    prefill(tokens/s)         : 785.1395
Total TPS without prefill                   : 846.6809
Mean TPS with    prefill                    : 6.1339
Mean TPS without prefill                    : 6.6147
=====================================================
Mean TTFT(ms)                               : 10307.8123
Max  TTFT(ms)                               : 21423.0733
Min  TTFT(ms)                               : 362.3602
=====================================================
Mean TPOT(ms)                               : 151.3051
Max  TPOT(ms)                               : 159.4649
Min  TPOT(ms)                               : 140.899
=====================================================
Total Time(s)                               : 175.6032
Request Throughput(requests/s)              : 0.7289
=====================================================
```

**Apply this PR**

```text
=====================================================
Total TPS with    prefill(tokens/s)         : 811.0014
Total TPS without prefill                   : 876.4423
Mean TPS with    prefill                    : 6.3359
Mean TPS without prefill                    : 6.8472
=====================================================
Mean TTFT(ms)                               : 10263.8382
Max  TTFT(ms)                               : 21151.2547
Min  TTFT(ms)                               : 375.9136
=====================================================
Mean TPOT(ms)                               : 146.1686
Max  TPOT(ms)                               : 154.0957
Min  TPOT(ms)                               : 136.8879
=====================================================
Total Time(s)                               : 169.8579
Request Throughput(requests/s)              : 0.7536
=====================================================
```

The TPOT performance gap between these two sets of data is about 3%.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
2025-07-18 23:09:54 +08:00
Mengqing Cao
574fe407eb [1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841)
### What this PR does / why we need it?
We'll refator `CustomOp` in vllm-ascend from this pr on. 

Use function `CustomOp.register_oot` to achieve the customop registery,
taking `AscendQuickGELU` as an example:
```python
from vllm_ascend.ops.activation import AscendQuickGELU
CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU")
```

This is a quick adapt for `CustomOp.register_oot` mechanism from vllm
0.9.2. For further step, we can remove inherit from `QuickGELU` can
write our own `QuickGELU` at all.

Part of https://github.com/vllm-project/vllm-ascend/pull/1647



- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-18 23:07:14 +08:00
xudongLi-cmss
33ef5dc813 add unit test for func wrapper (#1863)
### What this PR does / why we need it?
test func wrapper file

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

- vLLM version: v0.9.2
- vLLM main:
8dfb45ca33

Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>
2025-07-18 11:05:17 +08:00
Shanshan Shen
f96100fad5 [Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805)
### What this PR does / why we need it?
Remove V0 related codes of test, example, platform.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:58:55 +08:00