### What this PR does / why we need it?
torchair w8a8 and w4a8 Separate from fused_moe due to the refactor and
change for fused_moe
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
69244e67e6
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
After moved torchair related quantization section into
torchair_quantization, split the torchair from the origin quantization
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
69244e67e6
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
Register VocabParallelEmbedding instead of overwrite forward
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.1.1
- vLLM main:
644d57d531
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
de02b07db4
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR fix bugs and refactor cached mask generation logic. Now just
pre-construct and use the cached mask on cpu instead of device on npu.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.1.1
- vLLM main:
9b5f64238f
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
convert the format of gmm to nz
### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
ut: test_fused_ops.py and e2e: test_fused_moe.py
**performance**:
(qwen3 30B, 2k->20k)
base:
Total Token throughput (tok/s): 719.93
gmm nz:
Total Token throughput (tok/s): 728.52
- vLLM version: v0.10.1.1
- vLLM main:
bfc1edc9f5
Signed-off-by: huangxialu <huangxialu1@huawei.com>
### What this PR does / why we need it?
Move torchair related qunatization section into torchair dir to make the
code clear. Next step we'll remove all torchair related code outside of
torchair quantization.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
959783fb99
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
Fix the bug of cos invalid shape when dp
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
1fdc732419
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
This pr updates compilation config in `check_and_update_config`, we use
`compilation_config.level` to update `compilation_config.cudagraph_mode`
to ensure the config is correct.
Add `compilation_config.cudagraph_num_of_warmups = 1` when V1 is
enabled, cause this is also used in torchair graph mode. and this fixes
https://github.com/vllm-project/vllm-ascend/issues/2523
fix the bug that the `aclgraphmode` always be `NONE` while running
forward in aclgraph mode
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.1.1
- vLLM main:
f58675bfb3
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Integrate the arange operator to reduce the time spent and improve
performance
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
56dcf4e7e9
---------
Signed-off-by: s30076806 <songjiayang2@h-partners.com>
### What this PR does / why we need it?
This PR fixes the bug on some machines where quantmatmul failed to run
with the NZ format. The change ensures proper execution under the
expected data layout.
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.1.1
- vLLM main:
b5d34af328
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial
delivery version of modelslim in Q3, and has been verified available
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
7d67a9d9f9
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This method replaces the previous all-gather approach for small numbers
of tokens.
The key changes include:
- A new `AscendFusedMoE` layer that handles token splitting, local
computation, and final aggregation via all-gather.
- Logic in the model runner to dynamically select between the new MC2
method and the existing all-gather method based on the number of input
tokens.
- Sharding the MoE communication mask across tensor-parallel ranks.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Test case fixed.
- vLLM version: v0.10.1.1
- vLLM main:
b00e69f8ca
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
After moved torchair related fused_moe section into torchair_fused_moe,
split the torchair from the origin fused_moe
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
2a97ffc33d
Signed-off-by: hust17yixuan <303660421@qq.com>
[Bugfix]Support Qwen3-MOE on aclgraph mode in sizes capture and add new
ut
What this PR does / why we need it?
This PR solves the problem of sizes capture and stream error caused by
using ACLgraph on the Qwen3-30B MOE model.
Add new ut.
Does this PR introduce any user-facing change?
no
How was this patch tested?
ut
- vLLM version: v0.10.1.1
- vLLM main:
6fad29b11b
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
The constant ASCEND_QUATIZATION_METHOD in vllm_ascend/utils.py is
misspelled and should be corrected to ASCEND_QUANTIZATION_METHOD.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.1.1
- vLLM main:
c9abb10489
Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
### What this PR does / why we need it?
Upgrade vllm in accuracy and performance CI
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.1.1
- vLLM main:
5c4b6e66fe
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Fixes hang when batch size < DP size.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
After this change, the function in DP case will work now.
- vLLM version: v0.10.1.1
- vLLM main:
d9a55204ba
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Move torchair related fused_moe section into torchair_fused_moe to make
the code clear. Next step we'll remove all torchair related code outside
of torchair_fused_moe .
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.10.0
vLLM main:
08d5f7113a
- vLLM version: v0.10.1.1
- vLLM main:
170e8ea9ea
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
Update release version info.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
712d0f88d8
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
### What this PR does / why we need it?
update vllm version in ci
- vLLM version: v0.10.0
- vLLM main:
170e8ea9ea
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Add test for `outlines` backend for structured output in CI.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests have all passed with:
```bash
pytest -sv tests/e2e/singlecard/test_guided_decoding.py
```
- vLLM version: v0.10.0
- vLLM main:
53415653ff
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Accuracy report formatting
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
53415653ff
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
Register RotaryEmbedding instead of overwrite forward
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.0
- vLLM main:
808d2e9aa0
---------
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Refactor all2all-related fused_experts (both quantized/unquantized) into
TokenDispatcherWithAll2AllV, including dispatch & combine calculation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E & UT
- vLLM version: v0.10.0
- vLLM main:
65197a5fb3
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
Add configuration check logic for ascend scheduler: if chunked_prefill
is disabled, max_num_batched_tokens couldn't be less than max_model_len,
following vLLM;
### Does this PR introduce _any_ user-facing change?
users cannot set max_num_batched_tokens smaller than max_model_len with
ascend scheduler
### How was this patch tested?
CI and vllm serving passed
- vLLM version: v0.10.0
- vLLM main:
f77a0802b7
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Fix mtp mode ut
### Does this PR introduce _any_ user-facing change?
Nothing
### How was this patch tested?
This can be tested in the same way as a unit test.
- vLLM version: v0.10.0
- vLLM main:
53415653ff
Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com>
Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>
### What this PR does / why we need it?
Fix the bug of incorrect precision
- vLLM version: v0.10.0
- vLLM main:
53415653ff
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Add release note for `v0.9.1rc3`.
- vLLM version: v0.10.0
- vLLM main:
53415653ff
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Skip failed ut to recover CI quickly
related ut:
- `test_embed_models_correctness`: revert me when pooler is adapted with
the latest vllm main
- `test_check_and_update_config_enforce_eager_mode`: revert me when the
occasional failed is fixed
- vLLM version: v0.10.0
- vLLM main:
8896eb72eb
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Add cp/sp feature branch
- vLLM version: v0.10.0
- vLLM main:
0c6e40bbaa
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
1. use action/checkout@v5 instead of v4
2. remove dbo test case because there is issue with it and will be
refactored later
3. make vllm-ascend compatible with vllm v0.10.1.1 and add CI for it
4. fix sampler api changes introduced by
https://github.com/vllm-project/vllm/pull/22387
6. fix qwen3 moe config changes intruoduced by
https://github.com/vllm-project/vllm/pull/20562
7. fix kvcache block changes introduced by
https://github.com/vllm-project/vllm/pull/23262
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
0c6e40bbaa
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
This PR move current unified mla backend to torchair folder and remove
torchair-related code in attention/mla_v1.py (1.3k -> 0.9k).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running eager mode with mla backend, and torchair mode with code before
[2445](https://github.com/vllm-project/vllm-ascend/pull/2445)
- vLLM version: v0.10.0
- vLLM main:
f571ff8eb6
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
This patch add the feature branch policy.
After this patch: maintainers are allowed to create a feature branch.
Feature branches are used for collaboration and must include an RFC
link, merge plan and mentor info.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.0
- vLLM main:
7be5d113d8
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
For dense models, by not applying tensor parallelism (TP) to the
attention module and applying TP to the MLP module, the allreduce
operations in the attention module can be eliminated, thereby reducing
computational overhead. However, this approach increases memory usage,
so the environment variable VLLM_ASCEND_ENABLE_MLP_OPTIMZE is used to
control this optimization.
- vLLM main:
b17109beea
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Update DOC. Guide users to run LoRA with ACLGraph.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
- vLLM version: v0.10.0
- vLLM main:
de7b67a023
---------
Signed-off-by: paulyu12 <507435917@qq.com>
refact model runner v1
### What this PR does / why we need it?
1. Separate the execute model logic from the prepare input logic
2. Disassemble the torchchair in model runner v1
- vLLM version: v0.10.0
- vLLM main:
68fcd3fa73
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
enable_shared_pert_dp is currently on by default. This optimization is
currently only valid for deepseek series models. The default opening
affects the accuracy of the qwen series models.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
use parameter --additional_config='{"enable_shared_expert_dp": true}'
- vLLM version: v0.10.0
- vLLM main:
d983769c41
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
The deepseek w4a8 weights we supported before were in mindie-format
format. It uses int8 to represent int4, so the weight size is similar to
w8a8, and we need to do a few extra steps to make vllm-ascend load it
normally.
Now we can directly use the new weight format, which uses two int4 packs
to save the weight, the weight size is reduced, and there is no need to
do many extra operations to directly use it on vllm-ascend, but we are
also compatible with the weights of the previous mindie format.
The weight changes in the new version:
1. The weight is packed (2 int4 pack to int8)
2. The bias required in the apply method is directly generated by
modelslim
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
#### 1.How to get weights using Modelslim
##### Installation steps
we can use the branch br_release_MindStudio_8.1.RC2_TR5_20260624
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624
https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh
##### Generate w4a8 weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/br_release_MindStudio_8.1.RC2_TR5_20260624/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/br_release_MindStudio_8.1.RC2_TR5_20260624/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96)
chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path}
##### Adapt to vllm-ascend
Modification in `config.json`:`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3`;
#### 2.How to run w4a8
##### a.How to run eager mode
export VLLM_ASCEND_MLA_PA=1
python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120 --max-num-seqs 128 --enforce-eager
##### b.How to run graph mode
export HCCL_BUFFSIZE=1024
python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
eg: python -m vllm.entrypoints.openai.api_server
--model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
- vLLM version: v0.10.0
- vLLM main:
103f1ec8d3
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
1. rename `VLLM_LLMDD_RPC_PORT` to `VLLM_ASCEND_LLMDD_RPC_PORT` to make
the prefix the same in vllm-ascend
2. enable `VLLM_ASCEND_LLMDD_RPC_IP` env for PD feature.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>