### What this PR does / why we need it?
- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.
### Does this PR introduce any user-facing change?
- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.
### How was this patch tested?
- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.
- vLLM version: v0.11.0
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
### What this PR does / why we need it?
Add quantization param for `deepseek-w8a8` multi-node test
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
1. Enable tests/e2e/multicard/test_external_launcher.py
2. Add e2e test for sleep mode in level2
### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: huangxialu <huangxialu1@huawei.com>
Co-authored-by: Shangwei-Li <lishangwei2@huawei.com>
### What this PR does / why we need it?
This pr purpose to add multi-node test, on the first step, add
`deepseek-v3` dp+tp+ep test
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Adds support for capturing the Multi-Layer Attention (MLA) decode
operation into an ACL graph. This improves performance by compiling the
attention kernel for single-token decoding.
Key changes include:
- Implementing the graph capture logic for the MLA kernel, including
workspace management and parameter updates.
- Modifying the rotary embedding (RoPE) handling to use pre-allocated
tensors, which is a requirement for graph capture.
- Adding a `build_for_graph_capture` method to the MLA metadata builder
to create dummy metadata during the graph compilation phase.
Known issues:
- Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're
working on a fix
- We are preparing to remove update_mla_attn_params with
auto_dispatch_capture
### Does this PR introduce _any_ user-facing change?
compilation_config={
"cudagraph_mode": "FULL_DECODE_ONLY",
},
### How was this patch tested?
- vLLM version: v0.11.0
---------
Signed-off-by: panchao-hub <315134829@qq.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
we notice that torch npu 0919 doesn't work. This PR revert related
change which rely on 0919 version.
Revert PR: #3295#3205#3102
Related: #3353
- vLLM version: v0.11.0
### What this PR does / why we need it?
Fix empty lines between lm_eval command lines for accuarcy template
- vLLM version: v0.10.2
- vLLM main:
9607d5eb44
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Calculate in advance the workspace memory size needed for the
PagedAttention operator to avoid deadlocks during resource cleanup. This
PR requires torch_npu version 0920 or newer.
### How was this patch tested?
- vLLM version: v0.11.0
---------
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
1. clean up v0.10.2 support in ut and e2e test
2. remove v0.11.0 period job, we're at v0.11.0 now.
3. remove uesless patch for deepseek v3.2. They have been done in vLLM
already.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix CI by addressing max_split_size_mb config
### Does this PR introduce _any_ user-facing change?
No, test onyl
### How was this patch tested?
Full CI passed, espcially eagle one
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
we add a patch for model weight loader to avoid using vLLM weight loader
v2, since v2 will lead unknown issue for torchair. While this patch make
some unknown memory usage problem. To quick fix the problem, let's
expend the `max_split_size_mb` to a larger value to avoid weight load
oom issue.
Further solution is to remove the patch and address weight loader v2
from vLLM.
Closes: https://github.com/vllm-project/vllm-ascend/issues/3251
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1.Support deepseek w4a8 per-channel quantization
2.The eager mode supports converting weights to the NZ format
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
#### How to get weights using Modelslim
##### Installation steps
git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh
##### Generate w4a8 per-channel weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Add e2e test related to weight updates in RL scenarios.
Due to CI issues, the newly added Python test files cannot locate the
correct path. As a temporary solution, use absolute paths to add test
cases.
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: Shangwei-Li <lishangwei2@huawei.com>
### What this PR does / why we need it?
Correct the vllm interface e2e test Base container image name
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Tests in vllm ci pipeline
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Add OOT platform E2E test case to be run in the vllm buildkite pipeline.
Note: added test case is not run in vllm-ascend CI.
### Does this PR introduce _any_ user-facing change?
NA
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
This PR implements the renaming of the environment variable
VLLM_LLMDD_RPC_PORT to VLLM_ASCEND_LLMDD_RPC_PORT, as proposed and
tracked in
[#2450](https://github.com/vllm-project/vllm-ascend/pull/2450). The
renaming is intended to align the variable naming convention with other
Ascend-specific environment variables in the vllm-ascend codebase,
enhancing consistency and clarity for developers and users working with
Ascend-based deployments.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.2
- vLLM main:
9607d5eb44
Signed-off-by: wyu0-0 <woshilynn@163.com>
### What this PR does / why we need it?
Update the format of the accuracy report
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
c60e6137f0
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Bump main to
c60e6137f0
- Updated imports in `vllm.config` to
`vllm.config.model`(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252
- Refactored `vllm_ascend/sample/sampler.py` to use string values for
`logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs
mode handling and improving compatibility with recent vLLM changes
(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
6d8246aaff
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
1. update expected accuracy for DeepSeek-V2-Lite
2. add batch size
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Accuracy CI passed
- vLLM version: v0.10.2
- vLLM main:
838d7116ba
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Increase doctest timeout to 300s and time print, according to time print
in https://github.com/vllm-project/vllm-ascend/pull/3045 , most of time
consumed in `Graph capturing`, so I think it's fine to increase doctest
timeout
This PR also add time log for each task.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Run `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`
- CI passed
- vLLM version: v0.10.2
- vLLM main:
a684c0124c
Closes: https://github.com/vllm-project/vllm-ascend/issues/3045
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
This PR puts the calculation of shared experts into a separate stream,
overlaping with routing experts.
- vLLM version: v0.10.2
- vLLM main:
fbd6523ac0
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Fix accuracy for DeepSeek-V2-Lite
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
66072b36db
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR depends on the merge of #2707 and has adapted the aclgraph
functionality to support MTP.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
2b85697031
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving with torchair graph mode and eager mode.
- vLLM version: v0.10.2
- vLLM main:
759ef49b15
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
**Background:**
There are two principles about operator registration in PyTorch
- The same namespace can be only registered once by `TORCH_LIBRARY`
- The operator signatures can be only registered once by `def`
Considering that all custom operators defined in the current repo are
only used by Ascend, instead of defining a common operator schema by
vLLM, all accelerators then follow this operator schema and complete the
implementation based on their respective hardware, which is conducive to
functional abstraction.
Therefore, we can rename the operator registration namespace to an
Ascend-specific namespace(**_C_ascend**).
Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742
- vLLM version: main
- vLLM main:
f592b3174b
Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
### What this PR does / why we need it?
This PR prefetchs the weight of mlp layers in Qwen Dense Models to
optimize the performance in Decode phase mainly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: main
- vLLM main:
a1213fae5f
Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Shuming19 <313093131@qq.com>
### What this PR does / why we need it?
[Feat]support dynamic quantization in allgather
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: main
- vLLM main:
5931b7e5d9
Signed-off-by: withHades <244036962@qq.com>
Signed-off-by: WithHades <244036962@qq.com>
This PR is based on top of
[#23569](https://github.com/vllm-project/vllm/pull/23569) and
[#24219](https://github.com/vllm-project/vllm/pull/24219).
### What this PR does / why we need it?
This PR allows the model runner to function asynchronously when using
async scheduling. This allows full overlap of the cpu operations
(including prepare_inputs) and the model forward pass. This diff is
functional and does not support speculative decoding, PP, or guided
decoding.
Expected speedup is 5-10% over the current async scheduling.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
server
```
python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\
--trust-remote-code --enforce-eager \
--distributed-executor-backend=mp \
-tp=4 \
--port 8006 \
--max-model-len 32000 \
--block-size 128 \
--gpu-memory-utilization 0.99
```
client
```
python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \
--dataset-name random --random-input-len 2048 --random-output-len 2048 \
--ignore-eos\
--num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \
--metric-percentiles 90 --base-url http://localhost:8006 --save-result \
--result-dir $PROFILER_DIR
```
benchmark test based on Qwen3-32B TPOT result:
||forward async| scheduler async |sync|
|-|-|-|-|
|avg|41.73|41.86|44.20|
|improve0|0.3%|0|0|
|improve1|5.58%|0|0|
benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result:
||forward async|sync|
|-|-|-|
|avg|23.22|29.16|
|improve|20.3%|0|
- vLLM version: main
- vLLM main:
e93f4cc9e3
Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com>
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com>
Co-authored-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Flashcomm_v1 optim in Qwen Dense Models.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.1.1
- vLLM main:
5e537f45b4
Co-authored-by: 1024daniel <xxltju324@gmail.com>
### What this PR does / why we need it?
1. Move prepare/finalize operation from moe_comm_method to
/ops/moe/fused_moe_prepare_and_finalize
2. Adapt to token_dispatcher in moe_comm_method
3. Move
moe_comm_method/experts_selector/token_dispatcher/fused_moe_prepare_and_finalize
to /ops/moe
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut
- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55
Signed-off-by: weichen <calvin_zhu0210@outlook.com>
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
### What this PR does / why we need it?
1. Similar to #2384 , this PR add a torchair-specific modeling for
pangu.
2. Fixes a bug introduced by routed_scaling_factor in #2675 .
3. remove eager test case for pangu since there has already been a
torchair test case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6
---------
Signed-off-by: zengyanjia <z00883269@china.huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Co-authored-by: zengyanjia <z00883269@china.huawei.com>
### What this PR does / why we need it?
Fix MTP torchair bug caused by torchair refactor and moe refactor
Depends on PRs:
fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627
torchair multi DP fix:
https://github.com/vllm-project/vllm-ascend/pull/2626
### Does this PR introduce _any_ user-facing change?
when dp is enabled, to run mtp online server, need to disable server log
due to the current metrics does not support multi dp
`--disable-log-stats`
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
7c8271cd1e
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Refactor E2E CI to make it clear and faster
1. remove some uesless e2e test
2. remove some uesless function
3. Make sure all test runs with VLLMRunner to avoid oom error
4. Make sure all ops test end with torch.empty_cache to avoid oom error
5. run the test one by one to avoid resource limit error
- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
There are a lot of redundant codes related to moe here, and the
structure is not very clear.
We did the following things:
we have placed the relatively independent code related to apply_mlp into
a separate file;
removed the environment variables of alltoall_buffer and alltoall_seq.
Remove the code related to alltoall_buffer and alltoall_seq, and retain
the sole TokenDispatcher inheritance class.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e&ut
- vLLM version: v0.10.1.1
- vLLM main:
4071c76cf3
---------
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
### What this PR does / why we need it?
* **Unify execution paths:** Consolidates the quantized and
non-quantized execution paths into a single `fused_experts` function,
removing duplicated logic and making the control flow clearer and easier
to maintain.
* **W8A8 dynamic quantization:** Adds support for W8A8 dynamic
quantization inside the unified MoE kernel. Communication routines are
updated to correctly handle dynamic quantization scales for activations.
* **Weight pre-processing:** Prae-transpose the `w13` and `w2` weight
matrices (as implemented in PR #2025) so that quantized and
non-quantized models follow the same code path for the MoE gating,
up-projection, and down-projection operations.
* **All-to-all communication:** Adds an `all-to-all` collective
communication pattern. For large token counts on modern hardware,
`all-to-all` is more efficient than the previous `all-gather` strategy.
However, `all-to-all` is not really captured and replayed due to
multiple D2H operations which will trigger synchronization, and thus
raise error when capture graphs. We only use `all-to-all` when fallback
to `compiled_graph_for_general_shape`.
* **Dynamic communication selection:** The model runner now selects the
optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at
runtime based on token count and the Ascend SoC version.
* **Limitation:** `all-gather` is not yet supported for quantized
models, which means there is still something left to do on A2.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
No further test cases needed.
- vLLM version: v0.10.1.1
- vLLM main:
d660c98c1b
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover
CI, will be refactorred in the future
2. disable aclgraph when testing w8a8
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.1.1
- vLLM main:
95089607fa
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Integrate the arange operator to reduce the time spent and improve
performance
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
56dcf4e7e9
---------
Signed-off-by: s30076806 <songjiayang2@h-partners.com>