### What this PR does / why we need it?
Avoid oom during CI by using `with VllmRunner` instead of `LLM()`, and
enable `test_ngram_correctness`
### How was this patch tested?
CI passed.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This patch adds support for the xlite graph wrapper to vllm_ascend.
Xlite provides operator implementations of the transformer network on
Ascend hardware. For details about xlite, please refer to the following
link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:
## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison
- aclgraph: main(c4a71fc6)
- xlite-full: main(c4a71fc6) + xlite-full
- xlite-decode-only: main(c4a71fc6) + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph
### Does this PR introduce _any_ user-facing change?
Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' #
Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true,
"full_mode": true}}' # Enabled for prefill and decode
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lulina <lina.lulina@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
this pr aims to support async_scheduling for mtp, which refer to vllm pr
https://github.com/vllm-project/vllm/pull/24799.
and this pr fix some synchronize problem in vllm-ascend.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR adds mlapo operation support qdown of output.
### Does this PR introduce _any_ user-facing change?
mlapo operation add enable_inner_out of input
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: h1074112368 <h1074112368@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR introduces the Ascend implementation of the
`dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime,
together with follow‑up fixes to ensure the kernel builds and runs
correctly in CI.
- Add full host and device implementation of the `dispatch_ffn_combine`
kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE
routing helpers, and kernel utilities for quantized FFN dispatch.
- Integrate the new kernel with the PyTorch binding
(csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend
runtime (vllm_ascend/ascend_forward_context.py,
vllm_ascend/worker/model_runner_v1.py).
- Extend fused MoE communication and token dispatch support in
`vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new
dispatch path.
- Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py
to support the new FFN dispatch flow.
- Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake
configuration, and include/namespace usage in the new kernel files.
- Add an end‑to‑end nightly test
`tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper
utilities in `vllm_ascend/utils.py` to validate the new kernel.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
---------
Signed-off-by: mojave2 <chenchen145@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1. Optimize multi-node waiting logic
2. Remove the `tee` pipeline for logs, which will lead to hang issue
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This pr https://github.com/vllm-project/vllm-ascend/pull/2958 is
supporting gatingtopk operator generalization, but caused nightly ci
error.
Now we add check logits for ops gatingtopk, and fix nightly ci.
- vLLM version: v0.12.0
Signed-off-by: 1092626063 <1092626063@qq.com>
### What this PR does / why we need it?
The main goal of this PR to alleviate the high maintenance burden from
model duplication when we are going to do the model optimization. Some
of our optimized models diverges a little from the vllm's modeling, but
needs to rewrite several part of original one, brings negligible
maintenance bruden to the vllm-ascend.In order to solve that, we propose
to leverage `torch.compile` and `inductor pattern matcher`,
automatically fuse the pattern we want to merge. For more details can
refer to the RFC https://github.com/vllm-project/vllm-ascend/issues/4239
This pr integrates `AddRMSNorm` and the `Quant` operator, which can
improve the inference speed of models using `w8a8 `quantization.
### Does this PR introduce _any_ user-facing change?
Yes, add new additional_config
### How was this patch tested?
```python
def main():
prompts = [
"The president of the United States is Mr.",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
# Create an LLM.
llm = LLM(
model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8",
# enforce_eager=True,
tensor_parallel_size=1,
trust_remote_code=True,
gpu_memory_utilization=0.7,
quantization="ascend",
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
```text
Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of'
```
- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4
---------
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Add Qwen3Next support in main
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
set `enable_chunked_prefill` to True for e2e test by default to keep the
same behavior with vLLM
- vLLM version: v0.11.2
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix the issue where the qwen3 moe service cannot be started due to
upgrading the vllm version
Error info:
AttributeError: 'AscendFusedMoE' object has no attribute 'use dp
chunking'
### Does this PR introduce _any_ user-facing change?
no
- vLLM version: v0.11.2
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Add accuracy nightly test for new models:
PaddlePaddle/ERNIE-4.5-21B-A3B-PT
LLM-Research/Molmo-7B-D-0924
LLM-Research/gemma-2-9b-it
LLM-Research/gemma-3-4b-it
Shanghai_AI_Laboratory/internlm-7b
llava-hf/llava-1.5-7b-hf
- vLLM version: v0.11.2
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature.
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.
### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>
enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>
TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
Signed-off-by: zengran <zengran2@huawei.com>
Co-authored-by: zengran <zengran2@huawei.com>
### What this PR does / why we need it?
This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975)
from vllm (https://github.com/vllm-project/vllm/pull/25784)
#
Suffix Decoding is a dynamic n-gram matching method that:
1. Uses suffix trees to generate speculative tokens quickly using branch
frequency counts.
2. Can keep a history of prior model responses, which tends to work very
well with repetitive agentic use cases.
3. Can be dynamically updated with newly generated tokens, and FIFO
eviction of older requests.
#
### Does this PR introduce _any_ user-facing change?
This feature should be implemented as opt-in and remain seamless for
users who do not require suffix speculative decoding.
For users who wish to enable it, they must first install
arctic-inference:
`pip install arctic-inference
`
After installation, the suffix speculative decoding feature can be
enabled using the following speculative config:
`--speculative_config '{"method": "suffix", "num_speculative_tokens":
5}'
`
### How was this patch tested?
This PR is currently being tested on vLLM
main:83f478bb19
with PR https://github.com/vllm-project/vllm/pull/25784
In our previous testing, suffix decoding achieved a 13%-30% throughput
improvement over n-gram on the sonnet dataset, tested on vllm-ascend
v0.9.1 with concurrency ranging from 2 to 40.
- vLLM version: v0.11.2
---------
Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
### What this PR does / why we need it?
Add a new fusion ops to custom_op, which can cobime the torch.bmm() and
transpsose to achieve better peformance. This ops is used in mla_v1 to
replace the bmm and transpose
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.2
---------
Signed-off-by: hust17yixuan <303660421@qq.com>
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.
Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.
- vLLM version: v0.11.2
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
qwen3-next suppot triton chunk_gated_delta_rule ops
### co-owners
@OsirisDuan
- vLLM version: v0.11.2
Signed-off-by: shiyuan680 <917935075@qq.com>
### What this PR does / why we need it?
This PR introduces support for adding custom CANN `aclnn` ops to
`vllm-ascend`, allowing users to define and use their own custom
operators.
Key changes include:
- Building and installing custom ops into the `vllm-ascend`-specified
directory
- Binding the `aclnn` op interface to the `torch.ops._C_ascend` module
- Enabling invocation of these ops within `vllm-ascend`
This PR includes a sample custom op:
`aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from
the CANN operator
[`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md).
Its input parameters `weight` and `weight_scale` now accept
`list[torch.Tensor]` (i.e., `at::TensorList`).
### Does this PR introduce _any_ user-facing change?
No.
- vLLM version: v0.11.2
---------
Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.
1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.
Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com
- vLLM version: v0.11.2
---------
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
### What this PR does / why we need it?
Upgrade cann to 8.3rc2
### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.3.RC2
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer
to simplify the invocation of `aclnn` operators on Ascend NPUs.
**Key Changes:**
* **Adapter Layer:** Added `EXEC_NPU_CMD` macro and related dependencies
to standardize `aclnn` calls.
* **Operator Support:** Integrated `grouped_matmul_swiglu_quant` as a
reference implementation to demonstrate the usage of the new macro.
---
- vLLM version: v0.11.2
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
This patch mainly doing the following things:
1. Make k8s/lws optional for multi-node testing, allowing developers to
run multi-node tests locally by actively passing in the IP addresses of
all nodes.
2. Allows passing a custom proxy script path in the config file to load
the proxy.
- vLLM version: v0.11.2
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Due to the inconsistency between the attention operators used in eager
mode and graph mode, the accumulation order of the operator cannot be
guaranteed to be deterministic. Therefore, we modify the test to compare
with given outputs.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR updates the acc standard for deepseek mtpx cases, according to
inner standard
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
In [#26016](https://github.com/vllm-project/vllm/pull/26016), vllm
change the `cudagraph_capture_sizes` to be in ascending order. This PR
fixes related issues caused by this.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Support the Qwen3-Next-80B-A3B-Instruct quantization model and Fix the
NZ issue. Triton kernel doesn't support data format nz, thus we skip
converting weight to nz on layer `conv1d`
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: IncSec <1790766300@qq.com>
### What this PR does / why we need it?
Add ACL graph capture/replay DP test, this is a imprved version of #3886
Restructures the multi-card ACL graph test for improved clarity,
robustness, and accuracy.
Key improvements include:
- Replaces fragile `sys.settrace` and manual patching with a clean,
reusable spy installer using `unittest.mock.patch`.
- Introduces more precise metrics by tracking
`NPUModelRunner.execute_model` and `_dummy_run` calls directly.
- Rewrites assertions to be more accurate and provides clear
explanations for the expected counts of graph captures, replays, model
executions, and dummy runs.
- Simplifies the overall test structure by separating the worker logic
into a dedicated function.
- Removes a long, unnecessary sleep at the end of the test.
- Expands test coverage by adding a larger `max_tokens` parameter.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Add tests for the multi-node DeepSeek-V2-Lite network in GE Graph mode,
and supplement the end-to-end (e2e) tests for the MLA and NZ features of
this network.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>
### What this PR does / why we need it?
This PR update the prefixcache threshold for qwen3-32b-int from 0.4 to
0.8, as the baseline has been improved.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
The current library only supports the FullDecodeOnly graph mode, which
enables full graph execution during the decode. This PR extends support
to allow full graph execution in both the prefill and decode, referred
to as FULL graph mode.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
- skip the nightly image build when the github event is pull_request
- set imagepullpolicy as alway for multi_node test
- move multi_node tests ahead to have some resource clean first
- do not relevant nightly image build with nightly tests for tolerance
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and EPLB
scenario
### Does this PR introduce _any_ user-facing change?
no
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
### What this PR does / why we need it?
Given the current excessively long build time of our nightly-ci, I
recommend installing necessary, confirmed versions of packages in the
Docker image to reduce the time required for integration testing.
Including Mooncake vllm with fixed tags, This is expected to reduce
nightly-ci duration by 2 hours.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Now, from https://github.com/vllm-project/vllm-ascend/pull/3967, chunked
prefill and spiltfuse are defaultly enabled.
The e2e test for mtp breaks now.
After locating the bug, we found that a triton operator does not support
chunked prefill.
But if let e2e test be skipped is bad.
So, we changed the e2e test to only test the case in which chunked
prefill is off.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Because we only modified
`test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY`.
So, we only run `pytest -s
tests/e2e/multicard/test_qwen3_next.py::test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY`
locally to test it.
Below is the result:
```text
==================================================================================================================== warnings summary ====================================================================================================================
usr/local/python3.11.10/lib/python3.11/site-packages/torch_npu/dynamo/torchair/__init__.py:8
/usr/local/python3.11.10/lib/python3.11/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
tests/e2e/multicard/test_qwen3_next.py::test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY
tests/e2e/multicard/test_qwen3_next.py::test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY
/usr/local/python3.11.10/lib/python3.11/site-packages/pydantic/_internal/_dataclasses.py:121: DeprecationWarning: The 'task' option has been deprecated and will be removed in v0.13.0 or v1.0, whichever comes first. Please remove this option.
s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================= 1 passed, 5 warnings in 314.52s (0:05:14) ========================================================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
```
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: drslark <slarksblood@qq.com>