### What this PR does / why we need it?
This PR adds mlapo operation support qdown of output.
### Does this PR introduce _any_ user-facing change?
mlapo operation add enable_inner_out of input
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: h1074112368 <h1074112368@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR introduces the Ascend implementation of the
`dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime,
together with follow‑up fixes to ensure the kernel builds and runs
correctly in CI.
- Add full host and device implementation of the `dispatch_ffn_combine`
kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE
routing helpers, and kernel utilities for quantized FFN dispatch.
- Integrate the new kernel with the PyTorch binding
(csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend
runtime (vllm_ascend/ascend_forward_context.py,
vllm_ascend/worker/model_runner_v1.py).
- Extend fused MoE communication and token dispatch support in
`vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new
dispatch path.
- Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py
to support the new FFN dispatch flow.
- Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake
configuration, and include/namespace usage in the new kernel files.
- Add an end‑to‑end nightly test
`tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper
utilities in `vllm_ascend/utils.py` to validate the new kernel.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
---------
Signed-off-by: mojave2 <chenchen145@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1. Optimize multi-node waiting logic
2. Remove the `tee` pipeline for logs, which will lead to hang issue
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This pr https://github.com/vllm-project/vllm-ascend/pull/2958 is
supporting gatingtopk operator generalization, but caused nightly ci
error.
Now we add check logits for ops gatingtopk, and fix nightly ci.
- vLLM version: v0.12.0
Signed-off-by: 1092626063 <1092626063@qq.com>
### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Add a new fusion ops to custom_op, which can cobime the torch.bmm() and
transpsose to achieve better peformance. This ops is used in mla_v1 to
replace the bmm and transpose
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.2
---------
Signed-off-by: hust17yixuan <303660421@qq.com>
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.
Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.
- vLLM version: v0.11.2
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR introduces support for adding custom CANN `aclnn` ops to
`vllm-ascend`, allowing users to define and use their own custom
operators.
Key changes include:
- Building and installing custom ops into the `vllm-ascend`-specified
directory
- Binding the `aclnn` op interface to the `torch.ops._C_ascend` module
- Enabling invocation of these ops within `vllm-ascend`
This PR includes a sample custom op:
`aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from
the CANN operator
[`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md).
Its input parameters `weight` and `weight_scale` now accept
`list[torch.Tensor]` (i.e., `at::TensorList`).
### Does this PR introduce _any_ user-facing change?
No.
- vLLM version: v0.11.2
---------
Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer
to simplify the invocation of `aclnn` operators on Ascend NPUs.
**Key Changes:**
* **Adapter Layer:** Added `EXEC_NPU_CMD` macro and related dependencies
to standardize `aclnn` calls.
* **Operator Support:** Integrated `grouped_matmul_swiglu_quant` as a
reference implementation to demonstrate the usage of the new macro.
---
- vLLM version: v0.11.2
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
This patch mainly doing the following things:
1. Make k8s/lws optional for multi-node testing, allowing developers to
run multi-node tests locally by actively passing in the IP addresses of
all nodes.
2. Allows passing a custom proxy script path in the config file to load
the proxy.
- vLLM version: v0.11.2
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR updates the acc standard for deepseek mtpx cases, according to
inner standard
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
This PR update the prefixcache threshold for qwen3-32b-int from 0.4 to
0.8, as the baseline has been improved.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
- skip the nightly image build when the github event is pull_request
- set imagepullpolicy as alway for multi_node test
- move multi_node tests ahead to have some resource clean first
- do not relevant nightly image build with nightly tests for tolerance
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and EPLB
scenario
### Does this PR introduce _any_ user-facing change?
no
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
### What this PR does / why we need it?
Given the current excessively long build time of our nightly-ci, I
recommend installing necessary, confirmed versions of packages in the
Docker image to reduce the time required for integration testing.
Including Mooncake vllm with fixed tags, This is expected to reduce
nightly-ci duration by 2 hours.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Explicit specification `NUMEXPR_MAX_THREADS` to avoid `Error. nthreads
cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)`
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR updates some nightly test cases and adds mtpx cases, we need to
test them daily
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Drop VLLM_USE_V1 usage. This env has been removed from vLLM already.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR adds some qwen3-235b-w8a8 cases qwen3-30b-w8a8 cases, we need
test them daily
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
This pr fixes ci on eplb
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
Since we have upgraded to CANN 8.3rc1, we will no longer use the
privately maintained Mooncake repository, but instead use the official
release released by Mooncake:
https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.7.post2 .
Next step: this is only a temporary solution. We will integrate mooncake
into the vllm-ascend base image later for easier use. see
https://github.com/vllm-project/vllm-ascend/pull/3989
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR updates the acc test standard for some cases, we need it to
better maintain acc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
This PR adds full graph for multimodal nightly test, we need to maintain
this senario
### How was this patch tested?
by running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.
TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Because the previous commit hash was accidentally deleted or
overwritten. This patch correct the commit hash available for
https://github.com/AscendTransport/Mooncake to make nightly ci happy
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This patch mainly fix the the problem of not being able to determine the
exit status of the pod's entrypoint script and some other tiny
optimizations:
1. Shorten wait for server timeout
2. fix typo
3. fix the issue of ais_bench failing to correctly access the proxy URL
in a PD separation scenario.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR adds MALPO for deepseek aclgraph, we need to test it nightly
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
Fix oom of deepseek-eplb nigtly test
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
Fix eplb nightly tests.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
This patch optimize nightly CI:
1. Bug fixes ais_bench get None repo_type error
2. Fix A2 install kubectl error with arm arch
3. Fix the multi_node CI unable to determine whether the job was
successful error
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR adds 2 more A2 caces which we need to test daily. It also
enhances the logging for aisbench test failures to improve issues
identification
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
This patch add multi-node test case for a2
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR adds the 2P1D multi node func/acc/perf test cases, we need test
them daily
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR adds a qwq case for nightly test for qwen-qwq on A3 ,we need
test them daily
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: ckhw <cuikai1@huawei.com>
### What this PR does / why we need it?
This PR adds a prefix cache case for nightly test for
DeepSeek-r1-0528-W8A8 on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
---------
Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>
### What this PR does / why we need it?
This pull request mainly do the following things:
1. Add a doc for multi-node CI, The main content is the mechanism
principle and how to contribute
2. Simplify the config yaml for more developer-friendly
3. Optimized the mooncake installation script to prevent accidental
failures during installation
4. Fix the workflow to ensure the kubernetes can be apply correctly
5. Add Qwen3-235B-W8A8 disaggregated_prefill test
6. Add GLM-4.5 multi dp test
7. Add 2p1d 4nodes disaggregated_prefill test
8. Refactor nightly tests
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR adds the initial prefix cache case for nightly test for
Qwen3-32b-int8 on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>