Compare commits

136 Commits

Author SHA1 Message Date
starkwj
389030a8f8 add env vars & misc 2026-02-11 06:27:58 +00:00
starkwj
739d074b0c update other platforms' Dockerfile 2026-01-23 03:24:25 +00:00
starkwj
2a571d8bc8 support multi npu partially 2026-01-09 04:36:39 +00:00
starkwj
fa0fb46853 fix reload return value 2026-01-07 07:42:30 +00:00
074ae28d6e 更新 README.md 2026-01-05 20:33:31 +08:00
starkwj
caf0289e1a add Dockerfile and readme 2026-01-05 11:31:07 +00:00
starkwj
135cc0a505 vllm-ascend vnpu v1 2025-12-26 07:37:35 +00:00
zhangyiming
2f1aed98cc [Doc] Update version policy to the latest. (#5071)
### What this PR does / why we need it?
[Doc] Update version policy to the latest.

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-16 15:24:46 +08:00
zzzzwwjj
8c41770f1f [bugfix] fix fp32 trans nz (#5068)
### What this PR does / why we need it?
fix fp32 trans nz error, disable fp32 dtype trans nz.

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-12-16 15:04:31 +08:00
wangxiyuan
11e6d6c291 [doc] update developer guide (#5060)
Update developer doc for v0.11.0-dev. This PR mainly picks developer doc
from main to v0.11.0-dev. All related Feature work with 0.11.0 already.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-16 14:09:52 +08:00
zhangyiming
e07abfaa75 [Doc] Add new contributors. (#5066)
### What this PR does / why we need it?
[Doc] Add new contributors.

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-16 12:47:40 +08:00
zhangxinyuehfad
ca0823f238 [0.11.0][Bugfix] fix fastapi version (#5052)
### What this PR does / why we need it?
fix fastapi version

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-16 11:34:11 +08:00
Shanshan Shen
303c08aec9 [Doc] Update structured output doc with upstream link (#5058)
### What this PR does / why we need it?

Cherry-pick from main
https://github.com/vllm-project/vllm-ascend/pull/4015.

Currently, the usage of structured output feature in vllm-ascend is
totally the same as that in vllm.

Thus, IMO, it's better to remove this doc directly to avoid some case
that there are some changes in the upstream doc and we don't update our
doc in time, which can be misleading to users.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-16 11:32:53 +08:00
Clorist33
2b5b309133 [Bugfix]Fix precision issues in moe_mlp (vllm-ascend v0.11.0-dev) (#5023)
### What this PR does / why we need it?
Use group_list[0] to replace group_diff[0] in function
"cumsum_group_list" (moe_mlp.py).
The purpose is to modify it to the correct logic of converting cumsum to
count.

### Does this PR introduce _any_ user-facing change?
No

Signed-off-by: tanqingshan (A)  <50050625@china.huawei.com>
Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>
2025-12-16 08:40:03 +08:00
zhangxinyuehfad
87c0cfafa3 [0.11.0][Bugfix] fix fastapi version (#5048)
### What this PR does / why we need it?
fix fastapi version <0.124.0

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-15 23:51:38 +08:00
wangxiyuan
01a13a9b77 fix nz for quantization (#4943)
quantization ops rely on NZ by force, we should remove the nz check for it.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-12 14:54:41 +08:00
sunchendd
5932abc446 [Bugfix] Fix the Eagle3 inference failure issue. (#4721)
### What this PR does / why we need it?
Fix the Eagle3 inference failure issue.
error message: "EngineCore encountered an issue. See stack trace (above)
for the root cause."

Fixes https://github.com/vllm-project/vllm-ascend/issues/4323

### How was this patch tested?
`vllm serve /nfs/1_AscendPackage/05_weights_public/Qwen3-32B \
--served-model-name Qwen3-32B \ -tp 4 \ --host "0.0.0.0" \ --port "8000"
\ --trust-remote-code \ --speculative-config
'{"method":"eagle3","model":"/home/scd/qwen3_32b_eagle3/","num_speculative_tokens":4,"draft_tensor_parallel_size":1}'
\ --max-num-batched-tokens 4096 \ --max-model-len 4096`

```
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen3-32B",
        "prompt": "hi, where is the capital of France?",
        "max_tokens": 10,
        "temperature": 0
    }' | python3 -m json.tool
```

vLLM version: v0.11.0
vLLM-ascend version: v0.11.0rc2

Signed-off-by: 17764591921 <sunchend@outlook.com>
2025-12-12 14:52:29 +08:00
Clorist33
4f0dddc9ee [Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev (#4885)
### What this PR does / why we need it?
This PR fixes a bug in the moe_mlp module by correcting the arguments
passed to the torch_npu.npu_dequant_swiglu_quant function.It properly
converts group_list from a cumulative sum to counts for the group_index
parameter.

### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/main

---------

Signed-off-by: tanqingshan (A)  <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>
Co-authored-by: Mercykid-bash <ruanche0218@gmail.com>
2025-12-12 14:51:47 +08:00
Slightwind
9c0ad46c1a [0.11.0][Bugfix] Remove the ZMQ communication setup on the D node (#4916)
In the PD separation scenario, the D node does not need to perform get
operations, and therefore does not need to create ZeroMQ (ZMQ)
communication.
---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-12-12 14:37:49 +08:00
1092626063
ceadc2788d Revert "[refactor]support gatingtopk operator generalization (#4356)" (#4873)
This reverts commit c4a11a745a.

ops npu_gating_top_k caused Qwen3-30B precision problem, so revert it.

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-10 15:45:20 +08:00
linfeng-yuan
9a144bc7be [Docs][0.11.0] delete AIV env variables in DSV32 documentation (#4833)
### What this PR does / why we need it?
Delete wrong configuration in deepseek v3.2 documentation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
NA.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-12-09 15:53:53 +08:00
Mercykid-bash
8f45f9ce29 BugFix: Resolve shape mismatch in eplb update and calculation issues in quant_apply_mlp (#4777)
## Description
This PR addresses two key issues in the MoE module when redundant
experts are enabled, and fixes a calculation precision bug in the
forward inference of quantized MLP:

### 1. Shape Mismatch in EPLB Expert Map Update
- **Root Cause**: 
When redundant experts are turned on, a shape inconsistency occurs
during the expert map update in `Vllm_apaptor`:
- The shape of `self.expert_map_per_layer[layer_id]` is
`[num_physical_experts,]` (aligned with physical expert count).
- The shape of `updated_expert_map` is `[num_logical_experts,]` (aligned
with logical expert count).
- Indices in `self.expert_map_per_layer[layer_id]` that exceed the
logical expert count cannot be properly mapped, leading to tensor shape
mismatch errors.
- The same shape mismatch exists in the `log2phy` map update (between
`self.log2phy_map_per_layer[layer_id]` and `updated_log2phy_map`).

- **Fix**:
- Fix the shape initialization of `expert_map_per_layer` and
`log2phy_map_per_layer` to be consistently set to
`[num_physical_experts,]` across the module lifecycle.
- Align the shape of `updated_expert_map` and `updated_log2phy_map` with
the pre-initialized physical-expert-sized tensors during update
operations, ensuring shape consistency for index mapping.

### 2. Calculation Precision Issue in Quantized MoE MLP Forward
Inference
- **Root Cause**:
In the forward pass of `moe_mlp`, the
`torch_npu.npu_dequant_swiglu_quant` operator only accepts group lists
in **Count format** as input. However, the group list provided by
`quant_apply_mlp` was in **Cumsum format**, which caused operator input
format mismatch and degraded calculation precision.

- **Fix**:
- Convert the cumsum-formatted group list from `quant_apply_mlp` to
Count format before passing it to `torch_npu.npu_dequant_swiglu_quant`.
- Ensure the input format of the dequantization operator meets its
requirements, restoring the expected calculation precision for quantized
MoE MLP layers.

## Impact
- Resolves shape mismatch errors in EPLB expert/log2phy map updates when
redundant experts are enabled, ensuring stable expert routing.
- Fixes quantized MoE MLP forward precision issues on NPU, aligning
operator input formats with NPU kernel requirements.
- No breaking changes to existing interfaces; the fixes are
backward-compatible for scenarios without redundant experts enabled.

---------

Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-09 15:46:58 +08:00
linfeng-yuan
695e5c9ebc [0.11.0][ops] npu_top_k_top_p supports k and p only (#4153)
### What this PR does / why we need it?
With CANN 8.3 and corresponding PTA 2.7.1, `npu_top_k_top_p` supports
passing only k (1<=k<=1024) and p separately.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
E2E performance test with only `top_k` and `p` seperately. This pr gains
0.2ms improvements in TPOT with `batch_size=16`.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-12-09 15:45:40 +08:00
Li Wang
4588d1f215 [CI] Use arm node for unit tests (#4819)
### What this PR does / why we need it?
Use arm node for unit tests

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-09 15:45:14 +08:00
linfeng-yuan
e0757dc376 [0.11.0]fix the configuration conflicts in documentation (#4824)
### What this PR does / why we need it?
Fix configuration errors in our documentation.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
NA.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-12-09 15:37:06 +08:00
zhangxinyuehfad
033e3557cc [cherry-pick]fix qwen3vl mrope op (#4484) (#4811)
### What this PR does / why we need it?
Qwen2.5-VL mrope precision problem would been solved once this pr is
merged
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested?
Test on G8600 with textVQA dataset

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: shaopeng-666 <lishaopeng21@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-09 11:07:32 +08:00
Levi
9862a23985 【0.11.0-dev】optimization of kimi-k2 in cann8.3 (#4555)
### What this PR does / why we need it?
In cann8.3, npu_moe_gating_top_k operator can support expert nums with
384, so kimi can use the operator to get better preformance.
---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-09 08:49:15 +08:00
zhangxinyuehfad
0d094531b4 [bugfix] Fixed the bug in retrieving the quantization method for mlp.… (#4797)
When retrieving the quantization method for MOE (e.g., the quantization
file of DeepSeek v3.2 exp do not match the model's naming convention in
eager mode), a KeyError is raised: "model.layers.3.mlp.experts.weight
not in self.quant_description". However the quantization file is like :
```bash
  "model.layers.3.mlp.experts.255.gate_proj.weight": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.gate_proj.weight_scale": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.gate_proj.weight_offset": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.down_proj.weight": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.down_proj.weight_scale": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.down_proj.weight_offset": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.up_proj.weight": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.up_proj.weight_scale": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.up_proj.weight_offset": "W8A8_DYNAMIC",
```

Co-Authored-By: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
2025-12-09 08:47:19 +08:00
Levi
4e728f1f40 [Bugfix] fix qwen3-vl-moe shape ERROR during the _prepare_inputs phase under high concurrency. (#4658)
### What this PR does / why we need it?
Earlier we fixed a similar issue for qwen2.5-vl 【
https://github.com/vllm-project/vllm-ascend/issues/4430 】, and then the
multimodal models in vllm v0.11.0 should all have this problem. Here, we
have specifically proposed a fix for qwen3-vl-moe.

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-08 19:30:16 +08:00
Wang Yixuan
d412565ec9 [Cherry-pick]bmm_transpose to v011dev (#3995)
### What this PR does / why we need it?
Add a custom op to acclerater the deepseek model. The fusion ops combine
the bmm and transpose together, which is applied to mla module.
Cherry-pick from this commtid c68ddc11ce

### Does this PR introduce _any_ user-facing change?
No

---------

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-12-08 19:22:14 +08:00
Angazenn
6391f0625f [v0.11.0-dev][bugfix] Add branch for stream up-lifting in update_attn_params (#4437)
### What this PR does / why we need it?
#3985 move stream context initialization before for-loops to improve
performance. However, we find that this might cause potential accuracy
drop when used with pd disaggregation. Thus we partly revert this change
when using pd disaggregation, and we shall fix this bug in th future.

### Does this PR introduce _any_ user-facing change?
No.


---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-12-08 08:54:46 +08:00
Li Wang
2598124e67 [Image] Correcting the vllm tag of the openeuler image on the A2 device. (#4745)
### What this PR does / why we need it?
Corrected the vllm tag, which should have been in v0.11.0


Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-06 10:55:22 +08:00
offline893
350999c4ef [Bugfix]Fix eplb enable when using mtp float weights. (#4576)
### What this PR does / why we need it?
Fix eplb enable when using mtp float weights. It will be remove when
eplb supporting mtp and float weights.

### How was this patch tested?
Deepseek-V3 + MTP + EPLB in A3.
---------

Signed-off-by: offline0806 <3337230449@qq.com>
Signed-off-by: offline893 <158537145+offline893@users.noreply.github.com>
Co-authored-by: offline0806 <3337230449@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-05 21:15:32 +08:00
1092626063
c4a11a745a [refactor]support gatingtopk operator generalization (#4356)
### What this PR does / why we need it?
This pr is cherry-pick from :
https://github.com/vllm-project/vllm-ascend/pull/2958 and
https://github.com/vllm-project/vllm-ascend/pull/4340

Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern

Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`

CANN: depends on 8.3.RC1

Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before

---------

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-04 20:10:13 +08:00
LI SHENGYONG
593a96056c 【EPLB】Eplb Redundant Experts Bugfix (#4232)
### What this PR does / why we need it?
Redundant experts bugfix
The calculation logic for redundant experts has been fixed, allowing the
correct number of redundant experts to be calculated using the map.
Therefore, there is no longer a need to set the redundant expert
parameter when passing the map.

### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.

### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-12-03 12:00:05 +08:00
Mengqing Cao
b6d63bbd52 [v0.11.0-dev][CI] Fix ngram lacking of input arg dummy_compute_logits error (#4648)
### What this PR does / why we need it?
Fix ngram lacking of input arg `dummy_compute_logits` error

### How was this patch tested?
CI passed with existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-03 09:22:07 +08:00
Levi
865f1f7fc8 [Bugfix] Resolve the interface compatibility issue of get_input_embeddings in MM (#4638)
### What this PR does / why we need it?
Resolve the interface compatibility issue of get_input_embeddings in MM,
because the get_input_embeddings func of other model does not have the
is_multimodal parameter

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-02 22:21:47 +08:00
Levi
3b4cb23616 [Bugfix] fix qwen2.5-vl-72b shape ERROR during the _prepare_inputs phase under high concurrency. (#4553)
### What this PR does / why we need it?
qwen2.5-vl-72b reports a shape ERROR during the _prepare_inputs phase
under high concurrency【 issue
https://github.com/vllm-project/vllm-ascend/issues/4430 】

This PR fix it.

The related PR in main branch
:https://github.com/vllm-project/vllm-ascend/pull/3612

The related commit in vllm :
17c540a993/vllm/model_executor/models/interfaces.py

【The _get_text_embeddings function has been refactored to
interfaces.pyin vllm.】

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-02 14:20:45 +08:00
Zetong Li
52abd47f8c [Bugfix][SHM] Use writer lock by default and remove redundant env (#4117)
### What this PR does / why we need it?
This PR aims to remove env introduced by #3988 and use lock by default.
As described in https://github.com/vllm-project/vllm/issues/27858, we
have tested the writer lock method in various scenarios and the
performance is almost unaffected. Therefore, we believe that it would be
safe to enable the lock by default and remove the redundant env
`SHM_BARRIER` now.

After discussion, we decide to preserve env and set it as true by
default.

### Does this PR introduce _any_ user-facing change?
`SHM_BARRIER` is set as true by default.

### How was this patch tested?
by ci

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-12-01 22:27:01 +08:00
Li Wang
76d0ba4342 [Image][Build] Cherry pick #4062 from main (#4506)
### What this PR does / why we need it?
This patch aims to integrate the mooncake
[v0.3.7.2.post2](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.7.post2)
to vllm-ascend images

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-01 11:39:40 +08:00
zouyida2052
2b4f7a5016 [cherry-pick pr-4254] bugfix for mtp>1 when lm_head_tp>1 (#4360)
### What this PR does / why we need it?
Previously, the dummy run executed compute_logits only once, regardless
of num_speculative_tokens. This caused execute_model to hang on
compute_logits when lm head tensor parallelism exceeded 1. The fix
ensures compute_logits executes correctly during dummy run, matching
num_speculative_tokens.

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-12-01 11:11:15 +08:00
LI SHENGYONG
cd9f5c0611 [bugfix] dep ineffective (#4416)
### What this PR does / why we need it?
The expert mapping table and weights of the dynamic EPLB were not
updated, causing the accuracy to be correct but not effective. This bug
has now been fixed.

If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-29 15:19:11 +08:00
henryxuxu0716
71acc8ddeb For nz unset in bf16&fp16 (#4495)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
disable NZ for float weight case. This is only a quick fix for dev
branch.

For main branch, we'll consider more case to make it more common.


### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
qwen2.5 32B
<img width="441" height="221" alt="image"
src="https://github.com/user-attachments/assets/7ae18ffd-1ce2-43d9-9960-be45250ad0da"
/>

---------

Signed-off-by: 刘哲续 <liuzhexu1@huawei.com>
Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>
2025-11-28 17:32:25 +08:00
Zhu Yi Lin
96c362361e [0.11.0][TEST] Delete Comment (#4428)
### What this PR does / why we need it?
delete chinese comment
pick from https://github.com/vllm-project/vllm-ascend/pull/4427

### Does this PR introduce _any_ user-facing change?
no

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-25 21:39:36 +08:00
zhangxinyuehfad
a686f2962a [0.11.0][Bugfix] fix e2e full test (#4424)
### What this PR does / why we need it?
pin Transformer version to 4.57.1 fix 'dict' object has no attribute
'model_type'

https://github.com/vllm-project/vllm-ascend/actions/runs/19660859460/job/56306822464

picked from https://github.com/vllm-project/vllm-ascend/pull/4423


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-11-25 21:21:42 +08:00
Shanshan Shen
cdaf7f4a51 [MM][Bugfix] Minor fix for VL model verification (#4385)
### What this PR does / why we need it?

To fix ops test, where `model_config` has been set to `None` and doesn't
has `hf_config` attribute, we have added a check for `model_config` to
guarantee it is not `None_Type`.

cherry-pick from main:
https://github.com/vllm-project/vllm-ascend/pull/4384.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-25 20:36:32 +08:00
wujinyuan1
386a85eccc [Bugfix]Fix the hang issue of multimodal model when running with DP>1 (#4393)
### What this PR does / why we need it?
When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run
process will be triggered. When calling the update_attn_params function,
the num_tokens parameter needs to be passed, and this value is obtained
through positions.shape[0]. However, the multimodal model uses mRope
(multi-dimensional rotary positional embeddings), which causes the shape
of positions to be 2. As a result, the value obtained from
positions.shape[0] is incorrect. We solve this problem by replacing
positions.shape[0] with num_tokens.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
2025-11-25 09:32:22 +08:00
weichen
a3164ac372 [v0.11.0][Bugfix][MoE] enable force_load_balance in aclgraph (#4367)
### What this PR does / why we need it?
Enable force_load_balance in aclgraph, solving OOM issues.
pick from https://github.com/vllm-project/vllm-ascend/pull/4366
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
e2e & ut

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-11-25 09:16:57 +08:00
mazhixin000
75452abe1e [Doc][v11.0-dev][cherry-pick]Add single node PD disaggregation instructions (#4370)
### What this PR does / why we need it?

add single node PD disaggregation instructions for Qwen 2.5VL model.


### Does this PR introduce _any_ user-facing change?
no


---------

Signed-off-by: mazhixin <mazhixin7@huawei.com>
Signed-off-by: mazhixin000 <mazhixinkorea@163.com>
Co-authored-by: mazhixin <mazhixin7@huawei.com>
2025-11-24 17:23:11 +08:00
wangxiyuan
a2e4c3fe78 Revert "[cherry-pick][refactor]support gatingtopk operator generalization (#4050)" (#4352)
This reverts commit c87a77e8b4.

it breaks ops e2e test

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-21 23:03:20 +08:00
SILONG ZENG
5ad0ccdc31 [v0.11.0]Upgrade cann to 8.3.rc2 (#4332)
### What this PR does / why we need it?
Upgrade CANN to 8.3.rc2

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-11-21 22:48:57 +08:00
LI SHENGYONG
0f9025cceb [EPLB] Eplb Verify Fix (#4334)
### What this PR does / why we need it?
Eplb Verify Fix
---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-21 18:18:15 +08:00
Ting FU
97ffb9120f [CI] Defaultly compile vllm with multimodal audio feature in dockerfile (#4324) (#4341)
### What this PR does / why we need it?
For better usability, add multimodal audio to vllm compiling in
dockerfile defaultly.

Image size will increase only 2.xM.

Signed-off-by: Ting FU <futing10@huawei.com>
2025-11-21 17:53:00 +08:00
Li Wang
218bc70f6f [CI] Remove redundant workflows (#4335)
### What this PR does / why we need it?
Remove redundant workflows, just maintain a separate workflow which
setting up on the main branch to control the execution of each branch,
instead of running each branch simultaneously, thus reducing resource
waste.


Signed-off-by: wangli <wangli858794774@gmail.com>
2025-11-21 16:48:35 +08:00
Shanshan Shen
70f076331f [MM][Bugfix] Add error log for VL models when enabling FLASHCOMM (#4222)
### What this PR does / why we need it?

Add error log for VL models when enabling
`VLLM_ASCEND_ENABLE_FLASHCOMM1=1` or `VLLM_ASCEND_ENABLE_FLASHCOMM=1`
(for backward compatibility).

This is a temporary fix for
https://github.com/vllm-project/vllm-ascend/issues/4132.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-21 15:04:35 +08:00
LI SHENGYONG
c94b38c82e [Readme] EPLB Support Scenarios (#4315)
### What this PR does / why we need it?
Add information on the scope of EPLB support.

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-21 14:25:39 +08:00
Angazenn
9c6d0b422c [v0.11.0-dev][misc]change default capture size for Qwen3-MoE when using full dp (#4205)
### What this PR does / why we need it?
This dev version of #4199 .
Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8
,16 ,24 ,... , max_capture_size]`. However, this is not always the best
choice on different situations. This PR aims to change the default
setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size ==
1`) setting, which is usually applied in Large-Scale EP.
old :
`[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`
new:
`[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]`
This is mainly because the performance of `_npu_paged_attention` op
degrades dramatically on old settings. We hope to provide better
performance if users do not set specific `cudagraph_capture_size`.
### Does this PR introduce _any_ user-facing change?
The default `cudagraph_capture_size` is modified in above cases.
However, if `cudagraph_capture_size` has already set by users, this PR
won't have any influence on this.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-21 11:19:11 +08:00
shaopeng-666
b6d59bdea2 cherry pick from pr 4270 (#4285)
### What this PR does / why we need it?
avoid mrope fusion op when running qwen25vl on x86 machine

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2025-11-19 22:32:02 +08:00
MengLong Chen
277670730c [Bugfix][Aclgraph] failed to update graph task (#4282)
### What this PR does / why we need it?
bugfix the error of full graph aclgraph


Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-11-19 21:30:48 +08:00
1092626063
c87a77e8b4 [cherry-pick][refactor]support gatingtopk operator generalization (#4050)
### What this PR does / why we need it?
pick from : https://github.com/vllm-project/vllm-ascend/pull/2958
Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern

Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`

CANN: depends on 8.3.RC1

Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before


Signed-off-by: 1092626063 <1092626063@qq.com>
2025-11-19 10:39:28 +08:00
liziyu
ddf3e75800 [Cherry-pick] [0.11.0] pd proxy support ipv6 and fix proxy (#4242)
### What this PR does / why we need it?
pd proxy support ipv6, mooncake connector check whether the IPv6 address
is used and notify the user.

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-18 16:33:00 +08:00
Icey
378e92a2a2 [Cherry-pick][0.11.0] Adapted to torch_npu.npu_fused_infer_attention_score (#4202)
### What this PR does / why we need it?
Fixes a compatible bug with torch_npu.npu_fused_infer_attention_score
which is discribed in
https://github.com/vllm-project/vllm-ascend/issues/4020.
@momo609 tells us this solution.
cherry-pick: https://github.com/vllm-project/vllm-ascend/pull/4025

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: Icey <1790571317@qq.com>
2025-11-17 10:56:23 +08:00
zhangyiming
a7eb42cf0a [v0.11.0-dev][Bugfix][cherry-pick]bugfix for weight load of kimi-k2 (#4190)
### What this PR does / why we need it?
This is cherry-pick from #3798 

Fix kimi-k2 start bug, weight load
ERROR:https://github.com/vllm-project/vllm-ascend/issues/3785

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: Levi <54832289+Levi-JQ@users.noreply.github.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: zhaozx-cn <zhaozx2116@163.com>
2025-11-14 15:43:22 +08:00
weichen
51e5806d76 [0.11.0-dev][Bugfix][EPLB] Quick fix for missing log2phy conversion (#4150)
### What this PR does / why we need it?
Quick fix for missing log2phy conversion in MC2 token_dispatcher, which
has been already fixed in main branch
https://github.com/vllm-project/vllm-ascend/pull/3512.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
e2e & ut

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-11-13 14:32:40 +08:00
zhaozx-cn
cd652acb65 [BugFix] Fix kv_no_split not contiguous (#3711)
allgather need contiguous data, split operation return uncontiguous
data.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: zhaozx-cn <zhaozx2116@163.com>
2025-11-13 11:29:37 +08:00
Angazenn
28a15299ea [cherry-pick][v0.11.0-dev][bugfix] Change seq_lens in dummy attn_metadata to max_query_len (#4099)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This is cherry-pick from #4097 .
Currently, we set `seq_lens` in dummy attn_metadata to be
`max_model_len` to get max workspace for attention during capturing.
However, setting it consistently to be `max_model_len` causing dummy_run
to execute a long attention when running actual inference. For example,
if there is a single req with `seqs_lens` as [8] but `max_model_len` is
131072, the whole process will be slow down by dummy_run as it execute a
fake long-seq attention. Therefore, we instead set it to max_query_len,
which is also consistent with vLLM gpu implementation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-12 20:32:50 +08:00
zhangxinyuehfad
7732a89fd9 [v0.11.0][UT][Fixbug] Fix UT test (#4151)
### What this PR does / why we need it?
Fix UT test
Backport: https://github.com/vllm-project/vllm-ascend/pull/4116

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-11-12 16:55:18 +08:00
zhaomingyu13
650ce8ad19 [0.11.0][Bugfix] Fix ngram precision issue and open e2e ngram test (#4092)
### What this PR does / why we need it?
Fix ngram precision issue and open e2e ngram test
---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
Co-authored-by: Icey <1790571317@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-11-11 09:58:03 +08:00
Angazenn
2069bef449 [v0.11.0-dev][bugfix] Fix a bug in wrongly set npu_stream (#4106)
### What this PR does / why we need it?
This pr fixes a bug introduced in #3985, which set wrong npu_stream
(possibly by mistakes in cherry-pick). I correct it and make
`update_attn_params` consistent to main branch.

### Does this PR introduce _any_ user-facing change?
No.

Signed-off-by: Angazenn <supperccell@163.com>
2025-11-11 09:16:41 +08:00
Icey
c5fe179cef [0.11.0] [Cherry-pick #4058] Fixes Qwen3-Next enable nz accuracy problem (#4056)
### What this PR does / why we need it?
- Fixes Qwen3-Next enable nz accuracy problem

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Icey <1790571317@qq.com>
2025-11-10 20:56:39 +08:00
rjg-lyh
ebd45b6596 [V0.11.0][Core] Restore scheduling logic under default configuration (#4094)
### What this PR does / why we need it?
Cherry-pick #3967 from main branch. This PR reverts the changes
introduced in PR #2894 Initially, due to performance issues with the
older version of the chunked prefill ops, the default behavior was to
use the Ascend scheduler to disable the chunked prefill feature.
However, with the improvements in the performance of the new chunked
prefill ops, this interception strategy has been removed. This change
also aligns with the community's default configuration behavior.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-11-10 20:02:23 +08:00
XiaoxinWang
c3c9138719 [Perf] Move attention update stream out of loop to optimize performance (#3985)
### What this PR does / why we need it?
In the `update_*attn_params` functions, the
`torch.npu.stream(update_stream)` context manager was previously located
inside the for-loop that updates parameters for each layer. This
resulted in redundant stream initiations for every layer, adding
unnecessary overhead.

This commit refactors the code by moving the stream context manager to
wrap the entire for-loop. This ensures that the update stream is
initiated only once per function call, rather than for each layer. This
change reduces 90us in each decode model.
update stream in every layer:
<img width="1720" height="383" alt="image"
src="https://github.com/user-attachments/assets/70e4cb69-5bc1-4180-a67d-c99132134be6"
/>

remove update stream in every layer:
<img width="1269" height="175" alt="image"
src="https://github.com/user-attachments/assets/0e290edb-b0ce-48fe-b032-1b924ade6ae5"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-11-10 17:18:45 +08:00
zhangxinyuehfad
d913f9474b [0.11.0][Fix] Fix Qwen2-Audio-7B-Instruct accuracy test (#4018)
### What this PR does / why we need it?

Fix Qwen2-Audio-7B-Instruct accuracy test

Backport:https://github.com/vllm-project/vllm-ascend/pull/4017

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-11-10 11:54:30 +08:00
hucong
7ea17fbee3 [0.11.0][BugFix] Improve the performance of prefixcache features (#4021)
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/4022

The code bug caused an empty bubble. When the npu_paged_cache_load
operator was called, it forcibly transferred seq_len2 to the device,
which triggered synchronization and interrupted the CPU operator's
launch stream.


---------

Signed-off-by: underfituu <hzhucong@163.com>
2025-11-10 11:51:34 +08:00
wangxiaoteng888
c2d58c0655 [P/D][BugFix][v0.11.0-dev]Fix proxy format processing errors & Layerwise connector performance optimization (#4069)
### What this PR does / why we need it?
1.Fix proxy format processing errors.
2.Layer-wise connector performance optimization

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-11-09 09:55:10 +08:00
wangx700
55e37f5041 [v0.11.0][Bugfix] fix sleepmode level2 e2e test (#4023)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
fix sleepmode level2 e2e test

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
no

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
use e2e tests

Signed-off-by: wangx700 <wangxin700@huawei.com>
2025-11-08 14:11:15 +08:00
tingfu
f9842560cb [0.11.0][Perf] Add padding vision tower for Qwen2_5_Omni (#4041)
### What this PR does / why we need it?
This PR repalce the vision tower in Qwen2.5-Omni-Thinker model,
Qwen2_5_VisionTransformer, with AscendQwen2_5_VisionTransformer, which
use QKV padding for padding performance.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Ting FU <futing10@huawei.com>
2025-11-08 13:56:05 +08:00
zxr2333
d4e2a44307 [Cherry Pick from pr#3981][0.11.0][P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3983)
### What this PR does / why we need it?
Make kv-transfer env variable take effect & Fix load-balance proxy.
Cherry Pick from #3981

---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-11-08 13:52:33 +08:00
offline893
8e72758645 [BugFix]Fix grouplist type of mc2. (#4049)
### What this PR does / why we need it?
Fix accrucy problem of eplb because of PTA upgrade. This is a backport
of #4047

### How was this patch tested?
Mian:
    baseline:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 87.50 |

   EPLB:

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 87.50 |
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-11-07 17:43:23 +08:00
lilinsiman
016337eaec [v0.11.0][UT] Add new ut case for aclgraph enable (#4038)
### What this PR does / why we need it?
add new ut case for aclgraph enable

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-11-07 11:35:24 +08:00
Angazenn
f9494d978a [cherry-pick][v0.11.0-dev][bugfix] Fix a rare bug triggered by _npu_paged_attention in FULL_DECODE_ONLY mode (#3987)
### What this PR does / why we need it?
This is cherry-pick from #3986 . 

This PR fixes a bug where the workspace of `_npu_paged_attention` in
setup is smaller than execution. For current implementation of
FULL_DECODE_ONLY with `_npu_paged_attention`, we use
`_npu_paged_attention_get_workspace` when capturing with `max_model_len`
as `seq_lens`. This assumes that PA with larger `seq_lens` inputs should
have larger workspace than smaller `seq_lens`. However, there are rare
cases where PA with smaller `seq_lens` incurs larger space. So I add
`get_workspace` directly into `update_attn_params`.
This change might introduce slight(≈1%) performance degradation for
small num_tokens(such as 1) in decode phase, and there is no other known
memory issues. So I think this change is acceptable. We can remove this
if new attention op (such as `npu_fused_infer_attention_score`) does not
have such problems.


Signed-off-by: Angazenn <supperccell@163.com>
2025-11-06 23:08:57 +08:00
Shanshan Shen
27547a10e6 [MM][Bugfix] Add MoE verification for multi-modal models (#3897) (#4027)
### What this PR does / why we need it?

Fix #3891.

The empty of `moe_comm_method` in the above issue is due to the wrong
check for MoE models. To be specific, the method `is_moe_model` only
checks whether a text-only model is a MoE model, without considering
multi-modal models, e.g., `VL` and `Omni`.

Check the config dict recursively to find if it has a key contains
"expert", without checking the model architecture.

It is worth noting that, we can't verify a model by if it contains
`FusedMoE` module because `is_moe_model` is called somewhere before the
model loading, e.g., it's called when updating the ACLGraph config in
platform initialization.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-06 20:30:40 +08:00
zzzzwwjj
3db53d117e [0.11.0][doc] add aclgraph developer guide (#3947)
### What this PR does / why we need it?
Add aclgraph developer guide.

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-06 09:54:38 +08:00
wangxiyuan
7ee0b0b5d8 [cherry-pick]Upgrade CANN to 8.3.rc1 (#3945) (#3962)
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.

TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-06 09:05:08 +08:00
Zetong Li
66b67f9cf2 [Bugfix][SHM] Fix weak memory ordering problem in share memory (#3988)
### What this PR does / why we need it?
This PR aims to fix weak memory ordering problem in share memory by
patching message queue with an additional lock. The detailed issue can
be found here https://github.com/vllm-project/vllm/issues/27858. The key
point is to use the writer lock to enforce memory fence before the ready
flag `metadata_buffer[0] = 1` is set.

This is a temporary solution, and you can use it by setting env
`SHM_BARRIER=true`. By default, we disable this modification.

### Does this PR introduce _any_ user-facing change?
`SHM_BARRIER=true` enables this change while `SHM_BARRIER=false`
disables this change. The latter is the default choice.

### How was this patch tested?
by ci

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-11-04 23:07:23 +08:00
zxr2333
954dab64fb [v0.11.0][P/D]Set adxl as default backend and update readme (#3771)
### What this PR does / why we need it?
Set adxl engine as the default Mooncake backend, because Ascend
Transport is no longer maintained.
Update README to include instructions for installing the adxl backend
Mooncake.

### Does this PR introduce _any_ user-facing change?
Users need to compile and install the mooncake backend for adxl
according to the revised README instructions.

### How was this patch tested?
By CI.

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-11-04 16:06:58 +08:00
leo-pony
0cead5c1ee Quality enhancement: Immediately interrupt execution when allocate NPU memory OOM (#3944)
### What this PR does / why we need it?
Protect the scene where the first problem occurs. The execution should
be interrupted when the video memory application fails, rather than
waiting until an illegal address is accessed.


### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-04 08:55:22 +08:00
Mengqing Cao
7cc6208029 [0.11.0][MTP][Aclgraph] Fix the support aclgraph with MTP (#3912)
### What this PR does / why we need it?
Fix 2 breaks of aclgraph with MTP:
1. deepseekmtp in vllm 0.11.0 does not support aclgraph and lack the
`support_torch_compile` decorator
2. There is a d2h synchornization in the original forward of mtp
predictor. The fix pr in vllm
https://github.com/vllm-project/vllm/pull/27643

As we'll fix it in vllm main, this fix pr is only needed in branch
v0.11.0-dev

The profling shows that MTP replays in aclgraph now:
<img width="1612" height="1866" alt="a7d7f04155df4ed454b7eb20a92b2e2a"
src="https://github.com/user-attachments/assets/eaa4b9ff-aeb0-416d-964f-5a06e497f155"
/>

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-11-03 14:25:37 +08:00
wangxiyuan
8a7154001e [0.11.0]Chery pick pta upgrade change (#3940)
This PR cherry-pick two commit from main to upgrade torch-npu to 2.7.1
official release

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-31 22:14:26 +08:00
rjg-lyh
3d81ea03ed [v0.11.0-dev][bugfix] fix valueError in static_forward_context when prefix is empty (#3929)
### What this PR does / why we need it?
This PR temporarily bypasses the scenario where some models in vLLM
trigger a `ValueError` during the process of storing values in
`static_forward_context` when no `prefix` is specified for the linear
layers, which is a bug in some models in vLLM. The official fix will be
addressed by submitting a PR to the vLLM community that specifies a
prefix for the linear layers in each model.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-10-31 15:45:06 +08:00
Nagisa125
9f7de45b75 [Bugfix] fix MTP support for lmhead_tensor_parallel_size (#3921)
### What this PR does / why we need it?
Fix the issue of MTP being enabled and setting
Imhead_tensor_parallel_size=16 causing the inference to hang.


Signed-off-by: wyh145 <1987244901@qq.com>
2025-10-31 14:34:28 +08:00
lilinsiman
ee2e55e602 [v0.11.0][Test] Add new test model for aclgraph single_request v0.11.0 (#3889)
### What this PR does / why we need it?
add new test model for aclgraph single_request v0.11.0

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-31 11:23:55 +08:00
zouyida2052
90aca84e60 fix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len (#3909)
### What this PR does / why we need it?
1. Revert [bugfix for mtp in
fullgraph](0948483642)
and support it when vllm supports
2. raise error when cudagraph_capture_sizes can't be an integer multiple
of uniform_decode_query_len
3. bugfix when max_num_seqs=14 in mtp=2 scenario

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-31 09:25:06 +08:00
lilinsiman
387ce1cc5b add new e2e tests case for aclgraph memory to v0.11.0 (#3880)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
add new e2e tests case for aclgraph memory to v0.11.0

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-31 09:17:09 +08:00
wangxiaoteng888
38afd2c9cb [bugfix_v0.11.0]cancel tokenize for layerwise_proxy (#3913)
### What this PR does / why we need it?
cancel tokenize for layerwise_proxy
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by ci

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-10-30 23:55:04 +08:00
wangxiaoteng888
af7a56550b [bugfix_v0.11.0-dev] layerwise D first plan (#3907)
### What this PR does / why we need it?
Refactored the layerwise code to send to the D node first, preventing
P-node hangs due to communication timeouts when DP > 1.
---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-10-30 22:21:11 +08:00
offline893
d5a9aba03f [BugFix]Fix group list type of mc2. (#3890)
### What this PR does / why we need it?
Fix the precision issue caused by the inconsistency between the group
list type used by mc2 and that of eplb.

---------

Signed-off-by: offline0806 <3337230449@qq.com>
2025-10-30 21:44:14 +08:00
weichen
c506ba60fb [v0.11.0] [Bugfix] [MoE]fix error in deepseek when using allgather (#3827)
### What this PR does / why we need it?
After refactoring vllm_ascend/models and FusedMoE, we are unable to pass
`gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result
in error when running deepseek v3/r1 with allgather.
Hence, this pr removes `gate` related computations from FusedMoE module
in eager/aclgraph mode.
### Does this PR introduce _any_ user-facing change?
`rm_router_logits` is deprecated in eager/aclgraph.
### How was this patch tested?
e2e & ut

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-10-30 14:59:46 +08:00
whx
211d4b9da4 [BugFix] Fix mlapo accuracy problem related with weight processing. (#3857)
This PR fixes a mlapo accuracy problem related with weight processing.
Furthermore, modify mlapo related e2e test with quantized deepseek model
to make it effective.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-30 00:35:50 +08:00
zouyida2052
d9249c968e bugfix for mtp in fullgraph (#3878)
### What this PR does / why we need it?
bugfix for mtp in fullgraph

### Does this PR introduce _any_ user-facing change?
no

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-29 23:52:20 +08:00
fems14
19f49ecb5f [0.11.0][Bugfix]fix_mulit_connector_bug (#3332) (#3882)
### What this PR does / why we need it?
When using multi connector, the multi connector does not define
get_finished_count, which will cause the kv cache to be released ###
Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19


Signed-off-by: baxingpiaochong <771405853@qq.com>
Co-authored-by: baxingpiaochong <771405853@qq.com>
2025-10-29 23:44:52 +08:00
liziyu
e5b938c5fe [v0.11.0] [P/D] force with_prefill true after allreduce in kv producer (#3835)
### What this PR does / why we need it?
force with_prefill true after allreduce in kv producer. This is a backport of #3768 and #3849

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-10-29 23:14:00 +08:00
Wang Yixuan
b323be9fe4 deepseek torchair adapt for torch_npu version (#3876)
### What this PR does / why we need it?
To adapt the torch_npu version to avoid the precision problem of
torchair deepseek. The torch_npu version may result in the different
branches in the ops register, the rms_norm ops has two branches
according to the verson_check, this pr unify the rms_norm in torchair by
patch method. #3862

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-29 22:44:44 +08:00
realliujiaxu
29bd9235ed [v0.11.0][Perf] Delete redundant operations in model_runner and forward_context (#3775)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->

cherry pick https://github.com/vllm-project/vllm-ascend/pull/3677

Remove redundant operations from `model_runner` and `forward_context`.
This optimization can significantly reduce the idle time (bubble) before
decoding when running models with small parameter counts (e.g.,
Qwen/Qwen2.5-0.5B).

Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms :
Before
<img width="1655" height="696" alt="image"
src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495"
/>

After
<img width="1607" height="774" alt="image"
src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806"
/>
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
No
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-10-29 15:58:53 +08:00
zhangxinyuehfad
75de3fa172 [v0.11.0][Doc] Update doc (#3852)
### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-10-29 11:32:12 +08:00
ZYang6263
6188450269 [v0.11.0][Bugfix]Avoid using the fusion operator in the MOE model (#3837)
### What this PR does / why we need it?
The current MatmulReduceScatter operator experiences performance
degradation in small-shape scenarios, so it determines whether to use
this operator by judging the size of the shape.


---------

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-28 23:31:19 +08:00
Shirley125
e48ca0b6ec [bugfix][0.11]fix proxy decode bug (#3751)
### What this PR does / why we need it?
fix proxy decode bug while parsing non-UTF-8 characters.

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
2025-10-27 16:56:50 +08:00
Yizhou
43276fd822 [v0.11.0][Fix] Prevent memory leak in MLA decode graph (#3743) (#3774)
### What this PR does / why we need it?
The cache for MLA decode graph parameters was holding strong references
to tensors, preventing them from being garbage collected and leading to
increased memory usage.

This change wraps the cached tensors in weak references, allowing them
to be deallocated when no longer in use and reducing overall memory
pressure.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-27 16:00:20 +08:00
Ruri
825fdfb197 [v0.11.0][Feat] Prefetching Attention QKV Linear Weight With AddRmsNormQuant Custom Op (#3649)
### What this PR does / why we need it?

- `qkv_proj.weight` prefetching has been implemented with `Quant` op,
when `AddRmsNormQuant` is enabled (#3465) `qkv_proj.weight` prefetching
won't work
- Implement `qkv_proj.weight` prefetching with `AddRmsNormQuant`, which
has been merged on `main` branch (#3517)

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Tested on `Qwen3-235B-A22B-W8A8`
<img width="1868" height="109" alt="image"

src="https://github.com/user-attachments/assets/0bc28082-0287-4d5c-b8f6-f907c3134d36"
/>


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
2025-10-27 09:42:09 +08:00
Mengqing Cao
1b16c01afd [v0.11.0-dev][Installation] limit opencv-python-headless version to resolve numpy version conflict (#3767)
### What this PR does / why we need it?
vllm requires opencv-python-headless >= 4.11.0 which requires
(numpy<2.3.0,>=2), but vllm-ascend numpy version must be less than
2.0.0, so limit opencv-python-headless less than 4.11.0.86 will fix this
conflict.

backport of
afc58184ec

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
2025-10-25 18:18:28 +08:00
whx
a58ff9e92f [Cherry-pick] Port MoE multi-stream fix to v0.11.0-dev (#3753)
This PR moves the communication operation of shared experts out of extra
stream because I found that this might cause rtMemcpy related errors
when running shared experts multistream with aclgraph.

Furthermore, I utilize a global variable as extra stream object to avoid
allocating streams for each layer in full-graph mode.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-25 15:51:43 +08:00
Yizhou
1bc61031e5 [v0.11.0][Fix] Cap max tokens to prevent potential OOM (#3720) (#3744)
### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.

This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 15:46:56 +08:00
fems14
99e154dc84 [0.11.0] cherry-pick from #3747 (#3746)
cherry-pick from #3747

correct _register function place for mooncacke

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-25 14:21:30 +08:00
shaopeng-666
fed8145aea [cherry-pick][Feat] Add mrope fusion op#3708 (#3735)
### What this PR does / why we need it?
Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't
support Qwen3-VL currently. Thus could only take affect in qwen2.5-vl
cherry pick from 39b994a987

CI passed with existing test

Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
2025-10-25 11:41:23 +08:00
whx
0644113c35 [BugFix] cherry-pick PR 3736 to v0.11.0-dev (#3737)
This PR comments out newly added vlm e2e test of ascend scheduler
scenario because I found that when running in multi-batch this will
stuck. Need to add this back after dealing with this issue.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-25 10:35:14 +08:00
whx
5a2c5be229 [BugFix][Cherry-pick] Cherry-pick PR 3675 to v0.11.0-dev (#3732)
This PR cherry-picks the bugfix related with running multi-modal models
with AscendScheduler to v0.11.0-dev

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-10-25 09:41:51 +08:00
hucong
12bc78d252 [v0.11.0][BugFix][P/D] Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache (#3686)
### What this PR does / why we need it?
Modify the recalculation logic to prevent waiting requests from filling
up the D node KVCache

Signed-off-by: underfituu <hzhucong@163.com>
2025-10-25 09:15:42 +08:00
ZYang6263
5c0a23f98b [0.11.0][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. (#3725)
### What this PR does / why we need it?
This PR boosts performance by introducing a fused kernel for the matrix
matmul and reduce scatter operations. It supports both unquantized
(e.g., BFloat16) and W8A8 quantized models.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-25 08:20:43 +08:00
fems14
17dd9ae42c [0.11.0][bugfix]look up multi_tp key (#3699) (#3723)
### What this PR does / why we need it?
In multi-Tensor Parallel (TP) scenarios, the KV pool only queries the
first GPU card. When keys on other cards are released, the query result
still returns as successful, introducing accuracy issues. This PR
modifies the KV pool's query logic to check all cards, resolving this
problem.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-24 18:22:45 +08:00
fems14
f0eb3e1d97 [v0.11.0][bugfix]kvpool sync load (#3698) (#3722)
### What this PR does / why we need it?
In certain scenarios, the performance of synchronously loading data from
the pool is better than that of asynchronously loading data. Therefore,
a control logic (or switch) for asynchronous loading from the pool has
been added.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-24 18:21:46 +08:00
何必问
33514a4cc2 [Bugfix] The server fails to locate the request, leading to the server hanging. (#3721)
### What this PR does / why we need it?
fix bug: In the mooncake pooling scenario, when the client closes the
request, the server fails to locate the request, leading to the server
hanging.oling scenario, when the client closes the request, the server
fails to locate the request, leading to the server hanging.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Pull up the PD separated pooling service, send requests using aisbench,
press CTRL+C twice, and check if the vllm_ascend service exit.

---------

Signed-off-by: linhebiwen <linhebiwen@gmail.com>
2025-10-24 17:41:29 +08:00
offline893
4e21b1537e [BugFix] Check all expert maps when using muilty instance. (#3662)
### What this PR does / why we need it?
Check all expert maps when using muilty instance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Qwen 235B in double A3.
case1:master has expert map, slave has not expert map.
case2:   master has expert map, slave has error expert map.
case3:   master has expert map,slave has correct expert map.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-24 17:10:31 +08:00
wangxiyuan
b321e3846a [cherry-pick]【main】patch sched_yield (#3648) (#3687)
### What this PR does / why we need it?
On Arm systems, os.sched_yield() does not take effect, causing the GIL
(Global Interpreter Lock) to remain unrelinquished and resulting in CPU
bound issues. This PR applies a patch to sched_yield in vLLM, making the
process execute time.sleep(0) instead to release the GIL. ### Does this
PR introduce _any_ user-facing change?

Signed-off-by: fems14 <1804143737@qq.com>
Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>
2025-10-24 00:24:58 +08:00
Wang Yixuan
d0086d432a fix deepseek torchair recompile (#3679)
### What this PR does / why we need it?
The #3624 PR fix the precision of deepseek torchair, but don't consider
the limitation of torch compile which results in the recompile, This PR
fixs this problem. PR to main #3678


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-23 22:53:13 +08:00
Slightwind
d2d19a4c3c [v0.11.0][bugfix] Add 'layer_type' param to get_pergroup_param() for compatibility (#3684)
Resolves a `TypeError: got an unexpected keyword argument 'layer_type'`.

A recent change (PR #3311) started passing the `layer_type` argument
when calling `get_pergroup_param()`. This specific implementation does
not use this parameter, causing the error.

This patch adds `layer_type=None` to the method signature to maintain
API compatibility and ignore the unused argument.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-10-23 21:26:50 +08:00
liziyu
f3ea657e93 [0.11.0][Bugfix] fix delay free prefill req & D node support prefix cache (#3609)
### What this PR does / why we need it?
Fix mooncake connector. In scenarios where TP is not equal, when the
prefill TP size is less than the number of key-value heads,
_get_remote_tp_ranks_for_req will return a list of np.arrays. Performing
an operation like int in list of np.arrays will cause an error.
Converting the list of np.arrays into a single np.array resolves this
issue.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
qwen235B
P tp16, D tp1
P tp8, D tp1
P tp4, D tp1
P tp8, D tp2


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-10-23 20:39:35 +08:00
ZYang6263
6975d46627 [v0.11.0][Perf] Eliminating the zerolike operator through patch (#3632)
### What this PR does / why we need it?
There is a zero-like operator before the attention operation in each
decoding stage. After analysis, this operator can be eliminated. The
purpose of this PR is to remove this operator and improve performance.

---------

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-23 14:49:28 +08:00
rjg-lyh
74903af460 [v0.11.0][refactor] refactor SequenceRowParallelOp forward (#3654)
### What this PR does / why we need it?
This PR refactors SequenceRowParallelOp forward. In order to further
expand the operator inclusion scope in dynamic judgment scenarios, this
PR customizes the entire matmul computation and communication as a
custom operator masking. With this refactor, it will support directly
writing code such as common operation fusion into the
SequenceRowParallelOp class's member function matmul_and_reduce, without
the need to register more redundant custom masking operators.

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-10-23 14:45:49 +08:00
Yizhou
54bd531db8 [v0.11.0][Fix] Fix attention metadata handling for profiling and MLA (#3636) (#3643)
### What this PR does / why we need it?
This is a port PR of #3636 .

Move the creation of dummy attention metadata to occur after the ACL
graph runtime mode is determined. This ensures the metadata is
initialized with the correct configuration during a profile run.

Additionally, remove the `attn_metadata` existence check before updating
MLA attention parameters. This change prevents the update from being
skipped when metadata is not yet available, ensuring parameters are set
correctly.

### Does this PR introduce _any_ user-facing change? None.

### How was this patch tested?
None.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-23 10:29:30 +08:00
whx
6464c97ff9 [BugFix][v0.11.0] Fix quantization related mtp bug with patch (#3619)
vLLM 0.11.0 didn't bring PR
(https://github.com/vllm-project/vllm/pull/25805) thus missing the
prefix of mtp's SharedHead. This PR fixes this bug with a patch to
vllm's deepseek_mtp.

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-22 23:06:09 +08:00
Zetong Li
6e72bfdc50 [v0.11.0] cherry-pick Fix performance degradation when mtp>1 (#3597) (#3630)
### What this PR does / why we need it?
cherry-pick Fix performance degradation when mtp>1 (#3597)

This PR aims to fix performance degradation when mtp>1. Since mtp>1 may
result in more tokens (i.e. larger batch size) than acl graph maximum
batch size, this will cause draft model to run in eager mode.

### How was this patch tested?
by ci

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-10-22 22:07:39 +08:00
zouyida2052
a989fef5de unify logic between aclgraph and torchair (#3602)
### What this PR does / why we need it?
unify logic between aclgraph and torchair. This is a cherry-pick of https://github.com/vllm-project/vllm-ascend/pull/3560

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-22 21:55:06 +08:00
Wang Yixuan
edccd46d74 fix deepseek torchair precision (#3635)
### What this PR does / why we need it?
The precision of deepseek torchair is broken by #3465 , which due to the origin patch or rmsnorm in torchair. This PR fixes the precision of deepseek torchair.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-22 20:20:32 +08:00
Yizhou
984efdc0d0 [v0.11.0][Fix] Fixes attribute error in MLA implementation (#3617)
### What this PR does / why we need it?
Corrects the attribute access for retrieving the device from `q_a_proj`
to `q_proj`. This prevents an `AttributeError` as `q_a_proj` does not
exist on the class instance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Need MLAPO tests.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-22 15:49:18 +08:00
wangxiyuan
a0c3b8dd2d [v0.11.0]cherry-pick fix ut (#3608) (#3614)
cherry-pick fix ut (#3608)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-22 14:14:15 +08:00
offline893
726bc8aa2a [CI]fix test nightly workflow. (#3604)
Add the nightly test back, it's deleted by mistake.

Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-22 10:34:03 +08:00
1511 changed files with 69835 additions and 226881 deletions

View File

@@ -1,115 +0,0 @@
# vLLM Ascend skills
This directory contains the skills for vLLM Ascend.
Note: Please copy the skills directory `.agents/skills` to `.claude/skills` if you want to use the skills in this repo with Claude code.
## Table of Contents
- [vLLM Ascend Model Adapter Skill](#vllm-ascend-model-adapter-skill)
- [vLLM Ascend main2main Skill](#vllm-ascend-main2main-skill)
- [vLLM Ascend Release Note Writer Skill](#vllm-ascend-release-note-writer-skill)
## vLLM Ascend Model Adapter Skill
Adapt and debug models for vLLM on Ascend NPU — covering both already-supported
architectures and new models not yet registered in vLLM.
### What it does
This skill guides an AI agent through a deterministic workflow to:
1. Triage a model checkpoint (architecture, quant type, multimodal capability).
2. Implement minimal code changes in `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend`.
3. Validate via a two-stage gate (dummy fast gate + real-weight mandatory gate).
4. Deliver one signed commit with code, test config, and tutorial doc.
### File layout
| File | Purpose |
| ---- | ------- |
| `SKILL.md` | Skill definition, constraints, and execution playbook |
| `references/workflow-checklist.md` | Step-by-step commands and templates |
| `references/troubleshooting.md` | Symptom-action pairs for common failures |
| `references/fp8-on-npu-lessons.md` | FP8 checkpoint handling on Ascend |
| `references/multimodal-ep-aclgraph-lessons.md` | VL, EP, and ACLGraph patterns |
| `references/deliverables.md` | Required outputs and commit discipline |
### Quick start
1. Open a conversation with the AI agent inside the vllm-ascend dev container.
2. Invoke the skill (e.g. `/vllm-ascend-model-adapter`).
3. Provide the model path (default `/models/<model-name>`) and the originating issue number.
4. The agent follows the playbook in `SKILL.md` and produces a ready-to-merge commit.
### Key constraints
- Never upgrade `transformers`.
- Start `vllm serve` from `/workspace` (direct command, port 8000).
- Dummy-only evidence is not sufficient — real-weight validation is mandatory.
- Final delivery is exactly one signed commit in the current repo.
### Two-stage validation
- **Stage A (dummy)**: fast architecture / operator / API path check with `--load-format dummy`.
- **Stage B (real)**: real-weight loading, fp8/quant path, KV sharding, runtime stability.
Both stages require request-level verification (`/v1/models` + at least one chat request),
not just startup success.
## vLLM Ascend main2main Skill
Migrate changes from the main vLLM repository to the vLLM Ascend repository, ensuring compatibility and performance optimizations for Ascend NPUs.
### What it does
This skill facilitates the process of:
1. Identifying changes in the main vLLM repository.
2. Applying necessary modifications for Ascend support.
3. Validating the changes in an Ascend environment.
4. Delivering a ready-to-merge commit with optimized code and configurations.
### Quick start
1. Open a conversation with the AI agent inside the vllm-ascend dev container.
2. Invoke the skill (e.g. `/main2main`).
3. The agent follows the playbook and produces a ready-to-merge commit.
## vLLM Ascend Release Note Writer Skill
You just need to say: `Please help me write a 0.13.0 release note based on commits from v0.11.0 and releases/v0.13.0`
### What it does
This skill guides you through a structured workflow to:
1. Fetch commits between two versions using the provided script.
2. Analyze and categorize each commit in a CSV workspace.
3. Draft highlights and write polished release notes.
4. Generate release notes organized by category (Features, Hardware Support, Performance, Dependencies, etc.).
### File layout
| File | Purpose |
| ---- | ------- |
| `SKILL.md` | Skill definition, workflow, and writing guidelines |
| `references/ref-past-release-notes-highlight.md` | Style and category reference for release notes |
| `scripts/fetch_commits-optimize.py` | Script to fetch commits between versions |
### Quick start
1. Open a conversation with the AI agent.
2. Invoke the skill (e.g. `/vllm-ascend-release-note-writer`).
3. Follow the workflow steps:
- Fetch commits between versions
- Analyze commits in CSV format
- Draft and edit highlights
4. Output files are saved to `vllm-ascend-release-note/output/$version`
### Key guidelines
- Use one-level headings (###) for sections in a specific order: Highlights, Features, Hardware and Operator Support, Performance, Dependencies, Deprecation & Breaking Changes, Documentation, Others.
- Focus on user-facing impact and include context for practical usage.
- Verify details by checking linked PRs (use GitHub API for descriptions if needed).
- Keep notes concise and avoid unnecessary technical details.

View File

@@ -1,277 +0,0 @@
---
name: main2main
description: "The main2main skill guides an AI agent to adapt the latest vLLM main branch code for vLLM Ascend project."
---
# main2main Skill
This skill guides AI agents to adapt the latest vLLM main branch code for the vLLM Ascend project.
## Workflow
### 1. Get Current vLLM Version Information for vLLM Ascend
Find the vLLM version information for the **main branch** in `docs/source/community/versioning_policy.md` under the `Release compatibility matrix` section:
- **Current adapted vLLM commit**: Format like `83b47f67b1dfad505606070ae4d9f83e50ad4ebd, v0.15.0 tag`
- **Compatible vLLM version**: From the table, e.g., `v0.15.0`
### 2. Get the Latest vLLM Code
Retrieve the latest commit from the local vLLM git repository:
```bash
# The vLLM git repository is typically located in the parent directory
cd ../vllm
git log -1 --format="%H %s"
```
If the vLLM repository is not found at the default location, prompt the user to specify the exact path to the vLLM git repository.
### 3. Compare vLLM Changes
Compare the differences between the vLLM commit currently adapted by vLLM Ascend and the latest commit:
```bash
# View file changes between two commits
git diff <old_commit> <new_commit> --name-only
# View detailed code changes
git log --oneline <old_commit>..<new_commit>
```
### 4. Analyze vLLM Changes and Generate Change Report
Create a file named `vllm_changes.md` to save the list of changes in vLLM that are relevant to vLLM Ascend. This file will be used to guide the adaptation process and should be removed after all work is done.
#### 4.1 Identify Key vLLM Source Files
Focus on vLLM source files under `vllm/vllm/` directory, especially:
```bash
# Get changed files in vLLM source code
git diff <old_commit> <new_commit> --name-only | grep -E "^vllm/" | head -200
# Count total changes
git diff <old_commit> <new_commit> --name-only | wc -l
```
#### 4.2 Categorize Changes by Priority
When analyzing changes, categorize them into the following priority levels:
| Priority | Category | Description |
|----------|----------|-------------|
| **P0** | Breaking Changes | API changes that will cause runtime errors if not adapted |
| **P1** | Important Changes | Changes that affect functionality or performance |
| **P2** | Moderate Changes | Changes that may need review for compatibility |
| **P3** | Model Changes | New models or model updates |
| **P4** | Minor Changes | Configuration, documentation, or minor refactoring |
#### 4.3 Key Areas to Focus On
When analyzing vLLM changes, pay special attention to these areas that typically require vLLM Ascend adaptation:
1. **Platform Interface** (`vllm/platforms/`)
- New abstract methods that must be implemented
- Method signature changes
- New platform features
2. **MoE (Mixture of Experts)** (`vllm/model_executor/layers/fused_moe/`)
- FusedMoE layer changes
- Activation function changes
- Router changes
3. **Attention** (`vllm/model_executor/layers/attention/`)
- Attention backend changes
- New parameters or interfaces
- MLA (Multi-Head Latent Attention) updates
4. **Speculative Decoding** (`vllm/v1/worker/gpu/spec_decode/`, `vllm/config/speculative.py`)
- Import path changes
- Config field changes
- New speculative methods
5. **Distributed** (`vllm/distributed/`)
- Parallel state changes
- KV transfer changes
- Device communicator updates
6. **Models** (`vllm/model_executor/models/`)
- New model architectures
- Model interface changes
7. **Worker/Model Runner** (`vllm/v1/worker/gpu/model_runner.py`)
- New worker methods
- Model runner changes
8. **Quantization** (`vllm/model_executor/layers/quantization/`)
- Quantization config changes
- compress-tensor method changes
#### 4.4 vllm_changes.md Template
Use the following template structure for `vllm_changes.md`:
```markdown
# vLLM Changes Relevant to vLLM Ascend
# Generated: <DATE>
# Old commit: <OLD_COMMIT_HASH> (<OLD_VERSION>)
# New commit: <NEW_COMMIT_HASH>
# Total commits: <COUNT>
================================================================================
## P0 - Breaking Changes (Must Adapt)
================================================================================
### <INDEX>. <CHANGE_TITLE>
FILE: <VLLM_FILE_PATH>
CHANGE: <DESCRIPTION_OF_CHANGE>
IMPACT: <WHAT_BREAKS_IF_NOT_ADAPTED>
VLLM_ASCEND_FILES:
- <PATH_TO_ASCEND_FILE_1>
- <PATH_TO_ASCEND_FILE_2>
================================================================================
## P1 - Important Changes (Should Adapt)
================================================================================
...
================================================================================
## P2 - Moderate Changes (Review Needed)
================================================================================
...
================================================================================
## P3 - Model Changes
================================================================================
...
================================================================================
## P4 - Configuration/Minor Changes
================================================================================
...
================================================================================
## Files/Directories Renamed
================================================================================
<LIST_OF_RENAMED_FILES>
================================================================================
## END OF CHANGES
================================================================================
```
#### 4.5 Commands to Analyze Specific Changes
```bash
# Check for breaking changes in commit messages
git log --oneline <old_commit>..<new_commit> | grep -iE "(refactor|breaking|api|rename|remove|deprecate)"
# View specific file changes
git diff <old_commit> <new_commit> -- <FILE_PATH>
# Check for renamed/moved files
git diff <old_commit> <new_commit> --name-status | grep -E "^R"
# Check platform interface changes
git diff <old_commit> <new_commit> -- vllm/platforms/
# Check MoE changes
git diff <old_commit> <new_commit> -- vllm/model_executor/layers/fused_moe/
# Check attention changes
git diff <old_commit> <new_commit> -- vllm/model_executor/layers/attention/
# Check speculative decoding changes
git diff <old_commit> <new_commit> -- vllm/v1/worker/gpu/spec_decode/ vllm/config/speculative.py
```
### 5. Adapt vLLM Ascend Project
For each related change in vLLM from the file `vllm_changes.md`, evaluate whether adaptation in vLLM Ascend is needed:
#### 5.1 Internal Architecture Changes
- Check internal interfaces of vLLM core modules (scheduler, executor, model runner, etc.)
- Update vLLM Ascend's Ascend-specific implementations (e.g., NPU worker/model runner, custom attention、custom ops)
- Preserve vLLM Ascend specific modifications (e.g., code under `vllm_ascend/`)
#### 5.2 Dependency Changes
- Check for dependency version changes in `pyproject.toml` or `setup.py`
- Update dependency declarations in vLLM Ascend
### 5. Test and Verify
- Run vLLM Ascend's CI/CD pipeline
- Verify core functionality (text generation, batching, NPU memory management)
- Ensure backward compatibility: test compatibility with older vLLM versions
## Key File Locations
| Project | Path |
|---------|------|
| vLLM Ascend version compatibility | `docs/source/community/versioning_policy.md` |
| vLLM Ascend source code | `vllm_ascend/` |
| **Core Modules** | |
| Ascend-specific attention | `vllm_ascend/attention/` |
| Ascend-specific executor | `vllm_ascend/worker/` |
| Ascend-specific ops | `vllm_ascend/ops/` |
| **Specialized Implementations** | |
| Ascend 310P specific | `vllm_ascend/_310p/` |
| EPLB load balancing | `vllm_ascend/eplb/` |
| XLite compiler | `vllm_ascend/xlite/` |
| **Compilation & Fusion** | |
| Graph fusion pass manager | `vllm_ascend/compilation/` |
| Compilation passes | `vllm_ascend/compilation/passes/` |
| **Quantization** | |
| Quantization methods | `vllm_ascend/quantization/` |
| ModelSlim integration | `vllm_ascend/quantization/methods/modelslim/` |
| **Distributed & KV Cache** | |
| KV transfer | `vllm_ascend/distributed/kv_transfer/` |
| Device communicators | `vllm_ascend/distributed/device_communicators/` |
| **Speculative Decoding** | |
| MTP proposer | `vllm_ascend/spec_decode/mtp_proposer.py` |
| Eagle proposer | `vllm_ascend/spec_decode/eagle_proposer.py` |
| **Utility Modules** | |
| Common utilities | `vllm_ascend/utils.py` |
| Ascend config | `vllm_ascend/ascend_config.py` |
| Platform detection | `vllm_ascend/platform.py` |
| Environment variables | `vllm_ascend/envs.py` |
## Important Notes
1. **Version Checking**: vLLM Ascend uses version checking to maintain compatibility with multiple vLLM versions. Preserve or update related logic when adapting.
2. **Test Verification**: After adaptation, tests must verify:
- Compatibility with the latest vLLM version
- Backward compatibility with older vLLM versions
- Ascend NPU functionality works correctly
3. **Documentation Sync**: If vLLM documentation has significant changes, update vLLM Ascend's documentation accordingly.
4. **Backward Compatibility**:
- Maintain compatibility from the version currently adapted by vLLM Ascend to the latest version
- Use version checking to handle code branches for different versions:
```python
from vllm_ascend.utils import vllm_version_is
if vllm_version_is("0.15.0"):
# Use API for v0.15.0
else:
# Use API for other versions
```
5. Do not forget to update the vLLM version is `.github` for CI files.
6. **Change Logging**: After adaptation, clearly document in the commit message:
- The range of adapted vLLM commits
- Main changes made
- Test results
7. the vLLM python code is under `vllm/vllm` folder.
## Reference
- [Versioning Policy](../../../docs/source/community/versioning_policy.md) - vLLM Ascend versioning strategy

View File

@@ -1,140 +0,0 @@
---
name: vllm-ascend-model-adapter
description: "Adapt and debug existing or new models for vLLM on Ascend NPU. Implement in /vllm-workspace/vllm and /vllm-workspace/vllm-ascend, validate via direct vllm serve from /workspace, and deliver one signed commit in the current repo."
---
# vLLM Ascend Model Adapter
## Overview
Adapt Hugging Face or local models to run on `vllm-ascend` with minimal changes, deterministic validation, and single-commit delivery. This skill is for both already-supported models and new architectures not yet registered in vLLM.
## Read order
1. Start with `references/workflow-checklist.md`.
2. Read `references/multimodal-ep-aclgraph-lessons.md` (feature-first checklist).
3. If startup/inference fails, read `references/troubleshooting.md`.
4. If checkpoint is fp8-on-NPU, read `references/fp8-on-npu-lessons.md`.
5. Before handoff, read `references/deliverables.md`.
## Hard constraints
- Never upgrade `transformers`.
- Primary implementation roots are fixed by Dockerfile:
- `/vllm-workspace/vllm`
- `/vllm-workspace/vllm-ascend`
- Start `vllm serve` from `/workspace` with direct command by default.
- Default API port is `8000` unless user explicitly asks otherwise.
- Feature-first default: try best to validate ACLGraph / EP / flashcomm1 / MTP / multimodal out-of-box.
- `--enable-expert-parallel` and flashcomm1 checks are MoE-only; for non-MoE models mark as not-applicable with evidence.
- If any feature cannot be enabled, keep evidence and explain reason in final report.
- Do not rely on `PYTHONPATH=<modified-src>:$PYTHONPATH` unless debugging fallback is strictly needed.
- Keep code changes minimal and focused on the target model.
- Final deliverable commit must be one single signed commit in the current working repo (`git commit -sm ...`).
- Keep final docs in Chinese and compact.
- **Dummy-first is encouraged for speed, but dummy is NOT fully equivalent to real weights.**
- **Never sign off adaptation using dummy-only evidence; real-weight gate is mandatory.**
## Execution playbook
### 1) Collect context
- Confirm model path (default `/models/<model-name>`; if environment differs, confirm with user explicitly).
- Confirm implementation roots (`/vllm-workspace/vllm`, `/vllm-workspace/vllm-ascend`).
- Confirm delivery root (the current git repo where the final commit is expected).
- Confirm runtime import path points to `/vllm-workspace/*` install.
- Use default expected feature set: ACLGraph + EP + flashcomm1 + MTP + multimodal (if model has VL capability).
- User requirements extend this baseline, not replace it.
### 2) Analyze model first
- Inspect `config.json`, processor files, modeling files, tokenizer files.
- Identify architecture class, attention variant, quantization type, and multimodal requirements.
- Check state-dict key prefixes (and safetensors index) to infer mapping needs.
- Decide whether support already exists in `vllm/model_executor/models/registry.py`.
### 3) Choose adaptation strategy (new-model capable)
- Reuse existing vLLM architecture if compatible.
- If architecture is missing or incompatible, implement native support:
- add model adapter under `vllm/model_executor/models/`;
- add processor under `vllm/transformers_utils/processors/` when needed;
- register architecture in `vllm/model_executor/models/registry.py`;
- implement explicit weight loading/remap rules (including fp8 scale pairing, KV/QK norm sharding, rope variants).
- If remote code needs newer transformers symbols, do not upgrade dependency.
- If unavoidable, copy required modeling files from sibling transformers source and keep scope explicit.
- If failure is backend-specific (kernel/op/platform), patch minimal required code in `/vllm-workspace/vllm-ascend`.
### 4) Implement minimal code changes (in implementation roots)
- Touch only files required for this model adaptation.
- Keep weight mapping explicit and auditable.
- Avoid unrelated refactors.
### 5) Two-stage validation on Ascend (direct run)
#### Stage A: dummy fast gate (recommended first)
- Run from `/workspace` with `--load-format dummy`.
- Goal: fast validate architecture path / operator path / API path.
- Do not treat `Application startup complete` as pass by itself; request smoke is mandatory.
- Require at least:
- startup readiness (`/v1/models` 200),
- one text request 200,
- if VL model, one text+image request 200,
- ACLGraph evidence where expected.
#### Stage B: real-weight mandatory gate (must pass before sign-off)
- Remove `--load-format dummy` and validate with real checkpoint.
- Goal: validate real-only risks:
- weight key mapping,
- fp8/fp4 dequantization path,
- KV/QK norm sharding with real tensor shapes,
- load-time/runtime stability.
- Require HTTP 200 and non-empty output before declaring success.
- Do not pass Stage B on startup-only evidence.
### 6) Validate inference and features
- Send `GET /v1/models` first.
- Send at least one OpenAI-compatible text request.
- For multimodal models, require at least one text+image request.
- Validate architecture registration and loader path with logs (no unresolved architecture, no fatal missing-key errors).
- Try feature-first validation: EP + ACLGraph path first; eager path as fallback/isolation.
- If startup succeeds but first request crashes (false-ready), treat as runtime failure and continue root-cause isolation.
- For `torch._dynamo` + `interpolate` + `NPU contiguous` failures on VL paths, try `TORCHDYNAMO_DISABLE=1` as diagnostic/stability fallback.
- For multimodal processor API mismatch (for example `skip_tensor_conversion` signature mismatch), use text-only isolation (`--limit-mm-per-prompt` set image/video/audio to 0) to separate processor issues from core weight loading issues.
- Capacity baseline by default (single machine): `max-model-len=128k` + `max-num-seqs=16`.
- Then expand concurrency (e.g., 32/64) if requested or feasible.
### 7) Backport, generate artifacts, and commit in delivery repo
- If implementation happened in `/vllm-workspace/*`, backport minimal final diff to current working repo.
- Generate test config YAML at `tests/e2e/models/configs/<ModelName>.yaml` following the schema of existing configs (must include `model_name`, `hardware`, `tasks` with accuracy metrics, and `num_fewshot`). Use accuracy results from evaluation to populate metric values.
- Generate tutorial markdown at `docs/source/tutorials/models/<ModelName>.md` following the standard template (Introduction, Supported Features, Environment Preparation with docker tabs, Deployment with serve script, Functional Verification with curl example, Accuracy Evaluation, Performance). Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, and accuracy table.
- Update `docs/source/tutorials/models/index.md` to include the new tutorial.
- Confirm test config YAML and tutorial doc are included in the staged files.
- Commit code changes once (single signed commit).
### 8) Prepare handoff artifacts
- Write comprehensive Chinese analysis report.
- Write compact Chinese runbook for server startup and validation commands.
- Include feature status matrix (supported / unsupported / checkpoint-missing / not-applicable).
- Include dummy-vs-real validation matrix and explicit non-equivalence notes.
- Include changed-file list, key logs, and final commit hash.
- Post the SKILL.md content (or a link to it) as a comment on the originating GitHub issue to document the AI-assisted workflow.
## Quality gate before final answer
- Service starts successfully from `/workspace` with direct command.
- OpenAI-compatible inference request succeeds (not startup-only).
- Key feature set is attempted and reported: ACLGraph / EP / flashcomm1 / MTP / multimodal.
- Capacity baseline (`128k + bs16`) result is reported, or explicit reason why not feasible.
- **Dummy stage evidence is present (if used), and real-weight stage evidence is present (mandatory).**
- Test config YAML exists at `tests/e2e/models/configs/<ModelName>.yaml` and follows the established schema (`model_name`, `hardware`, `tasks`, `num_fewshot`).
- Tutorial doc exists at `docs/source/tutorials/models/<ModelName>.md` and follows the standard template (Introduction, Supported Features, Environment Preparation, Deployment, Functional Verification, Accuracy Evaluation, Performance).
- Tutorial index at `docs/source/tutorials/models/index.md` includes the new model entry.
- Exactly one signed commit contains all code changes in current working repo.
- Final response includes commit hash, file paths, key commands, known limits, and failure reasons where applicable.

View File

@@ -1,47 +0,0 @@
# Deliverables
## Required outputs in current repo
1. One final signed commit (`git commit -sm ...`) containing the adaptation changes.
2. Chinese analysis report精简但完整:
- model architecture summary
- incompatibility root causes
- code changes and rationale
- startup and inference verification evidence
- feature status matrixsupported / unsupported / checkpoint-missing / not-applicable
- max model len: config theoretical vs runtime practical
- dummy-vs-real validation matrixwhat dummy proved / what only real proved
- false-ready cases and final resolution pathif any
- fallback ladder evidencewhich fallback was tried, what changed
3. Chinese compact runbook:
- how to start server in `/workspace` (direct command, default `:8000`)
- how to run OpenAI-compatible validation
- optional eager fallback command
- optional `TORCHDYNAMO_DISABLE=1` fallback command (if relevant)
4. Test config YAML at `tests/e2e/models/configs/<ModelName>.yaml` — must include `model_name`, `hardware`, `tasks` with accuracy metrics (name + value), and `num_fewshot`. Use accuracy results from evaluation to populate metric values. Follow the schema of existing configs (e.g. `Qwen3-8B.yaml`).
5. Tutorial doc at `docs/source/tutorials/models/<ModelName>.md` — must follow the standard template: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance. Fill in model-specific details (HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, accuracy table).
6. Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue.
## Commit discipline
- Keep one signed commit for code changes in the current working repo.
- If implementation occurred in `/vllm-workspace/*`, backport minimal final diff to current repo before commit.
- Keep diff scoped to target model adaptation.
## Validation discipline
- Always provide log file paths for key claims.
- Keep docs synchronized with latest successful test mode (do not leave stale command variants as default).
- Final report must include pass/fail reason for each key feature attempt: ACLGraph / EP / flashcomm1 / MTP / multimodal.
- EP and flashcomm1 are MoE-only checks; for non-MoE models mark as not-applicable with evidence.
- Final report should include baseline capacity result (`128k + bs16`) or explicit reason if not feasible.
- Dummy-first can be used to speed up iterations, but real-weight gate is mandatory before final sign-off.
- Startup-only evidence is insufficient; include first-request smoke results.
## Suggested final response structure
- What changed
- What went well / what went wrong
- Validation performed
- Commit hash and changed files
- Optional next step

View File

@@ -1,57 +0,0 @@
# FP8-on-NPU Lessons
## 1) Recommended debug order
1. Start with `--load-format dummy` to quickly verify architecture path.
2. Run with real weights to validate weight mapping and load-time stability.
3. If blocked by fp8 execution limits on NPU, use fp8->bf16 dequantization loading path.
4. Validate `/v1/models`, then one text request, then one VL request (if multimodal).
## 2) FP8 checkpoint on NPU
Common symptom:
- `fp8 quantization is currently not supported in npu`.
Recommended pattern:
- do not force fp8 execution kernels on NPU;
- dequantize fp8 weights to bf16 during loading using paired tensors:
- `*.weight`
- `*.weight_scale_inv`
- keep strict unpaired scale/weight checks to avoid silent corruption.
## 3) Typical real-only risks (dummy may not expose)
- missing fp8 scale keys during real shard loading;
- wrong weight remap path only triggered by real checkpoints;
- KV/QK norm sharding mismatch under TP + replicated KV heads.
## 4) KV replication + TP pitfalls
Typical symptom:
- shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.
Recommended pattern:
- detect KV-head replication explicitly;
- use local norm/shard loader path for replicated KV heads;
- avoid assuming uniform divisibility for all head dimensions.
## 5) ACLGraph stability for fp8-origin checkpoints
Recommended pattern:
- prefer `HCCL_OP_EXPANSION_MODE=AIV` when using graph mode;
- keep practical capture sizes and re-test from small, stable shapes;
- use `--enforce-eager` only as temporary isolation fallback.
## 6) Reporting discipline
Always report both:
- what dummy validated (fast gate), and
- what only real weights validated (mandatory gate).
Do not sign off fp8-on-NPU adaptation with dummy-only evidence.

View File

@@ -1,64 +0,0 @@
# Multimodal + EP + ACLGraph Lessons
This note captures practical patterns that repeatedly matter for VL checkpoints on Ascend.
## 1) Out-of-box feature expectation
Try best to validate key features by default:
- ACLGraph
- MTP
- multimodal (if model supports VL)
- EP (MoE models only)
- flashcomm1 (MoE models only)
If any feature fails, keep logs and explain the reason in the final report.
For non-MoE models, EP/flashcomm1 should be marked not-applicable.
## 2) Validate in this order
1. Single text request success (`/v1/models` + `/v1/chat/completions`).
2. Single text+image request success.
3. Graph evidence (`Replaying aclgraph`) when graph mode is expected.
4. Capacity baseline: `128k + bs16`.
5. Concurrency expansion if needed (`32/64` suggested).
## 3) EP + graph startup expectations
- Startup latency is much higher than eager due to:
- compile warmup
- graph capture rounds
- multimodal encoder profiling
- Do not treat slow startup as failure unless logs show hard errors.
## 4) Always distinguish two max lengths
- **Theoretical max**: from model config (`max_position_embeddings`).
- **Practical max**: largest value that actually starts and serves on current hardware + TP/EP settings.
Report both values explicitly.
## 5) Multimodal testing with temporary layer reduction
- Reducing `num_hidden_layers` can speed smoke tests.
- This does **not** remove ViT structure itself.
- Still require one full-layer validation before final sign-off.
## 6) Feature-status semantics
Use four categories:
- ✅ supported and verified
- ❌ framework-level unsupported
- ⚠️ checkpoint missing (weights/config do not provide feature)
- N/A not-applicable (for example EP/flashcomm1 on non-MoE models)
Typical examples:
- flashcomm1 on non-MoE VL models is often N/A or ❌ depending on framework gate.
- MTP may be ⚠️ checkpoint missing even if framework has code paths.
## 7) Keep docs and defaults aligned with latest success path
- If EP+graph is validated and requested/expected, it should be the default runbook path.
- Eager mode should be documented as fallback/troubleshooting only.

View File

@@ -1,229 +0,0 @@
# Troubleshooting
## Direct run doesn't pick your code changes
Symptoms:
- `vllm serve` behavior still old after code edits.
Actions:
1. Check runtime import path:
```bash
python - <<'PY'
import vllm
print(vllm.__file__)
PY
```
2. Ensure edits were made under `/vllm-workspace/vllm` and/or `/vllm-workspace/vllm-ascend`.
3. Avoid PYTHONPATH-overlay workflow unless as temporary debugging fallback.
## Server fails to bind on `:8000` or fails with HCCL bind errors
Symptoms:
- Port bind fail on startup.
- HCCL error like `Communication_Error_Bind_IP_Port(EJ0003)`.
Actions:
1. Kill stale `vllm serve` processes.
2. Ensure `:8000` is free.
3. Retry clean startup before changing code.
## Startup appears "stuck" in graph mode
Symptoms:
- Process alive, but `curl /v1/models` not ready yet.
- Logs show compile/graph capture messages for a long time.
Actions:
1. Keep waiting until graph capture completes.
2. Look for `Capturing CUDA graphs ...` and `Graph capturing finished`.
3. Only declare failure after an explicit error or timeout window.
## False-ready: startup succeeds but first request crashes
Symptoms:
- `Application startup complete` exists.
- `GET /v1/models` may return 200.
- First text or VL request crashes workers/engine.
Actions:
1. Always run at least one text smoke request immediately after ready.
2. For VL models, always run one text+image smoke request as well.
3. Treat first-request crash as runtime failure (do not mark as success).
4. Capture first runtime error signature and branch to targeted fallback.
## Architecture not recognized
Symptoms:
- `ValueError` or log shows unresolved architecture.
Actions:
1. Verify `architectures` in model `config.json`.
2. Add mapping to `vllm/model_executor/models/registry.py`.
3. Ensure module and class names exactly match.
## Remote code import fails on transformers symbols
Symptoms:
- Missing class/function in current `transformers`.
Actions:
1. Do not upgrade `transformers`.
2. Prefer native vLLM implementation.
3. If unavoidable, copy required modeling files from sibling transformers source.
## Weight loading key mismatch
Symptoms:
- Missing/unexpected key warnings during load.
Actions:
1. Inspect checkpoint key prefixes.
2. Add explicit mapping logic.
3. Keep mapping minimal and auditable.
4. Re-test with full shards, not only tiny-layer smoke runs.
## FP8 checkpoint on Ascend A2/A3 (must dequant to bf16)
Symptoms:
- fp8 kernels unsupported or unstable on Ascend A2/A3.
Actions:
1. Do not force fp8 quantization kernels on Ascend.
2. Use load-time fp8->bf16 dequantization path (weight + scale pairing).
3. Add strict unpaired scale/weight checks to avoid silent corruption.
## QK norm mismatch (KV heads / TP / head divisibility)
Symptoms:
- Shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.
- Similar mismatch when head topology is not cleanly divisible.
Actions:
1. Detect KV-head replication case.
2. Use local `k_norm` shard path for replicated KV heads.
3. Avoid assumptions that all head dimensions split evenly under current TP.
4. Validate both normal and edge topology cases explicitly.
## MLA attention runtime failures after ready
Symptoms:
- First request fails with signatures like `AtbRingMLAGetWorkspaceSize` / `AtbRingMLA`.
- May also show `aclnnFusedInferAttentionScoreV3 ... error code 561002`.
Actions:
1. Reproduce with one minimal text request (deterministic payload).
2. Try eager isolation (`--enforce-eager`) once to verify whether issue is graph-only.
3. If eager still fails, prioritize model/backend code fix path (not runtime flags only).
4. Check `vllm-ascend` MLA/rope/platform implementation used by known-good runs.
## VL + TorchDynamo interpolate contiguous failure
Symptoms:
- `torch._dynamo.exc.TorchRuntimeError`.
- Stack contains `torch.nn.functional.interpolate`.
- Error contains `NPU contiguous operator only supported contiguous memory format`.
Actions:
1. Add `TORCHDYNAMO_DISABLE=1` and retry with same serve args.
2. Validate both text and text+image after startup.
3. If this stabilizes startup and inference, record it as current fallback path.
4. Keep code-level fix exploration as next step, but do not block delivery if fallback is accepted.
## Multimodal processor signature mismatch (`skip_tensor_conversion`)
Symptoms:
- Early failure before engine ready.
- `convert_to_tensors() got an unexpected keyword argument 'skip_tensor_conversion'`.
Actions:
1. Identify processor compatibility mismatch (HF remote processor vs current transformers API).
2. Use text-only isolation (`--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'`) only to separate layers, not as final fix.
3. Expect potential follow-up core failures after bypassing processor path; keep logs for both layers.
4. Align to known-good model dispatch and processor compatibility implementation.
## Text-only isolation triggers meta tensor load errors
Symptoms:
- `NotImplementedError: Cannot copy out of meta tensor; no data!`
- May occur after disabling multimodal prompt items.
Actions:
1. Treat as secondary failure signature (after bypassing earlier MM-processor failure).
2. Do not assume text-only isolation is universally safe for all VL models.
3. Return to model-specific code-fix path with captured signatures.
## Config max length works on paper but not in runtime
Symptoms:
- `max_position_embeddings` is large, but service fails or OOM with that value.
Actions:
1. Record config max (theoretical).
2. Find practical max by successful startup + serving under target TP/EP setup.
3. Report both values explicitly in docs.
## flashcomm1 / MTP confusion on VL checkpoints
Symptoms:
- flashcomm1 enabled but startup fails.
- MTP expected but no effect.
Actions:
1. Only validate flashcomm1 for MoE models; non-MoE mark as not-applicable.
2. Verify MTP from both config and weight index (`mtp/nextn` keys).
3. Mark unsupported vs checkpoint-missing clearly.
## ACL graph capture fails (507903)
Symptoms:
- `AclmdlRICaptureEnd ... 507903`
- `rtStreamEndCapture ... invalidated stream capture sequence`
Actions:
1. Prefer `HCCL_OP_EXPANSION_MODE=AIV` for graph capture stability.
2. Reduce shape pressure (`--max-model-len`) and retry.
3. Temporarily fallback `--enforce-eager` for isolation.
## API reachable but output quality odd
Symptoms:
- `/v1/models` works but output has template artifacts.
Actions:
1. Use deterministic request (`temperature=0`, bounded `max_tokens`).
2. Verify endpoint (`/v1/chat/completions` vs `/v1/completions`) matches model template.
3. Confirm non-empty output and HTTP 200 before success declaration.

View File

@@ -1,255 +0,0 @@
# Workflow Checklist
## 0) Environment prerequisites
Set these once per session. Defaults match the official vllm-ascend Docker image.
```bash
# --- configurable paths (adjust if your layout differs) ---
VLLM_SRC=/vllm-workspace/vllm # vLLM source root
VLLM_ASCEND_SRC=/vllm-workspace/vllm-ascend # vllm-ascend source root
WORK_DIR=/workspace # directory to run vllm serve from
MODEL_ROOT=/models # parent directory of model checkpoints
```
Expected environment:
- Hardware: Ascend A2 or A3 server
- Software: official vllm-ascend Docker image (see `./Dockerfile` for full contents)
- TP=16 typical for A3 (16-NPU), TP=8 typical for A2 (8-NPU)
## 1) Fast triage commands
```bash
MODEL_PATH=${MODEL_ROOT}/<model-name>
echo "MODEL_PATH=$MODEL_PATH"
# model inventory
ls -la "$MODEL_PATH"
# architecture + quant hints
rg -n "architectures|model_type|quantization_config|torch_dtype|max_position_embeddings|num_nextn_predict_layers|version|num_attention_heads|num_key_value_heads|num_experts" "$MODEL_PATH/config.json"
# state-dict key layout hints (if index exists)
ls -la "$MODEL_PATH"/*index*.json 2>/dev/null || true
# model custom code (if exists)
ls -la "$MODEL_PATH"/*.py 2>/dev/null || true
```
## 2) Confirm implementation and delivery roots
```bash
# implementation roots (fixed by Dockerfile)
cd "$VLLM_SRC" && git status -s
cd "$VLLM_ASCEND_SRC" && git status -s
# runtime import source check (expect vllm-workspace path)
python - <<'PY'
import vllm
print(vllm.__file__)
PY
# direct-run working directory
cd "$WORK_DIR" && pwd
# delivery root (current repo)
cd <current-repo>
git status -s
```
## 3) Session hygiene (before rerun)
```bash
# stop stale servers
pkill -f "vllm serve|api_server|EngineCore" || true
# confirm port 8000 is free
netstat -ltnp 2>/dev/null | rg ':8000' || true
```
When user explicitly requests reset:
```bash
cd "$VLLM_SRC" && git reset --hard && git clean -fd
cd "$VLLM_ASCEND_SRC" && git reset --hard && git clean -fd
```
## 4) New model onboarding checklist
```bash
# architecture mapping check in vLLM
rg -n "<ArchitectureClass>|registry" "$VLLM_SRC"/vllm/model_executor/models/registry.py
# optional: inspect model config and weight index quickly
cat "$MODEL_PATH/config.json"
cat "$MODEL_PATH"/*index*.json 2>/dev/null || true
```
If architecture is missing/incompatible, minimally do:
1. Add model adapter under `$VLLM_SRC/vllm/model_executor/models/<new_model>.py`.
2. Add processor under `$VLLM_SRC/vllm/transformers_utils/processors/<new_model>.py` when needed.
3. Register architecture in `$VLLM_SRC/vllm/model_executor/models/registry.py`.
4. Add explicit loader/remap rules for checkpoint key patterns (qkv/norm/rope/fp8 scales).
5. Touch `$VLLM_ASCEND_SRC` only when backend-specific errors are confirmed.
## 5) Typical implementation touch points
- `$VLLM_SRC/vllm/model_executor/models/<new_model>.py`
- `$VLLM_SRC/vllm/transformers_utils/processors/<new_model>.py`
- `$VLLM_SRC/vllm/model_executor/models/registry.py`
- `$VLLM_ASCEND_SRC/vllm_ascend/...` (only if backend behavior requires it)
## 6) Syntax sanity checks
```bash
python -m py_compile \
"$VLLM_SRC"/vllm/model_executor/models/<new_model>.py
python -m py_compile \
"$VLLM_SRC"/vllm/transformers_utils/processors/<new_model>.py 2>/dev/null || true
```
## 7) Two-stage serve templates (direct run, default `:8000`)
### Stage A: dummy fast gate (first try)
```bash
cd "$WORK_DIR"
MODEL_PATH=${MODEL_ROOT}/<model-name>
HCCL_OP_EXPANSION_MODE=AIV \
VLLM_ASCEND_ENABLE_FLASHCOMM1=0 \
vllm serve "$MODEL_PATH" \
--served-model-name <served-name> \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len <practical-max-len-or-131072> \
--tensor-parallel-size <TP-size> \
--max-num-seqs 16 \
--load-format dummy \
--port 8000
```
### Stage B: real-weight mandatory gate
```bash
# remove this from Stage A:
--load-format dummy
```
> Note: dummy is not equivalent to real weights. Real gate is mandatory before sign-off.
### EP + ACLGraph (feature-first, MoE only)
```bash
# add to Stage B when model is MoE and validating EP:
--enable-expert-parallel
```
### flashcomm1 check (MoE only)
```bash
# only evaluate flashcomm1 when model is MoE
VLLM_ASCEND_ENABLE_FLASHCOMM1=1
```
### Eager fallback (isolation)
```bash
# add to command for isolation only:
--enforce-eager
```
### TorchDynamo fallback (for VL interpolate-contiguous failures)
```bash
# add env var when logs contain:
# torch._dynamo.exc.TorchRuntimeError + interpolate +
# "NPU contiguous operator only supported contiguous memory format"
TORCHDYNAMO_DISABLE=1
```
## 8) Readiness + smoke checks (must verify true-ready)
```bash
# readiness
for i in $(seq 1 200); do
curl -sf http://127.0.0.1:8000/v1/models >/tmp/models.json && break
sleep 3
done
# text smoke (required)
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"<served-name>","messages":[{"role":"user","content":"say hi"}],"temperature":0,"max_tokens":16}'
# VL smoke (required for multimodal models)
# send one text+image OpenAI-compatible request and require non-empty choices.
```
> `Application startup complete` alone is not success. If first request crashes, treat as runtime failure (false-ready).
## 9) Feature validation checklist (default out-of-box)
1. `GET /v1/models` returns 200.
2. Text request returns 200 and non-empty output.
3. If VL model: text+image request returns 200.
4. ACLGraph evidence exists (`Replaying aclgraph`) where expected.
5. EP path is validated only for MoE models; non-MoE must be marked not-applicable.
6. flashcomm1 is validated only for MoE models; non-MoE must be marked not-applicable.
7. MTP status verified from config + weight index (enabled vs checkpoint-missing).
8. Dummy-vs-real differences are explicitly reported (if any).
9. Any false-ready case is explicitly marked as failure (with log signature).
## 10) Fallback ladder (recommended order)
1. Keep same params and reproduce once to ensure deterministic failure signature.
2. Add `--enforce-eager` to isolate graph-capture influence.
3. For VL + dynamo/interpolate/contiguous failures, add `TORCHDYNAMO_DISABLE=1`.
4. For multimodal-processor suspicion, isolate text-only by:
- `--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'`
- then check whether failure moves from processor layer to model core.
5. If issue persists, map failure signature to known-good implementation and patch minimal code.
## 11) Capacity baseline + sweep
- Baseline (single machine): **`max-model-len=128k` + `max-num-seqs=16`**.
- If baseline passes, expand to `max-num-seqs=32/64` when requested.
- If baseline cannot pass due hardware/runtime limits, report explicit root cause.
## 12) Delivery checklist
```bash
# in current working repo (delivery root)
git add <changed-files>
git commit -sm "<message>"
```
Confirm:
- one signed commit only
- Chinese analysis + Chinese runbook present
- feature status matrix included with pass/fail reason
- dummy stage and real stage validation evidence included
- false-ready cases (if any) documented with final fallback status
### Test config generation
- Generate `tests/e2e/models/configs/<ModelName>.yaml` using accuracy results from evaluation.
- Must include: `model_name` (HF path), `hardware` (e.g. "Atlas A2 Series"), `tasks` (list with `name` and `metrics` containing `name` + `value`), `num_fewshot`.
- Follow the schema of existing configs (e.g. `Qwen3-8B.yaml`).
### Tutorial doc generation
- Generate `docs/source/tutorials/models/<ModelName>.md` from the standard template.
- Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl request, accuracy table.
- Must include sections: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance.
- Update `docs/source/tutorials/models/index.md` to include the new tutorial entry.
### GitHub issue comment
- Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue.
Confirm both test config YAML and tutorial doc are included in the signed commit.

View File

@@ -1,79 +0,0 @@
---
name: vLLM Ascend Release Note Writer
description: You are a release note writer for vLLM Ascend project (vllm-project/vllm-ascend). You are responsible for writing release notes for vLLM Ascend.
---
# vLLM Ascend release Note Writer Skill
## Overview
You should use the `ref-past-release-notes-highlight.md` as style and category reference. Always read these first.
## When to use this skill
When a new version of vLLM Ascend is released, you should use this skill to write the release notes.
## How to use it
0. all output files should be saved under `vllm-ascend-release-note/output/$version` folder
1. Use the `fetch_commits-optimize.py` script to fetch the commits between the previous and current version.
```bash
uv run python fetch_commits-optimize.py --base-tag $LAST_TAG --head-tag $NEW_TAG --output 0-current-raw-commits.md
```
`0-current-raw-commits.md` is your raw data input.
2. Use the `commit-analysis-draft.csv` tool to analyze the commits and put them into the correct section.
`1-commit-analysis-draft.csv` is your workspace for commit by commit analysis for which commit goes into which section, whether can be ignored, and why. You can create auxilariy files in `tmp` folder.
* You should check each commit. They are put into rows in the CSV file.
* The CSV should have headers `title`, `pr number`, `user facing impact/summary`, `category`, `decision`, `reason`. Please brainstorm other fields as you see fit.
3. Draft the highlights note, and save it to `2-highlights-note-draft.md`.
4. Edit the draft highlights note in `2-highlights-note-draft.md`, and save it to `3-highlights-note-edit.md`. You should double and triple check with the raw commits + analysis. You can leave any uncertainty and doubts in the file, and we will discuss them together.
5. Use the format `This is the $NUMBER release candidate of $VERSION for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started.`.
## Writing style
1. To keep simple, you should only save one level of headings, starting with ###, which may include the following categories follow below order:
### Highlights
### Features
### Hardware and Operator Support
### Performance
### Dependencies
### Deprecation & Breaking Changes
### Documentation
### Others
2. Additional Inclusion Criteria
* User experience improvements (CLI enhancements, better error messages, configuration flexibility)
* Core feature (PD Disaggregation, KVCaceh, Graph mode, CP/SP, quantization)
* Breaking changes and deprecations (always include with clear impact description)
* Significant infrastructure changes (elastic scaling, distributed serving, hardware support)
* Major dependency updates (CANN/torch_npu/triton-ascend/MoonCake/Ray/transformers versions, critical library updates)
* Binary/deployment improvements (size reductions, Docker enhancements)
* Default behavior changes (default models, configuration changes that affect all users)
* Hardware compatibility expansions (310P, A2, A3, A5 support)
In the end we don't want to miss any important changes. But also don't want to spam the notes with unnecessary details.
3. Section Organization Guidelines
* **Model Support first**: Most immediately visible to users, should lead the highlights
* **Group by user impact**: Hardware/performance should focus on what users experience, not internal optimizations
* **Provide usage context**: Include relevant flags, configuration options, and practical usage information
* **Technical detail level**: Explain what features enable rather than just listing technical changes
4. Writing Tips
* Look up the PR if you are not sure about the details. The PR number at the end (#12345) can be looked up via vllm-project/vllm#12345. To get the description, you just need to call <https://api.github.com/repos/vllm-project/vllm/pulls/12345> and look at the body field.
* When writing the highlights, don't be too verbose. Focus exclusively on what users should know.

View File

@@ -1,198 +0,0 @@
## v0.14.0rc1 - 2026.01.26
This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started. This release includes all the changes in v0.13.0rc2. So We just list the differences from v0.13.0rc2. If you are upgrading from v0.13.0rc1, please read both v0.14.0rc1 and v0.13.0rc2 release notes.
### Highlights
- 310P support is back now. In this release, only basic dense and vl models are supported with eager mode. We'll keep improving and maintaining the support for 310P. [#5776](https://github.com/vllm-project/vllm-ascend/pull/5776)
- Support compressed tensors moe w8a8-int8 quantization. [#5718](https://github.com/vllm-project/vllm-ascend/pull/5718)
- Support Medusa speculative decoding. [#5668](https://github.com/vllm-project/vllm-ascend/pull/5668)
- Support Eagle3 speculative decoding for Qwen3vl. [#4848](https://github.com/vllm-project/vllm-ascend/pull/4848)
### Features
- Xlite Backend supports Qwen3 MoE now. [#5951](https://github.com/vllm-project/vllm-ascend/pull/5951)
- Support DSA-CP for PD-mix deployment case. [#5702](https://github.com/vllm-project/vllm-ascend/pull/5702)
- Add support of new W4A4_LAOS_DYNAMIC quantization method. [#5143](https://github.com/vllm-project/vllm-ascend/pull/5143)
### Performance
- The performance of Qwen3-next has been improved. [#5664](https://github.com/vllm-project/vllm-ascend/pull/5664) [#5984](https://github.com/vllm-project/vllm-ascend/pull/5984) [#5765](https://github.com/vllm-project/vllm-ascend/pull/5765)
- The CPU bind logic and performance has been improved. [#5555](https://github.com/vllm-project/vllm-ascend/pull/5555)
- Merge Q/K split to simplify AscendApplyRotaryEmb for better performance. [#5799](https://github.com/vllm-project/vllm-ascend/pull/5799)
- Add Matmul Allreduce Rmsnorm fusion Pass. It's disabled by default. Set `fuse_allreduce_rms=True` in `--additional_config` to enable it. [#5034](https://github.com/vllm-project/vllm-ascend/pull/5034)
- Optimize rope embedding with triton kernel for huge performance gain. [#5918](https://github.com/vllm-project/vllm-ascend/pull/5918)
- support advanced apply_top_k_top_p without top_k constraint. [#6098](https://github.com/vllm-project/vllm-ascend/pull/6098)
- Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance. [#6204](https://github.com/vllm-project/vllm-ascend/pull/6204)
### Others
- model runner v2 support triton of penalty. [#5854](https://github.com/vllm-project/vllm-ascend/pull/5854)
- model runner v2 support eagle spec decoding. [#5840](https://github.com/vllm-project/vllm-ascend/pull/5840)
- Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
### Dependencies
- torch-npu is upgraded to 2.9.0 [#6112](https://github.com/vllm-project/vllm-ascend/pull/6112)
### Deprecation & Breaking Changes
- EPLB config options is moved to `eplb_config` in [additional config](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/configuration/additional_config.html). The old ones are removed in this release.
- The profiler envs, such as `VLLM_TORCH_PROFILER_DIR` and `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY` do not work with vLLM Ascend now. Please use vLLM `--profiler-config` parameters instead. [#5928](https://github.com/vllm-project/vllm-ascend/pull/5928)
### Known Issues
- If you hit the pickle error from `EngineCore` process sometimes, please cherry-pick the [PR](https://github.com/vllm-project/vllm/pull/32022) into your local vLLM code. This known issue will be fixed in vLLM in the next release.
## v0.13.0rc2 - 2026.01.24
This is the second release candidate of v0.13.0 for vLLM Ascend. In this rc release, we fixed lots of bugs and improved the performance of many models. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.13.0/) to get started. Any feedback is welcome to help us to improve the final version of v0.13.0.
### Highlights
We mainly focus on quality and performance improvement in this release. The spec decode, graph mode, context parallel and EPLB have been improved significantly. A lot of bugs have been fixed and the performance has been improved for DeepSeek3.1/3.2, Qwen3 Dense/MOE models.
### Features
- implement basic framework for batch invariant [#5517](https://github.com/vllm-project/vllm-ascend/pull/5517)
- Eagle spec decode feature now works with full graph mode. [#5118](https://github.com/vllm-project/vllm-ascend/pull/5118)
- Context Parallel(PCP&DCP) feature is more stable now. And it works for most case. Please try it out.
- MTP and eagle spec decode feature now works in most cases. And it's suggested to use them in most cases.
- EPLB feature more stable now. Many bugs have been fixed. Mix placement works now [#6086](https://github.com/vllm-project/vllm-ascend/pull/6086)
- Support kv nz feature for DeepSeek decode node in disagg-prefill scenario [#3072](https://github.com/vllm-project/vllm-ascend/pull/3072)
### Model Support
- LongCat-Flash is supported now.[#3833](https://github.com/vllm-project/vllm-ascend/pull/3833)
- minimax_m2 is supported now. [#5624](https://github.com/vllm-project/vllm-ascend/pull/5624)
- Support for cross-attention and whisper models [#5592](https://github.com/vllm-project/vllm-ascend/pull/5592)
### Performance
- Many custom ops and triton kernels are added in this release to speed up the performance of models. Such as `RejectSampler`, `MoeInitRoutingCustom`, `DispatchFFNCombine` and so on.
- Improved the performance of Layerwise Connector [#5303](https://github.com/vllm-project/vllm-ascend/pull/5303)
### Others
- Basic support Model Runner v2. Model Runner V2 is the next generation of vLLM. It will be used by default in the future release. [#5210](https://github.com/vllm-project/vllm-ascend/pull/5210)
- Fixed a bug that the zmq send/receive may failed [#5503](https://github.com/vllm-project/vllm-ascend/pull/5503)
- Supported to use full-graph with Qwen3-Next-MTP [#5477](https://github.com/vllm-project/vllm-ascend/pull/5477)
- Fix weight transpose in RL scenarios [#5567](https://github.com/vllm-project/vllm-ascend/pull/5567)
- Adapted SP to eagle3 [#5562](https://github.com/vllm-project/vllm-ascend/pull/5562)
- Context Parallel(PCP&DCP) support mlapo [#5672](https://github.com/vllm-project/vllm-ascend/pull/5672)
- GLM4.6 support mtp with fullgraph [#5460](https://github.com/vllm-project/vllm-ascend/pull/5460)
- Flashcomm2 now works with oshard generalized feature [#4723](https://github.com/vllm-project/vllm-ascend/pull/4723)
- Support setting tp=1 for the Eagle draft model [#5804](https://github.com/vllm-project/vllm-ascend/pull/5804)
- Flashcomm1 feature now works with qwen3-vl [#5848](https://github.com/vllm-project/vllm-ascend/pull/5848)
- Support fine-grained shared expert overlap [#5962](https://github.com/vllm-project/vllm-ascend/pull/5962)
### Dependencies
- CANN is upgraded to 8.5.0
- torch-npu is upgraded to 2.8.0.post1. Please note that the post version will not be installed by default. Please install it by hand from [pypi mirror](https://mirrors.huaweicloud.com/ascend/repos/pypi/torch-npu/).
- triton-ascend is upgraded to 3.2.0
### Deprecation & Breaking Changes
- `CPUOffloadingConnector` is deprecated. We'll remove it in the next release. It'll be replaced by CPUOffload feature from vLLM in the future.
- eplb config options is moved to `eplb_config` in [additional config](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/configuration/additional_config.html). The old ones will be removed in the next release.
- `ProfileExecuteDuration` [feature](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/profile_execute_duration.html) is deprecated. It's replaced by `ObservabilityConfig` from vLLM.
- The value of `VLLM_ASCEND_ENABLE_MLAPO` env will be set to True by default in the next release. It'll be enabled in decode node by default. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False.
## v0.13.0rc1 - 2025.12.27
This is the first release candidate of v0.13.0 for vLLM Ascend. We landed lots of bug fix, performance improvement and feature support in this release. Any feedback is welcome to help us to improve vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest) to get started.
### Highlights
- Improved the performance of DeepSeek V3.2, please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.2.html)
- Qwen3-Next MTP with chunked prefill is supported now [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770), please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen3-Next.html)
- [Experimental] Prefill Context Parallel and Decode Context Parallel are supported, but notice that it is an experimental feature now, welcome any feedback. please refer to [context parallel feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/context_parallel.html)
### Features
- Support openPangu Ultra MoE [4615](https://github.com/vllm-project/vllm-ascend/pull/4615)
- A new quantization method W8A16 is supported now. [#4541](https://github.com/vllm-project/vllm-ascend/pull/4541)
- Cross-machine Disaggregated Prefill is supported now. [#5008](https://github.com/vllm-project/vllm-ascend/pull/5008)
- Add UCMConnector for KV Cache Offloading. [#4411](https://github.com/vllm-project/vllm-ascend/pull/4411)
- Support async_scheduler and disable_padded_drafter_batch in eagle. [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893)
- Support pcp + mtp in full graph mode. [#4572](https://github.com/vllm-project/vllm-ascend/pull/4572)
- Enhance all-reduce skipping logic for MoE models in NPUModelRunner [#5329](https://github.com/vllm-project/vllm-ascend/pull/5329)
### Performance
Some general performance improvement:
- Add l2norm triton kernel [#4595](https://github.com/vllm-project/vllm-ascend/pull/4595)
- Add new pattern for AddRmsnormQuant with SP, which could only take effect in graph mode. [#5077](https://github.com/vllm-project/vllm-ascend/pull/5077)
- Add async exponential while model executing. [#4501](https://github.com/vllm-project/vllm-ascend/pull/4501)
- Remove the transpose step after attention and switch to transpose_batchmatmul [#5390](https://github.com/vllm-project/vllm-ascend/pull/5390)
- To optimize the performance in small batch size scenario, an attention operator with flash decoding function is offered, please refer to item 22 in [FAQs](https://docs.vllm.ai/projects/ascend/en/latest/faqs.html) to enable it.
### Other
- OOM error on VL models is fixed now. We're keeping observing it, if you hit OOM problem again, please submit an issue. [#5136](https://github.com/vllm-project/vllm-ascend/pull/5136)
- Fixed an accuracy bug of Qwen3-Next-MTP when batched inferring. [#4932](https://github.com/vllm-project/vllm-ascend/pull/4932)
- Fix npu-cpu offloading interface change bug. [#5290](https://github.com/vllm-project/vllm-ascend/pull/5290)
- Fix MHA model runtime error in aclgraph mode [#5397](https://github.com/vllm-project/vllm-ascend/pull/5397)
- Fix unsuitable moe_comm_type under ep=1 scenario [#5388](https://github.com/vllm-project/vllm-ascend/pull/5388)
### Deprecation & Breaking Changes
- `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is removed and `VLLM_ASCEND_ENABLE_PREFETCH_MLP` is recommend to replace as they always be enabled together. [#5272](https://github.com/vllm-project/vllm-ascend/pull/5272)
- `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is dropped now. [#5270](https://github.com/vllm-project/vllm-ascend/pull/5270)
- `VLLM_ASCEND_ENABLE_NZ` is disabled for float weight case, since we notice that the performance is not good in some float case. Feel free to set it to 2 if you make sure it works for your case. [#4878](https://github.com/vllm-project/vllm-ascend/pull/4878)
- `chunked_prefill_for_mla` in `additional_config` is dropped now. [#5296](https://github.com/vllm-project/vllm-ascend/pull/5296)
- `dump_config` in `additional_config` is renamed to `dump_config_path` and the type is change from `dict` to `string`. [#5296](https://github.com/vllm-project/vllm-ascend/pull/5296)
### Dependencies
- vLLM version has been upgraded to 0.13.0 and drop 0.12.0 support. [#5146](https://github.com/vllm-project/vllm-ascend/pull/5146)
- Transformer version has been upgraded >= 4.57.3 [#5250](https://github.com/vllm-project/vllm-ascend/pull/5250)
### Known Issues
- Qwen3-Next doesn't support long sequence scenario, and we should limit `gpu-memory-utilization` according to the doc to run Qwen3-Next. We'll improve it in the next release
- The functional break on Qwen3-Next when the input/output is around 3.5k/1.5k is fixed, but it introduces a regression on performance. We'll fix it in next release. [#5357](https://github.com/vllm-project/vllm-ascend/issues/5357)
- There is a precision issue with curl on ultra-short sequences in DeepSeek-V3.2. We'll fix it in next release. [#5370](https://github.com/vllm-project/vllm-ascend/issues/5370)
## v0.11.0 - 2025.12.16
We're excited to announce the release of v0.11.0 for vLLM Ascend. This is the official release for v0.11.0. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.11.0) to get started. We'll consider to release post version in the future if needed. This release note will only contain the important change and note from v0.11.0rc3.
### Highlights
- Improved the performance for deepseek 3/3.1. [#3995](https://github.com/vllm-project/vllm-ascend/pull/3995)
- Fixed the accuracy bug for qwen3-vl. [#4811](https://github.com/vllm-project/vllm-ascend/pull/4811)
- Improved the performance of sample. [#4153](https://github.com/vllm-project/vllm-ascend/pull/4153)
- Eagle3 is back now. [#4721](https://github.com/vllm-project/vllm-ascend/pull/4721)
### Other
- Improved the performance for kimi-k2. [#4555](https://github.com/vllm-project/vllm-ascend/pull/4555)
- Fixed a quantization bug for deepseek3.2-exp. [#4797](https://github.com/vllm-project/vllm-ascend/pull/4797)
- Fixed qwen3-vl-moe bug under high concurrency. [#4658](https://github.com/vllm-project/vllm-ascend/pull/4658)
- Fixed an accuracy bug for Prefill Decode disaggregation case. [#4437](https://github.com/vllm-project/vllm-ascend/pull/4437)
- Fixed some bugs for EPLB [#4576](https://github.com/vllm-project/vllm-ascend/pull/4576) [#4777](https://github.com/vllm-project/vllm-ascend/pull/4777)
- Fixed the version incompatibility issue for openEuler docker image. [#4745](https://github.com/vllm-project/vllm-ascend/pull/4745)
### Deprecation announcement
- LLMdatadist connector has been deprecated, it'll be removed in v0.12.0rc1
- Torchair graph has been deprecated, it'll be removed in v0.12.0rc1
- Ascend scheduler has been deprecated, it'll be removed in v0.12.0rc1
### Upgrade notice
- torch-npu is upgraded to 2.7.1.post1. Please note that the package is pushed to [pypi mirror](https://mirrors.huaweicloud.com/ascend/repos/pypi/torch-npu/). So it's hard to add it to auto dependence. Please install it by yourself.
- CANN is upgraded to 8.3.rc2.
### Known Issues
- Qwen3-Next doesn't support expert parallel and MTP features in this release. And it'll be oom if the input is too long. We'll improve it in the next release
- Deepseek 3.2 only work with torchair graph mode in this release. We'll make it work with aclgraph mode in the next release.
- Qwen2-audio doesn't work by default. Temporary solution is to set `--gpu-memory-utilization` to a suitable value, such as 0.8.
- CPU bind feature doesn't work if more than one vLLM instance is running on the same node.

View File

@@ -1,26 +0,0 @@
BasedOnStyle: Google
UseTab: Never
IndentWidth: 2
ColumnLimit: 120
# Force pointers to the type for C++.
DerivePointerAlignment: false
PointerAlignment: Left
# Reordering #include statements can (and currently will) introduce errors
SortIncludes: false
# Style choices
AlignConsecutiveAssignments: false
AlignConsecutiveDeclarations: false
IndentPPDirectives: BeforeHash
IncludeCategories:
- Regex: '^<'
Priority: 4
- Regex: '^"(llvm|llvm-c|clang|clang-c|mlir|mlir-c)/'
Priority: 3
- Regex: '^"(qoda|\.\.)/'
Priority: 2
- Regex: '.*'
Priority: 1

View File

@@ -1 +0,0 @@
If you want to use the skills in this repo with Claude code, please copy the skills directory `.agents/skills` to this directory.

View File

@@ -1,9 +1,6 @@
# https://developers.google.com/gemini-code-assist/docs/customize-gemini-behavior-github
have_fun: false # Just review the code
memory_config:
disabled: false
code_review:
comment_severity_threshold: HIGH # Reduce quantity of comments
pull_request_opened:
help: true # Add a help comment to the PR
summary: true # Summarize the PR in a separate comment
summary: false # Don't summarize the PR in a separate comment

View File

@@ -1,90 +0,0 @@
# Pull Request Summary Style Guide
## Output Instructions
**IMPORTANT**: When doing PR review, you MUST output them in markdown code blocks so users can easily copy them:
1. **PR Title**: Output the generated title in a code block with triple backticks
2. **PR Summary**: Output the generated summary in a markdown code block with triple backticks
This allows users to directly copy the content without manual formatting.
## Pull Request Summary Format
The summary should follow the format:
```markdown
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
```
## Pull Request Title Format
The summary should also refresh the Pull Request Title to follow the format:
```txt
[Branch][Module][Action] Pull Request Title
```
- Branch: The branch name where the PR is based. If the base branch is main, this prefix can be omitted.
- Module: The module or component being changed. It includes but is not limited to the following:
- [Attention]
- [Ops]
- [Doc]
- [Test]
- [CI]
- [Benchmark]
- Action: The action being performed. It includes but is not limited to the following:
- [BugFix]
- [Feature]
- [Misc]
## Example Output Format
When providing a PR review, format your response like this:
**Suggested PR Title:**
```markdown
[Branch][Module][Action] Your generated title here
```
**Suggested PR Summary:**
```markdown
### What this PR does / why we need it?
Your analysis of what the PR does and why it's needed.
Fixes #issue_number
### Does this PR introduce _any_ user-facing change?
Your assessment of user-facing changes.
### How was this patch tested?
Your description of testing approach.
```
And please print your review suggestion in markdown format no matter the pull request description is empty or not.

View File

@@ -15,13 +15,13 @@
# This file is a part of the vllm-ascend project.
#
ARG PY_VERSION=3.11
FROM quay.io/ascend/manylinux:8.5.1-910b-manylinux_2_28-py${PY_VERSION}
FROM quay.io/ascend/manylinux:8.2.rc1-910b-manylinux_2_28-py${PY_VERSION}
ARG SOC_VERSION="ascend910b1"
ARG COMPILE_CUSTOM_KERNELS=1
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV SOC_VERSION=$SOC_VERSION
ENV COMPILE_CUSTOM_KERNELS=${COMPILE_CUSTOM_KERNELS}
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
@@ -32,7 +32,7 @@ COPY . /workspace/vllm-ascend/
# Install req
RUN python3 -m pip install -r vllm-ascend/requirements.txt --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip install twine attrs psutil
python3 -m pip install twine
# Install vllm-ascend
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \

View File

@@ -1,5 +1,5 @@
name: 📚 User Story
description: Apply for an user story to be displayed on https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/index.html
description: Apply for an user story to be displayed on https://vllm-ascend.readthedocs.io/en/latest/community/user_stories/index.html
title: "[User Story]: "
labels: ["user-story"]
@@ -18,7 +18,7 @@ body:
A brief introduction about the background of your use case, like your scenario, hardware size etc.
- type: textarea
attributes:
label: Business Challenges
label: Bussiness Challenges
description: >
Tell us how what kind of challenge you faced in this user story.
- type: textarea
@@ -30,7 +30,7 @@ body:
attributes:
label: Extra Info
description: >
Any extra information you want to include in this story
Any extra infomation you want to include in this story
- type: markdown
attributes:
value: >

View File

@@ -9,7 +9,7 @@ body:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+sort%3Acreated-desc+).
#### We also highly recommend you read https://docs.vllm.ai/projects/ascend/en/latest/user_guide/supported_models.html first to know which model already supported.
#### We also highly recommend you read https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html first to know which model already supported.
- type: textarea
attributes:
label: The model to consider.
@@ -21,7 +21,7 @@ body:
attributes:
label: The closest model vllm already supports.
description: >
Here is the list of models already supported by vllm: https://docs.vllm.ai/projects/ascend/en/latest/user_guide/supported_models.html . Which model is the most similar to the model you want to add support for?
Here is the list of models already supported by vllm: https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html . Which model is the most similar to the model you want to add support for?
- type: textarea
attributes:
label: What's your difficulty of supporting the model you want?

View File

@@ -32,9 +32,9 @@ body:
- [ ] Add release note to docs/source/user_guide/release_notes.md
- [ ] Update release version in README.md and README.zh.md (Getting Started and Branch section)
- [ ] Update release version in README.md and README.zh.md
- [ ] Update version info in docs/source/community/versioning_policy.md(Release compatibility matrix, Release window and Branch states section)
- [ ] Update version info in docs/source/community/versioning_policy.md
- [ ] Update contributor info in docs/source/community/contributors.md

View File

@@ -1,18 +1,21 @@
self-hosted-runner:
# Labels of self-hosted runner in array of strings.
labels:
- linux-aarch64-a2-0
- linux-aarch64-a2-1
- linux-aarch64-a2-2
- linux-aarch64-a2-4
- linux-aarch64-a2-8
- linux-arm64-npu-static-8
- linux-aarch64-310p-1
- linux-aarch64-310p-2
- linux-aarch64-310p-4
- ubuntu-24.04-arm
- linux-aarch64-a3-1
- linux-aarch64-a3-2
- linux-aarch64-a3-4
- linux-aarch64-a3-8
- linux-amd64-cpu-0
- linux-amd64-cpu-8
- linux-amd64-cpu-16
- linux-aarch64-a3-0
- linux-amd64-cpu-8-hk
- linux-amd64-cpu-16-hk
- linux-aarch64-a2b3-0
- linux-aarch64-a2b3-1
- linux-aarch64-a2b3-2
- linux-aarch64-a2b3-4

View File

@@ -0,0 +1,59 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
# Adapted from vllm/.github/scripts/cleanup_pr_body.sh
#!/bin/bash
set -eux
# ensure 2 argument is passed
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <pr_number> <vllm_version> <vllm_commit>"
exit 1
fi
PR_NUMBER=$1
VLLM_VERSION=$2
VLLM_COMMIT=$3
OLD=/tmp/orig_pr_body.txt
NEW=/tmp/new_pr_body.txt
FINAL=/tmp/final_pr_body.txt
gh pr view --json body --template "{{.body}}" "${PR_NUMBER}" > "${OLD}"
cp "${OLD}" "${NEW}"
# Remove notes in pr description and add vLLM version and commit
sed -i '/<!--/,/-->/d' "${NEW}"
sed -i '/- vLLM .*$/d' "${NEW}"
{
echo ""
echo "- vLLM version: $VLLM_VERSION"
echo "- vLLM main: $VLLM_COMMIT"
} >> "${NEW}"
# Remove redundant empty lines
uniq "${NEW}" > "${FINAL}"
# Run this only if ${NEW} is different than ${OLD}
if ! cmp -s "${OLD}" "${FINAL}"; then
echo
echo "Updating PR body:"
echo
cat "${NEW}"
gh pr edit --body-file "${FINAL}" "${PR_NUMBER}"
else
echo "No changes needed"
fi

View File

@@ -8,8 +8,8 @@ documentation:
ci/build:
- changed-files:
- any-glob-to-any-file:
- '.github/actions/*.yaml'
- '.github/workflows/*.yaml'
- '.github/actions/*.yml'
- '.github/workflows/*.yml'
'module:tests':
- changed-files:

View File

@@ -0,0 +1,175 @@
name: 'accuracy test'
on:
workflow_call:
inputs:
vllm:
required: true
type: string
vllm-ascend:
required: false
type: string
default: main
runner:
required: true
type: string
image:
required: true
type: string
model_name:
required: true
type: string
upload:
required: false
type: boolean
default: false
jobs:
accuracy_tests:
runs-on: ${{ inputs.runner }}
name: ${{ inputs.model_name }} accuracy
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11
env:
VLLM_USE_MODELSCOPE: True
# 1. If version specified (work_dispatch), do specified branch accuracy test
# 2. If no version (labeled PR), do accuracy test by default ref:
# The branch, tag or SHA to checkout. When checking out the repository that
# triggered a workflow, this defaults to the reference or SHA for that event.
# Otherwise, uses the default branch.
GHA_VLLM_ASCEND_VERSION: ${{ inputs.vllm-ascend }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set model name as output
id: set_output
run: |
echo "model_name=${{ inputs.model_name }}" >> $GITHUB_OUTPUT
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Resolve vllm-ascend version
run: |
VERSION_INPUT="${{ inputs.vllm-ascend }}"
if [[ "$VERSION_INPUT" == "latest" ]]; then
TAGS=$(git ls-remote --tags --sort=-v:refname https://github.com/vllm-project/vllm-ascend "v*" | cut -f2 | sed 's|refs/tags/||')
LATEST_TAG=$(echo "$TAGS" | head -n1)
if [[ -z "$LATEST_TAG" ]]; then
RESOLVED_VERSION="main"
else
RESOLVED_VERSION="$LATEST_TAG"
fi
else
RESOLVED_VERSION="$VERSION_INPUT"
fi
echo "GHA_VLLM_ASCEND_VERSION=$RESOLVED_VERSION" >> $GITHUB_ENV
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm-ascend
path: ./vllm-ascend
ref: ${{ env.GHA_VLLM_ASCEND_VERSION }}
- name: Install vllm-project/vllm-ascend
working-directory: ./vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Get vLLM commit hash and URL
working-directory: ./vllm-empty
run: |
VLLM_COMMIT=$(git rev-parse --short=7 HEAD)
echo "VLLM_COMMIT=$VLLM_COMMIT" >> $GITHUB_ENV
- name: Get vLLM-Ascend commit hash and URL
working-directory: ./vllm-ascend
run: |
VLLM_ASCEND_COMMIT=$(git rev-parse --short=7 HEAD)
echo "VLLM_ASCEND_COMMIT=$VLLM_ASCEND_COMMIT" >> $GITHUB_ENV
- name: Collect version info
run: |
for dir in /usr/local/Ascend/ascend-toolkit/*; do
dname=$(basename "$dir")
if [ "$dname" != "latest" ]; then
TOOLKIT_DIR="$dname"
break
fi
done
INFO_FILE="/usr/local/Ascend/ascend-toolkit/${TOOLKIT_DIR}/$(uname -i)-linux/ascend_toolkit_install.info"
GHA_CANN_VERSION=$(grep "version=" "$INFO_FILE" \
| head -n1 \
| cut -d'=' -f2 \
| tr -d '"')
{
echo "GHA_CANN_VERSION=$GHA_CANN_VERSION"
pip show torch | grep "Version:" | awk '{print "GHA_TORCH_VERSION="$2}'
pip show torch_npu | grep "Version:" | awk '{print "GHA_TORCH_NPU_VERSION="$2}'
pip show vllm | grep "Version:" | awk '{print "GHA_VLLM_VERSION="$2}' | sed 's/+.*//'
} >> "$GITHUB_ENV"
- name: Run accuracy test
id: report
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
VLLM_VERSION: ${{ env.GHA_VLLM_VERSION }}
VLLM_COMMIT: ${{ env.VLLM_COMMIT }}
VLLM_ASCEND_VERSION: ${{ env.GHA_VLLM_ASCEND_VERSION || github.ref }}
VLLM_ASCEND_COMMIT: ${{ env.VLLM_ASCEND_COMMIT }}
CANN_VERSION: ${{ env.GHA_CANN_VERSION }}
TORCH_VERSION: ${{ env.GHA_TORCH_VERSION }}
TORCH_NPU_VERSION: ${{ env.GHA_TORCH_NPU_VERSION }}
run: |
model_base_name=$(basename ${{ inputs.model_name }})
markdown_name="${model_base_name}"
echo "markdown_name=$markdown_name" >> $GITHUB_OUTPUT
mkdir -p ./benchmarks/accuracy
pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \
--config ./tests/e2e/models/configs/${{ inputs.model_name }}.yaml
- name: Generate step summary
if: ${{ always() }}
run: |
cat ./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md >> $GITHUB_STEP_SUMMARY
- name: Upload Report
if: ${{ inputs.upload == true }}
uses: actions/upload-artifact@v4
with:
name: "report-${{ env.GHA_VLLM_ASCEND_VERSION }}-${{ steps.report.outputs.markdown_name }}"
path: ./benchmarks/accuracy/${{ steps.report.outputs.markdown_name }}.md
if-no-files-found: warn
retention-days: 90
overwrite: true

View File

@@ -0,0 +1,199 @@
name: 'e2e test'
on:
workflow_call:
inputs:
vllm:
required: true
type: string
runner:
required: true
type: string
image:
required: true
type: string
type:
required: true
type: string
jobs:
e2e:
name: singlecard
runs-on: ${{ inputs.runner }}-1
container:
image: ${{ inputs.image }}
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
if: ${{ inputs.type == 'light' }}
run: |
pytest -sv tests/e2e/singlecard/test_aclgraph.py
pytest -sv tests/e2e/singlecard/test_quantization.py
pytest -sv tests/e2e/singlecard/test_vlm.py::test_multimodal_vl
- name: Run e2e test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
if: ${{ inputs.type == 'full' }}
run: |
# We found that if running aclgraph tests in batch, it will cause AclmdlRICaptureBegin error. So we run
# the test separately.
pytest -sv tests/e2e/singlecard/test_aclgraph.py
pytest -sv tests/e2e/singlecard/test_aclgraph_mem.py
pytest -sv tests/e2e/singlecard/test_ascend_scheduler.py
pytest -sv tests/e2e/singlecard/test_bge_model.py
pytest -sv tests/e2e/singlecard/test_camem.py
pytest -sv tests/e2e/singlecard/test_chunked.py
pytest -sv tests/e2e/singlecard/test_embedding.py
pytest -sv tests/e2e/singlecard/test_embedding_aclgraph.py
pytest -sv tests/e2e/singlecard/test_guided_decoding.py
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/singlecard/test_profile_execute_duration.py
pytest -sv tests/e2e/singlecard/test_quantization.py
pytest -sv tests/e2e/singlecard/test_sampler.py
pytest -sv tests/e2e/singlecard/test_vlm.py
# ------------------------------------ v1 spec decode test ------------------------------------ #
pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py
pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_torchair_correctness.py
# Fix me: test_eagle_correctness OOM error
pytest -sv tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py
pytest -sv tests/e2e/singlecard/ops/
e2e-2-cards:
name: multicard
runs-on: ${{ inputs.runner }}-2
container:
image: ${{ inputs.image }}
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test (light)
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
if: ${{ inputs.type == 'light' }}
run: |
pytest -sv tests/e2e/multicard/test_qwen3_moe.py::test_models_distributed_Qwen3_MOE_TP2_WITH_EP
- name: Run vllm-project/vllm-ascend test (full)
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
if: ${{ inputs.type == 'full' }}
run: |
pytest -sv tests/e2e/multicard/test_data_parallel.py
pytest -sv tests/e2e/multicard/test_expert_parallel.py
pytest -sv tests/e2e/multicard/test_external_launcher.py
pytest -sv tests/e2e/multicard/test_single_request_aclgraph.py
pytest -sv tests/e2e/multicard/test_fused_moe_allgather_ep.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py
# To avoid oom, we need to run the test in a single process.
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W8A8
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_sp_for_qwen3_moe
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen_Dense_with_flashcomm_v1
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen_Dense_with_prefetch_mlp_weight
pytest -sv tests/e2e/multicard/test_pipeline_parallel.py
pytest -sv tests/e2e/multicard/test_prefix_caching.py
pytest -sv tests/e2e/multicard/test_qwen3_moe.py
pytest -sv tests/e2e/multicard/test_torchair_graph_mode.py

View File

@@ -0,0 +1,72 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
# This test will be triggered:
# - PR labeled with: 'accuracy-test' & 'ready-for-test'
name: ascend test / accuracy
on:
pull_request:
branches:
- 'main'
- '*-dev'
types: [ labeled, synchronize ]
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
run:
name: ""
strategy:
matrix:
# Only top series models should be listed in here
include:
- runner: a2-1
model_name: Qwen3-8B
- runner: a2-1
model_name: Qwen2.5-VL-7B-Instruct
- runner: a2-1
model_name: Qwen2-Audio-7B-Instruct
- runner: a2-2
model_name: Qwen3-30B-A3B
- runner: a2-2
model_name: Qwen3-VL-30B-A3B-Instruct
- runner: a2-2
model_name: DeepSeek-V2-Lite
fail-fast: false
# test will be triggered when tag 'accuracy-test' & 'ready-for-test'
if: >-
${{
contains(github.event.pull_request.labels.*.name, 'accuracy-test') &&
contains(github.event.pull_request.labels.*.name, 'ready-for-test')
}}
uses: ./.github/workflows/_accuracy_test.yaml
with:
vllm: v0.11.0
runner: linux-aarch64-${{ matrix.runner }}
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11
model_name: ${{ matrix.model_name }}

View File

@@ -0,0 +1,57 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: format / pr body
on:
# The PR updated when PR opened and push new commits
pull_request_target:
types: [opened, synchronize]
branches:
- 'main'
permissions:
pull-requests: write
jobs:
update-description:
name: update vLLM version
runs-on: ubuntu-latest
steps:
- name: Get vLLM version
run: |
VLLM_COMMIT=v0.11.0
echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> $GITHUB_ENV
- name: Checkout repository
uses: actions/checkout@ff7abcd0c3c05ccf6adc123a8cd1fd4fb30fb493 # v4.2.2
- name: Set up Python
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
- name: Get vLLM release version
run: |
VLLM_VERSION=$(python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"')
echo "VLLM_VERSION=$VLLM_VERSION" >> $GITHUB_ENV
- name: Update PR description
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
bash .github/format_pr_body.sh "${{ github.event.number }}" "${{ env.VLLM_VERSION }}" "${{ env.VLLM_COMMIT }}"

View File

@@ -0,0 +1,135 @@
name: 'image / openEuler / 310p'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main-310p-openeuler / vllm-ascend:*-dev-310p-openeuler
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-310p-openeuler / vllm-ascend:v1.2.3rc1-310p-openeuler
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_310p_openeuler.yml'
- 'Dockerfile.310p.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
types: [ labeled ]
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_310p_openeuler.yml'
- 'Dockerfile.310p.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build:
name: vllm-ascend image build
# Only arm64 build on openEuler arm64, only amd64 build on Ubuntu amd64
# Push event or PR with both 'ready' and 'ready-for-test' labels
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
if: ${{ github.event_name == 'push' || (contains(github.event.pull_request.labels.*.name, 'ready') && contains(github.event.pull_request.labels.*.name, 'ready-for-test')) }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
persist-credentials: false
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-310p-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-310p-openeuler
# - pre/post/dev: v0.7.1rc1-310p-openeuler/v0.7.1rc1-310p-openeuler/v0.7.1rc1.dev1-310p-openeuler/v0.7.1.post1-310p-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-310p-openeuler
type=ref,event=pr,suffix=-310p-openeuler
type=pep440,pattern={{raw}},suffix=-310p-openeuler
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 310p
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
file: Dockerfile.310p.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@@ -0,0 +1,131 @@
name: 'image / Ubuntu / 310p'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main-310p / vllm-ascend:*-dev-310p
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-310p / vllm-ascend:v1.2.3rc1-310p
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_310p_ubuntu.yml'
- 'Dockerfile.310p'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
types: [ labeled ]
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_310p_ubuntu.yml'
- 'Dockerfile.310p'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build:
name: vllm-ascend image build
# Only arm64 build on openEuler arm64, only amd64 build on Ubuntu amd64
# Push event or PR with both 'ready' and 'ready-for-test' labels
runs-on: ubuntu-latest
if: ${{ github.event_name == 'push' || (contains(github.event.pull_request.labels.*.name, 'ready') && contains(github.event.pull_request.labels.*.name, 'ready-for-test')) }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
persist-credentials: false
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-310p
# - pre/post/dev: v0.7.1rc1-310p/v0.7.1rc1-310p/v0.7.1rc1.dev1-310p/v0.7.1.post1-310p, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-310p
type=ref,event=pr,suffix=-310p
type=pep440,pattern={{raw}},suffix=-310p
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 310p
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile.310p
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@@ -0,0 +1,135 @@
name: 'image / openEuler / a3'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-a3-openeuler / vllm-ascend:v1.2.3rc1-a3-openeuler
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_a3_openeuler.yml'
- 'Dockerfile.a3.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
types: [ labeled ]
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_a3_openeuler.yml'
- 'Dockerfile.a3.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build:
name: vllm-ascend image build
# Only arm64 build on openEuler arm64, only amd64 build on Ubuntu amd64
# Push event or PR with both 'ready' and 'ready-for-test' labels
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
if: ${{ github.event_name == 'push' || (contains(github.event.pull_request.labels.*.name, 'ready') && contains(github.event.pull_request.labels.*.name, 'ready-for-test')) }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
persist-credentials: false
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-a3-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-a3-openeuler
# - pre/post/dev: v0.7.1rc1-a3-openeuler/v0.7.1rc1-a3-openeuler/v0.7.1rc1.dev1-a3-openeuler/v0.7.1.post1-a3-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-a3-openeuler
type=ref,event=pr,suffix=-a3-openeuler
type=pep440,pattern={{raw}},suffix=-a3-openeuler
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push a3
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
file: Dockerfile.a3.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@@ -0,0 +1,131 @@
name: 'image / Ubuntu / a3'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-a3|vllm-ascend:v1.2.3rc1-a3
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_a3_ubuntu.yml'
- 'Dockerfile.a3'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
types: [ labeled ]
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_a3_ubuntu.yml'
- 'Dockerfile.a3'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build:
name: vllm-ascend image build
# Only arm64 build on openEuler arm64, only amd64 build on Ubuntu amd64
# Push event or PR with both 'ready' and 'ready-for-test' labels
runs-on: ubuntu-latest
if: ${{ github.event_name == 'push' || (contains(github.event.pull_request.labels.*.name, 'ready') && contains(github.event.pull_request.labels.*.name, 'ready-for-test')) }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
persist-credentials: false
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-a3 is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-a3
# - pre/post/dev: v0.7.1rc1-a3/v0.7.1rc1-a3/v0.7.1rc1.dev1-a3/v0.7.1.post1-a3, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-a3
type=ref,event=pr,suffix=-a3
type=pep440,pattern={{raw}},suffix=-a3
flavor:
latest=false
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push a3
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile.a3
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@@ -0,0 +1,134 @@
name: 'image / openEuler'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main-openeuler / vllm-ascend:*-dev-openeuler
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3-openeuler / vllm-ascend:v1.2.3rc1-openeuler
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_openeuler.yml'
- 'Dockerfile.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
types: [ labeled ]
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_openeuler.yml'
- 'Dockerfile.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build:
name: vllm-ascend image build
# Only arm64 build on openEuler arm64, only amd64 build on Ubuntu amd64
# Push event or PR with both 'ready' and 'ready-for-test' labels
runs-on: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'ubuntu-latest' ||
'ubuntu-24.04-arm'
}}
if: ${{ github.event_name == 'push' || (contains(github.event.pull_request.labels.*.name, 'ready') && contains(github.event.pull_request.labels.*.name, 'ready-for-test')) }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
persist-credentials: false
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-openeuler
# - pre/post/dev: v0.7.1rc1-openeuler/v0.7.1rc1-openeuler/v0.7.1rc1.dev1-openeuler/v0.7.1.post1-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,suffix=-openeuler
type=ref,event=pr,suffix=-openeuler
type=pep440,pattern={{raw}},suffix=-openeuler
flavor:
latest=true
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 910b
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/arm64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
file: Dockerfile.openEuler
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@@ -0,0 +1,131 @@
name: 'image / Ubuntu'
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3 / vllm-ascend:v1.2.3rc1
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/image_ubuntu.yml'
- 'Dockerfile'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
types: [ labeled ]
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
- 'main'
- '*-dev'
tags:
- 'v*'
paths:
- '.github/workflows/image_ubuntu.yml'
- 'Dockerfile'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build:
name: vllm-ascend image build
# Only arm64 build on openEuler arm64, only amd64 build on Ubuntu amd64
# Push event or PR with both 'ready' and 'ready-for-test' labels
runs-on: ubuntu-latest
if: ${{ github.event_name == 'push' || (contains(github.event.pull_request.labels.*.name, 'ready') && contains(github.event.pull_request.labels.*.name, 'ready-for-test')) }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
persist-credentials: false
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job pulish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1, latest
# - pre/post/dev: v0.7.1rc1/v0.7.1rc1/v0.7.1rc1.dev1/v0.7.1.post1, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch
type=ref,event=pr
type=pep440,pattern={{raw}}
flavor:
latest=true
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Publish - Login to Quay Container Registry
if: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ vars.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Build and push 910b
uses: docker/build-push-action@v6
with:
platforms: >-
${{
github.event_name == 'push' && github.repository_owner == 'vllm-project' &&
'linux/amd64,linux/arm64' ||
'linux/amd64'
}}
# use the current repo path as the build context, ensure .git is contained
context: .
file: Dockerfile
# only trigger when tag, branch/main push
push: ${{ github.event_name == 'push' && github.repository_owner == 'vllm-project' }}
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false

View File

@@ -1,4 +1,4 @@
name: Merge Conflict Labeler
name: "Merge Conflict Labeler"
on:
# So that PRs touching the same files as the push are updated
push:

View File

@@ -0,0 +1,18 @@
name: Pull Request Labeler
on: pull_request_target
jobs:
label:
name: Label
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- name: Label the PR
uses: actions/labeler@v6
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
configuration-path: .github/labeler.yml
sync-labels: true

View File

@@ -0,0 +1,17 @@
{
"problemMatcher": [
{
"owner": "ruff",
"pattern": [
{
"regexp": "^(.+?):(\\d+):(\\d+): (\\w+): (.+)$",
"file": 1,
"line": 2,
"column": 3,
"code": 4,
"message": 5
}
]
}
]
}

View File

@@ -0,0 +1,118 @@
name: 'e2e test / multi-dp'
on:
schedule:
- cron: "0 */4 * * *"
workflow_dispatch:
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 8 cards test type
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
e2e:
# This is a runner with no NPU for k8s controller
runs-on: linux-aarch64-a3-0
container:
image: m.daocloud.io/quay.io/ascend/cann:8.3.rc2-a3-ubuntu22.04-py3.11
env:
KUBECONFIG: /tmp/kubeconfig
KUBECTL: /root/.cache/.kube/kubectl
NAMESPACE: vllm-project
LEADER_POD: vllm-0
steps:
- name: Install system denpendencies
run: |
# configure apt and pip source
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
apt-get update -y && apt-get install -y git curl
TOKEN=`echo -n "x-access-token:${{ secrets.ADMIN_PTA }}" | base64`
git config --global http.https://gh-proxy.test.osinfra.cn/.extraheader "AUTHORIZATION: basic $TOKEN"
- name: Install kubectl
run: |
install -o root -g root -m 0755 $KUBECTL /usr/local/bin/kubectl
# get kubeconfig from secret
echo "${{ secrets.KUBECONFIG_B64 }}" | base64 -d > $KUBECONFIG
- name: Checkout code
uses: actions/checkout@v4
- name: Prepare scripts
run: |
# prepare for lws entrypoint scripts
install -D tests/e2e/multi_node/scripts/run.sh /root/.cache/tests/run.sh
- name: Launch cluster
run: |
kubectl apply -f tests/e2e/multi_node/scripts/lws.yaml
- name: Waiting for pod ready
run: |
echo "waiting for Pod [$LEADER_POD] in namespace [$NAMESPACE] to Ready..."
while true; do
# get pod status
READY_STATUS=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}')
if [[ "$READY_STATUS" == "true" ]]; then
echo "✅ Pod [$LEADER_POD] is Ready!"
break
else
echo "Pod [$LEADER_POD] not ready, waiting..."
sleep 3
fi
done
- name: Stream logs and monitor pod health
run: |
set -euo pipefail
echo "🚀 Start streaming logs for Pod [$LEADER_POD] ..."
kubectl logs -f "$LEADER_POD" -n "$NAMESPACE" &
LOG_PID=$!
echo "Start monitoring Pod [$LEADER_POD] status ..."
while true; do
STATUS=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}')
if [[ "$STATUS" != "Running" && "$STATUS" != "Succeeded" ]]; then
echo "❌ Pod [$LEADER_POD] exited abnormally with status: $STATUS"
kubectl describe pod "$LEADER_POD" -n "$NAMESPACE" || true
kubectl logs "$LEADER_POD" -n "$NAMESPACE" --previous --all-containers || true
kill $LOG_PID || true
exit 1
fi
sleep 5
done &
MONITOR_PID=$!
wait $LOG_PID || true
kill $MONITOR_PID || true
- name: Generate summary
if: always()
run: |
if [ -f "/root/.cache/test_summary.md" ]; then
cat /root/.cache/test_summary.md >> "$GITHUB_STEP_SUMMARY"
else
echo "No summary file found." >> "$GITHUB_STEP_SUMMARY"
fi
- name: Post process
if: always()
run: |
kubectl get pods -n $NAMESPACE
kubectl delete -f tests/e2e/multi_node/scripts/lws.yaml

View File

@@ -15,7 +15,7 @@
# limitations under the License.
#
name: Performance Schedule Test
name: 'ascend test / performance'
# This workflow runs nightly benchmarks for vllm-ascend.
on:
@@ -46,16 +46,17 @@ jobs:
test:
if: ${{ contains(github.event.pull_request.labels.*.name, 'performance-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
name: Benchmarks/vLLM=${{ matrix.vllm_branch }}, vLLM-Ascend=${{ matrix.vllm_ascend_branch }}
name: Benchmarks/vLLM=${{ matrix.vllm_branch }}, vLLM-Ascend=${{ matrix.vllm_ascend_branch }}, use_v1=${{ matrix.vllm_use_v1 }}
runs-on: 'linux-arm64-npu-static-8'
strategy:
matrix:
include:
- vllm_branch: v0.18.0
- vllm_branch: v0.11.0
vllm_ascend_branch: main
vllm_use_v1: 1
max-parallel: 1
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-910b-ubuntu22.04-py3.11
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11
volumes:
- /usr/local/dcmi:/usr/local/dcmi
- /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
@@ -72,6 +73,7 @@ jobs:
VLLM_USE_MODELSCOPE: True
ES_OM_DOMAIN: ${{ secrets.ES_OM_DOMAIN }}
ES_OM_AUTHORIZATION: ${{ secrets.ES_OM_AUTHORIZATION }}
VLLM_USE_V1: ${{ matrix.vllm_use_v1 }}
steps:
- name: Check npu and CANN info
run: |
@@ -95,12 +97,12 @@ jobs:
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: ./vllm-empty
@@ -130,11 +132,11 @@ jobs:
- name: Generate step summary
if: github.event_name != 'schedule' && github.event_name != 'workflow_dispatch'
run: |
cat ./benchmarks/results/benchmark_results.md >> "$GITHUB_STEP_SUMMARY"
cat ./benchmarks/results/benchmark_results.md >> $GITHUB_STEP_SUMMARY
- name: Upload benchmark artifacts
if: github.event_name != 'schedule' && github.event_name != 'workflow_dispatch'
uses: actions/upload-artifact@v7
uses: actions/upload-artifact@v4
with:
name: "benchmark-performance-${{ matrix.vllm_branch }}-${{ matrix.vllm_ascend_branch }}-report"
path: ./benchmarks/results/benchmark_results.md
@@ -172,9 +174,9 @@ jobs:
commit_id=${line%% *}
commit_title=${line#* }
git checkout "$commit_id"
commit_time=$(git show -s --format=%cd "$commit_id" --date=iso-strict)
commit_time_no_tz="${commit_time::19}"
git checkout $commit_id
commit_time=$(git show -s --format=%cd $commit_hash --date=iso-strict)
commit_time_no_tz=${commit_time::19}
pip install -e .
echo "------------------------"
@@ -191,13 +193,14 @@ jobs:
ERROR_MSG="Benchmark failed to run"
fi
# send the result to es
escli add --vllm_branch "${{ matrix.vllm_branch }}" \
--vllm_ascend_branch "${{ matrix.vllm_ascend_branch }}" \
--commit_id "$commit_id" \
escli add --vllm_branch ${{ matrix.vllm_branch }} \
--vllm_ascend_branch ${{ matrix.vllm_ascend_branch }} \
--commit_id $commit_id \
--commit_title "$commit_title" \
--created_at "$commit_time_no_tz" \
--res_dir ./benchmarks/results \
--error "$ERROR_MSG" \
--extra_feat '{"VLLM_USE_V1": "${{ matrix.vllm_use_v1 }}"}'
rm -rf ./benchmarks/results
cd -
done < commit_log.txt

View File

@@ -0,0 +1,43 @@
name: pre-commit
on:
workflow_call:
inputs:
vllm:
required: true
type: string
permissions:
contents: read
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: "3.11"
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
- run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
path: ./vllm-empty
ref: ${{ inputs.vllm }}
- name: Install vllm
working-directory: vllm-empty
run: |
pip install -r requirements/build.txt --extra-index-url https://download.pytorch.org/whl/cpu
VLLM_TARGET_DEVICE=empty pip install .
- name: Install vllm-ascend dev
run: |
pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
env:
SHELLCHECK_OPTS: "--exclude=SC2046,SC2006,SC2086" # Exclude SC2046, SC2006, SC2086 for actionlint
with:
extra_args: --all-files --hook-stage manual

View File

@@ -0,0 +1,75 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: build / sdist
on:
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/release_code.yml'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
tags:
- 'v*'
jobs:
build:
name: release code
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@ff7abcd0c3c05ccf6adc123a8cd1fd4fb30fb493 # v4.2.2
- name: Print
run: |
lscpu
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python3 -m pip install twine setuptools_scm
- name: Generate tar.gz
run: |
python3 setup.py sdist
ls dist
- name: Archive tar.gz
uses: actions/upload-artifact@v4
with:
name: vllm-ascend-src
path: dist/*
- name: Release
if: startsWith(github.ref, 'refs/tags/')
run: |
python3 -m twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}

View File

@@ -0,0 +1,125 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: build / wheel
on:
schedule:
# Runs at 23:00 UTC (7:00 AM Beijing) every day
- cron: '0 23 * * *'
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/release_whl.yml'
- '.github/Dockerfile.buildwheel'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
tags:
- 'v*'
jobs:
build:
name: build and release wheel
strategy:
matrix:
os: [ubuntu-24.04, ubuntu-24.04-arm]
# PR only trigger latest version
python-version: ${{ fromJSON(
(github.event_name == 'pull_request' && '["3.11"]') ||
'["3.9", "3.10", "3.11"]'
) }}
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@ff7abcd0c3c05ccf6adc123a8cd1fd4fb30fb493 # v4.2.2
- name: Print
run: |
lscpu
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build wheel
run: |
ls
docker build -f ./.github/Dockerfile.buildwheel \
--build-arg PY_VERSION=${{ matrix.python-version }} \
-t wheel:v1 .
docker run --rm \
-u $(id -u):$(id -g) \
-v $(pwd):/outpwd \
wheel:v1 \
bash -c "cp -r /workspace/vllm-ascend/dist /outpwd"
ls dist
- name: Set up Python ${{ matrix.python-version }}
if: startsWith(github.ref, 'refs/tags/')
uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: ${{ matrix.python-version }}
- name: Repair wheels with auditwheel
run: |
python3 -m pip install auditwheel
python3 -m pip install patchelf
mkdir -p dist/repaired
for whl in dist/*.whl; do
auditwheel repair "$whl" -w dist/repaired/ \
--exclude libplatform.so \
--exclude libregister.so \
--exclude libge_common_base.so \
--exclude libc10.so \
--exclude libc_sec.so \
--exclude "libascend*.so" \
--exclude "libtorch*.so" \
--exclude "liberror_manager.so"
done
rm -f dist/*.whl
mv dist/repaired/*.whl dist/
rmdir dist/repaired
ls dist
- name: Verify automatic platform tags
run: |
cd dist
for wheel in *.whl; do
echo "verification file: $wheel"
auditwheel show "$wheel"
done
- name: Archive wheel
uses: actions/upload-artifact@v4
with:
name: vllm-ascend-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
path: dist/*
- name: Release
if: startsWith(github.ref, 'refs/tags/')
run: |
python3 -m pip install twine
python3 -m twine upload --verbose dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}

View File

@@ -0,0 +1,26 @@
name: PR Reminder Comment Bot
permissions:
pull-requests: write
on:
pull_request_target:
types: [opened]
jobs:
pr_reminder:
runs-on: ubuntu-latest
steps:
- name: Remind to run full CI on PR
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: '👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:\n\n' +
'- A PR should do only one thing, smaller PRs enable faster reviews.\n' +
'- Every PR should include unit tests and end-to-end tests to ensure it works and is not broken by other future PRs.\n' +
'- Write the commit message by fulfilling the PR description to help reviewer and future developers understand.\n\n' +
'If CI fails, you can run linting and testing checks locally according [Contributing](https://vllm-ascend.readthedocs.io/zh-cn/latest/developer_guide/contribution/index.html) and [Testing](https://vllm-ascend.readthedocs.io/zh-cn/latest/developer_guide/contribution/testing.html).'
})
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -0,0 +1,100 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: 'e2e test / a3-test'
on:
workflow_call:
pull_request:
types: [ labeled ]
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 8 cards test type
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
e2e:
# only trigger e2e test after lint passed and the change is e2e related with pull request.
if: ${{ contains(github.event.pull_request.labels.*.name, 'dist-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'workflow_dispatch' }}
strategy:
matrix:
os: [linux-aarch64-a3-8]
vllm_version: [v0.11.0]
name: vLLM Ascend test
runs-on: ${{ matrix.os }}
container:
image: m.daocloud.io/quay.io/ascend/cann:8.3.rc2-a3-ubuntu22.04-py3.11
env:
DEBIAN_FRONTEND: noninteractive
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
apt-get update -y
apt install git -y
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
run: |
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend test for V1 Engine
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
run: |
# TODO: enable more tests
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe

View File

@@ -15,7 +15,7 @@
# This file is a part of the vllm-ascend project.
#
name: Doc Test
name: 'ascend test / doctest'
on:
workflow_dispatch:
@@ -23,13 +23,15 @@ on:
branches:
- 'main'
- '*-dev'
- 'releases/v*'
paths:
# If we are changing the doctest we should do a PR test
- '.github/workflows/labled_doctest.yaml'
- '.github/workflows/vllm_ascend_doctest.yaml'
- 'tests/e2e/doctests/**'
- 'tests/e2e/common.sh'
- 'tests/e2e/run_doctests.sh'
schedule:
# Runs every 12 hours
- cron: '0 */12 * * *'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
@@ -44,11 +46,11 @@ jobs:
# Each version should be tested
fail-fast: false
matrix:
vllm_version: [releases-v0.13.0, releases-v0.13.0-openeuler, main, main-openeuler]
vllm_verison: [v0.9.1-dev, v0.9.1-dev-openeuler, main, main-openeuler]
name: vLLM Ascend test
runs-on: linux-aarch64-a2b3-1
runs-on: linux-aarch64-a2-1
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:${{ matrix.vllm_version }}
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:${{ matrix.vllm_verison }}
steps:
- name: Check NPU/CANN and git info
run: |
@@ -64,7 +66,7 @@ jobs:
git --no-pager log -1 || true
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
uses: actions/checkout@v4
- name: Run vllm-ascend/tests/e2e/run_doctests.sh
run: |

View File

@@ -15,15 +15,16 @@
# This file is a part of the vllm-ascend project.
#
name: E2E-Light
name: 'ascend test'
on:
push:
branches:
- 'main'
pull_request:
branches:
- 'main'
- '*-dev'
- 'releases/v*'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
@@ -39,29 +40,23 @@ concurrency:
jobs:
lint:
uses: ./.github/workflows/_pre_commit.yml
uses: ./.github/workflows/pre-commit.yml
with:
vllm: v0.18.0
vllm: v0.11.0
changes:
runs-on: linux-aarch64-a2b3-0
runs-on: ubuntu-latest
outputs:
e2e_tracker: ${{ steps.filter.outputs.e2e_tracker }}
ut_tracker: ${{ steps.filter.outputs.ut_tracker }}
_310_tracker: ${{ steps.filter.outputs._310_tracker }}
steps:
- name: Setup git proxy
run: |
git config --global --add safe.directory "$GITHUB_WORKSPACE"
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
# NOTE: Do not update the version of checkout, there have some issue on self_hosted runner with the higher version
- uses: actions/checkout@v6
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
e2e_tracker:
- '.github/workflows/pr_test_light.yaml'
- '.github/workflows/_e2e_test.yaml'
- '.github/workflows/vllm_ascend_test.yaml'
- 'vllm_ascend/**'
- 'csrc/**'
- 'cmake/**'
@@ -74,35 +69,74 @@ jobs:
- 'packages.txt'
ut_tracker:
- 'tests/ut/**'
- '.github/workflows/pr_test_light.yaml'
_310_tracker:
- 'vllm_ascend/_310p/**'
- 'tests/e2e/310p/**'
- 'vllm_ascend/worker/model_runner_v1.py'
- 'vllm_ascend/attention/attention_v1.py'
- 'vllm_ascend/ops/fused_moe/**'
- 'CMakeLists.txt'
ut:
needs: [lint, changes]
name: unit test
# only trigger unit test after lint passed and the change is e2e and ut related.
if: ${{ needs.lint.result == 'success' && (needs.changes.outputs.e2e_tracker == 'true' || needs.changes.outputs.ut_tracker == 'true') }}
runs-on: ubuntu-22.04-arm
container:
image: quay.io/ascend/cann:8.2.rc1-910b-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
strategy:
matrix:
vllm_version: [v0.18.0]
uses: ./.github/workflows/_unit_test.yaml
with:
vllm: ${{ matrix.vllm_version }}
runner: linux-amd64-cpu-8-hk
image: quay.nju.edu.cn/ascend/cann:8.5.1-910b-ubuntu22.04-py3.11
type: pr
vllm_version: [v0.11.0]
steps:
- name: Install packages
run: |
apt-get update -y
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty python3 -m pip install .
python3 -m pip uninstall -y triton
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install vllm-project/vllm-ascend
run: |
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/arm64-linux/devlib
python3 -m pip install -r requirements-dev.txt
python3 -m pip install -v .
- name: Run unit test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
TORCH_DEVICE_BACKEND_AUTOLOAD: 0
run: |
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/arm64-linux/devlib
pytest -sv --cov --cov-report=xml:unittests-coverage.xml tests/ut \
--ignore tests/ut/attention/test_attention_v1.py
- name: Upload coverage to Codecov
# only upload coverage when commits merged
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
uses: codecov/codecov-action@v5
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
with:
flags: unittests
name: vllm-ascend
verbose: true
e2e-light:
name: e2e-light
strategy:
matrix:
vllm_version: [v0.18.0]
vllm_version: [v0.11.0]
# Note (yikun): If CI resource are limited we can split job into two chain jobs
needs: [lint, changes]
# only trigger e2e test after lint passed and the change is e2e related with pull request.
@@ -110,6 +144,6 @@ jobs:
uses: ./.github/workflows/_e2e_test.yaml
with:
vllm: ${{ matrix.vllm_version }}
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-910b-ubuntu22.04-py3.11
contains_310: ${{ needs.changes.outputs._310_tracker == 'true' }}
runner: linux-aarch64-a2
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11
type: light

View File

@@ -0,0 +1,117 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: 'e2e test / 310p-test'
on:
push:
tags:
- 'v*'
schedule:
# Runs every 6 hours
- cron: '0 */6 * * *'
pull_request:
types: [ labeled ]
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 1 card / 4 cards test type
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
e2e:
# e2e-310p-test will be triggered when tag 'e2e-310p-test' & 'ready-for-test' or schedule job
if: >-
${{
(contains(github.event.pull_request.labels.*.name, 'e2e-310p-test')) &&
contains(github.event.pull_request.labels.*.name, 'ready-for-test') ||
github.event_name == 'schedule' || github.event_name == 'push'
}}
strategy:
max-parallel: 2
matrix:
os: [linux-aarch64-310p-1, linux-aarch64-310p-4]
vllm_version: [v0.11.0]
name: 310p e2e test
runs-on: ${{ matrix.os }}
container:
# TODO(yikun): Remove m.daocloud.io prefix when infra proxy ready
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-310p-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_version }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
run: |
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
export SOC_VERSION=ASCEND310P3
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run e2e test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
run: |
if [[ "${{ matrix.os }}" == "linux-aarch64-310p-1" ]]; then
pytest -sv tests/e2e/310p/test_offline_inference_310p.py
else
pytest -sv tests/e2e/310p/test_offline_inference_parallel_310p.py
fi

View File

@@ -14,14 +14,13 @@
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: E2E-Full
name: 'ascend test / full'
on:
pull_request:
branches:
- 'main'
- '*-dev'
- 'releases/v*'
types: [ labeled, synchronize ]
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
@@ -39,24 +38,19 @@ concurrency:
jobs:
changes:
runs-on: linux-aarch64-a2b3-0
runs-on: ubuntu-latest
if: ${{ contains(github.event.pull_request.labels.*.name, 'ready') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') }}
outputs:
e2e_tracker: ${{ steps.filter.outputs.e2e_tracker }}
ut_tracker: ${{ steps.filter.outputs.ut_tracker }}
steps:
- name: Setup git proxy
run: |
git config --global --add safe.directory "$GITHUB_WORKSPACE"
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
# NOTE: Do not update the version of checkout, there have some issue on self_hosted runner with the higher version
- uses: actions/checkout@v6
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
e2e_tracker:
- '.github/workflows/pr_test_full.yaml'
- '.github/workflows/vllm_ascend_test.yaml'
- '.github/workflows/_e2e_test.yaml'
- 'vllm_ascend/**'
- 'csrc/**'
@@ -75,12 +69,12 @@ jobs:
name: e2e-full
strategy:
matrix:
vllm_version: [v0.18.0]
vllm_version: [v0.11.0]
needs: [changes]
if: ${{ needs.changes.outputs.e2e_tracker == 'true' || needs.changes.outputs.e2e_tracker == true }}
if: ${{ needs.changes.outputs.e2e_tracker == 'true' }}
uses: ./.github/workflows/_e2e_test.yaml
with:
vllm: ${{ matrix.vllm_version }}
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-910b-ubuntu22.04-py3.11
contains_310: false
runner: linux-aarch64-a2
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11
type: full

View File

@@ -14,12 +14,12 @@
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: vLLM Main Schedule Test
name: 'ascend test / vllm main'
on:
# Run full e2e tests UTC+8: 10am, 16pm, 22pm, 4am
# Run 1-card and 2-cards e2e tests per 2h
schedule:
- cron: '0 2,8,14,20 * * *'
- cron: '0 */2 * * *'
workflow_dispatch:
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
@@ -29,11 +29,17 @@ defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 1 card / 4 cards test type
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
e2e-test:
uses: ./.github/workflows/_e2e_test.yaml
with:
vllm: main
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-910b-ubuntu22.04-py3.11
contains_310: false
runner: linux-aarch64-a2
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11
type: full

View File

@@ -0,0 +1,177 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
# This test will be triggered:
# 1. schedule
# 2. pull_request change the related files
# 3. workflow_dispatch with models input
name: ascend test / models
on:
schedule:
# Runs every 6 hours
- cron: '0 */6 * * *'
pull_request:
branches:
- 'main'
- '*-dev'
paths:
- '.github/workflows/vllm_ascend_test_models.yaml'
- 'tests/e2e/models/test_lm_eval_correctness.py'
workflow_dispatch:
inputs:
vllm-ascend-version:
description: 'vllm-ascend:'
required: true
type: choice
# Current supported vLLM versions
options:
- latest
- main
default: main
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
run:
strategy:
matrix:
include:
- model_name: Qwen3-8B
runner: a2-1
- model_name: Qwen2.5-VL-7B-Instruct
runner: a2-1
- model_name: Qwen2-Audio-7B-Instruct
runner: a2-1
- model_name: Qwen3-30B-A3B
runner: a2-2
- model_name: Qwen3-VL-30B-A3B-Instruct
runner: a2-2
- model_name: DeepSeek-V2-Lite
runner: a2-2
fail-fast: false
uses: ./.github/workflows/_accuracy_test.yaml
with:
vllm: v0.11.0
runner: linux-aarch64-${{ matrix.runner }}
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11
model_name: ${{ matrix.model_name }}
upload: ${{ github.event_name == 'workflow_dispatch' && github.event.inputs.vllm-ascend-version == 'latest' }}
create_pr:
runs-on: ubuntu-latest
needs: run
if: ${{ github.event_name == 'workflow_dispatch' && github.event.inputs.vllm-ascend-version == 'latest' }}
env:
UPSTREAM_REPO: vllm-project/vllm-ascend
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: vllm-ascend-ci/vllm-ascend
token: ${{ secrets.PAT_TOKEN }}
ref: main
- name: Add upstream remote
run: |
git remote add upstream https://github.com/${{ env.UPSTREAM_REPO }}.git
git fetch upstream
git remote -v
- name: Set Git user info dynamically
run: |
git config user.name "${{ github.actor }}"
git config user.email "${{ github.actor }}@users.noreply.github.com"
- name: Create or switch to branch
run: |
TIMESTAMP=$(date +%Y%m%d%H%M%S)
BRANCH_NAME="auto-pr/accuracy-report-${TIMESTAMP}"
echo "BRANCH_NAME=${BRANCH_NAME}" >> $GITHUB_ENV
git checkout -B "${BRANCH_NAME}" upstream/main
- name: Download only current run reports
uses: actions/download-artifact@v5
with:
path: ./docs/source/developer_guide/evaluation/accuracy_report
pattern: report-*
github-token: ${{ secrets.GITHUB_TOKEN }}
run-id: ${{ github.run_id }}
- name: Delete old report
run: |
find ./docs/source/developer_guide/evaluation/accuracy_report -maxdepth 1 -type f -name '*.md' ! -name 'index.md' -delete
find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 2 -type f -name '*.md' -exec mv -f {} ./docs/source/developer_guide/evaluation/accuracy_report \;
find ./docs/source/developer_guide/evaluation/accuracy_report -mindepth 1 -type d -empty -delete
- name: Update accuracy_report/index.md
run: |
REPORT_DIR="./docs/source/developer_guide/evaluation/accuracy_report"
INDEX_MD="$REPORT_DIR/index.md"
{
echo "# Accuracy Report"
echo ""
echo ":::{toctree}"
echo ":caption: Accuracy Report"
echo ":maxdepth: 1"
for report in "$REPORT_DIR"/*.md; do
filename="$(basename "$report" .md)"
if [ "$filename" != "index" ]; then
echo "$filename"
fi
done
echo ":::"
} > "$INDEX_MD"
- name: push accuracy report
env:
GITHUB_TOKEN: ${{ secrets.PAT_TOKEN }}
run: |
git add ./docs/source/developer_guide/evaluation/accuracy_report/*.md
git commit -s -m "[Doc] Update accuracy reports for ${{ env.BRANCH_NAME }}"
git push -f origin "${{ env.BRANCH_NAME }}"
- name: Create PR in upstream via API
uses: actions/github-script@v8
with:
github-token: ${{ secrets.PAT_TOKEN }}
script: |
const pr = await github.rest.pulls.create({
owner: 'vllm-project',
repo: 'vllm-ascend',
head: `vllm-ascend-ci:${{ env.BRANCH_NAME }}`,
base: 'main',
title: `[Doc] Update accuracy reports for ${{ env.BRANCH_NAME }}`,
body: `The accuracy results running on NPU Altlas A2 have changed, updating reports for: All models
- [Workflow run][1]
[1]: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}`
});
core.info(`Created PR #${pr.data.number}`);

View File

@@ -0,0 +1,112 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
name: 'e2e test / pd-disaggregation'
on:
schedule:
# Runs at 23:00 UTC (7:00 AM Beijing) every day
- cron: '0 23 * * *'
pull_request:
types: [ labeled ]
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only 1 job can runs on static-8-01-cards
concurrency:
group: static-8-01-cards
cancel-in-progress: false
jobs:
prefilling-decoding-disaggregation:
# pd-test will be triggered when tag 'pd-test' & 'ready-for-test' or schedule job
if: ${{ contains(github.event.pull_request.labels.*.name, 'pd-test') && contains(github.event.pull_request.labels.*.name, 'ready-for-test') || github.event_name == 'schedule' }}
strategy:
matrix:
vllm_verison: [
main,
v0.9.1
]
name: vLLM Ascend prefilling decoding disaggregation test
runs-on: linux-arm64-npu-static-8
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-910b-ubuntu22.04-py3.11
volumes:
- /usr/local/dcmi:/usr/local/dcmi
- /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
- /usr/local/Ascend/driver/:/usr/local/Ascend/driver/
# Use self-host cache speed up pip and model download
- /home/action/.cache:/github/home/.cache/
options: >-
--device /dev/davinci0
--device /dev/davinci1
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
env:
VLLM_USE_MODELSCOPE: True
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
# keep using tuna's proxy since linux-arm64-npu-static-8 is in another region
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
apt-get update -y
apt install git -y
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v4
with:
repository: vllm-project/vllm
ref: ${{ matrix.vllm_verison }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
- name: Run vllm-project/vllm-ascend PD Disaggregation edge test
run: |
git config --global --add safe.directory/__w/vllm-ascend/vllm-ascend
bash tests/e2e/pd_disaggreate/run_edge_case_test.sh

65
.github/CODEOWNERS vendored
View File

@@ -1,65 +0,0 @@
# See https://help.github.com/articles/about-codeowners/
# for more info about CODEOWNERS file
# Infra, CI
/.gemini @wangxiyuan @Yikun
/.github @wangxiyuan @Yikun
/tools @wangxiyuan @Yikun
/.gitignore @wangxiyuan
/.gitmodules @wangxiyuan @zzzzwwjj
/.pre-commit-config.yaml @wangxiyuan
/codecov.yml @wangxiyuan
/Dockerfile* @wangxiyuan
/format.sh @wangxiyuan
/mypy.ini @wangxiyuan
/requirements* @wangxiyuan
/setup.py @wangxiyuan
/typos.toml @wangxiyuan
# benchmark
/benchmarks @wangxiyuan
# docs
/docs @wangxiyuan @Yikun @LCAIZJ
/.readthedocs.yaml @wangxiyuan @Yikun
/README* @wangxiyuan @Yikun
# example
/examples @wangxiyuan
# tests
/tests @wangxiyuan
# c++ source code
/cmake @zzzzwwjj
/csrc @zzzzwwjj
/CMakeLists.txt @zzzzwwjj
# python source code
/vllm_ascend/attention @weijinqian0 @whx-sjtu
/vllm_ascend/compilation @yiz-liu
/vllm_ascend/core @wangxiyuan @MengqingCao
/vllm_ascend/device @weijinqian0 @zzzzwwjj
/vllm_ascend/device_allocator @wangxiyuan @weijinqian0
/vllm_ascend/distributed @MengqingCao @LCAIZJ
/vllm_ascend/eplb @wangxiyuan
/vllm_ascend/kv_offload @nalinaly
/vllm_ascend/lora @paulyu12
/vllm_ascend/model_loader @wangxiyuan
/vllm_ascend/ops @zzzzwwjj @realliujiaxu @whx-sjtu
/vllm_ascend/patch @wangxiyuan
/vllm_ascend/quantization @wangxiyuan
/vllm_ascend/sample @realliujiaxu @whx-sjtu
/vllm_ascend/spec_decode @wangxiyuan
/vllm_ascend/worker @MengqingCao
/vllm_ascend/xlite @wangxiyuan
/vllm_ascend/ascend_config.py @wangxiyuan
/vllm_ascend/ascend_forward_context.py @wangxiyuan
/vllm_ascend/batch_invariant.py @wangxiyuan
/vllm_ascend/cpu_binding.py @wangxiyuan
/vllm_ascend/envs.py @wangxiyuan
/vllm_ascend/flash_common3_context.py @wangxiyuan
/vllm_ascend/meta_registration.py @wangxiyuan
/vllm_ascend/platform.py @wangxiyuan
/vllm_ascend/profiling_config.py @wangxiyuan
/vllm_ascend/utils.py @wangxiyuan

View File

@@ -1,74 +0,0 @@
core-features:
- '/((pd|(prefill[- ]?decode))\s+disaggregation|kv cache pool|aclgraph|async scheduler|cpu binding|quantization)/i'
pd-disaggregation:
- '/((pd|(prefill[- ]?decode))\s+disaggregation)/i'
kv-cache-pool:
- '/(kv cache pool)/i'
aclgraph:
- '/(aclgraph)/i'
async-scheduler:
- '/(async scheduler)/i'
cpu-binding:
- '/(cpu binding)/i'
quantization:
- '/(quantization)/i'
advanced_features:
- '/(long sequence|dpc|pcp|mtp|speculative decode)/i'
long-seq:
- '/(long sequence|dpc|pcp)/i'
mtp/speculative-decode:
- '/(mtp|speculative decode)/i'
eplb:
- '/(eplb)/i'
llm-model:
- '/(deepseek[- ]*(r1|v3(\.2)?)\S*|(kimi k2|kimik2|kimi-k2)(?!\.5)|glm5|qwen3-(?:235b|480b)\S*|Qwen3-(?:32B|8B|30B)\S*|qwen3 next|glm\s*4\.(?![^v\s]*v)\S*)/i'
deepseek:
- '/(deepseek[- ]*(r1|v3(\.2)?)\S*)/i'
kimi-k2:
- '/((kimi k2|kimik2|kimi-k2)(?!\.5))/i'
kimi-k2.5:
- '/((kimi k2\.5|kimik2\.5|kimi-k2\.5))/i'
glm5:
- '/(glm5)/i'
qwen3-moe:
- '/(Qwen3-(?:235B|480B)\S*)/i'
qwen3-dense:
- '/(Qwen3-(?:32B|8B|30B)\S*)/i'
qwen3-next:
- '/(qwen3-next)/i'
glm-4:
- '/(glm\s*4\.(?![^v\s]*v)\S*)/i'
multi-modality-generate:
- '/(seedance\S*|seedream\S*|wan\d[\d.]*|hunyuan\S*|fLux\S*|kimi k2\.5|kimi-k2\.5|kimik2\.5|minimax\S*|qwen-image\S*)/i'
seedance:
- '/(seedance\S*)/i'
seedream:
- '/(seedream\S*)/i'
wan:
- '/(wan\d[\d.]*)/i'
hunyuan:
- '/(hunyuan\S*)/i'
fLux:
- '/(fLux\S*)/i'
qwen-image:
- '/(qwen-image\S*)/i'
minimax:
- '/(minimax\S*)/i'
multimodal_understanding:
- '/(glm-?4\.\S*v\b|qwen3\.5\S*|deepseek-ocr\S*)/i'
glm-4v:
- '/(glm-?4\.\S*v\b)/i'
qwen-3.5:
- '/(qwen3\.5\S*)/i'
deepseek-ocr:
- '/(deepseek-ocr\S*)/i'
audio-model:
- '/(qwen3-tts\S*)/i'
omni-model:
- '/(qwen3-Omni\S*)/i'
multimodal-unified-autoregress:
- '/(hunyuan\S*|emu\S*)/i'
paddle:
- '/(paddle\S*)/i'
310p:
- '/(310p\S*)/i'

View File

@@ -1,85 +0,0 @@
# E2E Test Workflow Guide
This document provides a guide on how to manage and extend the E2E test suite for `vllm-ascend`. It covers how to add new test cases and understand the automatic partitioning mechanism.
## 1. Adding a New Test Case
All E2E test cases are defined and managed in the `.github/workflows/scripts/config.yaml` file.
### Steps
1. **Prepare the Test Script**: Ensure your test script (`.py` file) is placed in the appropriate location under the `tests/e2e/` directory (e.g., `tests/e2e/singlecard/` or `tests/e2e/multicard/`).
2. **Modify `config.yaml`**:
Open `.github/workflows/scripts/config.yaml` and locate the corresponding test suite (e.g., `e2e-singlecard` or `e2e-multicard-2-cards`).
3. **Add Configuration Entry**:
Add a new entry under the corresponding list. Each entry contains the following fields:
* `name`: The relative path to the test file. If you only need to run a specific test function within the file, use `::` as a separator, e.g., `path/to/test.py::test_func`.
* `estimated_time`: The estimated time (in seconds) required to run the test. **This field is crucial** as it is used for automatic load balancing (partitioning).
* `is_skipped` (Optional): If set to `true`, the test will be skipped.
### Example
Suppose you want to add a new test named `tests/e2e/singlecard/test_new_feature.py` with an estimated runtime of 120 seconds:
```yaml
suites:
e2e-singlecard:
# ... other existing tests ...
- name: tests/e2e/singlecard/test_new_feature.py
estimated_time: 120
```
To add a specific test function:
```yaml
- name: tests/e2e/singlecard/test_new_feature.py::test_specific_case
estimated_time: 60
```
## 2. Automatic Partitioning Mechanism
To speed up CI execution, we support splitting large test suites into multiple parallel Jobs (partitions). The partitioning logic is primarily implemented in the `auto_partition` function in `.github/workflows/scripts/run_suite.py`.
### Principle
The partitioning algorithm uses a Greedy Approach to achieve load balancing, aiming to make the total estimated runtime of each partition as equal as possible.
1. **Read Configuration**: The script reads all non-skipped test cases and their `estimated_time` from `config.yaml`.
2. **Sort(Balanced Assignment)**: Test cases are sorted by `estimated_time` in descending order. This ensures that the heaviest tasks are distributed first to achieve optimal load balancing across partitions.
3. **Assign**: Iterating through the sorted test cases, each case is assigned to the partition (Bucket) with the current minimum total time.
4. **Re-sort (Fast Feedback)**: Within each partition, tests are re-sorted by `estimated_time` in ascending order. This allows the CI to cover as many test cases as possible in the early stages.
> TIP: If you need to prioritize a new test case, you can temporarily set its estimated_time to 0 to ensure it runs first, then update it to the actual value later.
### How to Modify Partitioning Logic
If you need to adjust the partitioning strategy, please modify the `.github/workflows/scripts/run_suite.py` file.
* **Algorithm Location**: `auto_partition` function.
* **Input Parameters**:
* `files`: List of test files (including `estimated_time`).
* `rank`: Index of the current partition (0 to size-1).
* `size`: Total number of partitions.
* **Invocation**:
CI workflows (e.g., `.github/workflows/_e2e_test.yaml`) call the script via command-line arguments:
```bash
python3 .github/workflows/scripts/run_suite.py --suite <suite_name> --auto-partition-id <index> --auto-partition-size <total_count>
```
### Notes
* **Accurate Estimated Time**: To achieve the best load balancing, please provide an accurate `estimated_time` in `config.yaml`. If a new test is very time-consuming but the estimated time is set too low, it may cause a specific partition to timeout.
* **Number of Partitions**: The number of partitions (`auto-partition-size`) is typically defined in the `strategy.matrix` of the GitHub Actions workflow definition file (e.g., `_e2e_test.yaml`).
## 3. Running Tests Locally
You can use the `run_suite.py` script to run test suites locally:
```bash
# Run the full e2e-singlecard suite
python3 .github/workflows/scripts/run_suite.py --suite e2e-singlecard
# Simulate partitioned execution (e.g., partition 0 of 2)
python3 .github/workflows/scripts/run_suite.py --suite e2e-singlecard --auto-partition-id 0 --auto-partition-size 2
```

View File

@@ -1,331 +0,0 @@
name: 'e2e nightly test multi_node'
on:
workflow_call:
inputs:
soc_version:
required: true
type: string
description: use a2 or a3
runner:
required: false
type: string
default: linux-aarch64-a3-0
image:
required: false
type: string
description: base image for pods
default: "swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-910b-ubuntu22.04-py3.11"
config_file_path:
required: true
type: string
description: the model config for multi_node test
replicas:
required: false
default: "1"
type: string
description: replicas of the k8s cluster
size:
required: false
default: "2"
type: string
description: how many pods will be pulled up via lws.yaml, indicates number of nodes we need
vllm_version:
required: false
default: "v0.18.0"
type: string
description: vllm version to use
vllm_ascend_remote_url:
required: false
default: https://github.com/vllm-project/vllm-ascend.git
type: string
description: used for pr level tests
vllm_ascend_ref:
required: false
default: main
type: string
description: used for pr level tests
should_run:
required: true
type: boolean
secrets:
KUBECONFIG_B64:
required: true
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 8 cards test type
concurrency:
group: ascend-nightly-${{ github.workflow_ref }}-${{ github.ref }}-${{ inputs.soc_version }}-${{ inputs.config_file_path }}
cancel-in-progress: true
jobs:
e2e:
name: ${{ inputs.config_file_path }}
# This is the runner with no NPU for k8s controller
runs-on: ${{ inputs.runner }}
if: ${{ inputs.should_run }}
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-cpu
env:
KUBECONFIG: /tmp/kubeconfig
NAMESPACE: vllm-project
steps:
- name: Decode kubeconfig from secrets
run: |
# Decode and save kubeconfig
if [ "${{ github.event_name }}" = "pull_request" ]; then
echo "PR test mode"
if [ "${{ inputs.soc_version }}" = "a3" ]; then
echo "Using A3 cached kubeconfig"
cp /root/.cache/.kube/kubeconfig.yaml "$KUBECONFIG"
else
echo "Using A2 cached kubeconfig"
cp /root/.cache/.kube/hk_001_kb.yaml "$KUBECONFIG"
fi
else
echo "Decoding kubeconfig from secrets"
echo "${{ secrets.KUBECONFIG_B64 }}" | base64 -d > "$KUBECONFIG"
fi
- name: Checkout code
uses: actions/checkout@v6
- name: Set job variables
run: |
# Derive a unique, valid k8s resource name from config_file_path.
# Strip .yaml extension, lowercase, replace dots/underscores with hyphens, cap at 50 chars.
config_file="${{ inputs.config_file_path }}"
lws_suffix=$(echo "$config_file" | sed 's/\.yaml$//' | tr '[:upper:]' '[:lower:]' | tr '._' '-' | cut -c1-50)
LWS_NAME="vllm-${lws_suffix}"
echo "LWS_NAME=${LWS_NAME}" >> $GITHUB_ENV
echo "LEADER_POD=${LWS_NAME}-0" >> $GITHUB_ENV
echo "Computed LWS_NAME=${LWS_NAME}"
- name: Prepare scripts
run: |
# prepare for lws entrypoint scripts
install -D tests/e2e/nightly/multi_node/scripts/run.sh /root/.cache/tests/run.sh
- name: Clear resources
run: |
set -euo pipefail
TIMEOUT=${TIMEOUT:-120}
SLEEP_INTERVAL=2
echo "Deleting leaderworkerset [$LWS_NAME] in namespace [$NAMESPACE]..."
kubectl delete leaderworkerset "$LWS_NAME" -n "$NAMESPACE" --ignore-not-found
kubectl delete service "${LWS_NAME}-leader" -n "$NAMESPACE" --ignore-not-found
echo "Waiting for pods of leaderworkerset [$LWS_NAME] to be deleted..."
START_TIME=$(date +%s)
while true; do
NOW=$(date +%s)
ELAPSED=$((NOW - START_TIME))
if [[ $ELAPSED -ge $TIMEOUT ]]; then
echo "Timeout reached ($TIMEOUT seconds), some pods still exist:"
kubectl get pods -n "$NAMESPACE" | grep "^${LWS_NAME}-" || true
exit 1
fi
PODS_EXIST=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep "^${LWS_NAME}-" || true)
if [[ -z "$PODS_EXIST" ]]; then
echo "All pods for [$LWS_NAME] deleted."
break
else
echo "Waiting for pods to be deleted: $PODS_EXIST"
sleep $SLEEP_INTERVAL
fi
done
- name: Launch cluster
id: launcher
run: |
set -e
size="${{ inputs.size }}"
replicas="${{ inputs.replicas }}"
image="${{ inputs.image }}"
config_file_path="${{ inputs.config_file_path }}"
fail_tag=FAIL_TAG_"${{ inputs.config_file_path }}"
is_pr_test="${{ github.event_name == 'pull_request' }}"
vllm_version="${{ inputs.vllm_version }}"
vllm_ascend_ref="${{ inputs.vllm_ascend_ref }}"
vllm_ascend_remote_url="${{ inputs.vllm_ascend_remote_url }}"
echo "FAIL_TAG=${fail_tag}" >> $GITHUB_ENV
required_params=("size" "replicas" "image" "config_file_path" "is_pr_test" "vllm_version" "vllm_ascend_ref" "vllm_ascend_remote_url")
for param in "${required_params[@]}"; do
if [ -z "${!param}" ]; then
echo "Error: Parameter '$param' is required but empty"
exit 1
fi
done
if [ "${{ inputs.soc_version }}" = "a3" ]; then
npu_per_node=16
TEMPLATE_FILE="tests/e2e/nightly/multi_node/scripts/lws.yaml.jinja2"
else
npu_per_node=8
TEMPLATE_FILE="tests/e2e/nightly/multi_node/scripts/lws-a2.yaml.jinja2"
fi
jinja2 $TEMPLATE_FILE \
-D lws_name="$LWS_NAME" \
-D size="$size" \
-D replicas="$replicas" \
-D image="$image" \
-D config_file_path="$config_file_path" \
-D npu_per_node="$npu_per_node" \
-D fail_tag="$fail_tag" \
-D is_pr_test="$is_pr_test" \
-D vllm_version="$vllm_version" \
-D vllm_ascend_ref="$vllm_ascend_ref" \
-D vllm_ascend_remote_url="$vllm_ascend_remote_url" \
--outfile lws.yaml
kubectl apply -f ./lws.yaml
- name: Waiting for pod ready
run: |
POD_PREFIX="${LWS_NAME}-0"
SIZE="${{ inputs.size }}"
TIMEOUT=1200 # default timeout 20 minutes
echo "Waiting for Pods in namespace [$NAMESPACE] to become Running and Ready (timeout ${TIMEOUT}s)..."
START_TIME=$(date +%s)
while true; do
NOW=$(date +%s)
ELAPSED=$((NOW - START_TIME))
if [[ $ELAPSED -ge $TIMEOUT ]]; then
echo "Timeout reached after ${ELAPSED}s"
echo "Dumping pod status for debugging:"
kubectl get pods -n "$NAMESPACE"
kubectl describe pod "$LEADER_POD" -n "$NAMESPACE"
exit 1
fi
# 1) check follower pods
ALL_FOLLOWERS_READY=true
for ((i=1; i<SIZE; i++)); do
POD="${POD_PREFIX}-${i}"
PHASE=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound")
READY=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}' 2>/dev/null)
echo "Follower [$POD] phase=$PHASE ready=$READY"
if [[ "$PHASE" != "Running" || "$READY" != "true" ]]; then
echo "Follower [$POD] not Ready yet..."
ALL_FOLLOWERS_READY=false
break
fi
done
# 2) check leader pod
LEADER_PHASE=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound")
LEADER_READY=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}' 2>/dev/null)
echo "Leader [$LEADER_POD] phase=$LEADER_PHASE ready=$LEADER_READY"
if [[ "$LEADER_PHASE" != "Running" || "$LEADER_READY" != "true" ]]; then
echo "Leader not Ready yet..."
ALL_FOLLOWERS_READY=false
fi
if [[ "$ALL_FOLLOWERS_READY" == "true" ]]; then
echo "All follower pods and leader pod are Running and Ready — continuing."
break
fi
sleep 2
done
- name: Stream logs
run: |
set -euo pipefail
size="${{ inputs.size }}"
pids=()
cleanup() {
echo "Cleaning up background log streams..."
for pid in "${pids[@]}"; do
kill "$pid" 2>/dev/null || true
done
}
trap cleanup EXIT
for i in $(seq 1 $((size - 1))); do
POD="${LWS_NAME}-0-${i}"
echo "==== Collecting logs from worker pod: $POD ===="
kubectl logs -f "$POD" -n "$NAMESPACE" \
> "/tmp/${POD}_logs.txt" 2>&1 &
pids+=($!)
done
echo "==== Streaming logs from leader pod: $LEADER_POD ===="
echo "Looking for logs containing: $FAIL_TAG"
kubectl logs -f "$LEADER_POD" -n "$NAMESPACE" | while IFS= read -r line; do
echo "$line"
if echo "$line" | grep -q "$FAIL_TAG"; then
exit 1
fi
done
- name: Upload logs
if: always()
uses: actions/upload-artifact@v7
with:
name: ${{ inputs.config_file_path }}-pod-logs
path: /tmp/vllm*_logs.txt
retention-days: 7
- name: Post process
if: always()
run: |
echo "Current pod status:"
kubectl get pods -n "$NAMESPACE" --ignore-not-found=true
echo "Deleting resources for [$LWS_NAME]..."
kubectl delete -f ./lws.yaml --ignore-not-found=true || true
echo "Waiting for pods of [$LWS_NAME] to fully terminate..."
TIMEOUT=300
SLEEP_INTERVAL=5
START_TIME=$(date +%s)
while true; do
NOW=$(date +%s)
ELAPSED=$((NOW - START_TIME))
if [[ $ELAPSED -ge $TIMEOUT ]]; then
echo "Timeout reached ($TIMEOUT seconds) waiting for termination, continuing anyway."
kubectl get pods -n "$NAMESPACE" | grep "^${LWS_NAME}-" || true
break
fi
PODS_EXIST=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep "^${LWS_NAME}-" || true)
if [[ -z "$PODS_EXIST" ]]; then
echo "All pods for [$LWS_NAME] have terminated."
break
else
echo "Waiting for pods to terminate: $PODS_EXIST"
sleep $SLEEP_INTERVAL
fi
done

View File

@@ -1,224 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: 'e2e nightly test'
on:
workflow_call:
inputs:
runner:
required: true
type: string
image:
required: false
type: string
default: "swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-910b-ubuntu22.04-py3.11"
tests:
required: false
type: string
config_file_path:
required: false
type: string
name:
required: false
type: string
vllm_version:
required: false
type: string
default: "v0.18.0"
should_run:
required: true
type: boolean
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 1 card / 4 cards test type
concurrency:
group: ascend-nightly-${{ github.workflow_ref }}-${{ github.ref }}-${{ inputs.config_file_path || inputs.tests }}
cancel-in-progress: true
jobs:
e2e-nightly:
name: ${{ inputs.name || inputs.config_file_path || inputs.tests }}
runs-on: ${{ inputs.runner }}
if: ${{ inputs.should_run }}
timeout-minutes: 600
container:
image: ${{ inputs.image }}
env:
HF_HUB_OFFLINE: 1
VLLM_USE_MODELSCOPE: True
UV_INDEX_URL: http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
UV_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
UV_INDEX_STRATEGY: unsafe-best-match
UV_NO_CACHE: 1
UV_SYSTEM_PYTHON: 1
VLLM_ENGINE_READY_TIMEOUT_S: 1800
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
pip install uv
- name: uninstall vlm vllm-ascend and remove code (if pr test)
if: ${{ github.event_name == 'pull_request' }}
run: |
pip uninstall -y vllm vllm-ascend || true
cp -r /vllm-workspace/vllm-ascend/benchmark /tmp/aisbench-backup || true
rm -rf /vllm-workspace/vllm /vllm-workspace/vllm-ascend
- name: Checkout vllm-project/vllm repo
if: ${{ github.event_name == 'pull_request' }}
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm_version }}
path: ./temp-vllm
fetch-depth: 1
- name: Checkout vllm-project/vllm-ascend repo
if: ${{ github.event_name == 'pull_request' }}
uses: actions/checkout@v6
with:
path: ./temp-vllm-ascend
fetch-depth: 1
- name: Move code to /vllm-workspace
if: ${{ github.event_name == 'pull_request' }}
run: |
mv ./temp-vllm /vllm-workspace/vllm
mv ./temp-vllm-ascend /vllm-workspace/vllm-ascend
ls -R /vllm-workspace
- name: Install vllm-project/vllm from source
if: ${{ github.event_name == 'pull_request' }}
working-directory: /vllm-workspace/vllm
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
- name: Install vllm-project/vllm-ascend
if: ${{ github.event_name == 'pull_request' }}
working-directory: /vllm-workspace/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
git config --global --add safe.directory /vllm-workspace/vllm-ascend
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
- name: Install aisbench
if: ${{ github.event_name == 'pull_request' }}
shell: bash -l {0}
run: |
cp -r /tmp/aisbench-backup /vllm-workspace/vllm-ascend/benchmark
cd /vllm-workspace/vllm-ascend/benchmark
pip install pytest asyncio pytest-asyncio
pip install -e . -r requirements/api.txt -r requirements/extra.txt
python3 -m pip cache purge
- name: Show vLLM and vLLM-Ascend version
working-directory: /vllm-workspace
run: |
echo "Installed vLLM-related Python packages:"
pip list | grep vllm || echo "No vllm packages found."
echo ""
echo "============================"
echo "vLLM Git information"
echo "============================"
cd vllm
if [ -d .git ]; then
echo "Branch: $(git rev-parse --abbrev-ref HEAD)"
echo "Commit hash: $(git rev-parse HEAD)"
echo "Author: $(git log -1 --pretty=format:'%an <%ae>')"
echo "Date: $(git log -1 --pretty=format:'%ad' --date=iso)"
echo "Message: $(git log -1 --pretty=format:'%s')"
echo "Tags: $(git tag --points-at HEAD || echo 'None')"
echo "Remote: $(git remote -v | head -n1)"
echo ""
else
echo "No .git directory found in vllm"
fi
cd ..
echo ""
echo "============================"
echo "vLLM-Ascend Git information"
echo "============================"
cd vllm-ascend
if [ -d .git ]; then
echo "Branch: $(git rev-parse --abbrev-ref HEAD)"
echo "Commit hash: $(git rev-parse HEAD)"
echo "Author: $(git log -1 --pretty=format:'%an <%ae>')"
echo "Date: $(git log -1 --pretty=format:'%ad' --date=iso)"
echo "Message: $(git log -1 --pretty=format:'%s')"
echo "Tags: $(git tag --points-at HEAD || echo 'None')"
echo "Remote: $(git remote -v | head -n1)"
echo ""
else
echo "No .git directory found in vllm-ascend"
fi
cd ..
- name: Install clang
shell: bash -l {0}
run: |
apt-get update && apt-get -y install clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
- name: Validate Inputs
run: |
if [[ -z "${{ inputs.tests }}" && -z "${{ inputs.config_file_path }}" ]]; then
echo "Error: Either 'tests' or 'config_file_path' must be provided."
exit 1
fi
- name: Run Pytest (py-driven)
if: ${{ inputs.tests != '' }}
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
VLLM_CI_RUNNER: ${{ inputs.runner }}
working-directory: /vllm-workspace/vllm-ascend
run: |
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
echo "Running pytest with tests path: ${{ inputs.tests }}"
pytest -sv "${{ inputs.tests }}" \
--ignore=tests/e2e/nightly/single_node/ops/singlecard_ops/test_fused_moe.py
- name: Run Pytest (YAML-driven)
if: ${{ always() && inputs.config_file_path != '' }}
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
VLLM_USE_MODELSCOPE: True
VLLM_CI_RUNNER: ${{ inputs.runner }}
CONFIG_YAML_PATH: ${{ inputs.config_file_path }}
working-directory: /vllm-workspace/vllm-ascend
run: |
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib" >> ~/.bashrc
echo "Running YAML-driven test with config: ${{ inputs.config_file_path }}"
pytest -sv tests/e2e/nightly/single_node/models/scripts/test_single_node.py

View File

@@ -1,241 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: 'e2e nightly models test'
on:
workflow_call:
inputs:
vllm:
required: true
type: string
vllm-ascend:
required: false
type: string
default: main
runner:
required: true
type: string
image:
required: true
type: string
model_list:
required: true
type: string
upload:
required: false
type: boolean
default: false
is_run:
required: true
type: boolean
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
# and ignore the lint / 1 card / 2 cards / 4 cards test type
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ inputs.runner }}-${{inputs.model_list}}
cancel-in-progress: true
jobs:
e2e-nightly:
name: ${{inputs.model_list}} accuracy test
runs-on: ${{ inputs.runner }}
if: ${{ inputs.is_run }}
container:
image: "${{ inputs.image }}"
env:
VLLM_USE_MODELSCOPE: True
GHA_VLLM_ASCEND_VERSION: ${{ inputs.vllm-ascend }}
UV_INDEX_URL: http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
UV_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
UV_INDEX_STRATEGY: unsafe-best-match
UV_NO_CACHE: 1
UV_SYSTEM_PYTHON: 1
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
apt-get update -y
apt install git -y
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
git config --global url."https://gh-proxy.test.osinfra.cn/https://github.com/".insteadOf https://github.com/
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
- name: Install tensorflow (for Molmo-7B-D-0924)
if: ${{ inputs.runner == 'linux-aarch64-a2b3-1' && contains(inputs.model_list, 'Molmo-7B-D-0924') }}
shell: bash -l {0}
run: |
pip install tensorflow==2.19.1 --no-cache-dir
- name: Resolve vllm-ascend version
run: |
VERSION_INPUT="${{ inputs.vllm-ascend }}"
if [[ "$VERSION_INPUT" == "latest" ]]; then
TAGS=$(git ls-remote --tags --sort=-v:refname https://github.com/vllm-project/vllm-ascend "v*" | cut -f2 | sed 's|refs/tags/||')
LATEST_TAG=$(echo "$TAGS" | head -n1)
if [[ -z "$LATEST_TAG" ]]; then
RESOLVED_VERSION="main"
else
RESOLVED_VERSION="$LATEST_TAG"
fi
else
RESOLVED_VERSION="$VERSION_INPUT"
fi
echo "GHA_VLLM_ASCEND_VERSION=$RESOLVED_VERSION" >> $GITHUB_ENV
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm-ascend
path: ./vllm-ascend
ref: ${{ env.GHA_VLLM_ASCEND_VERSION }}
- name: Get vLLM commit hash and URL
working-directory: ./vllm-empty
run: |
VLLM_COMMIT=$(git rev-parse --short=7 HEAD)
echo "VLLM_COMMIT=$VLLM_COMMIT" >> $GITHUB_ENV
- name: Get vLLM-Ascend commit hash and URL
working-directory: ./vllm-ascend
run: |
VLLM_ASCEND_COMMIT=$(git rev-parse --short=7 HEAD)
echo "VLLM_ASCEND_COMMIT=$VLLM_ASCEND_COMMIT" >> $GITHUB_ENV
- name: Collect version info
run: |
for dir in /usr/local/Ascend/ascend-toolkit/*; do
dname=$(basename "$dir")
if [ "$dname" != "latest" ]; then
TOOLKIT_DIR="$dname"
break
fi
done
INFO_FILE="/usr/local/Ascend/ascend-toolkit/${TOOLKIT_DIR}/$(uname -i)-linux/ascend_toolkit_install.info"
GHA_CANN_VERSION=$(grep "version=" "$INFO_FILE" \
| head -n1 \
| cut -d'=' -f2 \
| tr -d '"')
{
echo "GHA_CANN_VERSION=$GHA_CANN_VERSION"
pip show torch | grep "Version:" | awk '{print "GHA_TORCH_VERSION="$2}'
pip show torch_npu | grep "Version:" | awk '{print "GHA_TORCH_NPU_VERSION="$2}'
pip show vllm | grep "Version:" | awk '{print "GHA_VLLM_VERSION="$2}' | sed 's/+.*//'
} >> "$GITHUB_ENV"
- name: Run vllm-project/vllm-ascend accuracy test
id: report
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
HF_DATASETS_OFFLINE: True
VLLM_USE_MODELSCOPE: True
VLLM_CI_RUNNER: ${{ inputs.runner }}
VLLM_VERSION: ${{ env.GHA_VLLM_VERSION }}
VLLM_COMMIT: ${{ env.VLLM_COMMIT }}
VLLM_ASCEND_VERSION: ${{ env.GHA_VLLM_ASCEND_VERSION || github.ref }}
VLLM_ASCEND_COMMIT: ${{ env.VLLM_ASCEND_COMMIT }}
CANN_VERSION: ${{ env.GHA_CANN_VERSION }}
TORCH_VERSION: ${{ env.GHA_TORCH_VERSION }}
TORCH_NPU_VERSION: ${{ env.GHA_TORCH_NPU_VERSION }}
run: |
mkdir -p ./benchmarks/accuracy
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib" >> ~/.bashrc
echo "Received model_list: ${{ inputs.model_list }}"
models=$(echo '${{ inputs.model_list }}' | jq -r '.[]')
any_failure=0
for model in $models; do
echo "Running test for model: $model"
pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \
--config "./tests/e2e/models/configs/${model}.yaml" || {
echo "Test failed for model: $model"
any_failure=1
}
done
if [ $any_failure -ne 0 ]; then
exit 1
fi
- name: Generate step summary
if: ${{ always() }}
run: |
models=$(echo '${{ inputs.model_list }}' | jq -r '.[]')
for model in $models; do
echo "Processing model: $model"
model_base_name=$(basename "$model")
cat ./benchmarks/accuracy/${model_base_name}.md >> $GITHUB_STEP_SUMMARY
done
- name: Set artifact timestamp
id: ts
run: |
echo "artifact_ts=$(date -u +%Y%m%dT%H%M%SZ)" >> $GITHUB_OUTPUT
- name: Upload Report
if: ${{ inputs.upload == true }}
uses: actions/upload-artifact@v7
with:
name: report-${{ env.GHA_VLLM_ASCEND_VERSION }}-${{ steps.ts.outputs.artifact_ts }}
path: ./benchmarks/accuracy/
if-no-files-found: warn
retention-days: 90
overwrite: true

View File

@@ -1,731 +0,0 @@
name: 'e2e test'
on:
workflow_call:
inputs:
vllm:
required: true
type: string
image:
required: true
type: string
type:
required: true
type: string
contains_310:
required: true
type: boolean
continue_on_error:
required: false
type: boolean
default: false
env:
UV_INDEX_URL: http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
UV_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
UV_INDEX_STRATEGY: unsafe-best-match
UV_NO_CACHE: 1
UV_SYSTEM_PYTHON: 1
jobs:
e2e-light:
name: singlecard-light
if: ${{ inputs.type == 'light' }}
runs-on: linux-aarch64-a2b3-1
strategy:
fail-fast: false
matrix:
part: [0]
container:
image: ${{ inputs.image }}
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
HF_HUB_OFFLINE: 1
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
uv pip uninstall triton
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
uv pip install https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/torch_npu-2.9.0.post1%2Bgitee7ba04-cp311-cp311-manylinux_2_28_aarch64.whl
uv pip install git+https://github.com/modelscope/modelscope.git@dbbcbf631fe6d10cc6446df2ad2fef24039fe7fe
- name: Run vllm-project/vllm-ascend test
env:
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
VLLM_WORKER_MULTIPROC_METHOD: spawn
shell: bash
run: |
set -o pipefail
if [ "${{ inputs.continue_on_error }}" = "true" ]; then
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-singlecard-light \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 1 \
--auto-upgrade-estimated-times \
--continue-on-error \
2>&1 | tee /tmp/e2e-singlecard-light-part${{ matrix.part }}.log
else
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-singlecard-light \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 1 \
2>&1 | tee /tmp/e2e-singlecard-light-part${{ matrix.part }}.log
fi
exit ${PIPESTATUS[0]}
- name: Summarize singlecard-light failure
if: ${{ always() }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--step-name "Run singlecard-light test" \
--log-file /tmp/e2e-singlecard-light-part${{ matrix.part }}.log \
--output "$GITHUB_STEP_SUMMARY"
- name: Upload timing data
uses: actions/upload-artifact@v4
if: ${{ inputs.continue_on_error == true && github.event_name != 'pull_request' }}
with:
name: timing-data-singlecard-light-part${{ matrix.part }}
path: test_timing_data.json
if-no-files-found: warn
retention-days: 5
e2e-full:
name: singlecard-full
if: ${{ inputs.type == 'full' }}
runs-on: linux-aarch64-a2b3-1
strategy:
fail-fast: false
matrix:
part: [0, 1]
container:
image: ${{ inputs.image }}
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
HF_HUB_OFFLINE: 1
MODELSCOPE_HUB_FILE_LOCK: False
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
uv pip uninstall triton
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
uv pip install https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/torch_npu-2.9.0.post1%2Bgitee7ba04-cp311-cp311-manylinux_2_28_aarch64.whl
uv pip install git+https://github.com/modelscope/modelscope.git@dbbcbf631fe6d10cc6446df2ad2fef24039fe7fe
- name: Run e2e test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
shell: bash
run: |
set -o pipefail
if [ "${{ inputs.continue_on_error }}" = "true" ]; then
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-singlecard \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 2 \
--auto-upgrade-estimated-times \
--continue-on-error \
2>&1 | tee /tmp/e2e-singlecard-full-part${{ matrix.part }}.log
else
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-singlecard \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 2 \
2>&1 | tee /tmp/e2e-singlecard-full-part${{ matrix.part }}.log
fi
exit ${PIPESTATUS[0]}
- name: Summarize singlecard-full failure
if: ${{ always() }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--step-name "Run singlecard-full test" \
--log-file /tmp/e2e-singlecard-full-part${{ matrix.part }}.log \
--output "$GITHUB_STEP_SUMMARY"
- name: Upload timing data
uses: actions/upload-artifact@v4
if: ${{ inputs.continue_on_error == true && github.event_name != 'pull_request' }}
with:
name: timing-data-singlecard-full-part${{ matrix.part }}
path: test_timing_data.json
if-no-files-found: warn
retention-days: 5
e2e-2-cards-light:
name: multicard-2-light
if: ${{ inputs.type == 'light' }}
runs-on: linux-aarch64-a3-2
strategy:
fail-fast: false
matrix:
part: [0]
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-a3-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
HCCL_BUFFSIZE: 1024
HF_HUB_OFFLINE: 1
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
uv pip uninstall triton
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
uv pip install https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/torch_npu-2.9.0.post1%2Bgitee7ba04-cp311-cp311-manylinux_2_28_aarch64.whl
uv pip install git+https://github.com/modelscope/modelscope.git@dbbcbf631fe6d10cc6446df2ad2fef24039fe7fe
- name: Run vllm-project/vllm-ascend test (light)
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
shell: bash
run: |
set -o pipefail
if [ "${{ inputs.continue_on_error }}" = "true" ]; then
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-2card-light \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 1 \
--auto-upgrade-estimated-times \
--continue-on-error \
2>&1 | tee /tmp/e2e-2card-light-part${{ matrix.part }}.log
else
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-2card-light \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 1 \
2>&1 | tee /tmp/e2e-2card-light-part${{ matrix.part }}.log
fi
exit ${PIPESTATUS[0]}
- name: Summarize multicard-2-light failure
if: ${{ always() }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--step-name "Run multicard-2-light test" \
--log-file /tmp/e2e-2card-light-part${{ matrix.part }}.log \
--output "$GITHUB_STEP_SUMMARY"
- name: Upload timing data
uses: actions/upload-artifact@v4
if: ${{ inputs.continue_on_error == true && github.event_name != 'pull_request' }}
with:
name: timing-data-2card-light-part${{ matrix.part }}
path: test_timing_data.json
if-no-files-found: warn
retention-days: 5
e2e-2-cards-full:
name: multicard-2-full
if: ${{ inputs.type == 'full' }}
runs-on: linux-aarch64-a3-2
strategy:
fail-fast: false
matrix:
part: [0]
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-a3-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
HCCL_BUFFSIZE: 1024
HF_HUB_OFFLINE: 1
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
uv pip uninstall triton
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
uv pip install https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/torch_npu-2.9.0.post1%2Bgitee7ba04-cp311-cp311-manylinux_2_28_aarch64.whl
uv pip install git+https://github.com/modelscope/modelscope.git@dbbcbf631fe6d10cc6446df2ad2fef24039fe7fe
- name: Run vllm-project/vllm-ascend test (full)
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
shell: bash
run: |
set -o pipefail
if [ "${{ inputs.continue_on_error }}" = "true" ]; then
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-multicard-2-cards \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 1 \
--auto-upgrade-estimated-times \
--continue-on-error \
2>&1 | tee /tmp/e2e-2card-full-part${{ matrix.part }}.log
else
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-multicard-2-cards \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 1 \
2>&1 | tee /tmp/e2e-2card-full-part${{ matrix.part }}.log
fi
exit ${PIPESTATUS[0]}
- name: Summarize multicard-2-full failure
if: ${{ always() }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--step-name "Run multicard-2-full test " \
--log-file /tmp/e2e-2card-full-part${{ matrix.part }}.log \
--output "$GITHUB_STEP_SUMMARY"
- name: Upload timing data
uses: actions/upload-artifact@v4
if: ${{ inputs.continue_on_error == true && github.event_name != 'pull_request' }}
with:
name: timing-data-2card-full-part${{ matrix.part }}
path: test_timing_data.json
if-no-files-found: warn
retention-days: 5
- name: Run vllm-project/vllm-ascend test (non triton)
if: ${{ inputs.type == 'full' && matrix.part == 0 }}
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
shell: bash
run: |
set -o pipefail
python3 -m pip uninstall -y triton-ascend
pytest -sv --durations=0 tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py \
2>&1 | tee /tmp/e2e-non-triton.log
exit ${PIPESTATUS[0]}
- name: Summarize non-triton failure
if: ${{ always() && inputs.type == 'full' && matrix.part == 0 }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--step-name "Run multicard-2-full test (non triton)" \
--log-file /tmp/e2e-non-triton.log \
--output "$GITHUB_STEP_SUMMARY"
e2e-4-cards-full:
name: multicard-4-full
if: ${{ inputs.type == 'full' }}
runs-on: linux-aarch64-a3-4
strategy:
fail-fast: false
matrix:
part: [0]
container:
image: m.daocloud.io/quay.io/ascend/cann:8.5.1-a3-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
HF_HUB_OFFLINE: 1
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev clang-15
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-15 20
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-15 20
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
uv pip uninstall triton
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
uv pip install https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/torch_npu-2.9.0.post1%2Bgitee7ba04-cp311-cp311-manylinux_2_28_aarch64.whl
uv pip install git+https://github.com/modelscope/modelscope.git@dbbcbf631fe6d10cc6446df2ad2fef24039fe7fe
- name: Run vllm-project/vllm-ascend test for V1 Engine
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
shell: bash
run: |
set -o pipefail
if [ "${{ inputs.continue_on_error }}" = "true" ]; then
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-multicard-4-cards \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 1 \
--auto-upgrade-estimated-times \
--continue-on-error \
2>&1 | tee /tmp/e2e-4card-full-part${{ matrix.part }}.log
else
python3 .github/workflows/scripts/run_suite.py \
--suite e2e-multicard-4-cards \
--auto-partition-id "${{ matrix.part }}" \
--auto-partition-size 1 \
2>&1 | tee /tmp/e2e-4card-full-part${{ matrix.part }}.log
fi
exit ${PIPESTATUS[0]}
- name: Summarize multicard-4-full failure
if: ${{ always() }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--step-name "Run vllm-project/vllm-ascend test for V1 Engine" \
--log-file /tmp/e2e-4card-full-part${{ matrix.part }}.log \
--output "$GITHUB_STEP_SUMMARY"
- name: Upload timing data
uses: actions/upload-artifact@v4
if: ${{ inputs.continue_on_error == true && github.event_name != 'pull_request' }}
with:
name: timing-data-4card-full-part${{ matrix.part }}
path: test_timing_data.json
if-no-files-found: warn
retention-days: 5
e2e_310p:
name: 310p singlecard
runs-on: linux-aarch64-310p-1
if: ${{ inputs.contains_310 }}
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-310p-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
HF_HUB_OFFLINE: 1
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
uv pip uninstall triton
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
uv pip install https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/torch_npu-2.9.0.post1%2Bgitee7ba04-cp311-cp311-manylinux_2_28_aarch64.whl
uv pip install git+https://github.com/modelscope/modelscope.git@dbbcbf631fe6d10cc6446df2ad2fef24039fe7fe
- name: Run vllm-project/vllm-ascend test
env:
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
VLLM_WORKER_MULTIPROC_METHOD: spawn
shell: bash
run: |
set -o pipefail
pytest -sv --durations=0 tests/e2e/310p/singlecard/test_dense_model_singlecard.py \
tests/e2e/310p/singlecard/test_vl_model_singlecard.py \
2>&1 | tee /tmp/e2e-310p-singlecard.log
exit ${PIPESTATUS[0]}
- name: Summarize 310p singlecard failure
if: ${{ always() && inputs.contains_310 }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--step-name "Run vllm-project/vllm-ascend test" \
--log-file /tmp/e2e-310p-singlecard.log \
--output "$GITHUB_STEP_SUMMARY"
e2e_310p-4cards:
name: 310p multicards 4cards
runs-on: linux-aarch64-310p-4
if: ${{ inputs.contains_310 }}
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-310p-ubuntu22.04-py3.11
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
HF_HUB_OFFLINE: 1
steps:
- name: Check npu and CANN info
run: |
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
- name: Config mirrors
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt install git -y
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Install system dependencies
run: |
apt-get -y install `cat packages.txt`
apt-get -y install gcc g++ cmake libnuma-dev
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
fetch-depth: 1
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install -e .
uv pip uninstall triton
- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install uc-manager
uv pip install -r requirements-dev.txt
uv pip install -v -e .
uv pip install https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/torch_npu-2.9.0.post1%2Bgitee7ba04-cp311-cp311-manylinux_2_28_aarch64.whl
uv pip install git+https://github.com/modelscope/modelscope.git@dbbcbf631fe6d10cc6446df2ad2fef24039fe7fe
- name: Run vllm-project/vllm-ascend test
env:
PYTORCH_NPU_ALLOC_CONF: max_split_size_mb:256
VLLM_WORKER_MULTIPROC_METHOD: spawn
shell: bash
run: |
set -o pipefail
pytest -sv --durations=0 \
tests/e2e/310p/multicard/test_dense_model_multicard.py \
tests/e2e/310p/multicard/test_moe_model_multicard.py \
tests/e2e/310p/multicard/test_vl_model_multicard.py \
2>&1 | tee /tmp/e2e-310p-4cards.log
exit ${PIPESTATUS[0]}
- name: Summarize 310p multicards failure
if: ${{ always() && inputs.contains_310 }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--step-name "Run vllm-project/vllm-ascend test" \
--log-file /tmp/e2e-310p-4cards.log \
--output "$GITHUB_STEP_SUMMARY"

View File

@@ -1,71 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: 'Nightly image build'
on:
workflow_call:
inputs:
target:
required: true
type: string
description: "Build target: 'a2' or 'a3'"
secrets:
HW_USERNAME:
required: false
HW_TOKEN:
required: false
GITEE_TOKEN:
required: false
jobs:
build:
name: Build nightly-${{ inputs.target }} image
runs-on: ubuntu-22.04-arm
steps:
- uses: actions/checkout@v6
- name: Login to Huawei Cloud SWR
id: login-swr
if: ${{ env.HW_USERNAME != '' && env.HW_TOKEN != '' }}
env:
HW_USERNAME: ${{ secrets.HW_USERNAME }}
HW_TOKEN: ${{ secrets.HW_TOKEN }}
run: |
echo "$HW_TOKEN" | docker login -u "$HW_USERNAME" --password-stdin swr.cn-southwest-2.myhuaweicloud.com
- name: Build nightly-${{ inputs.target }} image
env:
GITEE_USERNAME: ${{ vars.GITEE_USERNAME }}
GITEE_TOKEN: ${{ secrets.GITEE_TOKEN }}
run: |
IMAGE="swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-${{ inputs.target }}"
docker build \
--network host \
--platform linux/arm64 \
-f .github/workflows/dockerfiles/Dockerfile.nightly.${{ inputs.target }} \
--build-arg CANN_VERSION="8.5.1" \
--build-arg UBUNTU_VERSION="22.04" \
--build-arg PYTHON_VERSION="3.11" \
--build-arg GITEE_USERNAME="${GITEE_USERNAME}" \
--build-arg GITEE_TOKEN="${GITEE_TOKEN}" \
-t "$IMAGE" .
- name: Push image to SWR
if: ${{ github.repository_owner == 'vllm-project' && steps.login-swr.conclusion == 'success' }}
run: |
docker push swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-${{ inputs.target }}

View File

@@ -1,115 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: 'Parse nightly trigger'
on:
workflow_call:
inputs:
runner:
required: false
type: string
default: linux-aarch64-a2b3-0
outputs:
run:
description: "Whether nightly tests should run"
value: ${{ jobs.parse.outputs.run }}
filter:
description: "Comma-wrapped test name filter (e.g. ',name1,name2,'), or 'all'"
value: ${{ jobs.parse.outputs.filter }}
ref:
description: "The vllm-ascend ref (commit SHA for PRs, branch/tag name otherwise)"
value: ${{ jobs.parse.outputs.ref }}
jobs:
parse:
name: Parse trigger and determine test scope
runs-on: ${{ inputs.runner }}
outputs:
run: ${{ steps.parse.outputs.run }}
filter: ${{ steps.parse.outputs.filter }}
ref: ${{ steps.parse.outputs.ref }}
steps:
- name: Parse trigger
id: parse
uses: actions/github-script@v7
with:
script: |
const eventName = context.eventName;
function parseNightlyComment(body) {
if (!body) return null;
const match = body.trim().match(/^\/nightly(?:\s+(.+))?$/m);
if (!match) return null;
const args = (match[1] || '').trim();
if (!args || args === 'all') return 'all';
// Wrap with commas for exact-name matching: ",name1,name2,"
return ',' + args.split(/\s+/).join(',') + ',';
}
function getRef() {
if (eventName === 'pull_request') {
return context.payload.pull_request.head.sha;
}
return (context.ref || '').replace(/^refs\/(heads|tags)\//, '') || 'main';
}
core.setOutput('ref', getRef());
// 1. schedule / workflow_dispatch: run all tests with pre-built image
if (eventName === 'schedule' || eventName === 'workflow_dispatch') {
core.setOutput('run', 'true');
core.setOutput('filter', 'all');
return;
}
// 2. pull_request (labeled / synchronize)
if (eventName === 'pull_request') {
const labels = context.payload.pull_request.labels.map(l => l.name);
if (!labels.includes('nightly-test')) {
core.setOutput('run', 'false');
core.setOutput('filter', '');
return;
}
// Search comments for latest /nightly command
const prNumber = context.payload.pull_request.number;
const comments = await github.paginate(github.rest.issues.listComments, {
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: prNumber,
per_page: 100,
});
let filter = null;
for (let i = comments.length - 1; i >= 0; i--) {
const result = parseNightlyComment(comments[i].body);
if (result !== null) { filter = result; break; }
}
// No /nightly comment found: do not run any tests
if (filter === null) {
core.info('nightly-test label present but no /nightly comment found; skipping.');
core.setOutput('run', 'false');
core.setOutput('filter', '');
return;
}
core.setOutput('run', 'true');
core.setOutput('filter', filter);
return;
}
// Fallback
core.setOutput('run', 'false');
core.setOutput('filter', '');

View File

@@ -1,86 +0,0 @@
name: pre-commit
on:
workflow_call:
inputs:
vllm:
required: true
type: string
permissions:
contents: read
jobs:
pre-commit:
runs-on: linux-amd64-cpu-8-hk
container:
# Build it from https://github.com/nv-action/vllm-benchmarks/blob/main/Dockerfile
image: quay.io/ascend-ci/vllm-ascend:lint
steps:
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
# With problem matchers in a container, the output of $GITHUB_WORKSPACE and ${{ github.workspace }} are different.
# So we will just copy it into a temp path. see https://github.com/actions/runner/issues/2058
- name: cp problem matchers
run: |
cp .github/workflows/matchers/actionlint.json "$RUNNER_TEMP/actionlint.json"
cp .github/workflows/matchers/markdownlint.json "$RUNNER_TEMP/markdownlint.json"
cp .github/workflows/matchers/mypy.json "$RUNNER_TEMP/mypy.json"
- run: echo "::add-matcher::$RUNNER_TEMP/actionlint.json"
- run: echo "::add-matcher::$RUNNER_TEMP/markdownlint.json"
- run: echo "::add-matcher::$RUNNER_TEMP/mypy.json"
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
path: ./vllm-empty
ref: ${{ inputs.vllm }}
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
lint_tracker:
- 'requirements.txt'
- 'requirements-dev.txt'
- 'requirements-lint.txt'
- name: Install vllm-ascend dev (conditional)
if: steps.filter.outputs.lint_tracker == 'true'
env:
UV_INDEX_URL: http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
UV_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
UV_INDEX_STRATEGY: unsafe-best-match
UV_NO_CACHE: 1
UV_SYSTEM_PYTHON: 1
run: |
pip install uv
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
pip install uc-manager
uv pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu
- name: Run pre-commit
env:
PRE_COMMIT_COLOR: always
FORCE_COLOR: "1"
TERM: xterm-256color
SHELLCHECK_OPTS: "--exclude=SC2046,SC2006,SC2086" # Exclude SC2046, SC2006, SC2086 for actionlint
run: |
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
pre-commit run --all-files --hook-stage manual --show-diff-on-failure
- name: Run mypy
run: |
PYTHONPATH="$PYTHONPATH:$(pwd)/vllm-empty"
export PYTHONPATH
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
# Run mypy for Python 3.10, 3.11, 3.12 manually
# Note: We are now separating mypy from pre-commit hooks for performance reasons.
for python_version in "3.10" "3.11" "3.12"; do
echo "============================"
tools/mypy.sh 1 "$python_version"
echo "============================"
done

View File

@@ -1,192 +0,0 @@
name: Image_oncall
on:
workflow_call:
inputs:
suffix:
description: 'The tag subfix to use'
required: true
type: string
should_push:
description: 'Whether to push the image'
required: false
type: boolean
default: False
dockerfile:
description: 'The Dockerfile to use'
required: false
type: string
quay_username:
description: 'Quay username for pushing images'
required: false
type: string
workflow_dispatch_tag:
description: 'The tag to use for workflow dispatch'
required: false
type: string
secrets:
QUAY_PASSWORD:
description: 'Quay password for pushing images'
required: false
jobs:
build-push-digest:
name: build
runs-on: ${{ matrix.runner }}
strategy:
matrix:
include:
- arch: linux/amd64
runner: ubuntu-latest
tag: amd64
- arch: linux/arm64
runner: ubuntu-22.04-arm
tag: arm64
steps:
- uses: actions/checkout@v6
if: ${{ github.event_name != 'workflow_dispatch' }}
with:
fetch-depth: 0
persist-credentials: false
ref: ${{ github.ref }}
- uses: actions/checkout@v6
if: ${{ github.event_name == 'workflow_dispatch' }}
with:
fetch-depth: 0
persist-credentials: false
ref: ${{ inputs.workflow_dispatch_tag }}
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Publish - Login to Quay Container Registry
if: ${{ inputs.should_push }}
uses: docker/login-action@v4
with:
registry: quay.io
username: ${{ inputs.quay_username }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v4
with:
install: true
driver: docker-container
use: true
- name: Build and push
uses: docker/build-push-action@v7
id: build
with:
platforms: ${{ matrix.arch }}
# use the current repo path as the build context, ensure .git is contained
context: .
file: ${{ inputs.dockerfile || 'Dockerfile' }}
# only trigger when tag, branch/main push
push: ${{ inputs.should_push }}
outputs: type=image,name=quay.io/ascend/vllm-ascend,push-by-digest=true,name-canonical=true,push=${{ inputs.should_push }}
build-args: |
PIP_INDEX_URL=https://pypi.org/simple
provenance: false
- name: Export digest
run: |
mkdir -p ${{ runner.temp }}/digests
digest="${{ steps.build.outputs.digest }}"
touch "${{ runner.temp }}/digests/${digest#sha256:}"
- name: Upload digest
uses: actions/upload-artifact@v7
with:
name: digests-${{ inputs.suffix }}-${{ matrix.tag }}
path: ${{ runner.temp }}/digests/*
if-no-files-found: error
retention-days: 1
merge-image:
runs-on: ubuntu-latest
needs: build-push-digest
if: ${{ inputs.should_push }}
steps:
- name: Checkout
uses: actions/checkout@v6
with:
ref: ${{ github.ref }}
- name: Download arm64 digests
uses: actions/download-artifact@v8
with:
path: ${{ runner.temp }}/digests
pattern: digests-${{ inputs.suffix }}-arm64
merge-multiple: true
- name: Download amd64 digests
uses: actions/download-artifact@v8
with:
path: ${{ runner.temp }}/digests
pattern: digests-${{ inputs.suffix }}-amd64
merge-multiple: true
- name: Prepare suffix
id: suffix
run: |
if [ -n "${{ inputs.suffix }}" ]; then
echo "SUFFIX=-${{ inputs.suffix }}" >> $GITHUB_ENV
else
echo "SUFFIX=" >> $GITHUB_ENV
fi
- name: Docker meta
id: meta
uses: docker/metadata-action@v6
with:
# TODO(yikun): add more hub image and a note on release policy for container image
images: |
quay.io/ascend/vllm-ascend
# Note for test case
# https://github.com/marketplace/actions/docker-metadata-action#typeref
# 1. branch job publish per main/*-dev branch commits
# 2. main and dev pull_request is build only, so the tag pr-N-openeuler is fine
# 3. only pep440 matched tag will be published:
# - v0.7.1 --> v0.7.1-openeuler
# - pre/post/dev: v0.7.1rc1-openeuler/v0.7.1rc1-openeuler/v0.7.1rc1.dev1-openeuler/v0.7.1.post1-openeuler, no latest
# which follow the rule from vLLM with prefix v
# TODO(yikun): the post release might be considered as latest release
tags: |
type=ref,event=branch,prefix=nightly-,suffix=${{ env.SUFFIX }}
type=ref,event=pr,prefix=nightly-,suffix=${{ env.SUFFIX }}
type=pep440,pattern={{raw}},suffix=${{ env.SUFFIX }}
flavor:
latest=false
- name: Login to Quay
uses: docker/login-action@v4
with:
registry: quay.io
username: ${{ inputs.quay_username }}
password: ${{ secrets.QUAY_PASSWORD }}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v4
- name: Merge and push multi-arch image
env:
IMAGE: quay.io/ascend/vllm-ascend
TAGS: ${{ steps.meta.outputs.tags }}
run: |
DIGESTS=$(printf "$IMAGE@sha256:%s " $(ls ${{ runner.temp }}/digests))
echo "Digests: $DIGESTS"
echo "Current tags:"
echo "$TAGS"
for tag in $TAGS; do
echo "Creating tag $tag"
docker buildx imagetools create \
-t "$tag" \
$DIGESTS
done

View File

@@ -1,109 +0,0 @@
name: 'unit test'
on:
workflow_call:
inputs:
vllm:
required: true
type: string
runner:
required: true
type: string
image:
required: true
type: string
type:
required: true
type: string
jobs:
unit-test:
name: unit test
runs-on: ${{ inputs.runner }}
container:
image: ${{ inputs.image }}
env:
VLLM_LOGGING_LEVEL: ERROR
VLLM_USE_MODELSCOPE: True
SOC_VERSION: ascend910b1
MAX_JOBS: 4
COMPILE_CUSTOM_KERNELS: 0
UV_INDEX_URL: http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
UV_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
UV_INDEX_STRATEGY: unsafe-best-match
UV_NO_CACHE: 1
UV_SYSTEM_PYTHON: 1
UV_PYTHON: python3
steps:
- name: Install packages
run: |
sed -Ei 's@(ports|archive).ubuntu.com@cache-service.nginx-pypi-cache.svc.cluster.local:8081@g' /etc/apt/sources.list
pip config set global.index-url http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple
pip config set global.trusted-host cache-service.nginx-pypi-cache.svc.cluster.local
apt-get update -y
apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
pip install uv
- name: Checkout vllm-project/vllm repo
uses: actions/checkout@v6
with:
repository: vllm-project/vllm
ref: ${{ inputs.vllm }}
path: ./vllm-empty
- name: Install vllm-project/vllm from source
working-directory: ./vllm-empty
run: |
VLLM_TARGET_DEVICE=empty uv pip install . --extra-index-url https://download.pytorch.org/whl/cpu/
uv pip uninstall triton
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Install vllm-project/vllm-ascend
run: |
pip install uc-manager
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
uv pip install -v . --extra-index-url https://download.pytorch.org/whl/cpu/
uv pip install -r requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu/
- name: Run unit test
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
TORCH_DEVICE_BACKEND_AUTOLOAD: 0
shell: bash
run: |
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/devlib
set -o pipefail
pytest -sv --cov --cov-report=xml:unittests-coverage.xml tests/ut \
--ignore tests/ut/model_loader/netloader/test_netloader_elastic.py \
--ignore tests/ut/kv_connector/test_remote_prefill_lifecycle.py \
--ignore tests/ut/kv_connector/test_remote_decode_lifecycle.py \
--ignore tests/ut/core/test_scheduler_dynamic_batch.py \
--ignore tests/ut/kv_connector/test_mooncake_connector.py \
--ignore tests/ut/worker/test_worker_v1.py \
--ignore tests/ut/spec_decode/test_mtp_proposer.py \
--ignore tests/ut/kv_connector/test_mooncake_layerwise_connector.py \
2>&1 | tee /tmp/unit-test.log
exit ${PIPESTATUS[0]}
- name: Summarize unit test failure
if: ${{ always() }}
run: |
python3 .github/workflows/scripts/ci_log_summary.py \
--mode ut \
--step-name "Run unit test" \
--log-file /tmp/unit-test.log \
--output "$GITHUB_STEP_SUMMARY"
- name: Upload coverage to Codecov
# only upload coverage when commits merged
if: ${{ inputs.type == 'schedule' }}
uses: codecov/codecov-action@v5
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
with:
flags: unittests
name: vllm-ascend
verbose: true

View File

@@ -1,25 +0,0 @@
name: "Issue Create/Update Labeler"
on:
issues:
types: [opened, edited]
permissions:
issues: write
contents: read
jobs:
triage:
runs-on: ubuntu-latest
if: |
startsWith(github.event.issue.title, '[Bug]:') ||
startsWith(github.event.issue.title, '[Installation]:') ||
startsWith(github.event.issue.title, '[Usage]:') ||
startsWith(github.event.issue.title, '[Doc]:') ||
startsWith(github.event.issue.title, '[Misc]:')
steps:
- uses: github/issue-labeler@v3.4
with:
configuration-path: .github/issue-labeler.yml
enable-versioned-regex: 0
repo-token: ${{ secrets.GITHUB_TOKEN }}
include-title: 1

View File

@@ -1,113 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: PR Create
on:
# The PR updated when PR opened and push new commits
pull_request_target:
types: [opened]
branches:
- 'main'
permissions:
pull-requests: write
jobs:
pr-create:
permissions:
contents: read
pull-requests: write
name: PR create action
runs-on: ubuntu-latest
steps:
- name: Get vLLM version
run: |
VLLM_COMMIT=v0.18.0
echo "VLLM_COMMIT=https://github.com/vllm-project/vllm/commit/$VLLM_COMMIT" >> "$GITHUB_ENV"
- name: Checkout repository
uses: actions/checkout@0c366fd6a839edf440554fa01a7085ccba70ac98 # v4.2.2
- name: Set up Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
- name: Get vLLM release version
run: |
VLLM_VERSION=$(python3 docs/source/conf.py | jq .ci_vllm_version | tr -d '"')
echo "VLLM_VERSION=$VLLM_VERSION" >> "$GITHUB_ENV"
- name: Update PR description
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
PR_NUMBER=${{ github.event.number }}
VLLM_VERSION=${{ env.VLLM_VERSION }}
VLLM_COMMIT=${{ env.VLLM_COMMIT }}
OLD=/tmp/orig_pr_body.txt
NEW=/tmp/new_pr_body.txt
FINAL=/tmp/final_pr_body.txt
gh pr view --json body --template "{{.body}}" "${PR_NUMBER}" > "${OLD}"
cp "${OLD}" "${NEW}"
# Remove notes in pr description and add vLLM version and commit
sed -i '/<!--/,/-->/d' "${NEW}"
sed -i '/- vLLM .*$/d' "${NEW}"
{
echo ""
echo "- vLLM version: $VLLM_VERSION"
echo "- vLLM main: $VLLM_COMMIT"
} >> "${NEW}"
# Remove redundant empty lines
uniq "${NEW}" > "${FINAL}"
# Run this only if ${NEW} is different than ${OLD}
if ! cmp -s "${OLD}" "${FINAL}"; then
echo
echo "Updating PR body:"
echo
cat "${NEW}"
gh pr edit --body-file "${FINAL}" "${PR_NUMBER}"
else
echo "No changes needed"
fi
- name: Label the PR
uses: actions/labeler@v6
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
configuration-path: .github/labeler.yml
sync-labels: true
- name: Remind to run full CI on PR
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: '👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:\n\n' +
'- A PR should do only one thing, smaller PRs enable faster reviews.\n' +
'- Every PR should include unit tests and end-to-end tests to ensure it works and is not broken by other future PRs.\n' +
'- Write the commit message by fulfilling the PR description to help reviewer and future developers understand.\n\n' +
'If CI fails, you can run linting and testing checks locally according [Contributing](https://docs.vllm.ai/projects/ascend/zh-cn/latest/developer_guide/contribution/index.html) and [Testing](https://docs.vllm.ai/projects/ascend/zh-cn/latest/developer_guide/contribution/testing.html).'
})
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -1,45 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
ARG PY_VERSION=3.11
FROM quay.io/ascend/manylinux:8.5.1-310p-manylinux_2_28-py${PY_VERSION}
ARG SOC_VERSION="ascend310p1"
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV SOC_VERSION=$SOC_VERSION
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
WORKDIR /workspace
COPY . /workspace/vllm-ascend/
# Install req
RUN python3 -m pip install -r vllm-ascend/requirements.txt --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip install twine attrs psutil
# Install vllm-ascend
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
cd vllm-ascend && \
python3 setup.py bdist_wheel && \
ls -l dist
CMD ["/bin/bash"]

View File

@@ -1,45 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
ARG PY_VERSION=3.11
FROM quay.io/ascend/manylinux:8.5.1-a3-manylinux_2_28-py${PY_VERSION}
ARG SOC_VERSION="ascend910_9391"
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
ENV SOC_VERSION=$SOC_VERSION
RUN yum update -y && \
yum install -y python3-pip git vim wget net-tools gcc gcc-c++ make cmake numactl-devel && \
rm -rf /var/cache/yum
WORKDIR /workspace
COPY . /workspace/vllm-ascend/
# Install req
RUN python3 -m pip install -r vllm-ascend/requirements.txt --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip install twine attrs psutil
# Install vllm-ascend
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
cd vllm-ascend && \
python3 setup.py bdist_wheel && \
ls -l dist
CMD ["/bin/bash"]

View File

@@ -1,46 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM ascendai/python:3.11-ubuntu22.04
ARG TARGETARCH
RUN apt-get update -y && \
apt-get install -y curl git gcc g++ cmake libnuma-dev jq wget xz-utils shellcheck && \
rm -rf /var/cache/apt/* && \
rm -rf /var/lib/apt/lists/*
ARG VLLM_REPO=https://github.com/vllm-project/vllm.git
# For lint purpose, actually we need make a main2main matching.
ARG VLLM_COMMIT=v0.18.0
RUN git clone $VLLM_REPO /vllm-workspace/vllm && \
cd /vllm-workspace/vllm && \
git checkout $VLLM_COMMIT
# # Install vLLM common dependencies
RUN python3 -m pip install -r /vllm-workspace/vllm/requirements/common.txt --extra-index https://download.pytorch.org/whl/cpu/ && \
python3 -m pip uninstall -y triton && \
python3 -m pip cache purge
COPY . /vllm-workspace/vllm-ascend/
RUN pip install -r /vllm-workspace/vllm-ascend/requirements-dev.txt --extra-index-url https://download.pytorch.org/whl/cpu && \
pip cache purge && \
rm -fr /vllm-workspace/
CMD ["/bin/bash"]

View File

@@ -1,46 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/vllm-ascend:nightly-releases-v0.18.0
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG AIS_BENCH_TAG="v3.0-20250930-master"
ARG AIS_BENCH_URL="https://gitee.com/aisbench/benchmark.git"
ARG GITEE_USERNAME=""
ARG GITEE_TOKEN=""
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /workspace
RUN pip config set global.index-url ${PIP_INDEX_URL}
# Install requirements-dev.txt for tests
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
cd /vllm-workspace/vllm-ascend && \
python3 -m pip install -r requirements-dev.txt && \
python3 -m pip cache purge
# Install benchmark tools
RUN CLONE_URL=$(echo "${AIS_BENCH_URL}" | sed "s|https://|https://${GITEE_USERNAME}:${GITEE_TOKEN}@|") && \
git clone -b ${AIS_BENCH_TAG} --depth 1 "${CLONE_URL}" /vllm-workspace/vllm-ascend/benchmark && \
cd /vllm-workspace/vllm-ascend/benchmark && \
pip install -e . -r requirements/api.txt -r requirements/extra.txt && \
python3 -m pip cache purge
CMD ["/bin/bash"]

View File

@@ -1,46 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
FROM quay.io/ascend/vllm-ascend:nightly-releases-v0.18.0-a3
ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
ARG AIS_BENCH_TAG="v3.0-20250930-master"
ARG AIS_BENCH_URL="https://gitee.com/aisbench/benchmark.git"
ARG GITEE_USERNAME=""
ARG GITEE_TOKEN=""
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /workspace
RUN pip config set global.index-url ${PIP_INDEX_URL}
# Install requirements-dev.txt for tests
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
cd /vllm-workspace/vllm-ascend && \
python3 -m pip install -r requirements-dev.txt && \
python3 -m pip cache purge
# Install benchmark tools
RUN CLONE_URL=$(echo "${AIS_BENCH_URL}" | sed "s|https://|https://${GITEE_USERNAME}:${GITEE_TOKEN}@|") && \
git clone -b ${AIS_BENCH_TAG} --depth 1 "${CLONE_URL}" /vllm-workspace/vllm-ascend/benchmark && \
cd /vllm-workspace/vllm-ascend/benchmark && \
pip install -e . -r requirements/api.txt -r requirements/extra.txt && \
python3 -m pip cache purge
CMD ["/bin/bash"]

View File

@@ -1,87 +0,0 @@
name: 'model downloader'
on:
pull_request:
paths:
- '.github/workflows/misc/model_list.json'
- '.github/workflows/labled_download_model.yaml'
types: [labeled, synchronize]
defaults:
run:
shell: bash -el {0}
concurrency:
group: ascend-${{ github.workflow_ref }}
cancel-in-progress: true
jobs:
download-models:
if: contains(github.event.pull_request.labels.*.name, 'model-download')
name: Download models from ModelScope
runs-on: ${{ matrix.runner }}
strategy:
matrix:
runner: [linux-aarch64-a2b3-0, linux-aarch64-a3-0]
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-cpu
steps:
- name: Install dependencies
run: |
apt-get update -y && apt-get install git jq -y
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install modelscope
- name: Show Current Disk Usage
run: |
df -h /root/.cache | grep -v Filesystem | \
awk '{print "Mount point: "$6, "Total: "$2, "Used: "$3, "Available: "$4, "Usage: "$5}'
- name: Checkout PR branch
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Extract new models from PR
id: diff
run: |
set -euo pipefail
git config --global --add safe.directory /__w/vllm-ascend/vllm-ascend
JSON_PATH=".github/workflows/misc/model_list.json"
git fetch origin main
git show origin/main:$JSON_PATH > /tmp/models_main.json || \
echo '{"models":[]}' > /tmp/models_main.json
cp $JSON_PATH /tmp/models_pr.json
jq -r '
(.models // []) as $pr
| input
| (.models // []) as $main
| ($pr - $main)[]
' /tmp/models_pr.json /tmp/models_main.json > /tmp/new_models.txt
echo "New models:"
cat /tmp/new_models.txt || true
- name: Download new models (CLI)
run: |
set -euo pipefail
if [ ! -s /tmp/new_models.txt ]; then
echo "No new models to download."
exit 0
fi
while read -r model; do
[ -z "$model" ] && continue
echo "▶ Downloading $model"
modelscope download "$model"
done < /tmp/new_models.txt
- name: Summary
run: |
echo "Downloaded models:"
cat /tmp/new_models.txt || echo "No new models"

View File

@@ -1,17 +0,0 @@
{
"problemMatcher": [
{
"owner": "markdownlint",
"pattern": [
{
"regexp": "^([^:]*):(\\d+):?(\\d+)?\\s([\\w-\\/]*)\\s(.*)$",
"file": 1,
"line": 2,
"column": 3,
"code": 4,
"message": 5
}
]
}
]
}

View File

@@ -1,248 +0,0 @@
{
"models": [
"AngelSlim/Qwen3-32B_eagle3",
"AngelSlim/Qwen3-a3B_eagle3",
"Anionex/Qwen3-1.7B-W4A8-V1",
"ArthurZ/ilama-3.2-1B",
"BAAI/bge-base-en-v1.5",
"BAAI/bge-large-zh-v1.5",
"BAAI/bge-m3",
"BAAI/bge-multilingual-gemma2",
"BAAI/bge-reranker-large",
"BAAI/bge-reranker-v2-m3",
"BAAI/bge-small-en-v1.5",
"BAAI/kernel_meta",
"ByteDance-Seed/BAGEL-7B-MoT",
"DeepSeek-ai/DeepSeek-OCR",
"DevQuasar/deepseek-ai.DeepSeek-V3.2-BF16",
"Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot",
"Eco-Tech/Qwen3-30B-A3B-w8a8",
"Eco-Tech/Kimi-K2.5-W4A8",
"Howeee/Qwen2.5-1.5B-apeach",
"IntervitensInc/pangu-pro-moe-model",
"IntervitensInc/pangu-pro-moe-modelt",
"JackFram/llama-160m",
"JackFram/llama-68m",
"Kwai-Keye/Keye-VL-8B-Preview",
"LLM-Research/Llama-3.2-11B-Vision",
"LLM-Research/Llama-3.2-1B-Instruct",
"LLM-Research/Llama-3.2-3B-Instruct",
"LLM-Research/Meta-Llama-3-8B-Instruct",
"LLM-Research/Meta-Llama-3.1-8B-Instruct",
"LLM-Research/Molmo-7B-D-0924",
"LLM-Research/Phi-4-mini-instruct",
"LLM-Research/gemma-2-9b-it",
"LLM-Research/gemma-3-4b-it",
"LLM-Research/kernel_meta",
"OpenBMB/MiniCPM-2B-dpo-bf16",
"OpenBMB/MiniCPM-Llama3-V-2_5",
"OpenBMB/MiniCPM3-4B",
"OpenBMB/MiniCPM4-0.5B",
"OpenGVLab/InternVL2-8B",
"OpenGVLab/InternVL2_5-8B",
"OpenGVLab/InternVL3-78B",
"OpenGVLab/InternVL3-8B",
"OpenGVLab/InternVL3_5-8B",
"OpenGVLab/InternVL3_5-8B-hf",
"PaddlePaddle/ERNIE-4.5-21B-A3B-PT",
"PaddlePaddle/PaddleOCR-VL",
"QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ",
"Qwen/QwQ-32B",
"Qwen/QwQ-32B-AWQ",
"Qwen/Qwen",
"Qwen/Qwen-Image",
"Qwen/Qwen1.5-MoE-A2.7B",
"Qwen/Qwen2-1.5B-Instruct",
"Qwen/Qwen2-7B",
"Qwen/Qwen2-7B-Instruct",
"Qwen/Qwen2-7B-W8A8",
"Qwen/Qwen2-Audio-7B-Instruct",
"Qwen/Qwen2-VL-2B-Instruct",
"Qwen/Qwen2-VL-7B",
"Qwen/Qwen2-VL-7B-Instruct",
"Qwen/Qwen2.5-0.5B-Instruct",
"Qwen/Qwen2.5-0.5B-Instruct-AWQ",
"Qwen/Qwen2.5-1.5B-Instruct",
"Qwen/Qwen2.5-14B-Instruct",
"Qwen/Qwen2.5-32B-Instruct",
"Qwen/Qwen2.5-7B",
"Qwen/Qwen2.5-7B-Instruct",
"Qwen/Qwen2.5-7B-Instruct-1M",
"Qwen/Qwen2.5-7b-Instruct",
"Qwen/Qwen2.5-Math-PRM-7B",
"Qwen/Qwen2.5-Omni-3B",
"Qwen/Qwen2.5-Omni-7B",
"Qwen/Qwen2.5-VL-32B-Instruct",
"Qwen/Qwen2.5-VL-3B-Instruct",
"Qwen/Qwen2.5-VL-7B-Instruct",
"Qwen/Qwen2.7-7B",
"Qwen/Qwen3-0.6B",
"Qwen/Qwen3-0.6B-Base",
"Qwen/Qwen3-235B-A22B",
"Qwen/Qwen3-235B-A22B-Instruct-2507",
"Qwen/Qwen3-30B-A3B",
"Qwen/Qwen3-30B-A3B-Instruct-2507",
"Qwen/Qwen3-30B-A3B-W8A8",
"Qwen/Qwen3-32B",
"Qwen/Qwen3-32B-AWQ",
"Qwen/Qwen3-8B",
"Qwen/Qwen3-8B-A3B",
"Qwen/Qwen3-8B-Base",
"Qwen/Qwen3-8B-W8A8",
"Qwen/Qwen3-8B-w4a8",
"Qwen/Qwen3-8B-w8a8",
"Qwen/Qwen3-Base",
"Qwen/Qwen3-Coder-30B-A3B-Instruct",
"Qwen/Qwen3-Embedding-0.6B",
"Qwen/Qwen3-Embedding-8B",
"Qwen/Qwen3-Next-80B-A3B-Instruct",
"Qwen/Qwen3-Next-A3B-Instruct",
"Qwen/Qwen3-Omni-30B-A3B-Instruct",
"Qwen/Qwen3-Reranker-0.6B",
"Qwen/Qwen3-VL-235B-A22B-Instruct",
"Qwen/Qwen3-VL-2B-Instruct",
"Qwen/Qwen3-VL-30B-A3B-Instruct",
"Qwen/Qwen3-VL-32B-Instruct",
"Qwen/Qwen3-VL-8B-Instruct",
"Qwen/Qwen3.5-27B",
"Qwen/Qwen3.5-35B-A3B",
"RedHatAI/Qwen3-32B-speculator.eagle3",
"RedHatAI/Qwen3-8B-speculator.eagle3",
"Shanghai_AI_Laboratory/internlm--chat-7b",
"Shanghai_AI_Laboratory/internlm-7b",
"Shanghai_AI_Laboratory/internlm-7b-chat",
"Shanghai_AI_Laboratory/internlm-7bi-chat",
"Shanghai_AI_Laboratory/internlm-chat-7b",
"Tencent-Hunyuan/HunyuanOCR",
"Tengyunw/qwen3_8b_eagle3",
"Tongyi-MAI/Z-Image-Turbo",
"baichuan-inc/Baichuan2-7B-Chat",
"billy800/Qwen3-30B-A3B-Instruct-2507-AWQ",
"deepseek-ai/DeepSeek-OCR",
"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
"deepseek-ai/DeepSeek-V2",
"deepseek-ai/DeepSeek-V2-Lite",
"deepseek-ai/DeepSeek-V2-Lite-Chat",
"deepseek-ai/Deepseek-V2-Lite",
"dengcao/ms-marco-MiniLM-L6-v2",
"facebook/opt-125m",
"google/gemma-2-9b",
"google/gemma-3n-E2B-it",
"google/siglip2-base-patch16-224",
"hmellor/Ilama-3.2-1B",
"ibm-research/PowerMoE-3b",
"intfloat/multilingual-e5-small",
"jason9693/Qwen2.5-1.5B-apeach",
"jinaai/jina-embeddings-v3",
"jinaai/jina-embeddings-v4",
"jinaai/jina-embeddings-v4-vllm-code",
"jinaai/jina-embeddings-v4-vllm-retrieval",
"kernel_meta/kernel_meta_temp_2116872659434949099",
"llava-hf/LLaVA-NeXT-Video-7B-hf",
"llava-hf/llava-1.5-7b-hf",
"llava-hf/llava-onevision-qwen2-0.5b-ov-hf",
"llava-hf/llava-v1.6-mistral-7b-hf",
"meta-llama/Llama-3.2-1B-Instruct",
"mistralai/Ministral-3-3B-Instruct-2512-BF16",
"mistralai/Ministral-3-8B-Instruct-2512-BF16",
"mistralai/Mistral-7B-Instruct-v0.1",
"mistralai/Mistral-Small-3.1-24B-Instruct-2503",
"mlx-community/DeepSeek-V3-3bit-bf16",
"moonshotai/Kimi-K2-Thinking",
"moonshotai/Kimi-Linear-48B-A3B-Instruct",
"neuralmagic/Qwen2.5-3B-quantized.w8a8",
"MNN/Qwen3-VL-8B-Instruct-Eagle3",
"nv-community/audio-flamingo-3",
"nv-community/audio-flamingo-3-hf",
"nvidia/audio-flamingo-3-hf",
"openbmb/MiniCPM-2B-sft-bf16",
"openbmb/MiniCPM-V-2_6",
"openbmb/MiniCPM-V-4_5",
"opendatalab/MinerU2.5-2509-1.2B",
"rhymes-ai/Aria",
"sentence-transformers/all-MiniLM-L12-v2",
"tencent/HunyuanOCR",
"unsloth/DeepSeek-V3.1-BF16",
"unsloth/Kimi-K2-Thinking-BF16",
"unsloth/gpt-oss-20b-BF16",
"vllm-ascend/DeepSeek-R1-0528-W8A8",
"vllm-ascend/DeepSeek-R1-W8A8",
"vllm-ascend/DeepSeek-R1-fa3-pruning",
"vllm-ascend/DeepSeek-R1-w4a8-pruning",
"vllm-ascend/DeepSeek-V2-Lite",
"vllm-ascend/DeepSeek-V2-Lite-W8A8",
"vllm-ascend/DeepSeek-V3-Pruning",
"vllm-ascend/DeepSeek-V3-W4A8-Pruing",
"vllm-ascend/DeepSeek-V3-W8A8",
"vllm-ascend/DeepSeek-V3.1",
"vllm-ascend/DeepSeek-V3.1-W4A8-puring",
"vllm-ascend/DeepSeek-V3.1-W8A8",
"vllm-ascend/DeepSeek-V3.2-W8A8",
"vllm-ascend/DeepSeek-V3.2-W8A8-Pruning",
"vllm-ascend/EAGLE-LLaMA3.1-Instruct-8B",
"vllm-ascend/EAGLE3-LLaMA3.1-Instruct-8B",
"vllm-ascend/Kimi-K2-Instruct-W8A8",
"vllm-ascend/Kimi-K2-Thinking-Pruning",
"vllm-ascend/Llama-2-7b-hf",
"vllm-ascend/Llama-3.2-3B-Instruct",
"vllm-ascend/Meta-Llama-3-8B-Instruct",
"vllm-ascend/QwQ-32B-W8A8",
"vllm-ascend/QwQ-32B-w8a8",
"vllm-ascend/Qwen2-7B-W8A8",
"vllm-ascend/Qwen2-VL-7B-W8A8",
"vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8",
"vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new",
"vllm-ascend/Qwen2.5-0.5B-Instruct-fa3",
"vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8",
"vllm-ascend/Qwen2.5-Omni-7B",
"vllm-ascend/Qwen3-0.6B",
"vllm-ascend/Qwen3-0.6B-Instruct-W8A8",
"vllm-ascend/Qwen3-0.6B-W8A16",
"vllm-ascend/Qwen3-0.6B-W8A8",
"vllm-ascend/Qwen3-1.7B-W4A8-V1",
"vllm-ascend/Qwen3-235B-A22B",
"vllm-ascend/Qwen3-235B-A22B-W4A8",
"vllm-ascend/Qwen3-235B-A22B-W8A8",
"vllm-ascend/Qwen3-235B-A22B-w8a8",
"vllm-ascend/Qwen3-30B-A3B",
"vllm-ascend/Qwen3-a3B_eagle3",
"vllm-ascend/Qwen3-30B-A3B-Puring",
"vllm-ascend/Qwen3-30B-A3B-W8A8",
"vllm-ascend/Qwen3-30B-A3B-W8A8-Pruning",
"vllm-ascend/Qwen3-30B-A3B-W8A8-QuaRot",
"vllm-ascend/Qwen3-30B-A3B-Instruct-2507-quantized.w8a8",
"vllm-ascend/Qwen3-30B-A3B-Instruct-2507-quantized.w4a8",
"vllm-ascend/Qwen3-32B-W4A4",
"vllm-ascend/Qwen3-32B-W8A8",
"vllm-ascend/Qwen3-32B-W8A8-QuaRot",
"vllm-ascend/Qwen3-8B",
"vllm-ascend/Qwen3-8B-W4A8",
"vllm-ascend/Qwen3-8B-W8A8",
"vllm-ascend/Qwen3-Next-80B-A3B-Instruct-W8A8",
"vllm-ascend/Qwen3-Next-80B-A3B-Instruct-W8A8-Pruning",
"vllm-ascend/Qwen3-Omni-30B-A3B-Thinking",
"vllm-ascend/Qwen3-VL-8B-Instruct",
"vllm-ascend/Qwen3-VL-8B-Instruct-W8A8",
"vllm-ascend/TinyLlama-1.1B-Chat-v0.3",
"vllm-ascend/benchmark",
"vllm-ascend/ilama-3.2-1B",
"vllm-ascend/ilama-text2sql-spider",
"vllm-ascend/kernel_meta",
"vllm-ascend/llama-160m",
"vllm-ascend/llama-160m-accelerator",
"vllm-ascend/llama-2-7b-sql-lora-test",
"vllm-ascend/llama-68m",
"vllm-ascend/llama32-3b-text2sql-spider",
"vllm-ascend/pangu-pro-moe-pruing",
"vllm-ascend/self_cognition_Alice",
"vllm-ascend/self_cognition_Bob",
"vllm-ascend/tinyllama-colorist-lora",
"vllm-ascend/vllm-eagle-llama-68m-random",
"wemaster/deepseek_mtp_main_random_bf16",
"wemaster/deepseek_mtp_main_random_w8a8_part",
"xlangai/OpenCUA-7B",
"Eco-Tech/GLM-5-w4a8",
"Eco-Tech/GLM-4.7-W8A8-floatmtp",
"MiniMax/MiniMax-M2.5"
]
}

View File

@@ -1,44 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
# This workflow builds nightly images as a layer-cache warm-up at 20:00 Beijing time.
# The nightly test workflows (Nightly-A2, Nightly-A3) each rebuild the image fresh
# before running tests, so this schedule only serves to pre-populate the build cache.
name: Nightly Image Build Schedule
on:
workflow_dispatch:
# Next step: Add more inputs here if needed, e.g. vllm version, vllm-ascend version, image tag, etc.
jobs:
build-a2:
uses: ./.github/workflows/_nightly_image_build.yaml
with:
target: a2
secrets:
HW_USERNAME: ${{ secrets.HW_USERNAME }}
HW_TOKEN: ${{ secrets.HW_TOKEN }}
GITEE_TOKEN: ${{ secrets.GITEE_TOKEN }}
build-a3:
uses: ./.github/workflows/_nightly_image_build.yaml
with:
target: a3
secrets:
HW_USERNAME: ${{ secrets.HW_USERNAME }}
HW_TOKEN: ${{ secrets.HW_TOKEN }}
GITEE_TOKEN: ${{ secrets.GITEE_TOKEN }}

View File

@@ -1,46 +0,0 @@
name: Cancel runs on PR close
on:
pull_request:
types: [closed]
permissions:
actions: write
contents: read
jobs:
cancel:
runs-on: ubuntu-latest
steps:
- uses: actions/github-script@v8
with:
github-token: ${{ github.token }}
script: |
const { owner, repo } = context.repo;
const branch = context.payload.pull_request.head.ref;
const statuses = ["in_progress", "queued", "waiting", "pending", "requested"];
for (const status of statuses) {
let page = 1;
while (true) {
const resp = await github.rest.actions.listWorkflowRunsForRepo({
owner, repo, branch, status, per_page: 100, page
});
const runs = resp.data.workflow_runs;
if (!runs.length) break;
for (const run of runs) {
if (run.id === context.runId) continue; // don't cancel this workflow
try {
await github.rest.actions.cancelWorkflowRun({ owner, repo, run_id: run.id });
core.info(`Cancel requested: ${run.html_url}`);
} catch (e) {
// common reasons: already completed (409) or insufficient permissions (403)
core.warning(`Failed to cancel ${run.html_url}: ${e.message}`);
}
}
if (runs.length < 100) break;
page++;
}
}

View File

@@ -1,42 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: Refresh codecov
on:
schedule:
# UTC+8: 8am, 12pm, 16pm
- cron: '0 0,4,8 * * *'
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
jobs:
refresh-codecov:
name: refresh codecov
strategy:
matrix:
vllm_version: [v0.18.0]
uses: ./.github/workflows/_unit_test.yaml
with:
vllm: ${{ matrix.vllm_version }}
runner: linux-amd64-cpu-16-hk
image: quay.nju.edu.cn/ascend/cann:8.2.rc2-910b-ubuntu22.04-py3.11
type: schedule

View File

@@ -1,73 +0,0 @@
# This is a docker build check and publish job:
# 1. PR Triggered docker image build check
# - is for image build check
# - Enable on main/*-dev branch
# - push: ${{ github.event_name != 'pull_request' }} ==> false
# 2. branches push trigger image publish
# - is for branch/dev/nightly image
# - commits are merge into main/*-dev ==> vllm-ascend:main / vllm-ascend:*-dev
# 3. tags push trigger image publish
# - is for final release image
# - Publish when tag with v* (pep440 version) ===> vllm-ascend:v1.2.3 / vllm-ascend:v1.2.3rc1
name: Image Build and Push
on:
pull_request:
branches:
- 'releases/*'
paths:
- 'Dockerfile*'
- '.github/workflows/schedule_image_build_and_push.yaml'
types: [ labeled, synchronize ]
workflow_dispatch:
inputs:
tag:
description: 'Docker tag for build results'
default: main
required: true
type: choice
options:
- main
- v0.17.0rc1
- v0.16.0rc1
- v0.15.0rc1
- v0.14.0rc1
- v0.13.0rc3
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
image_build:
name: Image Build and Push
if: github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'image-build')
strategy:
matrix:
build_meta:
- name: A2 Ubuntu
dockerfile: Dockerfile
suffix: ''
- name: A2 openeuler
dockerfile: Dockerfile.openEuler
suffix: 'openeuler'
- name: A3 Ubuntu
dockerfile: Dockerfile.a3
suffix: 'a3'
- name: A3 openEuler
dockerfile: Dockerfile.a3.openEuler
suffix: 'a3-openeuler'
- name: 310P Ubuntu
dockerfile: Dockerfile.310p
suffix: '310p'
- name: 310P openEuler
dockerfile: Dockerfile.310p.openEuler
suffix: '310p-openeuler'
uses: ./.github/workflows/_schedule_image_build.yaml
with:
dockerfile: ${{ matrix.build_meta.dockerfile }}
suffix: ${{ matrix.build_meta.suffix }}
quay_username: ${{ vars.QUAY_USERNAME }}
should_push: ${{ github.repository_owner == 'vllm-project' && github.event_name != 'pull_request' }}
workflow_dispatch_tag: ${{ inputs.tag }}
secrets:
QUAY_PASSWORD: ${{ secrets.QUAY_PASSWORD }}

View File

@@ -1,89 +0,0 @@
name: 'Image build lint'
on:
schedule:
# Runs at 00:00 UTC+8 every day
- cron: '0 20 * * *'
workflow_dispatch:
inputs:
vllm_hash:
description: 'vLLM base hash'
default: main
required: true
type: string
push:
paths:
- '.github/workflows/dockerfiles/Dockerfile.lint'
- 'requirements-lint.txt'
- 'requirements-dev.txt'
- 'requirements.txt'
# only cancel in-progress runs of the same workflow
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build:
name: vllm-ascend lint image build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 0
persist-credentials: false
- name: Print
run: |
lscpu
- name: Docker meta
id: meta
uses: docker/metadata-action@v6
with:
images: |
quay.io/ascend-ci/vllm-ascend
tags: lint
flavor:
latest=false
- name: Build - Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build - Set up Docker Buildx
uses: docker/setup-buildx-action@v4
- name: Publish - Login to Quay Container Registry
if: ${{ github.repository_owner == 'vllm-project' }}
uses: docker/login-action@v4
with:
registry: quay.io
username: ${{ vars.QUAY_CI_USERNAME }}
password: ${{ secrets.QUAY_CI_PASSWORD }}
- name: Build and push
if: ${{ github.event_name != 'workflow_dispatch' }}
uses: docker/build-push-action@v7
with:
# For now, we only build amd64 lint image
platforms: 'linux/amd64'
context: .
file: .github/workflows/dockerfiles/Dockerfile.lint
push: true
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
provenance: false
- name: Build and push
if: ${{ github.event_name == 'workflow_dispatch' }}
uses: docker/build-push-action@v7
with:
# For now, we only build amd64 lint image
platforms: 'linux/amd64'
context: .
file: .github/workflows/dockerfiles/Dockerfile.lint
push: true
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
provenance: false
build-args: |
VLLM_HASH=${{ inputs.vllm_hash }}

View File

@@ -1,246 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
# This workflow related to the resources atlas 800 A2
# We will not limit the concurrency of jobs on A2
name: Nightly-A2
on:
schedule:
# Run test at 23:45 Beijing time (UTC+8)
- cron: "45 15 * * *"
workflow_dispatch:
pull_request:
branches:
- 'main'
- '*-dev'
- 'releases/v*'
types: [labeled, synchronize]
permissions:
contents: read
pull-requests: read
issues: read
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
# only cancel in-progress runs of the same workflow
concurrency:
group: ascend-nightly-${{ github.ref }}-a2
cancel-in-progress: true
jobs:
parse-trigger:
name: Parse trigger and determine test scope
if: >-
github.event_name == 'schedule' ||
github.event_name == 'workflow_dispatch' ||
contains(github.event.pull_request.labels.*.name, 'nightly-test')
uses: ./.github/workflows/_parse_trigger.yaml
build-image:
name: Build nightly-a2 image
if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
uses: ./.github/workflows/_nightly_image_build.yaml
with:
target: a2
secrets:
HW_USERNAME: ${{ secrets.HW_USERNAME }}
HW_TOKEN: ${{ secrets.HW_TOKEN }}
GITEE_TOKEN: ${{ secrets.GITEE_TOKEN }}
single-node-tests:
name: single-node
needs: [parse-trigger, build-image]
if: >-
always() &&
needs.parse-trigger.outputs.run == 'true' &&
(needs.build-image.result == 'success' || needs.build-image.result == 'skipped')
strategy:
fail-fast: false
matrix:
test_config:
# pytest-driven tests
- name: test_custom_op
os: linux-aarch64-a2b3-1
tests: tests/e2e/nightly/single_node/ops/singlecard_ops
- name: test_custom_op_multi_card
os: linux-aarch64-a2b3-4
tests: tests/e2e/nightly/single_node/ops/multicard_ops_a2/
# YAML-driven tests
- name: qwen3-32b
os: linux-aarch64-a2b3-4
config_file_path: Qwen3-32B.yaml
- name: qwen3-next-80b-a3b-instruct
os: linux-aarch64-a2b3-4
config_file_path: Qwen3-Next-80B-A3B-Instruct-A2.yaml
- name: qwen3-32b-int8
os: linux-aarch64-a2b3-4
config_file_path: Qwen3-32B-Int8-A2.yaml
uses: ./.github/workflows/_e2e_nightly_single_node.yaml
with:
runner: ${{ matrix.test_config.os }}
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a2'
tests: ${{ matrix.test_config.tests }}
config_file_path: ${{ matrix.test_config.config_file_path }}
name: ${{ matrix.test_config.name }}
should_run: >-
${{
needs.parse-trigger.outputs.run == 'true' && (
needs.parse-trigger.outputs.filter == 'all' ||
contains(needs.parse-trigger.outputs.filter, format(',{0},', matrix.test_config.name))
)
}}
multi-node-tests:
name: multi-node
needs: [parse-trigger, build-image, single-node-tests]
if: >-
always() &&
needs.parse-trigger.outputs.run == 'true' &&
(needs.build-image.result == 'success' || needs.build-image.result == 'skipped')
strategy:
fail-fast: false
max-parallel: 2
matrix:
test_config:
- name: multi-node-deepseek-dp
config_file_path: DeepSeek-R1-W8A8-A2.yaml
size: 2
- name: multi-node-qwen3-235b-dp
config_file_path: Qwen3-235B-A22B-A2.yaml
size: 2
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
with:
soc_version: a2
runner: linux-amd64-cpu-8-hk
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a2'
replicas: 1
size: ${{ matrix.test_config.size }}
config_file_path: ${{ matrix.test_config.config_file_path }}
vllm_ascend_ref: ${{ needs.parse-trigger.outputs.ref }}
should_run: >-
${{
needs.parse-trigger.outputs.run == 'true' && (
needs.parse-trigger.outputs.filter == 'all' ||
contains(needs.parse-trigger.outputs.filter, format(',{0},', matrix.test_config.name))
)
}}
secrets:
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_HK_001_INTERNAL_B64 }}
single-node-accuracy-tests:
needs: [parse-trigger]
if: always() && needs.parse-trigger.outputs.run == 'true'
strategy:
fail-fast: false
matrix:
test_config:
- name: accuracy-group-1
os: linux-aarch64-a2b3-1
model_list:
- Qwen3-VL-8B-Instruct-W8A8
- Qwen3-8B
- Qwen2-Audio-7B-Instruct
- Qwen3-8B-W8A8
- Qwen3-VL-8B-Instruct
- Qwen2.5-Omni-7B
- name: accuracy-group-2
os: linux-aarch64-a2b3-1
model_list:
- ERNIE-4.5-21B-A3B-PT
- InternVL3_5-8B-hf
- Molmo-7B-D-0924
- Llama-3.2-3B-Instruct
- llava-onevision-qwen2-0.5b-ov-hf
- name: accuracy-group-3
os: linux-aarch64-a2b3-2
model_list:
- Qwen3-30B-A3B
- Qwen3-VL-30B-A3B-Instruct
- Qwen3-30B-A3B-W8A8
- name: accuracy-group-4
os: linux-aarch64-a2b3-4
model_list:
- Qwen3-Next-80B-A3B-Instruct
- Qwen3-Omni-30B-A3B-Instruct
uses: ./.github/workflows/_e2e_nightly_single_node_models.yaml
with:
vllm: v0.18.0
runner: ${{ matrix.test_config.os }}
model_list: ${{ toJson(matrix.test_config.model_list) }}
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.1-910b-ubuntu22.04-py3.11'
is_run: >-
${{
needs.parse-trigger.outputs.run == 'true' && (
needs.parse-trigger.outputs.filter == 'all' ||
contains(needs.parse-trigger.outputs.filter, format(',{0},', matrix.test_config.name))
)
}}
upload: false
doc-test:
name: doc-test
needs: [parse-trigger]
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
strategy:
# Each version should be tested
fail-fast: false
matrix:
vllm_version: [releases-v0.13.0, releases-v0.13.0-openeuler, main, main-openeuler]
runs-on: linux-aarch64-a2b3-1
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:${{ matrix.vllm_version }}
steps:
- name: Check NPU/CANN and git info
run: |
echo "====> Print NPU/CANN info"
npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
echo "====> Print vllm-ascend git info"
cd /vllm-workspace/vllm-ascend
git --no-pager log -1 || true
echo "====> Print vllm git info"
cd /vllm-workspace/vllm
git --no-pager log -1 || true
- name: Checkout vllm-project/vllm-ascend repo
uses: actions/checkout@v6
- name: Run vllm-ascend/tests/e2e/run_doctests.sh
run: |
# PWD: /__w/vllm-ascend/vllm-ascend
# Make sure e2e tests are latest
echo "Replacing /vllm-workspace/vllm-ascend/tests/e2e ..."
rm -rf /vllm-workspace/vllm-ascend/tests/e2e
mkdir -p /vllm-workspace/vllm-ascend/tests
# Overwrite e2e and examples
cp -r tests/e2e /vllm-workspace/vllm-ascend/tests/
cp -r examples /vllm-workspace/vllm-ascend/
# Simulate container to enter directory
cd /workspace
# Run real test
echo "Test:"
/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh

View File

@@ -1,236 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
# This workflow related to the resources atlas 800 A3
# **Please note**: current A3 resource pool's maximum allowed concurrency is 5*16 NPUs
# We will limit the concurrency of jobs on A3 to avoid the risk of insufficient resources
name: Nightly-A3
on:
schedule:
# Run test at 23:45 Beijing time (UTC+8)
- cron: "45 15 * * *"
workflow_dispatch:
pull_request:
branches:
- 'main'
- '*-dev'
- 'releases/v*'
types: [ labeled, synchronize ]
permissions:
contents: read
pull-requests: read
issues: read
# Bash shells do not use ~/.profile or ~/.bashrc so these shells need to be explicitly
# declared as "shell: bash -el {0}" on steps that need to be properly activated.
# It's used to activate ascend-toolkit environment variables.
defaults:
run:
shell: bash -el {0}
concurrency:
group: ascend-nightly-${{ github.ref }}-a3
cancel-in-progress: true
jobs:
parse-trigger:
name: Parse trigger and determine test scope
if: >-
github.event_name == 'schedule' ||
github.event_name == 'workflow_dispatch' ||
contains(github.event.pull_request.labels.*.name, 'nightly-test')
uses: ./.github/workflows/_parse_trigger.yaml
build-image:
name: Build nightly-a3 image
if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
uses: ./.github/workflows/_nightly_image_build.yaml
with:
target: a3
secrets:
HW_USERNAME: ${{ secrets.HW_USERNAME }}
HW_TOKEN: ${{ secrets.HW_TOKEN }}
GITEE_TOKEN: ${{ secrets.GITEE_TOKEN }}
multi-node-tests:
name: multi-node
needs: [parse-trigger, build-image]
if: >-
always() &&
needs.parse-trigger.outputs.run == 'true' &&
(needs.build-image.result == 'success' || needs.build-image.result == 'skipped')
strategy:
fail-fast: false
max-parallel: 2
matrix:
test_config:
- name: multi-node-deepseek-pd
config_file_path: DeepSeek-V3.yaml
size: 2
- name: multi-node-qwen3-dp
config_file_path: Qwen3-235B-A22B.yaml
size: 2
- name: multi-node-qwenw8a8-2node
config_file_path: Qwen3-235B-W8A8.yaml
size: 2
- name: multi-node-qwenw8a8-2node-eplb
config_file_path: Qwen3-235B-W8A8-EPLB.yaml
size: 2
- name: multi-node-dpsk3.2-2node
config_file_path: DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml
size: 2
- name: multi-node-qwen3-dp-mooncake-layerwise
config_file_path: Qwen3-235B-A22B-Mooncake-Layerwise.yaml
size: 2
- name: multi-node-deepseek-r1-w8a8-longseq
config_file_path: DeepSeek-R1-W8A8-longseq.yaml
size: 2
- name: multi-node-qwenw8a8-2node-longseq
config_file_path: Qwen3-235B-W8A8-longseq.yaml
size: 2
- name: multi-node-qwen-disagg-pd
config_file_path: Qwen3-235B-disagg-pd.yaml
size: 2
- name: multi-node-qwen-vl-disagg-pd
config_file_path: Qwen3-VL-235B-disagg-pd.yaml
size: 2
- name: multi-node-kimi-k2-instruct-w8a8
config_file_path: Kimi-K2-Instruct-W8A8.yaml
size: 2
- name: multi-node-deepseek-v3.1
config_file_path: DeepSeek-V3.1-BF16.yaml
size: 2
- name: multi-node-deepseek-v3.2-W8A8-EP
config_file_path: DeepSeek-V3_2-W8A8-EP.yaml
size: 4
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
with:
soc_version: a3
runner: linux-aarch64-a3-0
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
replicas: 1
size: ${{ matrix.test_config.size }}
config_file_path: ${{ matrix.test_config.config_file_path }}
vllm_ascend_ref: ${{ needs.parse-trigger.outputs.ref }}
should_run: >-
${{
needs.parse-trigger.outputs.run == 'true' && (
needs.parse-trigger.outputs.filter == 'all' ||
contains(needs.parse-trigger.outputs.filter, format(',{0},', matrix.test_config.name))
)
}}
secrets:
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
single-node-tests:
name: single-node
needs: [parse-trigger, build-image, multi-node-tests]
if: >-
always() &&
needs.parse-trigger.outputs.run == 'true' &&
(needs.build-image.result == 'success' || needs.build-image.result == 'skipped')
strategy:
fail-fast: false
matrix:
test_config:
# pytest-driven tests
- name: qwen3-30b-acc
os: linux-aarch64-a3-4
tests: tests/e2e/weekly/single_node/models/test_qwen3_30b_acc.py
- name: custom-multi-ops
os: linux-aarch64-a3-16
tests: tests/e2e/nightly/single_node/ops/multicard_ops_a3/
# YAML-driven tests
- name: deepseek-r1-0528-w8a8
os: linux-aarch64-a3-16
config_file_path: DeepSeek-R1-0528-W8A8.yaml
- name: deepseek-r1-w8a8-hbm
os: linux-aarch64-a3-16
config_file_path: DeepSeek-R1-W8A8-HBM.yaml
- name: deepseek-v3-2-w8a8
os: linux-aarch64-a3-16
config_file_path: DeepSeek-V3.2-W8A8.yaml
- name: glm-5-w4a8
os: linux-aarch64-a3-16
config_file_path: GLM-5.yaml
- name: glm-4.7-w8a8
os: linux-aarch64-a3-16
config_file_path: GLM-4.7.yaml
- name: kimi-k2-thinking
os: linux-aarch64-a3-16
config_file_path: Kimi-K2-Thinking.yaml
- name: kimi-k2.5
os: linux-aarch64-a3-16
config_file_path: Kimi-K2.5.yaml
- name: minimax-m2-5
os: linux-aarch64-a3-16
config_file_path: MiniMax-M2.5-A3.yaml
- name: mtpx-deepseek-r1-0528-w8a8
os: linux-aarch64-a3-16
config_file_path: MTPX-DeepSeek-R1-0528-W8A8.yaml
- name: qwen3-235b-a22b-w8a8
os: linux-aarch64-a3-16
config_file_path: Qwen3-235B-A22B-W8A8.yaml
- name: qwen3-30b-a3b-w8a8
os: linux-aarch64-a3-4
config_file_path: Qwen3-30B-A3B-W8A8.yaml
- name: qwen3-next-80b-a3b-instruct
os: linux-aarch64-a3-4
config_file_path: Qwen3-Next-80B-A3B-Instruct.yaml
- name: qwen3-next-80b-a3b-instruct-w8a8
os: linux-aarch64-a3-4
config_file_path: Qwen3-Next-80B-A3B-Instruct-W8A8.yaml
- name: qwq-32b
os: linux-aarch64-a3-4
config_file_path: QwQ-32B.yaml
- name: qwen3-32b-int8
os: linux-aarch64-a3-4
config_file_path: Qwen3-32B-Int8.yaml
- name: qwen2-5-vl-7b
os: linux-aarch64-a3-4
config_file_path: Qwen2.5-VL-7B-Instruct.yaml
- name: qwen2-5-vl-7b-epd
os: linux-aarch64-a3-4
config_file_path: Qwen2.5-VL-7B-Instruct-EPD.yaml
- name: qwen2-5-vl-32b
os: linux-aarch64-a3-4
config_file_path: Qwen2.5-VL-32B-Instruct.yaml
- name: qwen3-32b-int8-a3-feature-stack3
os: linux-aarch64-a3-4
config_file_path: Qwen3-32B-Int8-A3-Feature-Stack3.yaml
- name: qwen3-32b-int8-prefix-cache
os: linux-aarch64-a3-4
config_file_path: Prefix-Cache-Qwen3-32B-Int8.yaml
- name: deepseek-r1-0528-w8a8-prefix-cache
os: linux-aarch64-a3-16
config_file_path: Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml
uses: ./.github/workflows/_e2e_nightly_single_node.yaml
with:
runner: ${{ matrix.test_config.os }}
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
tests: ${{ matrix.test_config.tests }}
config_file_path: ${{ matrix.test_config.config_file_path }}
name: ${{ matrix.test_config.name }}
should_run: >-
${{
needs.parse-trigger.outputs.run == 'true' && (
needs.parse-trigger.outputs.filter == 'all' ||
contains(needs.parse-trigger.outputs.filter, format(',{0},', matrix.test_config.name))
)
}}

View File

@@ -1,459 +0,0 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
name: Release Code and Wheel
on:
schedule:
# UTC+8: 10am, 16pm
- cron: '0 2,8 * * *'
push:
tags:
- 'v*'
workflow_dispatch:
inputs:
tag:
description: 'Docker tag for build results'
default: main
required: true
type: choice
options:
- main
- v0.17.0rc1
- v0.16.0rc1
- v0.15.0rc1
- v0.14.0rc1
- v0.13.0rc3
jobs:
build_and_release_code:
name: release code
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- name: checkout vllm-ascend
if: ${{ github.event_name != 'workflow_dispatch' }}
uses: actions/checkout@v6
- name: checkout vllm-ascend ${{ inputs.tag }}
if: ${{ github.event_name == 'workflow_dispatch' }}
uses: actions/checkout@v6
with:
ref: ${{ inputs.tag }}
- name: Print
run: |
lscpu
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python3 -m pip install twine setuptools_scm
- name: Generate tar.gz
env:
SOC_VERSION: ascend910b1
run: |
python3 setup.py sdist
ls dist
- name: Archive tar.gz
uses: actions/upload-artifact@v7
with:
name: vllm-ascend-src
path: dist/*
- name: Release
if: ${{ github.event_name == 'push' }}
run: |
python3 -m twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}
build_and_release_wheel:
name: build and release wheel
strategy:
matrix:
os: [ubuntu-24.04, ubuntu-24.04-arm]
python-version: ["3.10", "3.11"]
runs-on: ${{ matrix.os }}
steps:
- name: checkout vllm-ascend
if: ${{ github.event_name != 'workflow_dispatch' }}
uses: actions/checkout@v6
- name: checkout vllm-ascend ${{ inputs.tag }}
if: ${{ github.event_name == 'workflow_dispatch' }}
uses: actions/checkout@v6
with:
ref: ${{ inputs.tag }}
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build wheel
run: |
ls
docker build -f ./.github/workflows/dockerfiles/Dockerfile.buildwheel.a2 \
--build-arg PY_VERSION=${{ matrix.python-version }} \
-t wheel:v1 .
docker run --rm \
-u "$(id -u):$(id -g)" \
-v "$(pwd):/outpwd" \
wheel:v1 \
bash -c "cp -r /workspace/vllm-ascend/dist /outpwd"
ls dist
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Repair wheels with auditwheel
run: |
python3 -m pip install auditwheel
python3 -m pip install patchelf
mkdir -p dist/repaired
for whl in dist/*.whl; do
auditwheel repair "$whl" -w dist/repaired/ \
--exclude libplatform.so \
--exclude libregister.so \
--exclude libge_common_base.so \
--exclude libc10.so \
--exclude libc_sec.so \
--exclude libnnopbase.so \
--exclude libprofapi.so \
--exclude libgraph_base.so \
--exclude libgraph.so \
--exclude libexe_graph.so \
--exclude "libascend*.so" \
--exclude "libtorch*.so" \
--exclude "libopapi.so" \
--exclude "liberror_manager.so" \
--exclude "libruntime.so" \
--exclude "libmmpa.so"
done
rm -f dist/*.whl
mv dist/repaired/*.whl dist/
rmdir dist/repaired
ls dist
- name: Verify automatic platform tags
run: |
cd dist
for wheel in *.whl; do
echo "verification file: $wheel"
auditwheel show "$wheel"
done
- name: Generate variant wheels
env:
WHEEL_FILE: dist
PROJECT_TOML: .github/workflows/scripts/wheel/pyproject.toml
OUTPUT_DIR: dist/variants
run: |
pip install build git+https://github.com/wheelnext/variantlib.git --quiet
mkdir -p dist/variants
python3 .github/workflows/scripts/wheel/make_variant.py \
-c .github/workflows/scripts/wheel/config.json \
-l a2
echo "Generated variant wheels:"
ls dist/variants/
- name: Archive wheel
uses: actions/upload-artifact@v7
with:
name: vllm-ascend-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
path: dist/
- name: Release
if: ${{ github.event_name == 'push' || github.event_name == 'workflow_dispatch' }}
run: |
python3 -m pip install twine
python3 -m twine upload --verbose dist/*.whl -u __token__ -p ${{ secrets.PYPI_TOKEN }}
build_and_release_wheel_a3:
name: build and release wheel (A3)
strategy:
matrix:
os: [ubuntu-24.04, ubuntu-24.04-arm]
python-version: ["3.10", "3.11"]
runs-on: ${{ matrix.os }}
steps:
- name: checkout vllm-ascend
if: ${{ github.event_name != 'workflow_dispatch' }}
uses: actions/checkout@v6
- name: checkout vllm-ascend ${{ inputs.tag }}
if: ${{ github.event_name == 'workflow_dispatch' }}
uses: actions/checkout@v6
with:
ref: ${{ inputs.tag }}
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build wheel
run: |
ls
docker build -f ./.github/workflows/dockerfiles/Dockerfile.buildwheel.a3 \
--build-arg PY_VERSION=${{ matrix.python-version }} \
-t wheel-a3:v1 .
docker run --rm \
-u "$(id -u):$(id -g)" \
-v "$(pwd):/outpwd" \
wheel-a3:v1 \
bash -c "cp -r /workspace/vllm-ascend/dist /outpwd"
ls dist
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Repair wheels with auditwheel
run: |
python3 -m pip install auditwheel
python3 -m pip install patchelf
mkdir -p dist/repaired
for whl in dist/*.whl; do
auditwheel repair "$whl" -w dist/repaired/ \
--exclude libplatform.so \
--exclude libregister.so \
--exclude libge_common_base.so \
--exclude libc10.so \
--exclude libc_sec.so \
--exclude libnnopbase.so \
--exclude libprofapi.so \
--exclude libgraph_base.so \
--exclude libgraph.so \
--exclude libexe_graph.so \
--exclude "libascend*.so" \
--exclude "libtorch*.so" \
--exclude "libopapi.so" \
--exclude "liberror_manager.so" \
--exclude "libruntime.so" \
--exclude "libmmpa.so"
done
rm -f dist/*.whl
mv dist/repaired/*.whl dist/
rmdir dist/repaired
ls dist
- name: Verify automatic platform tags
run: |
cd dist
for wheel in *.whl; do
echo "verification file: $wheel"
auditwheel show "$wheel"
done
- name: Generate variant wheels
env:
WHEEL_FILE: dist
PROJECT_TOML: .github/workflows/scripts/wheel/pyproject.toml
OUTPUT_DIR: dist/variants
run: |
pip install build git+https://github.com/wheelnext/variantlib.git --quiet
mkdir -p dist/variants
python3 .github/workflows/scripts/wheel/make_variant.py \
-c .github/workflows/scripts/wheel/config.json \
-l a3
echo "Generated variant wheels:"
ls dist/variants/
- name: Archive wheel
uses: actions/upload-artifact@v7
with:
name: vllm-ascend-a3-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
path: dist/
build_and_release_wheel_310p:
name: build and release wheel (310P)
strategy:
matrix:
os: [ubuntu-24.04, ubuntu-24.04-arm]
python-version: ["3.10", "3.11"]
runs-on: ${{ matrix.os }}
steps:
- name: checkout vllm-ascend
if: ${{ github.event_name != 'workflow_dispatch' }}
uses: actions/checkout@v6
- name: checkout vllm-ascend ${{ inputs.tag }}
if: ${{ github.event_name == 'workflow_dispatch' }}
uses: actions/checkout@v6
with:
ref: ${{ inputs.tag }}
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Build wheel
run: |
ls
docker build -f ./.github/workflows/dockerfiles/Dockerfile.buildwheel.310p \
--build-arg PY_VERSION=${{ matrix.python-version }} \
-t wheel-310p:v1 .
docker run --rm \
-u "$(id -u):$(id -g)" \
-v "$(pwd):/outpwd" \
wheel-310p:v1 \
bash -c "cp -r /workspace/vllm-ascend/dist /outpwd"
ls dist
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: ${{ matrix.python-version }}
- name: Repair wheels with auditwheel
run: |
python3 -m pip install auditwheel
python3 -m pip install patchelf
mkdir -p dist/repaired
for whl in dist/*.whl; do
auditwheel repair "$whl" -w dist/repaired/ \
--exclude libplatform.so \
--exclude libregister.so \
--exclude libge_common_base.so \
--exclude libc10.so \
--exclude libc_sec.so \
--exclude libnnopbase.so \
--exclude libprofapi.so \
--exclude libgraph_base.so \
--exclude libgraph.so \
--exclude libexe_graph.so \
--exclude "libascend*.so" \
--exclude "libtorch*.so" \
--exclude "libopapi.so" \
--exclude "liberror_manager.so" \
--exclude "libruntime.so" \
--exclude "libmmpa.so"
done
rm -f dist/*.whl
mv dist/repaired/*.whl dist/
rmdir dist/repaired
ls dist
- name: Verify automatic platform tags
run: |
cd dist
for wheel in *.whl; do
echo "verification file: $wheel"
auditwheel show "$wheel"
done
- name: Generate variant wheels
env:
WHEEL_FILE: dist
PROJECT_TOML: .github/workflows/scripts/wheel/pyproject.toml
OUTPUT_DIR: dist/variants
run: |
pip install build git+https://github.com/wheelnext/variantlib.git --quiet
mkdir -p dist/variants
python3 .github/workflows/scripts/wheel/make_variant.py \
-c .github/workflows/scripts/wheel/config.json \
-l 310p
echo "Generated variant wheels:"
ls dist/variants/
- name: Archive wheel
uses: actions/upload-artifact@v7
with:
name: vllm-ascend-310p-${{ matrix.os }}-py${{ matrix.python-version }}-wheel
path: dist/
generate_and_upload_variant_index:
name: generate and upload variant index
needs: [build_and_release_wheel, build_and_release_wheel_a3, build_and_release_wheel_310p]
if: ${{ github.event_name == 'push' || github.event_name == 'workflow_dispatch' }}
runs-on: ubuntu-24.04
steps:
- name: Download all variant wheels
uses: actions/download-artifact@v4
with:
pattern: '*-wheel'
path: all-wheels/
merge-multiple: true
- name: Collect variant wheels
run: |
mkdir -p combined-variants
find all-wheels/ -path '*/variants/*.whl' -exec cp {} combined-variants/ \;
echo "Combined variant wheels:"
ls combined-variants/
- name: Set up Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.11"
- name: Generate combined variant index
run: |
pip install git+https://github.com/wheelnext/variantlib.git --quiet
variantlib generate-index-json -d combined-variants/
echo "Generated index files:"
ls combined-variants/
- name: Upload wheels and variant index to OBS
env:
OBS_ACCESS_KEY: ${{ secrets.OBS_ACCESS_KEY_ID }}
OBS_SECRET_KEY: ${{ secrets.OBS_SECRET_ACCESS_KEY }}
run: |
pip install esdk-obs-python --quiet
python3 - <<'EOF'
import os, glob
from obs import ObsClient
OBS_BUCKET = 'ascend-artifcat-packages'
OBS_PATH = 'pypi/packages/ascend/repos/pypi/variant/vllm-ascend'
client = ObsClient(
access_key_id=os.environ['OBS_ACCESS_KEY'],
secret_access_key=os.environ['OBS_SECRET_KEY'],
server='https://obs.cn-north-4.myhuaweicloud.com'
)
files = glob.glob('combined-variants/*.whl') + glob.glob('combined-variants/*.json')
for file in files:
filename = os.path.basename(file)
resp = client.putFile(OBS_BUCKET, f'{OBS_PATH}/{filename}', file)
if resp.status < 300:
print(f'Uploaded: {filename}')
else:
raise Exception(f'Failed to upload {filename}: {resp.errorMessage}')
EOF

View File

@@ -1,66 +0,0 @@
name: "Close stale resolved/awaiting-feedback issues"
on:
schedule:
- cron: '0 2 * * *'
jobs:
stale:
runs-on: ubuntu-latest
permissions:
actions: write
issues: write
steps:
- uses: actions/stale@v10
with:
# Process issues with the 'resolved' label
any-of-labels: 'resolved'
# Mark as stale after a period of inactivity
days-before-stale: 7
stale-issue-label: 'stale'
stale-issue-message: |
This issue has been marked as `resolved` but has not received any feedback for some time, so it is now labeled as `stale`.
If you feel this was a mistake, please leave a comment to have the `stale` label removed.
`Stale` issues will automatically be closed after 14 days of inactivity.
# Close stale issues after a period of inactivity
days-before-close: 14
close-issue-message: |
This issue is being closed due to a lack of recent activity.
If you have any further questions or requirements, please feel free to reopen this issue or create a new one.
# Automatically remove the 'stale' label when the issue is updated (default is true)
remove-stale-when-updated: true
# Also remove the 'resolved' label
labels-to-remove-when-unstale: 'resolved'
# Avoid accidental PR processing (PRs can be handled if needed; this is issue-only)
days-before-pr-stale: -1
days-before-pr-close: -1
- uses: actions/stale@v10
with:
# Process issues with the 'awaiting-feedback' label
any-of-labels: 'awaiting-feedback'
# Mark as stale after a period of inactivity
days-before-stale: 7
stale-issue-label: 'stale'
stale-issue-message: |
This issue has been marked as `awaiting-feedback` but has not received any feedback for some time, so it is now labeled as `stale`.
To more accurately locate and resolve the issue, we need you to provide the relevant information mentioned above.
`Stale` issues will automatically be closed after 14 days of inactivity.
# Close stale issues after a period of inactivity
days-before-close: 14
close-issue-message: |
This issue is being closed due to a lack of recent activity.
If you have any further questions or requirements, please feel free to reopen this issue or create a new one.
# Automatically remove the 'stale' label when the issue is updated (default is true)
remove-stale-when-updated: true
# Also remove the 'awaiting-feedback' label
labels-to-remove-when-unstale: 'awaiting-feedback'
# Avoid accidental PR processing (PRs can be handled if needed; this is issue-only)
days-before-pr-stale: -1
days-before-pr-close: -1

View File

@@ -1,120 +0,0 @@
name: Update estimated test times
on:
schedule:
- cron: '0 2 * * 1' # Every Monday at 02:00 UTC
workflow_dispatch:
permissions:
contents: write
pull-requests: write
env:
UPSTREAM_REPO: vllm-project/vllm-ascend
FORK_OWNER: vllm-ascend-ci
BRANCH_NAME: auto/update-estimated-times-${{ github.run_id }}
concurrency:
group: update-estimated-times-${{ github.ref }}
cancel-in-progress: true
jobs:
e2e-test:
name: e2e-test
strategy:
matrix:
vllm_version: [v0.18.0]
type: [full, light]
uses: ./.github/workflows/_e2e_test.yaml
with:
vllm: ${{ matrix.vllm_version }}
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:main
contains_310: false
type: ${{ matrix.type }}
continue_on_error: true # Continue even if some tests fail, we want to collect as much timing data as possible
update-estimated-times:
name: Update estimated_time in config.yaml
needs: [e2e-test]
runs-on: ubuntu-latest
steps:
- name: Checkout fork repo
uses: actions/checkout@v6
with:
repository: ${{ env.FORK_OWNER }}/vllm-ascend
token: ${{ secrets.PAT_TOKEN }}
- name: Download all timing artifacts
uses: actions/download-artifact@v4
with:
pattern: timing-data-*
path: timing-artifacts/
merge-multiple: false
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: '3.11'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pyyaml
- name: Config git
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git remote add upstream https://github.com/${{ env.UPSTREAM_REPO }}.git
git fetch upstream main && git checkout -b ${{ env.BRANCH_NAME }} upstream/main
- name: Update config.yaml from timing data
run: |
python3 .github/workflows/scripts/update_estimated_time.py \
--timing-dir timing-artifacts/ \
--config .github/workflows/scripts/config.yaml
- name: Check for changes
id: check_changes
run: |
if git diff --quiet .github/workflows/scripts/config.yaml; then
echo "changed=false" >> "$GITHUB_OUTPUT"
echo "No changes to config.yaml."
else
echo "changed=true" >> "$GITHUB_OUTPUT"
echo "config.yaml has been updated:"
git diff .github/workflows/scripts/config.yaml
fi
- name: Create pull request
env:
GITHUB_TOKEN: ${{ secrets.PAT_TOKEN }}
run: |
git add .github/workflows/scripts/config.yaml
git commit -sm "[CI] Auto-update estimated test times in config.yaml Computed from timing-data artifacts on workflow run-${{ github.run_id }}"
git remote -v
git push -f origin ${{ env.BRANCH_NAME }}:${{ env.BRANCH_NAME }}
gh pr create \
--repo ${{ env.UPSTREAM_REPO }} \
--base main \
--head ${{ env.FORK_OWNER }}:${{ env.BRANCH_NAME }} \
--title "[CI]: Auto-update estimated test times in config.yaml" \
--body "## Summary
This PR was auto-generated by the **Update estimated test times** [workflow](https://github.com/${{ env.UPSTREAM_REPO }}/actions/runs/${{ github.run_id }}).
It updates the \`estimated_time\` values in \`.github/workflows/scripts/config.yaml\` based on actual elapsed times collected from CI workflow runs.
### Methodology
- Each e2e test job uploads its elapsed time as a \`timing-data-*\` artifact upon completion.
- The workflow aggregates all collected timing artifacts across jobs.
- For each test, the **median** elapsed time is computed to reduce outlier impact.
- A **10% safety buffer** is applied and the result is rounded to the nearest 10 seconds.
### Review Checklist
- [ ] Verify that updated \`estimated_time\` values are within a reasonable range.
- [ ] Confirm no test entries are missing or unexpectedly removed.
> If the new values look reasonable, feel free to merge. Otherwise, leave a comment describing the anomaly."

File diff suppressed because it is too large Load Diff

View File

@@ -1,103 +0,0 @@
import subprocess
import time
from dataclasses import dataclass
class _Color:
HEADER = "\033[95m"
GREEN = "\033[92m"
RED = "\033[91m"
RESET = "\033[0m"
@dataclass
class TestFile:
name: str
estimated_time: float = 60
is_skipped: bool = False
@dataclass
class TestRecord:
name: str
passed: bool
elapsed: float
estimated: float
def to_dict(self) -> dict:
return {
"name": self.name,
"passed": self.passed,
"elapsed": self.elapsed,
"estimated": self.estimated,
}
def run_tests(
files: list[TestFile],
continue_on_error: bool = False,
) -> tuple[int, list[TestRecord]]:
"""
Run each TestFile with pytest and collect timing results.
NOTE:
The emitted START / PASSED / FAILED log lines are parsed by
ci_log_summary.py to recover per-test invocation boundaries.
Keep this output format stable, or update the corresponding
regexes in those CI log summarizers together.
Args:
files: Tests to run (skipped entries should already be filtered out).
continue_on_error: If True, keep running after a failure.
report_path: If provided, write a Markdown timing report here.
Returns:
(exit_code, records) — exit_code is 0 on full success, -1 otherwise.
"""
records: list[TestRecord] = []
all_passed = True
total_start = time.perf_counter()
for i, test in enumerate(files):
print(f"\n{'.' * 60}", flush=True)
# NOTE: ci_log_summary.py depend on this
# START line format when splitting suite-level logs into test runs.
print(
f"{_Color.HEADER}[{i + 1}/{len(files)}] START {test.name}{_Color.RESET}",
flush=True,
)
start = time.perf_counter()
result = subprocess.run(["pytest", "-sv", "--durations=0", "--color=yes", test.name])
elapsed = time.perf_counter() - start
passed = result.returncode == 0
records.append(TestRecord(name=test.name, passed=passed, elapsed=elapsed, estimated=test.estimated_time))
color = _Color.GREEN if passed else _Color.RED
status = "PASSED" if passed else f"FAILED (exit code {result.returncode})"
# NOTE: ci_log_summary.py depend on this
# PASSED / FAILED (exit code X) line format for suite end detection.
print(
f"{color}[{i + 1}/{len(files)}] {status} {test.name} ({elapsed:.0f}s){_Color.RESET}",
flush=True,
)
if not passed:
all_passed = False
if not continue_on_error:
break
total_elapsed = time.perf_counter() - total_start
passed_count = sum(1 for r in records if r.passed)
print(f"\n{'=' * 60}")
color = _Color.GREEN if all_passed else _Color.RED
print(f"{color}Summary: {passed_count}/{len(files)} passed ({total_elapsed:.2f}s total){_Color.RESET}")
print("=" * 60)
for r in records:
icon = f"{_Color.GREEN}{_Color.RESET}" if r.passed else f"{_Color.RED}{_Color.RESET}"
print(f" {icon} {r.name} ({r.elapsed:.0f}s)")
print(flush=True)
return (0 if all_passed else -1), records

View File

@@ -1,169 +0,0 @@
e2e-singlecard:
- name: tests/e2e/singlecard/compile/test_graphex_norm_quant_fusion.py
estimated_time: 83
- name: tests/e2e/singlecard/compile/test_graphex_qknorm_rope_fusion.py
estimated_time: 69
- name: tests/e2e/singlecard/test_auto_fit_max_mode_len.py
estimated_time: 70
- name: tests/e2e/singlecard/test_eager_mode_acc.py
estimated_time: 255
- name: tests/e2e/singlecard/test_aclgraph_accuracy.py
estimated_time: 839
- name: tests/e2e/singlecard/test_aclgraph_batch_invariant.py
estimated_time: 515
- name: tests/e2e/singlecard/test_aclgraph_mem.py
estimated_time: 187
- name: tests/e2e/singlecard/test_async_scheduling.py
estimated_time: 252
- name: tests/e2e/singlecard/test_batch_invariant.py
estimated_time: 506
- name: tests/e2e/singlecard/test_camem.py
estimated_time: 149
- name: tests/e2e/singlecard/test_completion_with_prompt_embeds.py
estimated_time: 136
- name: tests/e2e/singlecard/test_cpu_offloading.py
estimated_time: 166
- name: tests/e2e/singlecard/test_guided_decoding.py
estimated_time: 407
- name: tests/e2e/singlecard/test_ilama_lora.py
estimated_time: 112
- name: tests/e2e/singlecard/test_llama32_lora.py
estimated_time: 239
- name: tests/e2e/singlecard/test_qwen3_multi_loras.py
estimated_time: 140
- name: tests/e2e/singlecard/test_models.py
estimated_time: 320
- name: tests/e2e/singlecard/test_multistream_overlap_shared_expert.py
estimated_time: 292
- name: tests/e2e/singlecard/test_quantization.py
estimated_time: 284
- name: tests/e2e/singlecard/test_sampler.py
estimated_time: 258
- name: tests/e2e/singlecard/test_vlm.py
estimated_time: 495
- name: tests/e2e/singlecard/test_multi_instance.py
estimated_time: 120
- name: tests/e2e/singlecard/test_xlite.py
estimated_time: 135
- name: tests/e2e/singlecard/compile/test_norm_quant_fusion.py
estimated_time: 106
- name: tests/e2e/singlecard/pooling/test_classification.py
estimated_time: 148
- name: tests/e2e/singlecard/pooling/test_embedding.py
estimated_time: 324
- name: tests/e2e/singlecard/pooling/test_scoring.py
estimated_time: 553
- name: tests/e2e/singlecard/pooling/test_qwen3_reranker_lora.py
estimated_time: 280
- name: tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py
estimated_time: 6141
- name: tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py
estimated_time: 600
- name: tests/e2e/singlecard/model_runner_v2/test_basic.py
estimated_time: 80
is_skipped: true
e2e-singlecard-light:
- name: tests/e2e/singlecard/test_aclgraph_accuracy.py::test_piecewise_res_consistency
estimated_time: 229
- name: tests/e2e/singlecard/test_quantization.py::test_qwen3_w8a8_quant
estimated_time: 183
e2e-2card-light:
- name: tests/e2e/multicard/2-cards/test_qwen3_moe.py::test_qwen3_moe_distributed_mp_tp2_ep
estimated_time: 164
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek3_2_w8a8_pruning_mtp_tp2_ep
estimated_time: 90
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek3_2_w8a8c8_pruning_mtp_tp2_ep
estimated_time: 180
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_gpt_oss_distributed_tp2
estimated_time: 352
e2e-multicard-2-cards:
- name: tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py
estimated_time: 0
is_skipped: true
- name: tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py
estimated_time: 0
is_skipped: true
- name: tests/e2e/multicard/2-cards/test_offline_weight_load.py
estimated_time: 0
is_skipped: true
- name: tests/e2e/multicard/2-cards/test_shared_expert_dp.py
estimated_time: 0
is_skipped: true
- name: tests/e2e/multicard/2-cards/test_qwen3_performance.py
estimated_time: 194
- name: tests/e2e/multicard/2-cards/test_data_parallel.py
estimated_time: 454
- name: tests/e2e/multicard/2-cards/test_expert_parallel.py
estimated_time: 220
- name: tests/e2e/multicard/2-cards/test_external_launcher.py
estimated_time: 550
- name: tests/e2e/multicard/2-cards/test_full_graph_mode.py
estimated_time: 805
- name: tests/e2e/multicard/2-cards/test_ilama_lora_tp2.py
estimated_time: 113
- name: tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py
estimated_time: 410
- name: tests/e2e/multicard/2-cards/spec_decode/test_quarot_eagle.py
estimated_time: 859
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek_multistream_moe_tp2
estimated_time: 112
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_w4a8_dynamic_tp2
estimated_time: 104
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_moe_sp_tp2
estimated_time: 176
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek_w4a8_accuracy_tp2
estimated_time: 125
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_moe_fc2_tp2
estimated_time: 173
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek_v2_lite_fc1_tp2
estimated_time: 124
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_dense_fc1_tp2
estimated_time: 99
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_dense_prefetch_mlp_weight_tp2
estimated_time: 110
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek3_2_w8a8_pruning_mtp_tp2_ep
estimated_time: 111
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_deepseek3_2_w8a8c8_pruning_mtp_tp2_ep
estimated_time: 180
- name: tests/e2e/multicard/2-cards/test_offline_inference_distributed.py::test_qwen3_w4a4_distributed_tp2
estimated_time: 202
- name: tests/e2e/multicard/2-cards/test_prefix_caching.py
estimated_time: 470
- name: tests/e2e/multicard/2-cards/test_quantization.py
estimated_time: 511
- name: tests/e2e/multicard/2-cards/test_qwen3_moe.py
estimated_time: 986
- name: tests/e2e/multicard/2-cards/test_qwen3_moe_routing_replay.py
estimated_time: 210
- name: tests/e2e/multicard/2-cards/test_single_request_aclgraph.py
estimated_time: 290
- name: tests/e2e/multicard/2-cards/test_disaggregated_encoder.py
estimated_time: 164
- name: tests/e2e/multicard/2-cards/test_sp_pass.py
estimated_time: 198
- name: tests/e2e/multicard/2-cards/test_sequence_parallelism_moe.py
estimated_time: 120
e2e-multicard-4-cards:
- name: tests/e2e/multicard/4-cards/test_qwen3_next.py
estimated_time: 1868
- name: tests/e2e/multicard/4-cards/test_qwen3_5.py
estimated_time: 1030
- name: tests/e2e/multicard/4-cards/test_data_parallel_tp2.py
estimated_time: 306
- name: tests/e2e/multicard/4-cards/test_kimi_k2.py
estimated_time: 19
- name: tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py
estimated_time: 1445
- name: tests/e2e/multicard/4-cards/long_sequence/test_basic.py
estimated_time: 2186
- name: tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill_cp.py
estimated_time: 1191
- name: tests/e2e/multicard/4-cards/long_sequence/test_prefix_caching_cp.py
estimated_time: 883
- name: tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
estimated_time: 60
is_skipped: true
- name: tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py
estimated_time: 1340
- name: tests/e2e/multicard/4-cards/test_pipeline_parallel.py
estimated_time: 357

View File

@@ -1,240 +0,0 @@
import argparse
import json
import os
import sys
from datetime import datetime, timezone
from pathlib import Path
import tabulate
import yaml
from ci_utils import TestFile, TestRecord, run_tests
_CONFIG_PATH = Path(__file__).parent / "config.yaml"
def load_suites(config_path: Path = _CONFIG_PATH) -> dict[str, list[TestFile]]:
"""Load all test suites from config.yaml."""
data = yaml.safe_load(config_path.read_text())
return {
suite_name: [
TestFile(
name=entry["name"],
estimated_time=entry.get("estimated_time", 60),
is_skipped=entry.get("is_skipped", False),
)
for entry in entries
]
for suite_name, entries in data.items()
}
def partition(files: list[TestFile], rank: int, size: int) -> list[TestFile]:
"""
Split non-skipped files into `size` groups of approximately equal estimated
time using a greedy algorithm, and return the group at index `rank`.
Files within the returned group are sorted ascending by estimated_time.
"""
active = [f for f in files if not f.is_skipped]
if not active or size <= 0 or size > len(active):
return []
# Sort descending by weight; use original index as tiebreaker to be stable
indexed = sorted(enumerate(active), key=lambda x: (-x[1].estimated_time, x[0]))
buckets: list[list[int]] = [[] for _ in range(size)]
sums = [0.0] * size
for idx, test in indexed:
lightest = sums.index(min(sums))
buckets[lightest].append(idx)
sums[lightest] += test.estimated_time
return sorted([active[i] for i in buckets[rank]], key=lambda f: f.estimated_time)
def _find_project_root() -> Path:
root = Path.cwd()
if (root / "tests").exists():
return root
# Fall back: assume script lives at .github/workflows/scripts/
return Path(__file__).parents[3]
def _minimal_covered_dirs(file_paths: set[str], root: Path) -> set[Path]:
"""Return the minimal set of directories that covers all file_paths."""
dirs: set[Path] = set()
for fp in file_paths:
candidate = (root / fp).parent
if not candidate.exists():
continue
try:
rel = candidate.relative_to(root)
except ValueError:
continue
# Drop any existing entries that are subdirectories of rel
dirs = {d for d in dirs if rel not in d.parents}
# Only add rel if no ancestor already covers it
if not any(d == rel or d in rel.parents for d in dirs):
dirs.add(rel)
return dirs
def sanity_check(suites: dict[str, list[TestFile]]) -> None:
"""
Verify that:
1. Every test file in any suite exists on disk.
2. No test_*.py files exist on disk (in covered dirs) that are absent from all suites.
Raises SystemExit with a descriptive message on failure.
"""
suite_files = {f.name.split("::")[0] for tests in suites.values() for f in tests}
root = _find_project_root()
covered = _minimal_covered_dirs(suite_files, root)
disk_files = {str(p.relative_to(root)) for d in covered for p in (root / d).rglob("test_*.py")}
missing_from_suite = sorted(disk_files - suite_files)
if missing_from_suite:
entries = "\n".join(f' TestFile("{f}"),' for f in missing_from_suite)
raise SystemExit(f"Test files on disk are not in any suite (add them or mark is_skipped=True):\n{entries}")
missing_from_disk = sorted(suite_files - disk_files)
if missing_from_disk:
entries = "\n".join(f' TestFile("{f}"),' for f in missing_from_disk)
raise SystemExit(f"Test files listed in suite do not exist on disk:\n{entries}")
def _print_plan(
suite: str,
files: list[TestFile],
skipped: list[TestFile],
partition_info: str,
) -> None:
print(tabulate.tabulate([[suite, partition_info]], headers=["Suite", "Partition"], tablefmt="psql"))
total_est = sum(f.estimated_time for f in files)
print(f"✅ Enabled {len(files)} test(s) (est. total {total_est:.1f}s):")
for f in files:
print(f" - {f.name} (est={f.estimated_time}s)")
if skipped:
print(f"\n❌ Skipped {len(skipped)} test(s) (consider recovering):")
for f in skipped:
print(f" - {f.name}")
print(flush=True)
def _print_results(
suite: str,
records: list[TestRecord],
skipped: list[TestFile],
partition_info: str,
) -> None:
print(tabulate.tabulate([[suite, partition_info]], headers=["Suite", "Partition"], tablefmt="psql"))
total_elapsed = sum(r.elapsed for r in records)
passed_count = sum(1 for r in records if r.passed)
print(f"Results: {passed_count}/{len(records)} passed (actual total {total_elapsed:.1f}s):")
for r in records:
status = "✅ PASSED" if r.passed else "❌ FAILED"
print(f" {status} {r.name} (actual={r.elapsed:.0f}s est={r.estimated:.0f}s)")
if skipped:
print(f"\n❌ Skipped {len(skipped)} test(s) (consider recovering):")
for f in skipped:
print(f" - {f.name}")
print(flush=True)
def _save_timing_json(
records: list[TestRecord],
suite: str,
partition_id: int | None,
partition_size: int | None,
output_path: Path,
) -> None:
passed_suites = [r.to_dict() for r in records if r.passed]
payload = {
"suite": suite,
"partition_id": partition_id,
"partition_size": partition_size,
"commit_sha": os.environ.get("GITHUB_SHA", ""),
"github_run_id": os.environ.get("GITHUB_RUN_ID", ""),
"timestamp": datetime.now(timezone.utc).isoformat(),
"tests": passed_suites,
}
output_path.write_text(json.dumps(payload, indent=2))
print(
f"Timing data written to {output_path} ({len(passed_suites)}/{len(records)} passed)",
flush=True,
)
def main() -> None:
suites = load_suites()
parser = argparse.ArgumentParser(description="Run a named e2e test suite")
parser.add_argument(
"--suite",
required=True,
choices=list(suites.keys()),
help="Name of the test suite to run",
)
parser.add_argument(
"--auto-partition-id",
type=int,
default=None,
metavar="ID",
help="Zero-based partition index (requires --auto-partition-size)",
)
parser.add_argument(
"--auto-partition-size",
type=int,
default=None,
metavar="N",
help="Total number of partitions",
)
parser.add_argument(
"--auto-upgrade-estimated-times",
action="store_true",
help="Automatically update estimated times in config.yaml based on actual timings (default: False) \
If enabled, the script always exit with 0, even if some tests fail, since the primary purpose is to gather \
timing data to improve estimates.",
)
parser.add_argument(
"--continue-on-error",
action="store_true",
help="Continue running after a test failure (default: True)",
)
parser.add_argument(
"--timing-report-json",
type=Path,
default=Path("test_timing_data.json"),
help="Path to write the JSON timing data for CI aggregation",
)
args = parser.parse_args()
sanity_check(suites)
all_files = suites[args.suite]
skipped = [f for f in all_files if f.is_skipped]
if args.auto_partition_size is not None:
files = partition(all_files, args.auto_partition_id, args.auto_partition_size)
partition_info = f"{args.auto_partition_id + 1}/{args.auto_partition_size}"
else:
files = [f for f in all_files if not f.is_skipped]
partition_info = "full"
_print_plan(args.suite, files, skipped, partition_info)
exit_code, records = run_tests(
files,
continue_on_error=args.continue_on_error,
)
_save_timing_json(records, args.suite, args.auto_partition_id, args.auto_partition_size, args.timing_report_json)
_print_results(args.suite, records, skipped, partition_info)
if args.auto_upgrade_estimated_times:
sys.exit(0)
sys.exit(exit_code)
if __name__ == "__main__":
main()

View File

@@ -1,102 +0,0 @@
#!/usr/bin/env python3
"""
Update estimated_time in config.yaml from CI timing data.
Usage:
python3 update_estimated_time.py \
--timing-dir ./timing-artifacts \
--config .github/workflows/scripts/config.yaml
"""
import argparse
import json
from pathlib import Path
import yaml
def collect_timings(timing_dir: Path) -> dict[str, int]:
"""
Recursively scan timing_dir for JSON files produced by run_suite.py.
Returns {test_name: elapsed_seconds} for all passed tests.
Warns if the same test name appears in multiple files.
"""
json_files = list(timing_dir.rglob("*.json"))
print(f"Found {len(json_files)} timing file(s) in {timing_dir}")
timings: dict[str, int] = {}
for path in json_files:
try:
data = json.loads(path.read_text())
except (json.JSONDecodeError, OSError) as e:
print(f" Warning: skipping {path}: {e}")
continue
for test in data.get("tests", []):
if not test.get("passed", False):
continue
name: str = test.get("name", "")
elapsed: float = test.get("elapsed", 0.0)
if not name or elapsed <= 0:
continue
if name in timings:
print(f" Warning: duplicate entry for '{name}', overwriting {timings[name]}s with {int(elapsed)}s")
timings[name] = int(elapsed)
return timings
def update_config(config_path: Path, timings: dict[str, int]) -> int:
"""
Load config.yaml, update estimated_time for each test found in timings,
and write the result back. Returns the number of changed entries.
"""
configs: dict = yaml.safe_load(config_path.read_text())
changed = 0
for suite_tests in configs.values():
for test in suite_tests:
name: str = test.get("name", "")
if name not in timings:
continue
old_time: int = test.get("estimated_time", 0)
new_time: int = timings[name]
if old_time == new_time:
continue
test["estimated_time"] = new_time
print(f" {name}: {old_time}s -> {new_time}s")
changed += 1
config_path.write_text(yaml.dump(configs, default_flow_style=False, allow_unicode=True, sort_keys=False))
return changed
def main() -> None:
parser = argparse.ArgumentParser(description="Update estimated_time in config.yaml from CI timing data")
parser.add_argument(
"--timing-dir",
required=True,
type=Path,
help="Directory containing timing JSON files (searched recursively)",
)
parser.add_argument(
"--config",
default=".github/workflows/scripts/config.yaml",
type=Path,
help="Path to config.yaml (default: .github/workflows/scripts/config.yaml)",
)
args = parser.parse_args()
timings = collect_timings(args.timing_dir)
if not timings:
print("No timing data collected. Exiting without changes.")
return
print(f"\nCollected timing data for {len(timings)} test(s).")
print(f"Updating {args.config}...")
changed = update_config(args.config, timings)
print(f"\nDone. {changed} estimated_time value(s) changed.")
if __name__ == "__main__":
main()

View File

@@ -1,33 +0,0 @@
{
"variables": {
"wheel_env": "WHEEL_FILE",
"pyproject_toml_env": "PROJECT_TOML",
"output_dir_env": "OUTPUT_DIR"
},
"jobs": [
{
"variant_label": "310p",
"properties": [
"ascend :: npu_type :: 310p",
"ascend :: cann_version :: 8.5.1"
],
"skip_plugin_validation": true
},
{
"variant_label": "a2",
"properties": [
"ascend :: npu_type :: a2",
"ascend :: cann_version :: 8.5.1"
],
"skip_plugin_validation": true
},
{
"variant_label": "a3",
"properties": [
"ascend :: npu_type :: a3",
"ascend :: cann_version :: 8.5.1"
],
"skip_plugin_validation": true
}
]
}

Some files were not shown because too many files have changed in this diff Show More