### What this PR does / why we need it?
The name of the smoke test file for DeepSeek EPLB has been changed, but
the name in the script hasn't been updated. Fix this bug.
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Upgrade vllm commit hash to `4429d934de3c5cc327b0d7aec8e473aeba38db90`
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
[E2E] Collect test run time.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Delete deepseek3.2-exp nightly test firstly for replacing
deepseek3.2-exp with deepseek3.2 after nightly tests pass.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Delete accuracy tests for models that are no longer retained:
- Meta-Llama-3.1-8B-Instruct
- llava-1.5-7b-hf
- InternVL2-8B.yaml
- InternVL2_5-8B.yaml
- InternVL3-8B.yaml
Add accuracy tests for the new models:
- Llama-3.2-3B-Instruct
- llava-onevision-qwen2-0.5b-ov-hf
- Qwen3-VL-30B-A3B-Instruct
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
now vllm-ascend uses AsyncGPUModelRunnerOutput
,AsyncNPUModelRunnerOutput before is outdated, so we should fix it
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
This PR updates the CI configuration and adjusts a set of end-to-end
(e2e) tests under tests/e2e/multicard, in order to refactor the test
suite and ensure compatibility with current codebase and CI workflows.
1. tests/e2e/multicard/test_prefix_caching.py: change model to Qwen3-8B
and rename the test case
2. tests/e2e/multicard/test_quantization.py: rename the test case
3. tests/e2e/multicard/test_qwen3_moe.py: remove duplicate test and
rename test cases
4. tests/e2e/multicard/test_qwen3_next.py: rename test cases and change
the W8A8 pruning model to the W8A8 model and remove the eager parameter
5. tests/e2e/multicard/test_shared_expert_dp.py: rename test case and
remove the eager parameter
6. tests/e2e/multicard/test_single_request_aclgraph.py: rename test case
and change Qwen3-30B to Qwen3-0.6B
7. tests/e2e/multicard/test_torchair_graph_mode.py: delete test cases
about torchair
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
From a resource-saving perspective, canceling old jobs when submitting
new commits can reduce github_hosted in queue
Signed-off-by: wangli <wangli858794774@gmail.com>
- merge image build workflow into one
- merge package build workflow into one
- merge community related workflow into one
This change makes the workflow more clear
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR standardizes the fusion naming, changing
`enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e
testing.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Refactor the e2e testcases.
- tests/e2e/multicard/test_weight_loader.py: Remove the unused code.
- tests/e2e/singlecard/multi-modal/test_internvl.py: Move to accuracy
test.
- tests/e2e/singlecard/test_aclgraph.py: Rename the file.
- tests/e2e/singlecard/test_embedding_aclgraph.py : Combine with
tests/e2e/singlecard/test_bge_model.py
- tests/e2e/singlecard/test_completion_with_prompt_embeds.py: Delete
eager mode and modify model to Qwen3-0.6B
- tests/e2e/singlecard/test_quantization.py: Modify model to
Qwen3-0.6B-W8A8
- tests/e2e/singlecard/test_vlm.py: Modify model to Qwen3-VL-8B
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Adds W4A16 quantization method for the Kimi-K2-Thinking model and
updates relevant modules to support the new quantization method.
- Implements complete W4A16 quantization method including weight
packing/unpacking, per-group quantization parameter generation,
post-processing logic and MoE method application.
- Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts
`with_quant` conditional logic to support W4A16 matrix multiplication.
- Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and
processing logic for `weight_packed` field.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: Ruri <zhouxiang100@huawei.com>
### What this PR does / why we need it?
Set the global env `TRANSFORMERS_OFFLINE: 1`, which will avoid
downloading the file and return the path to the
local cached file if it exists when using modelscope's
`snapshot_download` api
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Delete accuracy testing of some models:
- Qwen2-VL-7B-Instruct
- Qwen2.5-VL-7B-Instruct
- gemma-2-9b-it
- DeepSeek-V2-Lite
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this
pr covered the three model types of embed (cls_token, mean_token,
lasttoken).
After this
[commit](17373dcd93),
vllm has provided support for adapting pooling models on the v1 engine.
This PR includes corresponding adaptations on the vllm-ascend side.
Fixes#1960
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
aclgraph is stable and fast now. Let's drop torchair graph mode now.
TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
As support for the mooncake connector is now available, the llmdatadist
connector is no longer being maintained, so the llmdatadist-related
files need to be retired.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Considering that long queues severely impact the developer experience,
we have decided to make the following changes:
1. Changes will use the self_hosted runner
2. e2e-2card will use the A3 node.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
In reinforcement learning scenarios, the current inference applies a
transpose operation to the weights. For a cleaner architecture, the
weight transpose module was moved to wakeup.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
Avoid oom during CI by using `with VllmRunner` instead of `LLM()`, and
enable `test_ngram_correctness`
### How was this patch tested?
CI passed.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This patch adds support for the xlite graph wrapper to vllm_ascend.
Xlite provides operator implementations of the transformer network on
Ascend hardware. For details about xlite, please refer to the following
link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:
## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison
- aclgraph: main(c4a71fc6)
- xlite-full: main(c4a71fc6) + xlite-full
- xlite-decode-only: main(c4a71fc6) + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph
### Does this PR introduce _any_ user-facing change?
Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' #
Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true,
"full_mode": true}}' # Enabled for prefill and decode
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lulina <lina.lulina@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Using an ARM-based github_hosted node to temporarily resolve `no space
left` issues when installing vllm in UT.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
It's safe to drop ascend scheduler now. The related test and doc has
been removed already
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
As title shows, upgrade vllm commit hash to `ad32e3e`
- vLLM version: v0.12.0
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1. Optimize multi-node waiting logic
2. Remove the `tee` pipeline for logs, which will lead to hang issue
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
1. Remove ascend schuduler ut
2. Remove models ut
3. move mla to ops
4. skip the failed ut
- vLLM version: v0.12.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add Qwen3Next support in main
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
Add accuracy nightly test for new models:
PaddlePaddle/ERNIE-4.5-21B-A3B-PT
LLM-Research/Molmo-7B-D-0924
LLM-Research/gemma-2-9b-it
LLM-Research/gemma-3-4b-it
Shanghai_AI_Laboratory/internlm-7b
llava-hf/llava-1.5-7b-hf
- vLLM version: v0.11.2
Signed-off-by: hfadzxy <starmoon_zhang@163.com>