### What this PR does / why we need it?
When matmul_and_reduce is enabled, the prefix attribute is required.
However, in some models, the prefix is not passed correctly, causing
errors when starting the service.
The issue of incorrect prefix passing will be fixed in vLLM in the
future.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
This pull request introduces an L2 normalization kernel implemented in
Triton, specifically optimized for Ascend NPUs.
### Does this PR introduce _any_ user-facing change?
No, this PR does not introduce any user-facing changes.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: Ascendyh <hw7osiris@outlook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as
it causes deepseek v3.2 hang error
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
`VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is not used anywhere, let's
remove it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Some E2E testcases are not in our CI workflow, this PR add them back.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
efect e2e ci test:
1. tests/e2e/singlecard/pooling/test_embedding.py: remove the eager
parameter and rename test case
2. tests/e2e/singlecard/pooling/test_scoring.py: Rename test cases
3. tests/e2e/singlecard/pooling/test_classification.py: Rename test case
4. tests/e2e/singlecard/test_quantization.py: remove the eager parameter
and chage model to vllm-ascend/Qwen2.5-0.6B-W8A8 and Rename test case
5. tests/e2e/multicard/test_shared_expert_dp.py: Rename test cases
6. tests/e2e/singlecard/test_sampler.py: Rename test cases
7. tests/e2e/singlecard/test_aclgraph_accuracy.py: Rename test cases
8. tests/e2e/multicard/test_offline_inference_distributed.py: Rename
test cases and remove the eager parameter
9. tests/e2e/multicard/long_sequence/test_accuracy.py: Rename test cases
and remove the eager parameter
10. tests/e2e/multicard/long_sequence/test_basic.py: Rename test cases
and remove the eager parameter
11.tests/e2e/multicard/test_expert_parallel.py:remove the eager
parameter
12.tests/e2e/multicard/test_full_graph_mode.py:remove the eager
parameter
13.tests/e2e/multicard/test_ilama_lora_tp2.py:remove the eager parameter
14.tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py:remove
the eager parameter
15.tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py:remove the
eager parameter
16.tests/e2e/singlecard/test_aclgraph_accuracy.py:remove the eager
parameter
17.tests/e2e/singlecard/test_camem.py:remove the eager parameter
18.tests/e2e/singlecard/test_ilama_lora.py:remove the eager parameter
19.tests/e2e/singlecard/test_multistream_overlap_shared_expert.py:remove
the eager parameter
20.tests/e2e/singlecard/test_vlm.py:remove the eager parameter
21.tests/e2e/singlecard/test_xli:remove the eager parameter
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Using `spawn` in continuous testing scenarios
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
- Standardize test case naming in `vllm-ascend/tests/e2e/multicard/` to
follow the `<model>_<feature>_<distributed>` convention.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
### What this PR does / why we need it?
Support KV-Sharing feature in CLA (cross layer attention) models, which
sharing kv cache in some layers.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
add pcp accuracy e2e test case
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
support basic long_seq feature st
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
This PR add w4a8 accuracy testcase for e2e test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: cuikai (C) <c00827167@china.huawei.com>
Co-authored-by: cuikai (C) <c00827167@china.huawei.com>
### What this PR does / why we need it?
[E2E] Collect test run time.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
now vllm-ascend uses AsyncGPUModelRunnerOutput
,AsyncNPUModelRunnerOutput before is outdated, so we should fix it
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
This PR updates the CI configuration and adjusts a set of end-to-end
(e2e) tests under tests/e2e/multicard, in order to refactor the test
suite and ensure compatibility with current codebase and CI workflows.
1. tests/e2e/multicard/test_prefix_caching.py: change model to Qwen3-8B
and rename the test case
2. tests/e2e/multicard/test_quantization.py: rename the test case
3. tests/e2e/multicard/test_qwen3_moe.py: remove duplicate test and
rename test cases
4. tests/e2e/multicard/test_qwen3_next.py: rename test cases and change
the W8A8 pruning model to the W8A8 model and remove the eager parameter
5. tests/e2e/multicard/test_shared_expert_dp.py: rename test case and
remove the eager parameter
6. tests/e2e/multicard/test_single_request_aclgraph.py: rename test case
and change Qwen3-30B to Qwen3-0.6B
7. tests/e2e/multicard/test_torchair_graph_mode.py: delete test cases
about torchair
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR standardizes the fusion naming, changing
`enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e
testing.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Refactor the e2e testcases.
- tests/e2e/multicard/test_weight_loader.py: Remove the unused code.
- tests/e2e/singlecard/multi-modal/test_internvl.py: Move to accuracy
test.
- tests/e2e/singlecard/test_aclgraph.py: Rename the file.
- tests/e2e/singlecard/test_embedding_aclgraph.py : Combine with
tests/e2e/singlecard/test_bge_model.py
- tests/e2e/singlecard/test_completion_with_prompt_embeds.py: Delete
eager mode and modify model to Qwen3-0.6B
- tests/e2e/singlecard/test_quantization.py: Modify model to
Qwen3-0.6B-W8A8
- tests/e2e/singlecard/test_vlm.py: Modify model to Qwen3-VL-8B
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Adds W4A16 quantization method for the Kimi-K2-Thinking model and
updates relevant modules to support the new quantization method.
- Implements complete W4A16 quantization method including weight
packing/unpacking, per-group quantization parameter generation,
post-processing logic and MoE method application.
- Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts
`with_quant` conditional logic to support W4A16 matrix multiplication.
- Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and
processing logic for `weight_packed` field.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: Ruri <zhouxiang100@huawei.com>
### What this PR does / why we need it?
Set the global env `TRANSFORMERS_OFFLINE: 1`, which will avoid
downloading the file and return the path to the
local cached file if it exists when using modelscope's
`snapshot_download` api
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this
pr covered the three model types of embed (cls_token, mean_token,
lasttoken).
After this
[commit](17373dcd93),
vllm has provided support for adapting pooling models on the v1 engine.
This PR includes corresponding adaptations on the vllm-ascend side.
Fixes#1960
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
aclgraph is stable and fast now. Let's drop torchair graph mode now.
TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Considering that long queues severely impact the developer experience,
we have decided to make the following changes:
1. Changes will use the self_hosted runner
2. e2e-2card will use the A3 node.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
In reinforcement learning scenarios, the current inference applies a
transpose operation to the weights. For a cleaner architecture, the
weight transpose module was moved to wakeup.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
Avoid oom during CI by using `with VllmRunner` instead of `LLM()`, and
enable `test_ngram_correctness`
### How was this patch tested?
CI passed.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This patch adds support for the xlite graph wrapper to vllm_ascend.
Xlite provides operator implementations of the transformer network on
Ascend hardware. For details about xlite, please refer to the following
link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:
## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison
- aclgraph: main(c4a71fc6)
- xlite-full: main(c4a71fc6) + xlite-full
- xlite-decode-only: main(c4a71fc6) + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph
### Does this PR introduce _any_ user-facing change?
Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' #
Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true,
"full_mode": true}}' # Enabled for prefill and decode
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lulina <lina.lulina@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
As title shows, upgrade vllm commit hash to `ad32e3e`
- vLLM version: v0.12.0
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add Qwen3Next support in main
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.
Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.
- vLLM version: v0.11.2
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
qwen3-next suppot triton chunk_gated_delta_rule ops
### co-owners
@OsirisDuan
- vLLM version: v0.11.2
Signed-off-by: shiyuan680 <917935075@qq.com>
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.
1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.
Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com
- vLLM version: v0.11.2
---------
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
### What this PR does / why we need it?
Upgrade cann to 8.3rc2
### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.3.RC2
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
1. Run 4-card test only when single and 2-card test passed
2. rename file to make it more clear
3. remove useless pd workflow, it has been managed by nightly test
already.
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>