### What this PR does / why we need it?
Add a control to enable the exponential distribution operator
overlapping with model executing (default is OFF due to this feature
might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance
improvement.
Also, overlapping the exponential operator with module execution can
cover the performance drop introduced by AICPU-version's exponential
operator.
**UPDATE**: (12/12)
Now our overlap will use the same stream that introduced in this pr:
#4908 .
We move the `do_async_exponential` from `model_runner_v1.py` to
`sampler.py`.
Now we are using `additional_config` to enable async exponential:
Add `"enable_async_exponential": 1` in `addition_config`.
Now we **ONLY** support default exponential/AI-CPU exponential, the old
`"enable_async_exponential": 2` option has been aborted to keep
consistency.
### Does this PR introduce _any_ user-facing change?
**YES**, added a new `additional_config` : `"enable_async_exponential":
1`.
When `enable_async_exponential` is set to 1, we enable the async
exponential and overlap with model runner.
When `enable_async_exponential` is set to 0 (default is 0), we disable
the async exponential, but exponential will still running on a different
stream using stream introduced in #4908.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com>
Signed-off-by: YuhanBai yuhan.bai0830@gmail.com
### What this PR does / why we need it?
add pcp accuracy e2e test case
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
unblock CI on suffix spec decoding
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
For single node test, the lack of a retry mechanism for accessing
ModelScope resulted in an HTTP 400 error sometimes. I recommend using a
local offline cache instead.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused
fused_gdn_gating+fused_recurrent_gated_delta_rule
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
In this PR, DispatchGmmCombineDecode add an optional input
x_active_mask, with which
only token masked True will be dispatched and handle.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
### What this PR does / why we need it?
support basic long_seq feature st
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
Add top_p,top_k in EAGLE e2e
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.
### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.
### How was this patch tested?
By CI.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
[Nightly] Avoid max_model_len being smaller than the decoder prompt to
prevent single-node-accuray-tests from failing
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
### What this PR does / why we need it?
1. In addition to
[#4168](https://github.com/vllm-project/vllm-ascend/pull/4168),
[#5011](https://github.com/vllm-project/vllm-ascend/pull/5011), this PR
adds two more pattern for AddRmsnormQuant with SP enabled. The key
difference is to insert an additional `maybe_all_gather_and_maybe_unpad`
between `addrmsnorm` and `quantize`.
2. This PR also introduce another api `torch.ops.vllm.quantize`, so that
we pass `input_scale` and `input_scale_reciprocal` at the same time.
This is because `npu_add_rms_norm_quant` and `npu_quantize` requires
different `div_mode`. To avoid introducing additional reciprocal
calculation in runtime, we have to pass both of them to quantize api.
3. Removes redundant `AscendQuantRmsnorm`.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
This PR add w4a8 accuracy testcase for e2e test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: cuikai (C) <c00827167@china.huawei.com>
Co-authored-by: cuikai (C) <c00827167@china.huawei.com>
### What this PR does / why we need it?
We will expose the enabling switch for npugraph_ex to better facilitate
subsequent optimization.
### Does this PR introduce _any_ user-facing change?
Previously, the enable_npugraph_ex switch would trigger an error; now we
have removed the error reporting mechanism to better facilitate
subsequent optimization efforts.
Basic functionalities are available in CANN and torch_npu for Q3, while
advanced optimizations will depend on the Q4 release.
### How was this patch tested?
llm =LLM(
model=model,
enforce_eager=False ,
additional_config={
"enable_npugraph_ex": True
},
compilation_config={
"cudagraph_mode": "FULL_DECODE_ONLY",
"cudagraph_capture_sizes": [16],
},
}
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
### What this PR does / why we need it?
Rename `_910B` to `A2`;
Rename `_910_93` to `A3`;
Rename `_910_95` to `A5`;
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion
pass for `qknorm_rope` operations. The implementation includes a new
configuration flag, a pattern matching pass using
`torch._inductor.pattern_matcher`, and a custom Triton kernel for the
fused operation.
Co-authored-by: Angazenn
[supperccell@163.com](mailto:supperccell@163.com)
### Does this PR introduce _any_ user-facing change?
Yes, add new additional_config
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Upstream vLLM PR #30212https://github.com/vllm-project/vllm/pull/30212
refactored the attention backend selection interface, This PR adapts
vllm-ascend's get_attn_backend_cls to align with the new upstream
standard, ensuring compatibility and reducing maintenance overhead.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zxwang <1476209578@qq.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py
in vllm-v0.12.0, to support the logit of padded drafter batch and
async-scheduler.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- https://github.com/vllm-project/vllm/pull/19482
- https://github.com/vllm-project/vllm/pull/26060
- https://github.com/vllm-project/vllm/pull/29223
#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
https://github.com/vllm-project/vllm/pull/26060
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
https://github.com/vllm-project/vllm-ascend/pull/4893/)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
This Pull Request removes the @pytest.mark.skip decorators from
test_mtp1_correctness_piecewise_graph and
test_mtp2_correctness_piecewise_graph.
These tests were temporarily skipped because of an issue with the MTP
ACL Graph (as per the original TODO comment). Since the relevant
bug/issue has been resolved, these tests are now re-enabled to ensure
full correctness coverage for MTP functionality.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
AddRMSNorm(with bias) and Quant Fusion Pattern
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Delete accuracy tests for models that are no longer retained:
- Meta-Llama-3.1-8B-Instruct
- llava-1.5-7b-hf
- InternVL2-8B.yaml
- InternVL2_5-8B.yaml
- InternVL3-8B.yaml
Add accuracy tests for the new models:
- Llama-3.2-3B-Instruct
- llava-onevision-qwen2-0.5b-ov-hf
- Qwen3-VL-30B-A3B-Instruct
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
now vllm-ascend uses AsyncGPUModelRunnerOutput
,AsyncNPUModelRunnerOutput before is outdated, so we should fix it
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
Since the `llmdatadist` has sunset, the logic gen_ranktable should also
be removed
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR updates the CI configuration and adjusts a set of end-to-end
(e2e) tests under tests/e2e/multicard, in order to refactor the test
suite and ensure compatibility with current codebase and CI workflows.
1. tests/e2e/multicard/test_prefix_caching.py: change model to Qwen3-8B
and rename the test case
2. tests/e2e/multicard/test_quantization.py: rename the test case
3. tests/e2e/multicard/test_qwen3_moe.py: remove duplicate test and
rename test cases
4. tests/e2e/multicard/test_qwen3_next.py: rename test cases and change
the W8A8 pruning model to the W8A8 model and remove the eager parameter
5. tests/e2e/multicard/test_shared_expert_dp.py: rename test case and
remove the eager parameter
6. tests/e2e/multicard/test_single_request_aclgraph.py: rename test case
and change Qwen3-30B to Qwen3-0.6B
7. tests/e2e/multicard/test_torchair_graph_mode.py: delete test cases
about torchair
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR standardizes the fusion naming, changing
`enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e
testing.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Support triton causal_conv1d_fn ops.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: QilaiZhang <245706640@qq.com>
### What this PR does / why we need it?
This PR adds mlapo operation support for bf16 no_quant mode.
### Does this PR introduce _any_ user-facing change?
This PR makes quant related parameters optional.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: chenjunyi <isjunyi.chen@gmail.com>
### What this PR does / why we need it?
Refactor the e2e testcases.
- tests/e2e/multicard/test_weight_loader.py: Remove the unused code.
- tests/e2e/singlecard/multi-modal/test_internvl.py: Move to accuracy
test.
- tests/e2e/singlecard/test_aclgraph.py: Rename the file.
- tests/e2e/singlecard/test_embedding_aclgraph.py : Combine with
tests/e2e/singlecard/test_bge_model.py
- tests/e2e/singlecard/test_completion_with_prompt_embeds.py: Delete
eager mode and modify model to Qwen3-0.6B
- tests/e2e/singlecard/test_quantization.py: Modify model to
Qwen3-0.6B-W8A8
- tests/e2e/singlecard/test_vlm.py: Modify model to Qwen3-VL-8B
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Remove unused PD-disaggreate scripts in E2E test.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Adds W4A16 quantization method for the Kimi-K2-Thinking model and
updates relevant modules to support the new quantization method.
- Implements complete W4A16 quantization method including weight
packing/unpacking, per-group quantization parameter generation,
post-processing logic and MoE method application.
- Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts
`with_quant` conditional logic to support W4A16 matrix multiplication.
- Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and
processing logic for `weight_packed` field.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: Ruri <zhouxiang100@huawei.com>
### What this PR does / why we need it?
Delete accuracy testing of some models:
- Qwen2-VL-7B-Instruct
- Qwen2.5-VL-7B-Instruct
- gemma-2-9b-it
- DeepSeek-V2-Lite
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this
pr covered the three model types of embed (cls_token, mean_token,
lasttoken).
After this
[commit](17373dcd93),
vllm has provided support for adapting pooling models on the v1 engine.
This PR includes corresponding adaptations on the vllm-ascend side.
Fixes#1960
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
add e2e test for mtp async scheduling
### Does this PR introduce _any_ user-facing change?
no
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
This patch do some tiny optimization for nightly ci:
1. Polling the frequency with which the service prints logs when it
starts up in order to obtain useful information more quickly.
2. Shorten the timeout for waiting server
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
aclgraph is stable and fast now. Let's drop torchair graph mode now.
TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
As there is not accuracy test for qwen3-235B-A22B model
Test result:
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.29
Times long for test case running: 30mintues
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
As support for the mooncake connector is now available, the llmdatadist
connector is no longer being maintained, so the llmdatadist-related
files need to be retired.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
We didn’t account for this earlier because we didn’t have A3 in CI, but
now that we do, this test case needs a few extra tweaks — please take a
look at `profile_run`.
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>