### What this PR does / why we need it?
[CI] Upgrade CANN to 8.5.1
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
CI passed with new added test.
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Add nightly test for Qwen3-235B-A22B with mooncake layerwise connector.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR fixes a `Stale file handle` error that occurs during doctests in
the CI environment. The error appears when loading models from
ModelScope, likely due to issues with network file systems used in CI.
The fix involves setting the `MODELSCOPE_HUB_FILE_LOCK` environment
variable to `false` in the `run_doctests.sh` script. This disables file
locking in the ModelScope hub, which is a common workaround for this
type of file system error.
### Does this PR introduce _any_ user-facing change?
No, this change only affects the CI test execution environment and has
no impact on users.
### How was this patch tested?
This change is validated by the CI pipeline. A successful run of the
doctests indicates that the fix is effective.
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
This pull request resolves an attention accuracy issue by enhancing the
AttentionMaskBuilder310 to correctly handle the maximum model length.
The change ensures that the attention mask generation process is
properly parameterized by the model's configuration, rather than relying
on a fixed internal value. This leads to more accurate attention mask
creation, which is crucial for the correct functioning of the attention
mechanism.
Update fused_moe to main branch.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Qwen3 dense mode & moe model e2e test
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
## Summary
- Add automatic quantization format detection, eliminating the need to
manually specify `--quantization` when serving quantized models.
- The detection inspects only lightweight JSON files
(`quant_model_description.json` and `config.json`) at engine
initialization time, with no `.safetensors` reads.
- User-explicit `--quantization` flags are always respected;
auto-detection only applies when the flag is omitted.
## Details
**Detection priority:**
1. `quant_model_description.json` exists → `quantization="ascend"`
(ModelSlim)
2. `config.json` contains `"quant_method": "compressed-tensors"` →
`quantization="compressed-tensors"` (LLM-Compressor)
3. Neither → default float behavior
**Technical approach:**
Hooked into `NPUPlatform.check_and_update_config()` to run detection
after `VllmConfig.__post_init__`. Since `quant_config` is already `None`
at that point, we explicitly recreate it via
`VllmConfig._get_quantization_config()` to trigger the full quantization
initialization pipeline.
## Files Changed
| File | Description |
|------|-------------|
| `vllm_ascend/quantization/utils.py` | Added
`detect_quantization_method()` and `maybe_auto_detect_quantization()` |
| `vllm_ascend/platform.py` | Integrated auto-detection in
`check_and_update_config()` |
| `vllm_ascend/quantization/modelslim_config.py` | Improved error
handling for weight loading |
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
This version has no divisibility constraint between tp and mtp+1.
However, cudagraph_capture_sizes must be a common multiple of tp and
mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed
cudagraph_capture_sizes.
We added a long-sequence test (64k input, 3k output) for the two-node
mixed deployment scenario. Due to the excessive time required for
performance benchmarking, we are only verifying functionality. The
single-node scenario is skipped because VRAM limitations prevent
launching the model with a max-model-len of 68,000.
and we also add aime2025 test for dual-node deepseek 3.2 nightly test.
### How was this patch tested?
test at nightly environment.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
New FIA operator requires queryT equal to the last element of
actualSequenceLengthQ.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passed existing test (test_mtp_eagle_correctness.py).
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: Wangbingjie <w30061490@china.huawei.com>
Co-authored-by: Wangbingjie <w30061490@china.huawei.com>
### What this PR does / why we need it?
fix(mtp): resolve MTP core bugs and enhance eager mode test cases
1. Resolved critical issues in eager mode MTP core execution logic;
2. Fixed functional bugs in the _update_states_after_model_execute
function;
3. Updated and released test_mtp_qwen3_next.py to validate eager mode
acceptance rate.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>
### What this PR does / why we need it?
The ds3.2 model adaptation supports the PCP feature.
The solution is as follows: When saving the KV cache, first perform an
allgather operation on the KVs, and then each node saves its own copy.
When the attention or indexer performs calculations, they all gather the
KV cache and then perform the calculations.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
02/12 23:05:10 - AISBench - INFO - Running 1-th replica of evaluation
02/12 23:05:10 - AISBench - INFO - Task [vllm-api-general-chat/gsm8k]:
{'accuracy': 96.35416666666667, 'type': 'GEN'}
02/12 23:05:10 - AISBench - INFO - time elapsed: 2.87s
02/12 23:05:12 - AISBench - INFO - Evaluation tasks completed.
02/12 23:05:12 - AISBench - INFO - Summarizing evaluation results...
dataset version metric mode vllm-api-general-chat
gsm8kdataset - accuracy gen 96.35
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
The basic configs are extracted and reused for eplb UT. This is done so
that if the basic configs are changed later, eplb UT does not need to be
modified repeatedly.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: bigsir007 <xujiacheng12@huawei.com>
Co-authored-by: bigsir007 <xujiacheng12@huawei.com>
### What this PR does / why we need it?
Integrating inductor pass and npugraph ex pass, see RFC:
https://github.com/vllm-project/vllm-ascend/issues/6347
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
all tests passed.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
1. support ND format gmm weight input.
Before this pr, gmm1_weight and gmm2_weight could only be passed as
input to the DispatchGmmCombineDecode operator in NZ data format. After
the modification, they are allowed to be passed in ND data format.
2. support bf16/float16 gmm weight
The current PR modification enables the DispatchGmmCombineDecode
operator to support non-W8A8 scenarios, allowing gmm1_weight and
gmm2_weight to be passed as float16/bfloat16 which is correspond with
input token data type.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: lih827 <383084552@qq.com>
### What this PR does / why we need it?
Fix the issue where, in graph mode, the fused `AddRMSNormQuant` operator
does not take effect when there is no bias.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: ZYang6263 <zy626375@gmail.com>
### What this PR does / why we need it?
This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).
### Does this PR introduce _any_ user-facing change?
Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.
### How was this patch tested?
CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.
Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
### What this PR does / why we need it?
This PR extends original `rope_triton_forward` and
`split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as
inputs. This fully aligns to vLLM RoPE api interface. Compared with
earlier implementation for RoPE, the benefits are:
1. avoiding pre-computation of `cos` `sin` before model execution, which
helps to remove redundant codes.
2. allowing eagle3 draft model to have different rope parameters with
main model (see #6612 ). This help to recover accept rate && accuracy in
that case.
In addition, this kernel change only introduces very small performance
degradation. Those `index_select` or `chunk` operations are now changed
into simple memory access in triton kernel (For example,
https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81).
**Highlights**
- **RoPE Cache Unification**: Replaced separate _sin and _cos global
tensors with a unified cos_sin_cache and explicit positions tensor for
Rotary Positional Embeddings (RoPE), streamlining data handling.
- **Triton Kernel Integration**: Updated Triton kernels
(split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the
cos_sin_cache and positions for more efficient and integrated RoPE
calculations.
- **Custom Operation Registration**: Registered `rope_forward_oot` as a
new custom operation, allowing its use in fused compilation passes and
providing a dedicated entry point for the new RoPE implementation.
- **Refactored RoPE Forward Pass**: Modified the rope_forward_oot
function to accept the new cos_sin_cache and positions arguments,
enabling a more flexible and integrated RoPE application within the
system.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
5326c89803
Additional test on Qwen3-235b accuracy:
| Aime2024 | GSM8K | Livecodebench |
| -------- | -------- | -------- |
| 83.33 | 96.26 | 70.23 |
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Fixes the issue where nightly multi-node tests hang during the "wait for
pod ready" stage due to strict shell mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
13397841ab
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
This PR fixes a `TypeError` in
`tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py`
that was causing nightly test failures. The `torch.rand()` function was
being called with the `device` string as a positional argument, which is
incorrect. This has been corrected to use the `device` keyword argument.
Fixes #
### Does this PR introduce _any_ user-facing change?
No, this change only affects a test file.
### How was this patch tested?
CI is expected to pass with this fix.
- vLLM version: v0.15.0
- vLLM main:
13397841ab
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
This PR adds a case of qwen3-30b w8a8 with mooncake mempool, we need to
test it regual
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.14.1
- vLLM main:
d68209402d
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
This PR upgrades the core vLLM dependency to a newer version from the
main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is
necessary to keep our project up-to-date with the latest features and
fixes from upstream vLLM.
1.
ac32e66cf9
pass file is moved.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
This PR adds an end-to-end test case to verify the correctness of base
model inference when LoRA is enabled. This is to ensure that after a
LoRA base model request issue was fixed, the functionality remains
correct and does not regress. The new test case calls `do_sample` with
`lora_id=0` to target the base model and asserts the output against
expected SQL queries.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with the new test case. The test can be run with:
```bash
pytest -sv tests/e2e/singlecard/test_llama32_lora.py
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
This PR adds DCP support to the SFA backend.
Please note that due to operator constraints, the current implementation
has to all-gather the entire KV cache and modify the block table to
satisfy the operator input requirements. This results in significantly
increased communication overhead and peak memory usage. Therefore, this
is only a temporary workaround and will be refactored once the operator
provides proper support.
Additionally, because of the above limitations,
`cp_kv_cache_interleave_size` is currently required to be equal to
`block_size`. This restriction will also be removed after the refactor.
#### Test
accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8
| dataset | version | metric | mode | vllm-api-general-stream |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.35 |
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
This PR fixes a bug in the `test_triton_fusion_ops` test case. The test
compares a fused kernel (`fused_sigmoid_gating_delta_rule_update`) with
a split implementation. Both paths use a recurrent state tensor.
The bug was that the state tensor was being modified in-place by the
fused kernel call, and this modified tensor was then reused for the
split implementation path. This led to an incorrect comparison and test
failure.
This fix ensures that each path starts with an identical, clean initial
state by creating separate tensors. It also changes the state
initialization from `torch.randn` to `torch.ones` to make the test
deterministic.
### Does this PR introduce _any_ user-facing change?
No, this change only affects a test case and has no user-facing impact.
### How was this patch tested?
The fix is applied directly to the test case. The CI passing for
`test_fused_sigmoid_gating_delta_rule.py` will confirm that the fix is
working as expected.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
### What this PR does / why we need it?
This pull request significantly enhances the test suite by adding new
end-to-end test cases for Qwen3 models on the 310P hardware platform.
The primary goal is to ensure the stability and correctness of these
models under diverse operational conditions, including various
parallelism strategies, data types, and quantization methods.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E test
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.
The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.
### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.
### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
by running ci
- vLLM version: release/v0.12.0
---------
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
### What this PR does / why we need it?
This patch bump the mooncake version to the latest
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
test is locally
>>> from mooncake.engine import TransferEngine
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR updates the CI runner from `linux-aarch64-a2-*` to
`linux-aarch64-a2b3-*` in various test configuration files. This change
is necessary to adapt to updates in the CI infrastructure.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The changes are configuration updates for CI tests. The correctness will
be verified by the CI pipeline.
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
When running the Qwen3-0.6B model using the npugraph_ex backend, the
last few characters of the generated results changed. We have modified
the relevant test cases to ensure the CI runs smoothly.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint #6459 (commit
5b0a6bcfe9)" and fixes a check in
`model_runner_v1`.
**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**
This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.
We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Test cases added.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
This PR enhances the test_deepseek3_2_w8a8_pruning_mtp_tp2_ep E2E test
by adding both short and long prompt test cases:
- Short test: Validates basic functionality with minimal input ("Hello
")
- Long test: Validates the model can handle prompts near its maximum
context length (~163K tokens, approaching the max_position_embeddings
limit of 163,840)
Additionally, explicitly sets max_model_len=163840 to ensure the test
properly exercises the model's full context window capability.
### Does this PR introduce _any_ user-facing change?
No. This change only affects internal E2E testing infrastructure.
### How was this patch tested?
The modified test case will be executed as part of the E2E test suite
and has been validated
[here](https://github.com/vllm-project/vllm-ascend/actions/runs/21620195055/job/62308026205?pr=6499).
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:
--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Remove kv_cache nz test case for test_mla_preprocess_nq.py. This case is
added by https://github.com/vllm-project/vllm-ascend/pull/3072 but has
not been tested on bf16 scenario. Results show that this is not
currently supported.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Add E2E for Prefix Caching cp & Chunked Prefill cp
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: F.Liu <liufeng248@huawei.com>
Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com>
Co-authored-by: F.Liu <liufeng248@huawei.com>
### What this PR does / why we need it?
Fix a bug in the repo and add a test case for MTP + Full Decode Only +
Qwen3Next.
The _build_dummy_attn_metadata function in NPUModelRunner seems losed a
query_star_loc.copy_to_gpu operation, which will lead to difference
between query_start_loc and query_start_loc_cpu, and they are required
to be same in MTP + Full Decode Only + Qwen3Next case.
Before this pr:
`self.query_start_loc = [0, 0, 0, 0, ... , 0]
self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
Introduced 310P W8A8 Quantization Support: New modules and methods have
been added to enable W8A8 static quantization specifically for the
Ascend 310P platform.
Platform-Specific Quantization Configuration Loading: The system now
dynamically loads the appropriate quantization configurations
(AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether
the current hardware is an Ascend 310P device.
Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization
method for 310P is provided, handling the specifics of weight and
activation quantization, including input parameter broadcasting and
weight data manipulation.
Extended AscendModelSlimConfig for 310P: A specialized configuration
class for 310P integrates the new W8A8 linear method for both standard
linear layers and vocabulary parallel embeddings, ensuring proper
quantization application.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
### What this PR does / why we need it?
As #2947 describe, we need to transpose kv cache layout after GQA kv
transfer when prefill and decode tensor parallel size are heterogeneous,
in the previous implementation, we use `npu_paged_cache_load ` +
`tranpose` + `_npu_reshape_and_cache` to do this work.
But obviously, it is not an efficient plan, the ops above need to be
called for each layer, which introduces 3 * layer_num kernel launch, and
6 * layer_num data movement between L1 Cache and HBM for one request on
decode node. Usually, decode node uses graph mode, so these op kernels
will be called between decode forward launched by an async thread in
mooncacke connector, this kernels maybe last for several decode forward
and TTFT will increase by 3~4 decode forward time.
In this PR, we implement an AscendC fused op
`transpose_kv_cache_by_block` to do this with only once kernel launch
and move data between L1 Cache and HBM only once.
After using this fused op, the time cost in transpose kv cacke layout
can be decreased to 0.24ms from 7ms in UT on 910C, and in PD
disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in
qwen3-235B.
| request_num | original | fused_op|
|:----------------------:|:---------------:|:-------------------:|
| 1 | 643 ms | 578 ms |
| 128 | 1480 ms | 1368 ms |
### Does this PR introduce _any_ user-facing change?
Use fused op by default, incase the op has bug in any scenario, provide
fallback choice using env to disable it.
**DISABLE fused op by add following env**
`export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0`
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
### What this PR does / why we need it?
Add New Output for Expert Token Count
An additional output tensor expert_token_nums is added to both operators
to meet the requirement of tracking token distribution among experts:
Tensor Name: expert_token_nums
Dimension: 1D tensor
Shape: (local_expert_num,)
Data Type: int32
Semantics: Represents the number of tokens actually received by each
expert on the current card.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: guanguan0308 <1546542263@qq.com>
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>