### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736
**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`
**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.
**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2
scenario.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
In graph + RL scenario, we only capture the graph once, and the weight
address is expected to be the same across iterations. However, when
calling .contiguous() on weight tensors, a new memory address may be
allocated, causing the graph to capture incorrect weight addresses.
This PR modifies the weight update logic in AscendMLAImpl and
AscendSFAImpl to use copy_() instead of reassignment, ensuring the
weight addresses remain consistent across iterations.
detailed in #7473
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Signed-off-by: Debonex <719893090@qq.com>
…DSV3.1 C8.
### What this PR does / why we need it?
DeepSeek v3.1 C8 had a hanging issue when overlaying MTP and full graph
modes; this pull request resolves that issue.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: pichangping <1337510399@qq.com>
### What this PR does / why we need it?
This PR simplifies and hardens MLA prefill context merging in
`vllm_ascend/attention/mla_v1.py` after FIA migration by directly
building `out_list/lse_list` (without temporary chunk buffers or
`cat/stack/split`) and using `reshape` for safe flattening of
non-contiguous tensors.
### Does this PR introduce _any_ user-facing change?
No. This is an internal refactor/stability improvement only; no
API/interface behavior changes.
### How was this patch tested?
- Verified tensor shape/data flow for `npu_attention_update` inputs
(`out_list/lse_list`) after refactor.
- Confirmed no lint errors in the modified file.
- CI UT coverage on attention/MLA paths is used for validation.
vLLM version: `v0.17.0`
vLLM main: `vllm-project/vllm@4034c3d`
---------
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>
### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints:
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
**Refactor: Replace npu_ring_mla with FIA in MLA prefill**
This PR refactors the MLA (Multi-Layer Attention) prefill implementation
by replacing `npu_ring_mla` with `npu_fused_infer_attention_score` (FIA)
operator, unifying the attention backend with the standard attention
implementation.
**Key changes:**
1. **Core prefill refactoring (`mla_v1.py`)**
- Replace `npu_ring_mla` with `npu_fused_infer_attention_score` in
`_forward_prefill` and `_compute_prefill_context`
- Use TND layout with `softmax_lse_flag=True` for prefill attention
- Use `npu_attention_update` to merge multiple chunk outputs with LSE
(Log-Sum-Exp)
- Change `attn_mask` from `get_final_mla_mask()` to
`get_splitfuse_attn_mask()` for FIA compatibility
2. **Data type handling**
- Add automatic float16 → bfloat16 conversion (FIA with TND layout only
supports bfloat16)
- Convert output back to original dtype after FIA computation
3. **Metadata optimization**
- Pre-calculate `actual_seq_lengths_q` in `AscendMLAPrefillMetadata`
- Pre-calculate `chunk_actual_seq_lengths_kv_list` in
`ChunkedContextMetadata`
- Move `torch.cumsum` operations from forward pass to metadata building
phase
4. **CP compatibility (`mla_cp.py`)**
- Add `_ring_mla_mask_builder` to get `npu_ring_mla`-compatible masks
for Context Parallel scenarios
- Add `chunk_actual_seq_lengths_kv_list` field to
`CPChunkedContextMetadata`
**Why we need it:**
- **Backend unification**: Aligns MLA prefill with standard attention
implementation (`attention_v1.py`)
- **Better chunked context support**: FIA + `npu_attention_update`
provides native LSE-based output merging
- **Future compatibility**: Prepares for eventual `npu_ring_mla` removal
across the codebase
### Does this PR introduce _any_ user-facing change?
**No.** This is a pure refactoring with no functional changes - same
behavior, unified backend.
---
- Related issue: #5463 (item 7)
- vLLM version: v0.14.1
Signed-off-by: lico67373 <918688502@qq.com>
### What this PR does / why we need it?
This PR aims to support aclgraph for model runner v2, please see RFC
#5208. The PR contains these modifications:
- adapt to newest commit of vllm main branch.
- supply a unified interface of extra forward context for both model
runner v1 and model runner v2.
- implement graph mode for main model.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Added a check in the may_reinitialize_input_batch method to verify
whether the backend implements the get_supported_block_size method
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
Only a few lines of code within the methods were modified, and the
format check test has been passed.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: Debuuuuger <huangzr@cmbchina.com>
Signed-off-by: debuger <102402761+huangazazaz@users.noreply.github.com>
Signed-off-by: Debuuuuger <12110718@mail.sustech.edu.cn>
Co-authored-by: Debuuuuger <huangzr@cmbchina.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
https://github.com/vllm-project/vllm/pull/31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
https://github.com/vllm-project/vllm/pull/32082 by
https://github.com/vllm-project/vllm-ascend/pull/6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
https://github.com/vllm-project/vllm/pull/25954 by overriding attention
metadata slots.
This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.
### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.
### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
co-authored-by: shen-shanshan <467638484@qq.com>
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
There is an issue with the current default logic for MLAPO (MLA Policy
Optimization). By design, MLAPO should only be enabled by default on
Decode (D) nodes. However, in hybrid (collocated prefill and decode)
scenarios, the strategy is erroneously activated during the Prefill
stage.
This PR corrects the default behavior to ensure that MLAPO is
exclusively enabled for the Decoding phase. This prevents unexpected
policy interference during Prefill and ensures optimal performance in
hybrid deployment environments.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Fix the MTP test failure caused by accessing non-existent attribute
`forward_context.draft_attn_metadatas`.
**Root cause:**
In `AscendAttentionBackendImpl.update_graph_params`, the code
incorrectly accessed `forward_context.draft_attn_metadatas`, but
`ForwardContext` class doesn't have this attribute. The original code
passed this value via function parameter.
**Fix:**
Add `draft_attn_metadatas` parameter to the entire call chain:
- `update_full_graph_params` function in `acl_graph.py`
- All `update_graph_params` methods in attention backends
- Pass the parameter correctly in `eagle_proposer.py`
Also applied Gemini's suggestion to make `vllm_config=None` in
`AscendAttentionCPImpl.update_graph_params` for API consistency.
Related to item 9 in #5463
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This fixes the CI test failure:
`test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`
Signed-off-by: lico67373 <918688502@qq.com>
This reverts commit 8966a99710.
It breaks the test
`tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`
- vLLM version: v0.14.0
- vLLM main:
d68209402d
### What this PR does / why we need it?
**Refactor: Unify full-graph parameter update logic**
This PR consolidates the scattered full-graph parameter update logic
into a unified approach, improving code architecture and eliminating
duplication.
**Key improvements:**
1. **Unified interface**
- Create `update_full_graph_params` as the single entry point for all
full-graph updates
- Replace multiple scattered update calls with one unified function
- Remove ~50 lines of duplicated if-else logic across
`model_runner_v1.py` and `eagle_proposer.py`
2. **Better architecture**
- Move update logic to respective Backend classes
(`AscendAttentionBackend`, `AscendMLABackend`)
- Each Backend manages its own parameter update logic internally
- Simplify caller code to just dispatch to the appropriate Backend
3. **Cleaner parameter handling**
- Remove unnecessary `pcp_size` and `dcp_size` parameter passing
- Get parallel configuration directly from distributed groups
- Consistent with how other parts of the codebase obtain these values
**Why we need it:**
- **Maintainability**: Future changes only need to be made in one place
per Backend
- **Code quality**: Follows DRY principle and Single Responsibility
Principle
- **Readability**: Cleaner, more intuitive code structure
### Does this PR introduce _any_ user-facing change?
**No.** This is a pure refactoring with no functional changes - same
behavior, cleaner code.
### How was this patch tested?
- All existing unit tests pass with updated mocks
- No new tests needed (pure refactoring, no behavior changes)
- CI validates correctness
---
- vLLM version: v0.13.0
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
PCP/DCP splits the kv-cache onto different cards. After introducing the
parameter cp-kv-cache-interleave-size, the first size tokens will be
cached at Card 0, and so on.
However, if there are too few tokens, some cards will not store the
key-value pairs, resulting in values of 0, corrupted values, and
precision issues. Currently, additional operations are introduced to
avoid this precision problem.
After we integrate FIA operator in mla_cp._forward_decode and CANN
updates to 8.5.0, we now can remove these additional operations.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
passed all CI by CANN 8.5.0
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: dsxsteven <dsxsteven@sina.com>
Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD
disagregation D Instance, for example: DeepSeekV3-W8A8,
DeepSeek-R1-W8A8.
2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models,
currently is DeepSeek-V3.2-W8A8.
### Does this PR introduce _any_ user-facing change?
Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO
feature for deepseek w8a8 model
The effect of enabling MLAPO SFA model deployed on a single A3 Node:
Test
with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py
dataset: gsm8k-lite,without set MTP, FULL GRAPH, has 19% promote:
未默认开启 MLAPO 时:
├─────────────────────────┤
│ TTFT │ 14055.8836 ms │
├─────────────────────────┤
│ ITL │ 66.8171 ms. │
├─────────────────────────┤
│ Output Token Throughput │ 104.9105 token/s │
├─────────────────────────┤
默认开启 MLAPO 时:
├─────────────────────────┤
│ TTFT │ 3753.1547 ms │
├─────────────────────────┤
│ ITL. │ 61.4236 ms. │
├─────────────────────────┤
│ Output Token Throughput │ 125.2075 token/s│
├─────────────────────────┤
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
This PR makes `AscendMLAMetadataBuilder` and `AscendSFAMetadataBuilder`
properly inherit from the base class `MLACommonMetadataBuilder` in vllm
by adding `super().__init__()` calls.
**Changes:**
- Add `super().__init__()` call in `AscendMLAMetadataBuilder.__init__()`
- Add `super().__init__()` call in `AscendSFAMetadataBuilder.__init__()`
- Extract `ascend_chunked_prefill_workspace_size()` to
`vllm_ascend/attention/utils.py` to avoid code duplication
- Override `determine_chunked_prefill_workspace_size()` to support
Ascend-specific 128k tokens workspace size (vs 64k in parent class)
- Update unit tests to mock parent class `__init__` for proper isolation
**Why we need it:**
- Follow proper Python inheritance patterns by calling
`super().__init__()`
- Reduce code duplication by reusing parent class initialization logic
- Better maintainability as parent class changes will be automatically
inherited
Part of issue #5463 item 10
### Does this PR introduce _any_ user-facing change?
No, this is an internal refactoring that does not change any user-facing
behavior.
Signed-off-by: lico67373 <918688502@qq.com>
### What this PR does / why we need it?
This PR fix the input constraints checks for the mlapo and bmm_transpose
operators.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
### Perf
64K/3K,1P1D,bs=32
before this pr:
TPOT 29ms, TTFT 47s,TPS 606 token/s
after this pr:
TPOT 29ms, TTFT 48s,TPS 636 token/s
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
Add docstrings for Metadata and MetadataBuilder classes in the attention
module to improve code readability.
Related to #5463 (Item 11: Add some comments for CommonMetadata and
others)
**Modified files:**
- `vllm_ascend/attention/context_parallel/common_cp.py`: Added comments
for `AscendPCPMetadata`, `CPChunkedContextMetadata`,
`AscendMetadataForPrefill`, `AscendMetadataForDecode`
- `vllm_ascend/attention/utils.py`: Added comments for
`AscendPrefillContextParallelMetadata`
- `vllm_ascend/attention/mla_v1.py`: Added comments for
`ChunkedContextMetadata`, `AscendMLADecodeMetadata`
- `vllm_ascend/attention/attention_v1.py`: Added comments for
`AscendMetadata`, `AscendAttentionMetadataBuilder`
- `vllm_ascend/attention/context_parallel/attention_cp.py`: Added
comments for `AscendAttentionCPMetadataBuilder`
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Documentation only, no functional changes.
Signed-off-by: lico67373 <918688502@qq.com>
### What this PR does / why we need it?
mlapo in deepseek is a huge performance improvement in decode, this pr
support pcp & dcp with mlapo
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
- Delete the environment variable
`VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED`
- Introduce layer_sharding as a configurable feature in
additional_config
- Revise the term "shared weight" to "shard weight."
Configuration : The feature is opt-in via the additional_config
argument:
```
--additional-config '{
"layer_sharding": ["o_proj", "q_b_proj"]
}'
```
This is orthogonal to standard tensor parallelism and weight replication
strategies. It is treated as a separate, explicit feature.It can be used
in any scenario, combined with the
flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature
or the ShardedCP #4702 feature, to achieve significant performance.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
## What this PR does / why we need it?
This PR fixes the `AttentionMaskBuilder` singleton initialization issue
introduced in PR #4779 and removes the unused `pcp_prefill_mask` field.
### Background
After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton`
decorator, the class constructor now requires a `device` parameter.
However, two initialization sites were still using the old parameterless
constructor, causing failures.
### Changes
1. **Fix singleton initialization**
- Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)`
in `AscendMLAMetadataBuilder.__init__()`
- Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)`
in `AscendAttentionMetadataBuilder.__init__()`
2. **Remove unused field**
- Removed `pcp_prefill_mask` field from
`AscendPrefillContextParallelMetadata` (never used in codebase)
- Updated related test assertions
### Related
- Issue #5463
- PR #4779 (Unify all mask generation methods)
- PR #5389 (Make AttentionMaskBuilder singleton)
## Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring.
## How was this patch tested?
- ✅ Local testing: No linter errors
- ✅ Unit tests for attention modules verified
- ⏳ CI pipeline
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
- Problem: In MLA+MLAPO, KV-consumer deployments keep
fused_qkv_a_proj/q_proj weights and quant params even though MLAPO uses
the prepacked buffers, increasing memory footprint on decode nodes.
- Fix: Conditionally drop those tensors only when
`kv_transfer_config.is_kv_consumer` to reclaim memory (consistent with
the SFA behavior #4774 ).
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
### What this PR does / why we need it?
Improve the performance of Layerwise Connector, mainly includes the
following points:
1. Use event synchronize to replace stream synchronize.
2. Access metaserver when scheduling.
3. Transfer kvcache each Chunk prefill segmentation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
By converting the KV cache from ND to NZ format when the decode node
receives it, this PR ensures that the KV NZ feature works correctly
during the decoding phase in disagg-prefill scenario.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: ghphotoframe <854746559@qq.com>
Co-authored-by: alex101-ops <alex1015718386@gmail.com>
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.
Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
#5051 only implement a basic framework for model runner v2, but there
are still some bugs for e2e functionality, this PR aim to enable basic
functionality.
model runner v2 plans:
https://github.com/vllm-project/vllm-ascend/issues/5208
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
We support to use full graph with eagle.
Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.
NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
1. The `npu_fused_infer_attention_score` kernel supports specifying the
output layout. By selecting the appropriate layout, we can avoid the
transpose operation typically required after the attention.
2. The `transpose_batchmatmul` function allows us to control whether the
output tensor is transposed. If we configure `perm_y`, an additional
transpose after executing `v_up` becomes unnecessary.
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
The functions related to Cp differ significantly from those of normal
MLA-Attention, but the coupling is quite severe.
Steps:
1)Extract common code AscendMLAMetadataBuilder.build to 4 functions:
build_prefill_metadata, build_decode_metadata,build_cp_metadata,
build_chunked_metadata
todo:
1)refactor function _compute_prefill_context;
2)refactor function _mla_preprocess,_mla_decode_preprocess
3)Extract public data and processing functions from the attention_cp.py
and mla_cp.py files to the common_cp file.
vLLM version: 0.13.0rc3
vLLM main:
ad32e3e19c
- vLLM version: 0.13.0rc3
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
The metadata data class contains an excessive number of variables. We
will inherit the metadata of the community and simultaneously remove
some variables that are no longer needed at present.
Todo:
1. remove attn_state partly.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
Now `VLLM_ASCEND_ENABLE_NZ` will have three options:
0: disable nz;
1: only quant case enable nz;
2: enable nz as long as possible;
And `VLLM_ASCEND_ENABLE_NZ`=1 by default.
All cases are shown in the table below:
| | W4A4 | W4A8 | W8A8 | fp16/bf16 | fp32 |
|---|---|---|---|---|---|
| trans nz | can't support nz | trans nz by default | trans nz by
default | trans nz when VLLM_ASCEND_ENABLE_NZ is 2 | can't support nz |
| transpose | only support not transpose case | only support transpose
case | only support transpose case | linear: only support not transpose
case<br>gmm: only support transpose case | same to fp16/bf16 |
Some exceptional cases:
1. MLAPO op need to do some additional processing on the weights,
including trans nz. If use MLAPO op, some weight will be transformed to
nz forcely;
2. MLA/SFA's weight `W_UV` will be used by op
`torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support
nz currently;
### Does this PR introduce _any_ user-facing change?
Now fp16/bf16 weight will not trans nz by default.
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
This PR fixes a bug in the `AscendMLAImpl._v_up_proj` method where the
optimized `batch_matmul_transpose` operator was not being utilized.
**Changes:**
- Modified `_v_up_proj` method to use
`torch.ops._C_ascend.batch_matmul_transpose` operator for FP16/BF16
dtypes when available
- Added fallback path using the original `torch.bmm` implementation for
other cases
- This avoids unnecessary transpose operations and improves performance
**Why needed:**
- The previous implementation only used `torch.bmm` with multiple
transpose operations, which is less efficient
- The Ascend backend provides an optimized `batch_matmul_transpose`
operator that can handle the computation more efficiently
- This fix improves inference performance for MLA (Multi-head Latent
Attention) models on Ascend NPU
### Does this PR introduce _any_ user-facing change?
No. This is a performance optimization that maintains the same
functionality and output. Users will experience faster inference for
MLA-based models, but no API or interface changes are introduced.
The changes maintain backward compatibility with the fallback path,
ensuring correct behavior when the operator is not available or for
unsupported dtypes.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: hwhaokun <haokun0405@163.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
This PR add `qkv_rmsnorm_rope` operator and introduces a graph fusion
pass for `qknorm_rope` operations. The implementation includes a new
configuration flag, a pattern matching pass using
`torch._inductor.pattern_matcher`, and a custom Triton kernel for the
fused operation.
Co-authored-by: Angazenn
[supperccell@163.com](mailto:supperccell@163.com)
### Does this PR introduce _any_ user-facing change?
Yes, add new additional_config
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
The functions related to Cp differ significantly from those of normal
MLA-Attention, but the coupling is quite severe.
Steps:
Isolate PCP and DCP
(1) create a new python file: mla_cp.py
(2) add classes AscendMlaCPImpl and
AscendMlaCPMetadataBuilder,Inheritance AscendMLAImpl and
AscendMLAMetadataBuilder
(3) Remove PCP and DCP-related methods from mla_v1.py to mla_cp.py
vLLM version: v0.12.0
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>