### What this PR does / why we need it?
Refactor kv cache tensor initialization logic.
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
move quant before allgather in Allgather EP, rely on
https://github.com/vllm-project/vllm-ascend/pull/3334
Deepseek R1 W8A8 performance on A2 with
`HCCL_ALGO="level0:NA;level1:pipeline"`:
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|----------|----------|----------|
| 4k | 375.21 | 364.99 |
| 16k | 1465.23 | 1421.75 |
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Set adxl engine as the default Mooncake backend, because Ascend
Transport is no longer maintained.
Update README to include instructions for installing the adxl backend
Mooncake.
### Does this PR introduce _any_ user-facing change?
Users need to compile and install the mooncake backend for adxl
according to the revised README instructions.
### How was this patch tested?
By CI.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
This PR fixes deepseek v3.2 mtp bug.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
All existed ci tests should pass.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Fix#3891.
The empty of `moe_comm_method` in the above issue is due to the wrong
check for MoE models. To be specific, the method `is_moe_model` only
checks whether a text-only model is a MoE model, without considering
multi-modal models, e.g., `VL` and `Omni`.
Check the config dict recursively to find if it has a key contains
"expert", without checking the model architecture.
It is worth noting that, we can't verify a model by if it contains
`FusedMoE` module because `is_moe_model` is called somewhere before the model loading, e.g., it's called when updating the ACLGraph config in
platform initialization.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
1、revert TND modify when dcp pcp, which is introduced by
f57bdb09fc
2、deal aclgraph pad border issue
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.
TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
correct bug to fix the value of max_num_tokens
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
- Add support for DeepSeek v3.2 in FULL_DECODE_ONLY mode.
- Add unit test for sfa_v1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
### What this PR does / why we need it?
In the `update_*attn_params` functions, the
`torch.npu.stream(update_stream)` context manager was previously located
inside the for-loop that updates parameters for each layer. This
resulted in redundant stream initiations for every layer, adding
unnecessary overhead.
This commit refactors the code by moving the stream context manager to
wrap the entire for-loop. This ensures that the update stream is
initiated only once per function call, rather than for each layer. This
change reduces 90us in each decode model.
update stream in every layer:
<img width="1720" height="383" alt="image"
src="https://github.com/user-attachments/assets/70e4cb69-5bc1-4180-a67d-c99132134be6"
/>
remove update stream in every layer:
<img width="1269" height="175" alt="image"
src="https://github.com/user-attachments/assets/0e290edb-b0ce-48fe-b032-1b924ade6ae5"
/>
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Upgrade torch-npu to the official release version 2.7.1
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
support pcp + mtp (with pd disaggregate, only pcp in P nodes)
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
### What this PR does / why we need it?
This PR temporarily bypasses the scenario where some models in vLLM
trigger a `ValueError` during the process of storing values in
`static_forward_context` when no `prefix` is specified for the linear
layers, which is a bug in some models in vLLM. The official fix will be
addressed by submitting a PR to the vLLM community that specifies a
prefix for the linear layers in each model.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
Fix the issue of MTP being enabled and setting
Imhead_tensor_parallel_size=16 causing the inference to hang.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wyh145 <1987244901@qq.com>
### What this PR does / why we need it?
1. Revert [bugfix for mtp in
fullgraph](0948483642)
and support it when vllm supports
2. raise error when cudagraph_capture_sizes can't be an integer multiple
of uniform_decode_query_len
3. bugfix when max_num_seqs=14 in mtp=2 scenario
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
cancel tokenize for layerwise_proxy
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by ci
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
Refactored the layerwise code to send to the D node first, preventing
P-node hangs due to communication timeouts when DP > 1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Fix the precision issue caused by the inconsistency between the group
list type used by mc2 and that of eplb.
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
This is the follow-up PR to PR #3189, which continues to refactor sfa
into mla and finally remove deepseek_v3_2.py. This is the last PR of
deepseek modeling refactoring. After this, all deepseek-related model
codes are removed from vllm_ascend.
FurtherMore, after this PR deepseek v3.2 can run chunk-prefill with
correct accuracy.
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
1. Refactor the file `mtp_proposer.py`, splits torchair related codes
into `mtp_torchair_proposer.py`
2. According to https://github.com/vllm-project/vllm/pull/24539,
implements padded speculative decoding as described in
https://github.com/vllm-project/vllm/issues/21984.
### Does this PR introduce _any_ user-facing change?
User can use `disable_padded_drafter_batch` to disable/enable padded
speculation, default is `False`.
offline example:
```
speculative_config={"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False}
```
### How was this patch tested?
- [x] egaer with pad/unpad:
- [x] aclgraph with pad/unpad
- [x] torchair with pad/unpad
performance test of deepseek-r1 with tp16、dp1
aclgraph with pad ITL: 168ms
aclgraph with unpad ITL: 169ms
original: 178ms
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
### What this PR does / why we need it?
Fix oom of deepseek-eplb nigtly test
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
This PR fixes a mlapo accuracy problem related with weight processing.
Furthermore, add back mlapo related e2e test with quantized deepseek
model.
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
bugfix for mtp fullgraph
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
When using multi connector, the multi connector does not define
get_finished_count, which will cause the kv cache to be released
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: baxingpiaochong <771405853@qq.com>
### What this PR does / why we need it?
fix a typo in mooncake layerwise connector. There is only `requests`,
instead of `request` in `connector_metadata`. This pr fixes this typo
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
To adapt the torch_npu version to avoid the precision problem of
torchair deepseek. The torch_npu version may result in the different
branches in the ops register, the rms_norm ops has two branches
according to the verson_check, this pr unify the rms_norm in torchair by
patching quant_rms_norm to rms_norm to fix the accuracy issue in torchair scenario
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
After refactoring vllm_ascend/models and FusedMoE, we are unable to pass
`gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result
in error when running deepseek v3/r1 with allgather.
Hence, this pr removes `gate` related computations from FusedMoE module
in eager/aclgraph mode.
### Does this PR introduce _any_ user-facing change?
`rm_router_logits` is deprecated in eager/aclgraph.
### How was this patch tested?
e2e & ut
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
Part of https://github.com/vllm-project/vllm-ascend/pull/3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4
<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
force with_prefill true after allreduce in kv producer
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
We have optimized the performance of long sequences:First,Modify the
input data format for attention calculation. Instead of using the
original BSND format, remove the logic for converting between TND and
BSND, and directly adopt the TND format.
The TND input format can be directly reused, which shortens the data
flow path. Converting to BSND is an unnecessary processing step.Second,
we switched the output update of the concatenated small operators to the
npu_attention_update fusion operator to improve performance.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: pichangping <1337510399@qq.com>
### What this PR does / why we need it?
The current MatmulReduceScatter operator experiences performance
degradation in small-shape scenarios, so it determines whether to use
this operator by judging the size of the shape.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1
---------
Signed-off-by: ZYang6263 <zy626375@gmail.com>
### What this PR does / why we need it?
dcp pcp support full aclgraph, including mla attention_v1
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
The cache for MLA decode graph parameters was holding strong references
to tensors, preventing them from being garbage collected and leading to
increased memory usage.
This change wraps the cached tensors in weak references, allowing them
to be deallocated when no longer in use and reducing overall memory
pressure.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Fix Qwen3NextGatedDeltaNet, caused by
https://github.com/vllm-project/vllm/pull/26437
### How was this patch tested?
```
def main():
prompts = [
"窗前明月光,",
"The president of the United States is Mr.",
"The capital of France is",
"The future of AI is",
"感时花溅泪,",
"家书抵万金啥意思?",
"plz tell me a story: ",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
# Create an LLM.
llm = LLM(
model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
enforce_eager=True,
trust_remote_code=True,
max_model_len=256,
gpu_memory_utilization=0.7,
block_size=64
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
Remove codes of dbo.
Currently, vLLM has supported dbo with pr:
https://github.com/vllm-project/vllm/pull/23693.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.
This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
1. Rename common_fused_moe.py to fused_moe.py.
2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize
to prepare_finalize.py / PrepareAndFinalize.
3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe.
4. Move vllm_ascend/ops/fused_moe.py to
vllm_ascend/ops/fused_moe/fused_moe.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
We optimized the _prepare_input method in eagle_proposer and no longer
use the _prepare_eagle_input_sequential method, improving the
performance of eagle-3.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```
python3 -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--port 13963
--dtype bfloat16
--model meta-llama/Llama-3.1-8B-Instruct
--served-model-name Llama-3.1-8B-Instruct
--tensor-parallel-size 1
--gpu-memory-utilization 0.85
--max-model-len 32768
--trust-remote-code
--seed 42
--no-enable-prefix-caching
--speculative_config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":2,"draft_tensor_parallel_size":1}'
```
Co-authored-by: QilaiZhang (245706640@qq.com )
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: lio <1983142975@qq.com>
### What this PR does / why we need it?
Since Attention and LinearAttention share the same ```slot_mapping```,
and the ```slot_mapping``` for LinearAttention is all zeros, the
```slot_mapping``` for Attention gets overwritten, resulting in the
computed output being all zeros.
This PR removes the uniformly managed ```self.slot_mapping``` and
directly passes the ```slot_mapping``` from ```input_batch.blocktable```
to ```attn_metadata```, along with modifying the relevant references.
Due to hardware, the data type of ```block_table.slot_mapping``` needs
to be set to int32.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: QilaiZhang <245706640@qq.com>
This PR fix the bug related with running multi-modal models with
AscendScheduler. This bug was introduced by PR #2372 by using the same
parameter names as vLLM with different default values.
Currently I fix this bug by changing the default values of these two
parameters to align with vLLM.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
### What this PR does / why we need it?
Modify the recalculation logic to prevent waiting requests from filling
up the D node KVCache
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: underfituu <hzhucong@163.com>