xc-llm-ascend/vllm_ascend at 016337eaec377e590313d12fceae32d9bdb22e22 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Angazenn f9494d978a [cherry-pick][v0.11.0-dev][bugfix] Fix a rare bug triggered by _npu_paged_attention in FULL_DECODE_ONLY mode (#3987 )

### What this PR does / why we need it?
This is cherry-pick from #3986 . 

This PR fixes a bug where the workspace of `_npu_paged_attention` in
setup is smaller than execution. For current implementation of
FULL_DECODE_ONLY with `_npu_paged_attention`, we use
`_npu_paged_attention_get_workspace` when capturing with `max_model_len`
as `seq_lens`. This assumes that PA with larger `seq_lens` inputs should
have larger workspace than smaller `seq_lens`. However, there are rare
cases where PA with smaller `seq_lens` incurs larger space. So I add
`get_workspace` directly into `update_attn_params`.
This change might introduce slight(≈1%) performance degradation for
small num_tokens(such as 1) in decode phase, and there is no other known
memory issues. So I think this change is acceptable. We can remove this
if new attention op (such as `npu_fused_infer_attention_score`) does not
have such problems.


Signed-off-by: Angazenn <supperccell@163.com>

2025-11-06 23:08:57 +08:00

..

[cherry-pick]Upgrade CANN to 8.3.rc1 (#3945 ) (#3962 )

2025-11-06 09:05:08 +08:00

[cherry-pick][v0.11.0-dev][bugfix] Fix a rare bug triggered by _npu_paged_attention in FULL_DECODE_ONLY mode (#3987 )

2025-11-06 23:08:57 +08:00

[BugFix][Cherry-pick] Cherry-pick PR 3675 to v0.11.0-dev (#3732 )

2025-10-25 09:41:51 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )

2025-07-28 16:01:59 +08:00

[v0.11.0][P/D]Set adxl as default backend and update readme (#3771 )

2025-11-04 16:06:58 +08:00

[CI]Add EPLB CI. (#3568 )

2025-10-21 22:58:02 +08:00

[Bugfix][LoRA] Fix forward error and shape mismatch when using LoRA (#3153 )

2025-09-28 17:30:50 +08:00

Revert "[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 )" (#3586 )

2025-10-21 22:24:30 +08:00

[Quickfix] update CachedRequestState as NewRequestData changed (#2367 )

2025-08-15 07:35:27 +08:00

[cherry-pick]Upgrade CANN to 8.3.rc1 (#3945 ) (#3962 )

2025-11-06 09:05:08 +08:00

[Bugfix][SHM] Fix weak memory ordering problem in share memory (#3988 )

2025-11-04 23:07:23 +08:00

[v0.11.0][bugfix] Add 'layer_type' param to get_pergroup_param() for compatibility (#3684 )

2025-10-23 21:26:50 +08:00

Drop 0.10.2 (#3284 )

2025-10-09 10:28:38 +08:00

[0.11.0][MTP][Aclgraph] Fix the support aclgraph with MTP (#3912 )

2025-11-03 14:25:37 +08:00

fix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len (#3909 )

2025-10-31 09:25:06 +08:00

[cherry-pick]Upgrade CANN to 8.3.rc1 (#3945 ) (#3962 )

2025-11-06 09:05:08 +08:00

__init__.py

[Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432 )

2025-10-15 17:48:58 +08:00

ascend_config.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

ascend_forward_context.py

[0.11.0]Chery pick pta upgrade change (#3940 )

2025-10-31 22:14:26 +08:00

cpu_binding.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

envs.py

[Feat] Flash comm allgher ep (#3334 )

2025-10-15 19:36:32 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

fix bug when max_seqs=14 in mtp=2 scenario and raise error when cudagraph_capture_sizes can't be an integer multiple of uniform_decode_query_len (#3909 )

2025-10-31 09:25:06 +08:00

utils.py

[MM][Bugfix] Add MoE verification for multi-modal models (#3897 ) (#4027 )

2025-11-06 20:30:40 +08:00