xc-llm-ascend

Files

ttanzhiqiang 980cd81466 etp best a2 (#1101 )

### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>

2025-06-11 10:40:50 +08:00

disaggregated_prefill

[CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889 )

2025-05-29 10:17:12 +08:00

dp_offline

[Misc] Refactor additional_config (#1029 )

2025-06-05 16:28:01 +08:00

offline_disaggregated_prefill_npu.py

[Feature] Add PD separation feature (#432 )

2025-04-15 15:11:35 +08:00

offline_distributed_inference_npu.py

[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 )