xc-llm-ascend

Files

Angazenn 1e67089bc9 [BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819 )

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
1. This PR introduces native `all_to_all` communication operator to fix
`allgather` bugs when dp_size > 1. Besides, it adds a naive
implementation of force-load-balance when doing profile runs.
2. The operator `npu_dequant_swiglu_quant` only supports input
hidden_states with dtype `torch.int32`. This tensor occupies space of
`global_bs * seq_len * topk * hidden_size`, which might be very large as
`ep_size` grows. Therefore we need to disable this operator and use
original `swiglu` && `quantize`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By performing offline inference:

![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828)

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>

2025-05-15 09:19:55 +08:00

__init__.py

[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819 )

2025-05-15 09:19:55 +08:00

patch_distributed.py

[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819 )

2025-05-15 09:19:55 +08:00

patch_metrics.py

[Misc] Remove some parts of metrics patch (#603 )

2025-04-22 18:45:21 +08:00

patch_minicpm.py

[Model][MiniCPM] support MiniCPM (#645 )