xc-llm-ascend

Files

linfeng-yuan ffdd1a36e2 [bugfix][torchair] fix wasted NPU memory buffer allocation for quantized deepseek with unquantized MTP layer (#3068 )

### What this PR does / why we need it?
While running quantized deepseek models with unquantized MTP layer, free
NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This
results from the wasted VRAM buffer allocation casued by calling
`dist.all_to_all_single` without correct device process group argument.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We run vllm online serving with quantized deepseek-r1 and unquantized
MTP layer, and observed that free_memory increased without redundat VRAM
buffer for HCCL communication op (all_to_all_single).

- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

Signed-off-by: linfeng-yuan <1102311262@qq.com>

2025-09-22 14:06:43 +08:00

__init__.py

[3/N][refactor] refactoer quantization (#2504 )

2025-08-27 10:45:50 +08:00

torchair_w4a8_dynamic.py

[bugfix][refactor]fix torchair_w8a8 (#2569 )

2025-08-28 09:10:31 +08:00

torchair_w8a8_dynamic.py

[bugfix][torchair] fix wasted NPU memory buffer allocation for quantized deepseek with unquantized MTP layer (#3068 )

2025-09-22 14:06:43 +08:00