xc-llm-ascend

Files

ApsarasX e3c7f71462 [Perf] Refactor tensor disposal logic to reduce memory usage (#966 )

### What this PR does / why we need it?
1. In previous PRs https://github.com/vllm-project/vllm-ascend/pull/580
https://github.com/vllm-project/vllm-ascend/pull/784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: https://github.com/sgl-project/sglang/pull/6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>

2025-05-29 11:48:26 +08:00

__init__.py

[Core] Cherry pick from 0.7.1 to keep the main code newest (#127 )

2025-02-21 17:07:37 +08:00

func_wrapper.py

[quantization] Support w8a8 quantization (#580 )

2025-04-20 18:14:05 +08:00

quant_config.py

[BugFix] Fix accuracy bugs for unquantized deepseekv3 models (#897 )

2025-05-24 14:29:36 +08:00

quantizer.py

[quantization] Support w8a8 quantization (#580 )