Files
xc-llm-ascend/vllm_ascend/quantization
ApsarasX e3c7f71462 [Perf] Refactor tensor disposal logic to reduce memory usage (#966)
### What this PR does / why we need it?
1. In previous PRs https://github.com/vllm-project/vllm-ascend/pull/580
https://github.com/vllm-project/vllm-ascend/pull/784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: https://github.com/sgl-project/sglang/pull/6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-05-29 11:48:26 +08:00
..