[Perf] Refactor tensor disposal logic to reduce memory usage (#966)
### What this PR does / why we need it? 1. In previous PRs https://github.com/vllm-project/vllm-ascend/pull/580 https://github.com/vllm-project/vllm-ascend/pull/784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. **Reference**: https://github.com/sgl-project/sglang/pull/6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <apsarax@outlook.com>
This commit is contained in:
@@ -68,6 +68,7 @@ from vllm.sequence import IntermediateTensors
|
||||
import vllm_ascend.envs as envs_ascend
|
||||
from vllm_ascend.ops.fused_moe import AscendFusedMoE
|
||||
from vllm_ascend.quantization.w8a8_dynamic import AscendW8A8DynamicLinearMethod
|
||||
from vllm_ascend.utils import dispose_tensor
|
||||
|
||||
VLLM_ENABLE_MC2: bool = envs_ascend.VLLM_ENABLE_MC2
|
||||
|
||||
@@ -518,8 +519,14 @@ class CustomDeepseekV2DecoderLayer(DeepseekV2DecoderLayer):
|
||||
residual = hidden_states
|
||||
hidden_states = self.input_layernorm(hidden_states)
|
||||
else:
|
||||
previous_hidden_states, previous_residual = hidden_states, residual
|
||||
hidden_states, residual = self.input_layernorm(
|
||||
hidden_states, residual)
|
||||
# Dispose hidden_states and residual from the previous layer
|
||||
# to save npu memory because they're no longer used.
|
||||
dispose_tensor(previous_hidden_states)
|
||||
dispose_tensor(previous_residual)
|
||||
|
||||
hidden_states = self.self_attn(
|
||||
positions=positions,
|
||||
hidden_states=hidden_states,
|
||||
|
||||
Reference in New Issue
Block a user