[Fix] Prevent memory leak in MLA decode graph (#3743)

### What this PR does / why we need it? The cache for MLA decode graph parameters was holding strong references to tensors, preventing them from being garbage collected and leading to increased memory usage. This change wraps the cached tensors in weak references, allowing them to be deallocated when no longer in use and reducing overall memory pressure. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: c9461e05a4 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 20:37:33 +08:00
parent afc58184ec
commit 8ab8111fde
4 changed files with 29 additions and 19 deletions
--- a/vllm_ascend/utils.py
+++ b/vllm_ascend/utils.py
@@ -697,6 +697,13 @@ def weak_ref_tensors(
    """
    Convenience function to create weak references to tensors,
    for single tensor, list of tensors or tuple of tensors.
+
+    This function should be used in the following scenario:
+    When a tensor is created during graph capture, and it's held by a method
+    that's not part of the graph, we don't really need to store it, but we
+    **do need** its buffer pointer. If we don't handle this, it cannot
+    be garbage collected, leading to a memory leak. To avoid this,
+    we should create a weak reference to the tensor.
    """
    if isinstance(tensors, torch.Tensor):
        return weak_ref_tensor(tensors)