From 07f4710216da1d61c521fb3bdf4ba90ea1794474 Mon Sep 17 00:00:00 2001
From: MengLong Chen <71744434+dragondream-chen@users.noreply.github.com>
Date: Thu, 25 Sep 2025 16:09:44 +0800
Subject: [PATCH] [BugFix] Fix dummy_run memory explosion in eager mode (#3132)

### What this PR does / why we need it?

It is a quick bugfix for the memory explosion issue that requires
further refactoring.
The dummy_run in eager mode may lead to OOM and the reason is that
`hidden_states` were not released in time.
The PR temporarily resolves the issue by manually clearing the cache,
and further refactoring will be conducted subsequently.

Before the modification, the dummy_run's memory showed an accumulation
issue.
<img width="1796" height="207" alt="image"
src="https://github.com/user-attachments/assets/05e2b04c-2f99-4085-9eda-c78b7d9a57b0"
/>

After modification, it can be observed that the memory is released
promptly.
And it was verified that the model responded normally after a single
data input.


- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/b1068903fdca26cf6b4a1a51a32c3365ce3ac636

---------

Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
---
 vllm_ascend/ops/moe/fused_moe_prepare_and_finalize.py | 10 ++++++++++
 1 file changed, 10 insertions(+)
diff --git a/vllm_ascend/ops/moe/fused_moe_prepare_and_finalize.py b/vllm_ascend/ops/moe/fused_moe_prepare_and_finalize.py
index 1d6df2c..6ed9858 100644
--- a/vllm_ascend/ops/moe/fused_moe_prepare_and_finalize.py
+++ b/vllm_ascend/ops/moe/fused_moe_prepare_and_finalize.py
@@ -183,6 +183,11 @@ class FusedMoEPrepareAndFinalizeWithMC2(FusedMoEPrepareAndFinalize):
                                 self.moe_config.tp_group.device_group)
                 hidden_states = torch.cat(self.split_hidden_states, dim=0)
 
+                # TODO: It is a quick bugfix for the memory explosion issue in eager mode.
+                # If the cache is not cleared after `self.split_hidden_states` is created,
+                # it can lead to the memory explosion in eager mode.
+                del self.split_hidden_states
+
             # Unpad if necessary
             if self.num_tokens < hidden_states.shape[0]:
                 hidden_states = hidden_states[:self.num_tokens]
@@ -267,6 +272,11 @@ class FusedMoEPrepareAndFinalizeWithAll2All(FusedMoEPrepareAndFinalize):
                                 self.moe_config.tp_group.device_group)
                 hidden_states = torch.cat(self.split_hidden_states, dim=0)
 
+                # TODO: It is a quick bugfix for the memory explosion issue in eager mode.
+                # If the cache is not cleared after `self.split_hidden_states` is created,
+                # it can lead to the memory explosion in eager mode.
+                del self.split_hidden_states
+
             if self.num_tokens < hidden_states.shape[0]:
                 hidden_states = hidden_states[:self.num_tokens]