[BugFix] Fix dummy_run memory explosion in eager mode (#3132)

### What this PR does / why we need it? It is a quick bugfix for the memory explosion issue that requires further refactoring. The dummy_run in eager mode may lead to OOM and the reason is that `hidden_states` were not released in time. The PR temporarily resolves the issue by manually clearing the cache, and further refactoring will be conducted subsequently. Before the modification, the dummy_run's memory showed an accumulation issue. <img width="1796" height="207" alt="image" src="https://github.com/user-attachments/assets/05e2b04c-2f99-4085-9eda-c78b7d9a57b0" /> After modification, it can be observed that the memory is released promptly. And it was verified that the model responded normally after a single data input. - vLLM version: v0.10.2 - vLLM main: b1068903fd --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
2025-09-25 16:09:44 +08:00
parent 72f64c10b7
commit 07f4710216
1 changed files with 10 additions and 0 deletions
--- a/vllm_ascend/ops/moe/fused_moe_prepare_and_finalize.py
+++ b/vllm_ascend/ops/moe/fused_moe_prepare_and_finalize.py
@@ -183,6 +183,11 @@ class FusedMoEPrepareAndFinalizeWithMC2(FusedMoEPrepareAndFinalize):
                                self.moe_config.tp_group.device_group)
                hidden_states = torch.cat(self.split_hidden_states, dim=0)

+                # TODO: It is a quick bugfix for the memory explosion issue in eager mode.
+                # If the cache is not cleared after `self.split_hidden_states` is created,
+                # it can lead to the memory explosion in eager mode.
+                del self.split_hidden_states
+
            # Unpad if necessary
            if self.num_tokens < hidden_states.shape[0]:
                hidden_states = hidden_states[:self.num_tokens]
@@ -267,6 +272,11 @@ class FusedMoEPrepareAndFinalizeWithAll2All(FusedMoEPrepareAndFinalize):
                                self.moe_config.tp_group.device_group)
                hidden_states = torch.cat(self.split_hidden_states, dim=0)

+                # TODO: It is a quick bugfix for the memory explosion issue in eager mode.
+                # If the cache is not cleared after `self.split_hidden_states` is created,
+                # it can lead to the memory explosion in eager mode.
+                del self.split_hidden_states
+
            if self.num_tokens < hidden_states.shape[0]:
                hidden_states = hidden_states[:self.num_tokens]