[Bugfix][MM] Fix multi-modal inference OOM issues by setting expandable_segments:True (#5855)

### What this PR does / why we need it? As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339, multi-modal inference on vllm-ascend may lead to OOM issues in some scenarios. After our analysis, this is due to the memory fragmentation caused by frequent dynamic memory size adjustments during runtime. During the inference, the figure for non-torch memory see a gradual increase from around 1G to over 5G until the OOM issue occurs. We find that this problem can be resolved by just directly setting `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue. Thus, we decide to set this value by default, except RL (sleep mode) scenarios. It's also worthy to note that this environment variable may have more than one key-value pairs. We should append `",expandable_segments:True"` to the current configs. For example: ```python PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True". ``` > [!NOTE] > `max_split_size_mb` or `garbage_collection_threshold` cannot be enabled together with `expandable_segments=True`. ### Does this PR introduce _any_ user-facing change? Users do not need to set `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more. ### How was this patch tested? I have build a dataset consisting of my own photographs, which can stably reproduce this OOM issue on Qwen3-VL serie models. After apply this PR, this problem has been resolved and the amount of non-torch memory will keep stable at around 1G throughout the whole inference. - vLLM version: v0.13.0 - vLLM main: 2f4e6548ef --------- Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-19 09:17:31 +08:00
parent 0eafed9bd6
commit ad3a1eaf70
1 changed files with 19 additions and 0 deletions
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -372,6 +372,25 @@ class NPUPlatform(Platform):
                    "Please set VLLM_ASCEND_ENABLE_FLASHCOMM1=0."
                )

+        # Set "PYTORCH_NPU_ALLOC_CONF=expandable_segments:True" by default to optimize NPU memory management.
+        # Find more details at https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue
+        # NOTE: We should not set this environment variable in RL (sleep mode) scenarios.
+        # Find more details about how to configure this environment variable at https://www.hiascend.com/document/detail/zh/Pytorch/720/comref/Envvariables/Envir_012.html
+        if model_config and not model_config.enable_sleep_mode:
+            npu_alloc_configs = os.getenv("PYTORCH_NPU_ALLOC_CONF", "expandable_segments:True")
+            # This environment variable may have more than one key-value pairs.
+            # We should append ",expandable_segments:True" to the current configs.
+            # For example: "page_size:1g" + ",expandable_segments:True".
+            # NOTE: `max_split_size_mb` or `garbage_collection_threshold` cannot
+            # be enabled together with `expandable_segments=True`.
+            if "expandable_segments" not in npu_alloc_configs and \
+                "max_split_size_mb" not in npu_alloc_configs and \
+                "garbage_collection_threshold" not in npu_alloc_configs:
+                npu_alloc_configs += ",expandable_segments:True"
+            os.environ["PYTORCH_NPU_ALLOC_CONF"] = npu_alloc_configs
+            logger.info("Set PYTORCH_NPU_ALLOC_CONF=%s", npu_alloc_configs)
+
+
    @classmethod
    def import_kernels(cls) -> None:
        # Directly importing vllm_ascend_C prevents ASCEND_RT_VISIBLE_DEVICES