From ad3a1eaf70f5da50379cb9bfaa2e3595dd2b36f6 Mon Sep 17 00:00:00 2001
From: Shanshan Shen <467638484@qq.com>
Date: Mon, 19 Jan 2026 09:17:31 +0800
Subject: [PATCH] [Bugfix][MM] Fix multi-modal inference OOM issues by setting
 `expandable_segments:True` (#5855)

### What this PR does / why we need it?

As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339,
multi-modal inference on vllm-ascend may lead to OOM issues in some
scenarios.

After our analysis, this is due to the memory fragmentation caused by
frequent dynamic memory size adjustments during runtime. During the
inference, the figure for non-torch memory see a gradual increase from
around 1G to over 5G until the OOM issue occurs.

We find that this problem can be resolved by just directly setting
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at
https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue.
Thus, we decide to set this value by default, except RL (sleep mode)
scenarios.

It's also worthy to note that this environment variable may have more
than one key-value pairs. We should append `",expandable_segments:True"`
to the current configs.

For example:

```python
PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True".
```

> [!NOTE]
> `max_split_size_mb` or `garbage_collection_threshold` cannot be
enabled together with `expandable_segments=True`.

### Does this PR introduce _any_ user-facing change?
Users do not need to set
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more.

### How was this patch tested?

I have build a dataset consisting of my own photographs, which can
stably reproduce this OOM issue on Qwen3-VL serie models.

After apply this PR, this problem has been resolved and the amount of
non-torch memory will keep stable at around 1G throughout the whole
inference.

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
---
 vllm_ascend/platform.py | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/vllm_ascend/platform.py b/vllm_ascend/platform.py
index cee3531a..5224ee9a 100644
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -372,6 +372,25 @@ class NPUPlatform(Platform):
                     "Please set VLLM_ASCEND_ENABLE_FLASHCOMM1=0."
                 )
 
+        # Set "PYTORCH_NPU_ALLOC_CONF=expandable_segments:True" by default to optimize NPU memory management.
+        # Find more details at https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue
+        # NOTE: We should not set this environment variable in RL (sleep mode) scenarios.
+        # Find more details about how to configure this environment variable at https://www.hiascend.com/document/detail/zh/Pytorch/720/comref/Envvariables/Envir_012.html
+        if model_config and not model_config.enable_sleep_mode:
+            npu_alloc_configs = os.getenv("PYTORCH_NPU_ALLOC_CONF", "expandable_segments:True")
+            # This environment variable may have more than one key-value pairs.
+            # We should append ",expandable_segments:True" to the current configs.
+            # For example: "page_size:1g" + ",expandable_segments:True".
+            # NOTE: `max_split_size_mb` or `garbage_collection_threshold` cannot
+            # be enabled together with `expandable_segments=True`.
+            if "expandable_segments" not in npu_alloc_configs and \
+                "max_split_size_mb" not in npu_alloc_configs and \
+                "garbage_collection_threshold" not in npu_alloc_configs:
+                npu_alloc_configs += ",expandable_segments:True"
+            os.environ["PYTORCH_NPU_ALLOC_CONF"] = npu_alloc_configs
+            logger.info("Set PYTORCH_NPU_ALLOC_CONF=%s", npu_alloc_configs)
+
+
     @classmethod
     def import_kernels(cls) -> None:
         # Directly importing vllm_ascend_C prevents ASCEND_RT_VISIBLE_DEVICES