Files
xc-llm-ascend/vllm_ascend
Shanshan Shen ad3a1eaf70 [Bugfix][MM] Fix multi-modal inference OOM issues by setting expandable_segments:True (#5855)
### What this PR does / why we need it?

As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339,
multi-modal inference on vllm-ascend may lead to OOM issues in some
scenarios.

After our analysis, this is due to the memory fragmentation caused by
frequent dynamic memory size adjustments during runtime. During the
inference, the figure for non-torch memory see a gradual increase from
around 1G to over 5G until the OOM issue occurs.

We find that this problem can be resolved by just directly setting
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at
https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue.
Thus, we decide to set this value by default, except RL (sleep mode)
scenarios.

It's also worthy to note that this environment variable may have more
than one key-value pairs. We should append `",expandable_segments:True"`
to the current configs.

For example:

```python
PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True".
```

> [!NOTE]
> `max_split_size_mb` or `garbage_collection_threshold` cannot be
enabled together with `expandable_segments=True`.

### Does this PR introduce _any_ user-facing change?
Users do not need to set
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more.

### How was this patch tested?

I have build a dataset consisting of my own photographs, which can
stably reproduce this OOM issue on Qwen3-VL serie models.

After apply this PR, this problem has been resolved and the amount of
non-torch memory will keep stable at around 1G throughout the whole
inference.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-19 09:17:31 +08:00
..
2025-12-20 17:03:25 +08:00
2026-01-15 10:26:44 +08:00