[Bugfix][MM] Fix multi-modal inference OOM issues by setting expandable_segments:True (#5855)
### What this PR does / why we need it?
As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339,
multi-modal inference on vllm-ascend may lead to OOM issues in some
scenarios.
After our analysis, this is due to the memory fragmentation caused by
frequent dynamic memory size adjustments during runtime. During the
inference, the figure for non-torch memory see a gradual increase from
around 1G to over 5G until the OOM issue occurs.
We find that this problem can be resolved by just directly setting
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at
https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue.
Thus, we decide to set this value by default, except RL (sleep mode)
scenarios.
It's also worthy to note that this environment variable may have more
than one key-value pairs. We should append `",expandable_segments:True"`
to the current configs.
For example:
```python
PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True".
```
> [!NOTE]
> `max_split_size_mb` or `garbage_collection_threshold` cannot be
enabled together with `expandable_segments=True`.
### Does this PR introduce _any_ user-facing change?
Users do not need to set
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more.
### How was this patch tested?
I have build a dataset consisting of my own photographs, which can
stably reproduce this OOM issue on Qwen3-VL serie models.
After apply this PR, this problem has been resolved and the amount of
non-torch memory will keep stable at around 1G throughout the whole
inference.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
This commit is contained in:
@@ -372,6 +372,25 @@ class NPUPlatform(Platform):
|
||||
"Please set VLLM_ASCEND_ENABLE_FLASHCOMM1=0."
|
||||
)
|
||||
|
||||
# Set "PYTORCH_NPU_ALLOC_CONF=expandable_segments:True" by default to optimize NPU memory management.
|
||||
# Find more details at https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue
|
||||
# NOTE: We should not set this environment variable in RL (sleep mode) scenarios.
|
||||
# Find more details about how to configure this environment variable at https://www.hiascend.com/document/detail/zh/Pytorch/720/comref/Envvariables/Envir_012.html
|
||||
if model_config and not model_config.enable_sleep_mode:
|
||||
npu_alloc_configs = os.getenv("PYTORCH_NPU_ALLOC_CONF", "expandable_segments:True")
|
||||
# This environment variable may have more than one key-value pairs.
|
||||
# We should append ",expandable_segments:True" to the current configs.
|
||||
# For example: "page_size:1g" + ",expandable_segments:True".
|
||||
# NOTE: `max_split_size_mb` or `garbage_collection_threshold` cannot
|
||||
# be enabled together with `expandable_segments=True`.
|
||||
if "expandable_segments" not in npu_alloc_configs and \
|
||||
"max_split_size_mb" not in npu_alloc_configs and \
|
||||
"garbage_collection_threshold" not in npu_alloc_configs:
|
||||
npu_alloc_configs += ",expandable_segments:True"
|
||||
os.environ["PYTORCH_NPU_ALLOC_CONF"] = npu_alloc_configs
|
||||
logger.info("Set PYTORCH_NPU_ALLOC_CONF=%s", npu_alloc_configs)
|
||||
|
||||
|
||||
@classmethod
|
||||
def import_kernels(cls) -> None:
|
||||
# Directly importing vllm_ascend_C prevents ASCEND_RT_VISIBLE_DEVICES
|
||||
|
||||
Reference in New Issue
Block a user