xc-llm-ascend/vllm_ascend at ad3a1eaf70f5da50379cb9bfaa2e3595dd2b36f6 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Shanshan Shen ad3a1eaf70 [Bugfix][MM] Fix multi-modal inference OOM issues by setting expandable_segments:True (#5855 )

### What this PR does / why we need it?

As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339,
multi-modal inference on vllm-ascend may lead to OOM issues in some
scenarios.

After our analysis, this is due to the memory fragmentation caused by
frequent dynamic memory size adjustments during runtime. During the
inference, the figure for non-torch memory see a gradual increase from
around 1G to over 5G until the OOM issue occurs.

We find that this problem can be resolved by just directly setting
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at
https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue.
Thus, we decide to set this value by default, except RL (sleep mode)
scenarios.

It's also worthy to note that this environment variable may have more
than one key-value pairs. We should append `",expandable_segments:True"`
to the current configs.

For example:

```python
PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True".
```

> [!NOTE]
> `max_split_size_mb` or `garbage_collection_threshold` cannot be
enabled together with `expandable_segments=True`.

### Does this PR introduce _any_ user-facing change?
Users do not need to set
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more.

### How was this patch tested?

I have build a dataset consisting of my own photographs, which can
stably reproduce this OOM issue on Qwen3-VL serie models.

After apply this PR, this problem has been resolved and the amount of
non-torch memory will keep stable at around 1G throughout the whole
inference.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: shen-shanshan <467638484@qq.com>

2026-01-19 09:17:31 +08:00

..

[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 )

2026-01-17 11:49:18 +08:00

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

Revert "[bugfix]limit graph replay sync (#5761 )" (#5965 )

2026-01-16 23:29:35 +08:00

[CI] fix lint (#5216 )

2025-12-20 17:03:25 +08:00

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

device_allocator

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

[P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968 )

2026-01-17 18:49:27 +08:00

[EPLB]Eplb Config Renaming (#5533 )

2026-01-15 10:26:44 +08:00

[Main2Main] Upgrade vllm commit to 0113 (#5839 )

2026-01-15 09:48:53 +08:00

[BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394 )

2026-01-09 15:57:43 +08:00

[BugFix] NetLoader: No backend type associated with device type npu (#5700 )

2026-01-09 15:54:54 +08:00

[Feature] Support fine-grained shared expert overlap (#5482 )

2026-01-17 11:53:22 +08:00

[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 )

2026-01-17 11:49:18 +08:00

[Feature] Support fine-grained shared expert overlap (#5482 )

2026-01-17 11:53:22 +08:00

[Feature] add the magicmtp speculative decoding acceleration algorithm (#5542 )

2026-01-08 09:15:55 +08:00

Eagle3 mm support, enablement on qwen3vl (#4848 )

2026-01-19 08:58:07 +08:00

Eagle3 mm support, enablement on qwen3vl (#4848 )

2026-01-19 08:58:07 +08:00

[BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711 )

2026-01-09 15:55:30 +08:00

__init__.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

ascend_config.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

ascend_forward_context.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

batch_invariant.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

cpu_binding.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

envs.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

flash_common3_context.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

meta_registration.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

platform.py

[Bugfix][MM] Fix multi-modal inference OOM issues by setting expandable_segments:True (#5855 )

2026-01-19 09:17:31 +08:00

profiling_config.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

utils.py

[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 )

2026-01-17 11:49:18 +08:00