[Refactor][Bugfix] Use upstream mem_utils for profiling and correct non-torch memory recorded during profiling (#6625)

### What this PR does / why we need it?

1. Following https://github.com/vllm-project/vllm/pull/32322, use the
`memory_profiling` context manager from vllm for profiling.
2. Fix wrong non-torch memory value recorded during profiling, which is
not its peak during inference.

---
**More details about point 2:**

After profling, the non-torch memory value we recorded is lower than
that in real inference. This is mainly because of the different memory
management behaviour between `torch.cuda.empty_cache()` and
`torch.npu.empty_cache()`.

With regard to `torch.cuda.empty_cache()`, it only recycle the unused
memory in pytorch memory pool (i.e., memory managed by pytorch caching
allocator), **with no affect to non-torch memory**. However, as for
`torch.npu.empty_cache()`, it has a totally different memory management
mechanism, i.e., it may call `aclrtSynchronize` and **enable Ascend
runtime to free up non-torch memory**.

Thus, the non-torch memory value we recorded after
`torch.npu.empty_cache()` is much lower than its peak during profling.

Resolution:

We record the peak non-torch memory value
(`non_torch_memory_before_empty_cache`) after profiling, but before
`torch.npu.empty_cache()`. Then, we add the diff
(`non_torch_memory_cleared_by_empty_cache =
non_torch_memory_before_empty_cache - self.non_torch_memory`) to
non-torch memory when calculating available KV cache memory, which will
lead to less KV cache memory (i.e., it's safer to avoid OOM issues).

---
> [!NOTE]
> This PR needs to wait for main2main aligning to latest vllm commit
before merging.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?

Before this PR, the non-torch memory we used to calculate available KV
cache memory is **0.90 G**, whereas its peak during real inference is
**1.08 G**, diff: **182.00 M**.

After this PR, we add this diff to non-torch memory after profiling and
thus make the profiling results more accurate.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
This commit is contained in:
Shanshan Shen
2026-02-25 14:28:08 +08:00
committed by GitHub
parent 812c722cfb
commit 957804df56
2 changed files with 57 additions and 31 deletions

View File

@@ -2330,6 +2330,7 @@ class NPUModelRunner(GPUModelRunner):
if self.lora_config:
self.model = self.load_lora_model(self.model, self.vllm_config, self.device)
self.model_memory_usage = m.consumed_memory
logger.info("Loading model weights took %.4f GB", m.consumed_memory / float(2**30))
# wrap the model with full graph wrapper if needed.