From 41eb71d665ab9f0b72b6d3bc15d41dee7fcc0f5f Mon Sep 17 00:00:00 2001
From: TMC <87188729+mengchengTang@users.noreply.github.com>
Date: Tue, 27 Jan 2026 22:09:50 +0800
Subject: [PATCH] [Refactor] profiler config optimze (#6141)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

### What this PR does / why we need it?
This PR optimizes the torch_npu profiler configuration to significantly
reduce overhead and trace file size. The key changes include:
**Enable Data Simplification**: Explicitly sets data_simplification=True
in _ExperimentalConfig. This filters out unnecessary intermediate data
during profiling, drastically reducing the memory footprint and I/O
overhead.
**Use Lightweight Stack Tracing**: Replaces with_stack with with_modules
when torch_profiler_with_stack is enabled. In torch_npu, with_stack
introduces heavy latency. with_modules provides equivalent semantic
information with much lower overhead.
**Code Simplification:** Removes redundant parameter configurations in
_ExperimentalConfig by utilizing default values, making the codebase
cleaner and easier to maintain.

**Test setup:**
 max length = 50, profiler + stack enabled

**Before optimization:**
Profiler data size: 651 MB
Generate time: 3 seconds

**After optimization:**
Profiler data size: 156 MB (≈76% reduction)
Generate time: <1 second

### Does this PR introduce _any_ user-facing change?
No API changes. Users profiling on Ascend will experience faster
profiling execution and smaller trace files when stack tracing is
enabled.
### How was this patch tested?
Manually verified on Ascend NPU by running vLLM with the profiler
enabled. Confirmed that trace files are generated correctly containing
necessary stack/module info, while showing the reported reduction in
size and time.

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: mengchengTang <745274877@qq.com>
---
 vllm_ascend/worker/worker.py | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/vllm_ascend/worker/worker.py b/vllm_ascend/worker/worker.py
index 15b45ee4..27cc3ca5 100644
--- a/vllm_ascend/worker/worker.py
+++ b/vllm_ascend/worker/worker.py
@@ -561,7 +561,7 @@ class NPUWorker(WorkerBase):
                 aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
                 l2_cache=False,
                 op_attr=False,
-                data_simplification=False,
+                data_simplification=True,
                 record_op_args=False,
                 gc_detect_threshold=None,
             )
@@ -571,9 +571,11 @@ class NPUWorker(WorkerBase):
                     torch_npu.profiler.ProfilerActivity.CPU,
                     torch_npu.profiler.ProfilerActivity.NPU,
                 ],
-                with_stack=profiler_config.torch_profiler_with_stack,
+                with_stack=False,
                 profile_memory=profiler_config.torch_profiler_with_memory,
-                with_modules=False,
+                # NOTE: torch_npu.profiler.with_modules is equivalent to torch.profiler.with_stack.
+                # The with_stack option in torch_npu.profiler introduces significant time overhead.
+                with_modules=profiler_config.torch_profiler_with_stack,
                 experimental_config=experimental_config,
                 on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(
                     torch_profiler_trace_dir))