[Refactor] profiler config optimze (#6141)

### What this PR does / why we need it?
This PR optimizes the torch_npu profiler configuration to significantly
reduce overhead and trace file size. The key changes include:
**Enable Data Simplification**: Explicitly sets data_simplification=True
in _ExperimentalConfig. This filters out unnecessary intermediate data
during profiling, drastically reducing the memory footprint and I/O
overhead.
**Use Lightweight Stack Tracing**: Replaces with_stack with with_modules
when torch_profiler_with_stack is enabled. In torch_npu, with_stack
introduces heavy latency. with_modules provides equivalent semantic
information with much lower overhead.
**Code Simplification:** Removes redundant parameter configurations in
_ExperimentalConfig by utilizing default values, making the codebase
cleaner and easier to maintain.

**Test setup:**
 max length = 50, profiler + stack enabled

**Before optimization:**
Profiler data size: 651 MB
Generate time: 3 seconds

**After optimization:**
Profiler data size: 156 MB (≈76% reduction)
Generate time: <1 second

### Does this PR introduce _any_ user-facing change?
No API changes. Users profiling on Ascend will experience faster
profiling execution and smaller trace files when stack tracing is
enabled.
### How was this patch tested?
Manually verified on Ascend NPU by running vLLM with the profiler
enabled. Confirmed that trace files are generated correctly containing
necessary stack/module info, while showing the reported reduction in
size and time.

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: mengchengTang <745274877@qq.com>
This commit is contained in:
TMC
2026-01-27 22:09:50 +08:00
committed by GitHub
parent 54e8389f8e
commit 41eb71d665

View File

@@ -561,7 +561,7 @@ class NPUWorker(WorkerBase):
aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
l2_cache=False,
op_attr=False,
data_simplification=False,
data_simplification=True,
record_op_args=False,
gc_detect_threshold=None,
)
@@ -571,9 +571,11 @@ class NPUWorker(WorkerBase):
torch_npu.profiler.ProfilerActivity.CPU,
torch_npu.profiler.ProfilerActivity.NPU,
],
with_stack=profiler_config.torch_profiler_with_stack,
with_stack=False,
profile_memory=profiler_config.torch_profiler_with_memory,
with_modules=False,
# NOTE: torch_npu.profiler.with_modules is equivalent to torch.profiler.with_stack.
# The with_stack option in torch_npu.profiler introduces significant time overhead.
with_modules=profiler_config.torch_profiler_with_stack,
experimental_config=experimental_config,
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(
torch_profiler_trace_dir))