[Refactor] profiler config optimze (#6141)
### What this PR does / why we need it?
This PR optimizes the torch_npu profiler configuration to significantly
reduce overhead and trace file size. The key changes include:
**Enable Data Simplification**: Explicitly sets data_simplification=True
in _ExperimentalConfig. This filters out unnecessary intermediate data
during profiling, drastically reducing the memory footprint and I/O
overhead.
**Use Lightweight Stack Tracing**: Replaces with_stack with with_modules
when torch_profiler_with_stack is enabled. In torch_npu, with_stack
introduces heavy latency. with_modules provides equivalent semantic
information with much lower overhead.
**Code Simplification:** Removes redundant parameter configurations in
_ExperimentalConfig by utilizing default values, making the codebase
cleaner and easier to maintain.
**Test setup:**
max length = 50, profiler + stack enabled
**Before optimization:**
Profiler data size: 651 MB
Generate time: 3 seconds
**After optimization:**
Profiler data size: 156 MB (≈76% reduction)
Generate time: <1 second
### Does this PR introduce _any_ user-facing change?
No API changes. Users profiling on Ascend will experience faster
profiling execution and smaller trace files when stack tracing is
enabled.
### How was this patch tested?
Manually verified on Ascend NPU by running vLLM with the profiler
enabled. Confirmed that trace files are generated correctly containing
necessary stack/module info, while showing the reported reduction in
size and time.
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: mengchengTang <745274877@qq.com>
This commit is contained in:
@@ -561,7 +561,7 @@ class NPUWorker(WorkerBase):
|
||||
aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
|
||||
l2_cache=False,
|
||||
op_attr=False,
|
||||
data_simplification=False,
|
||||
data_simplification=True,
|
||||
record_op_args=False,
|
||||
gc_detect_threshold=None,
|
||||
)
|
||||
@@ -571,9 +571,11 @@ class NPUWorker(WorkerBase):
|
||||
torch_npu.profiler.ProfilerActivity.CPU,
|
||||
torch_npu.profiler.ProfilerActivity.NPU,
|
||||
],
|
||||
with_stack=profiler_config.torch_profiler_with_stack,
|
||||
with_stack=False,
|
||||
profile_memory=profiler_config.torch_profiler_with_memory,
|
||||
with_modules=False,
|
||||
# NOTE: torch_npu.profiler.with_modules is equivalent to torch.profiler.with_stack.
|
||||
# The with_stack option in torch_npu.profiler introduces significant time overhead.
|
||||
with_modules=profiler_config.torch_profiler_with_stack,
|
||||
experimental_config=experimental_config,
|
||||
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(
|
||||
torch_profiler_trace_dir))
|
||||
|
||||
Reference in New Issue
Block a user