[Refactor] profiler config optimze (#6141)

### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: **Enable Data Simplification**: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. **Use Lightweight Stack Tracing**: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. **Code Simplification:** Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. **Test setup:** max length = 50, profiler + stack enabled **Before optimization:** Profiler data size: 651 MB Generate time: 3 seconds **After optimization:** Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: d68209402d Signed-off-by: mengchengTang <745274877@qq.com>
2026-01-27 22:09:50 +08:00
parent 54e8389f8e
commit 41eb71d665
1 changed files with 5 additions and 3 deletions
--- a/vllm_ascend/worker/worker.py
+++ b/vllm_ascend/worker/worker.py
@@ -561,7 +561,7 @@ class NPUWorker(WorkerBase):
                aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
                l2_cache=False,
                op_attr=False,
-                data_simplification=False,
+                data_simplification=True,
                record_op_args=False,
                gc_detect_threshold=None,
            )
@@ -571,9 +571,11 @@ class NPUWorker(WorkerBase):
                    torch_npu.profiler.ProfilerActivity.CPU,
                    torch_npu.profiler.ProfilerActivity.NPU,
                ],
-                with_stack=profiler_config.torch_profiler_with_stack,
+                with_stack=False,
                profile_memory=profiler_config.torch_profiler_with_memory,
-                with_modules=False,
+                # NOTE: torch_npu.profiler.with_modules is equivalent to torch.profiler.with_stack.
+                # The with_stack option in torch_npu.profiler introduces significant time overhead.
+                with_modules=profiler_config.torch_profiler_with_stack,
                experimental_config=experimental_config,
                on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(
                    torch_profiler_trace_dir))