[doc] Add Ascend PyTorch Profiler section (#7117)

### What this PR does / why we need it?
add Ascend PyTorch Profiler section

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Documentation Format Checks
Technical Content Validation
Build Verification
Version Compatibility
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: herizhen <1270637059@qq.com>
This commit is contained in:
herizhen
2026-03-12 15:51:00 +08:00
committed by GitHub
parent 132f3c5d0a
commit e5024d0264

View File

@@ -1,12 +1,123 @@
# Service Profiling Guide
In inference service processes, we sometimes need to monitor the internal execution flow of the inference service framework to identify performance issues. By collecting start and end timestamps of key processes, identifying critical functions or iterations, recording key events, and capturing diverse types of information, we can quickly pinpoint performance bottlenecks.
In an inference service process, it is sometimes necessary to monitor the internal execution flow of the inference service framework to identify performance issues. By collecting start and end timestamps of key processes, identifying key functions or iterations, recording critical events, and gathering various types of information, performance bottlenecks can be quickly located.
This guide walks you through collecting performance data for the vllm-ascend service framework and operators. It covers the full workflow from preparation and collection to analysis and visualization, helping you quickly get started with the profiling tool.
This guide will walk you through the process of collecting performance data from the vLLM-Ascend service framework and operators. It covers the complete workflow from preparation, collection, analysis, to visualization, helping you quickly get started with performance collection tools.
## Quick Start
Two performance collection solutions are provided below: Ascend PyTorch Profiler and MS Service Profiler. You can choose the appropriate tool for performance analysis and troubleshooting based on your actual requirements.
### 0 Installation
## Solution Comparison
| Feature | Ascend PyTorch Profiler | MS Service Profiler |
|:-----|:------------------------|:------------------|
| Installation Method | Built-in, no additional installation required | Requires pip installation of msserviceprofiler |
| Collection Granularity | PyTorch operator level | Service framework function level |
| Control Method | API request control | Configuration file control |
| Applicable Scenarios | Model operator performance analysis | Service framework workflow analysis |
| Data Format | ascend_pt format | Chrome Tracing + CSV |
| Main Advantage | Operator-level performance analysis | Service framework workflow visualization |
## Quick Selection Guide
- [**Model Operator Performance** → Use Ascend PyTorch Profiler](#ascend-pytorch-profiler)
- [**Service Framework Workflow** → Use MS Service Profiler](#ms-service-profiler)
---
## Ascend PyTorch Profiler
### 0. Installation and Configuration
No additional packages need to be installed; it can be enabled through command-line configuration. Currently, vLLM enables **python stack** by default, which can significantly inflate the collected performance data. If you do not wish to collect **python stack**, you can disable it using `torch_profiler_with_stack=false`.
### 1. Preparation for Collection
Start the online service and set the `--profiler-config` parameter to control the path for saving performance files. After the parameter is set, the collection function is enabled.
```bash
VLLM_PROMPT_SEQ_BUCKET_MAX=128
VLLM_PROMPT_SEQ_BUCKET_MIN=128
python3 -m vllm.entrypoints.openai.api_server \
--port 8080 \
--model "facebook/opt-125m" \
--tensor-parallel-size 1 \
--max-num-seqs 128 \
--profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile", "torch_profiler_with_stack": false}' \
--dtype bfloat16 \
--max-model-len 256
```
> Note:**January 19, 2026: The vLLM mainline has deprecated the VLLM_TORCH_PROFILER_DIR environment variable.**[Related PR](https://github.com/vllm-project/vllm-ascend/pull/5928) When using the vLLM Ascend mainline code to collect profiler data, remember to use the `--profiler-config` (online) parameter or the `profiler_config` (offline) parameter.
### 2. Start Collection
Performance collection is controlled by sending API requests. You can start collection after stabilizing the actual business data and collect profiling for a few seconds before stopping; or you can start collection first, then send business requests, and finally stop.
Send the following request to start the profiling service:
```bash
curl -X POST http://localhost:8080/start_profile
```
Send the following request to stop the profiling service:
```bash
curl -X POST http://localhost:8080/stop_profile
```
### 3. Send Requests
Send requests according to your actual business data. After sending the requests, stop the profiling service, and the data will be automatically saved to the previously configured path:
```bash
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
curl -X POST http://localhost:8080/stop_profile
```
### 4. Analyze Data
Navigate to the `./vllm_profile` directory and locate the generated `*ascend_pt` folder. This folder needs to be analyzed before profiling data can be examined.
```python
from torch_npu.profiler.profiler import analyse
analyse("./vllm_profile/localhost.localdomain_XXXXXXXXXX_ascend_pt/")
```
### 5. View Results
After analysis, the `*ascend_pt` directory will contain many files, with the main analysis focus being the `ASCEND_PROFILER_OUTPUT` folder. This directory will include the following files:
- `analysis.db`: Performance data in database format
- `api_statistic.csv`: API call statistics
- `ascend_pytorch_profiler_0.db`: Performance data in database format
- `kernel_details.csv`: Kernel-level related data
- `operator_details.csv`: Operator-level related data
- `op_statistic.csv`: Operator utilization data
- `step_trace_time.csv`: Scheduling data
- `trace_view.json`: Chrome tracing format data, can be opened with [MindStudio Insight](https://www.hiascend.com/document/detail/zh/mindstudio/81RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html)
[↑ Back to Top](#service-profiling-guide)
---
## MS Service Profiler
### 0. Installation
Install the `msserviceprofiler` package using pip:
@@ -14,7 +125,7 @@ Install the `msserviceprofiler` package using pip:
pip install msserviceprofiler==1.2.2
```
### 1 Preparation
### 1. Preparation
Before starting the service, set the environment variable `SERVICE_PROF_CONFIG_PATH` to point to the profiling configuration file, and set the environment variable `PROFILING_SYMBOLS_PATH` to specify the YAML configuration file for the symbols that need to be imported. After that, start the vLLM service according to your deployment method.
@@ -32,7 +143,7 @@ The file `ms_service_profiler_config.json` is the profiling configuration. If it
`service_profiling_symbols.yaml` is the configuration file containing the profiling points to be imported. You can choose **not** to set the `PROFILING_SYMBOLS_PATH` environment variable, in which case the default configuration file will be used. If the file does not exist at the path you specified, likewise, the system will generate a configuration file at your specified path for future configuration. You can customize it according to the instructions in the `Symbols Configuration File` section below.
### 2 Enable Profiling
### 2. Enable Profiling
To enable the performance data collection switch, change the `enable` field from `0` to `1` in the configuration file `ms_service_profiler_config.json`. This can be accomplished by executing the following sed command:
@@ -40,7 +151,7 @@ To enable the performance data collection switch, change the `enable` field from
sed -i 's/"enable":\s*0/"enable": 1/' ./ms_service_profiler_config.json
```
### 3 Send Requests
### 3. Send Requests
Choose a request-sending method that suits your actual profiling needs:
@@ -50,12 +161,12 @@ curl http://localhost:8000/v1/completions \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Beijing is a",
"max_completion_tokens": 5,
"max_tokens": 5,
"temperature": 0
}' | python3 -m json.tool
```
### 4 Analyze Data
### 4. Analyze Data
```bash
# xxxx-xxxx is the directory automatically created based on vLLM startup time
@@ -65,7 +176,7 @@ cd /root/.ms_server_profiler/xxxx-xxxx
msserviceprofiler analyze --input-path=./ --output-path output
```
### 5 View Results
### 5. View Results
After analysis, the `output` directory will contain:
@@ -80,15 +191,15 @@ After analysis, the `output` directory will contain:
---
## Appendix
### 6. Appendix related to MS Service Profiler
(profiling-configuration-file)=
### 1 Profiling Configuration File
#### 6.1 Profiling Configuration File
The profiling configuration file controls profiling parameters and behavior.
#### File Format
##### File Format
The configuration is in JSON format. Main parameters:
@@ -106,7 +217,7 @@ The configuration is in JSON format. Main parameters:
| timelimit | Profiling duration for the service. The process stops automatically after this time. Range: integer 07200, unit: seconds. Default 0 means unlimited. | No |
| domain | Limit profiling to the specified domains to reduce data volume. String, separated by semicolons, case-sensitive, e.g., "Request; KVCache".<br />Empty means all available domains.<br />Available domains: Request, KVCache, ModelExecute, BatchSchedule, Communication.<br />Note: If the selected domains are incomplete, analysis output may show warnings due to missing data. See [Reference Table 1](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/devaids/Profiling/mindieprofiling_0009.html#ZH-CN_TOPIC_0000002370256365__table1985410131831). | No |
#### Example Configuration
##### Example Configuration
```json
{
@@ -123,17 +234,17 @@ The configuration is in JSON format. Main parameters:
(symbols-configuration-file)=
### 2 Symbols Configuration File
#### 6.2 Symbols Configuration File
The symbols configuration file defines which functions/methods to profile and supports flexible configuration with custom attribute collection.
#### 2.1 File Name and Loading
##### File Name and Loading
- Default load path:`~/.config/vllm_ascend/service_profiling_symbols.MAJOR.MINOR.PATCH.yaml`(According to the installed version of vllm )
If you need to customize the profiling points, it is highly recommended to copy a profiling configuration file to your working directory using the `PROFILING_SYMBOLS_PATH` environment variable.
#### 2.2 Field Descriptions
##### Field Descriptions
| Field | Description | Example |
|:-----:|:-----|:-----|
@@ -145,7 +256,7 @@ If you need to customize the profiling points, it is highly recommended to copy
| max_version | Lower version constraint | `"0.11.0"` |
| attributes | Custom attribute collection | Only supported for `"timer"` handler. See the section below |
#### 2.3 Examples
##### Examples
- Example 1: Custom handler
@@ -157,7 +268,7 @@ If you need to customize the profiling points, it is highly recommended to copy
```
- Example 2: Default timer
```yaml
- symbol: vllm.v1.engine.core:EngineCore.execute_model
domain: ModelExecute
@@ -172,18 +283,18 @@ If you need to customize the profiling points, it is highly recommended to copy
# No handler specified -> default timer
```
#### 2.4 Custom Attribute Collection
##### Custom Attribute Collection
The `attributes` field supports flexible custom attribute collection and allows operations and transformations on function arguments and return values.
##### Basic Syntax
###### Basic Syntax
- Argument access: use the parameter name directly, e.g., `input_ids`
- Return value access: use the `return` keyword
- Pipeline operations: use `|` to chain multiple operations
- Attribute access: use `attr` to access object attributes
##### Example
###### Example
```yaml
- symbol: vllm_ascend.worker.model_runner_v1:NPUModelRunner.execute_model
@@ -198,20 +309,20 @@ The `attributes` field supports flexible custom attribute collection and allows
expr: args[0] | attr input_batch | attr _req_ids | len
```
##### Expression Notes
###### Expression Notes
1. `len(input_ids)`: get the length of parameter `input_ids`.
2. `len(return) | str`: get the length of the return value and convert to string (equivalent to `str(len(return))`).
3. `return[0] | attr input_ids | len`: get the length of the `input_ids` attribute of the first element in the return value.
##### Supported Expression Types
###### Supported Expression Types
- Basic operations: `len()`, `str()`, `int()`, `float()`
- Index access: `return[0]`, `return['key']`
- Attribute access: `return | attr attr_name`
- Pipeline composition: chain operations with `|`
##### Advanced Examples
###### Advanced Examples
```yaml
attributes:
@@ -232,7 +343,7 @@ attributes:
expr: data | attr items | len | str
```
#### 2.5 Custom Handler
##### Custom Handler
When `handler` specifies a custom function, it must match the following signature:
@@ -255,3 +366,5 @@ def custom_handler(original_func, this, *args, **kwargs):
```
If the custom handler fails to import, the system will automatically fall back to the default timer mode.
[↑ Back to Top](#service-profiling-guide)