[feature] vllm-ascend support msprobe (eager mode dump) (#4241)

### What this PR does / why we need it? vllm-ascend need to dump data during model execution to debug some precision problems, here msprobe provide the corresponding abilities, so msprobe will join vllm-ascend to make debug easier ### Does this PR introduce _any_ user-facing change? ``` 'dump_config': '/path/to/config.json' ``` - vLLM version: v0.11.0 - vLLM main: 2918c1b49c --------- Signed-off-by: Tjh-UKN <2559659915@qq.com>
2025-11-24 21:58:31 +08:00
parent 5b1a7514eb
commit 00ea61ec88
17 changed files with 1385 additions and 159 deletions
--- a/docs/source/developer_guide/performance_and_debug/index.md
+++ b/docs/source/developer_guide/performance_and_debug/index.md
@@ -0,0 +1,11 @@
+# Performance and Debug
+
+::::{toctree}
+:caption: Performance and Debug
+:maxdepth: 1
+performance_benchmark
+profile_execute_duration
+optimization_and_tuning
+service_profiling_guide
+msprobe_guide
+::::
--- a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
@@ -0,0 +1,516 @@
+# MSProbe Debugging Guide
+
+During inference or training runs we often encounter accuracy anomalies such as outputs drifting away from the expectation, unstable numerical behavior (NaN/Inf), or predictions that no longer match the labels. To pinpoint the root cause we have to monitor and capture intermediate data produced while the model executes—feature maps, weights, activations, and layer outputs. By capturing key tensors at specific stages, logging I/O pairs for the core layers, and retaining contextual metadata (prompts, tensor dtypes, hardware configuration, etc.), we can systematically trace where the accuracy degradation or numerical error started. This guide describes the end-to-end workflow for diagnosing accuracy issues for AI models (with a focus on vllm-ascend services): preparation, data capture, and analysis & verification.
+
+## 0. Background Concepts
+
+`msprobe` supports three accuracy levels:
+
+- **L0**: dumps tensors at the module level and generates `construct.json` so that visualization tools can rebuild the network structure. A model or submodule handle must be passed in.
+- **L1**: collects operator-level statistics only, which is suitable for lightweight troubleshooting.
+- **mix**: captures both structural information and operator statistics, which is useful when you need both graph reconstruction and numerical comparisons.
+
+## 1. Prerequisites
+
+### 1.1 Install `msprobe`
+
+Install msprobe with pip:
+
+```bash
+pip install mindstudio-probe==8.3.0
+```
+
+### 1.2 Visualization dependencies (optional)
+
+Install additional dependencies if you need to visualize the captured data.
+
+1. Install `tb_graph_ascend`:
+
+   ```bash
+   pip install tb_graph_ascend
+   ```
+
+## 2. Collecting Data with `msprobe`
+
+We generally follow a coarse-to-fine strategy when capturing data. First identify the token where the issue shows up, and then decide which range needs to be sampled around that token. The typical workflow is described below.
+
+### 2.1 Prepare the dump configuration file
+
+Create a `config.json` that can be parsed by `PrecisionDebugger` and place it in an accessible path. Common fields are:
+
+| Field | Description | Required |
+|:---:|:----|:---:|
+| `task` | Type of dump task. Common PyTorch values include `"statistics"` and `"tensor"`. A statistics task collects tensor statistics (mean, variance, max, min, etc.) while a tensor task captures arbitrary tensors. | Yes |
+| `dump_path` | Directory where dump results are stored. When omitted, `msprobe` uses its default path. | No |
+| `rank` | Ranks to sample. An empty list collects every rank. For single-card tasks you must set this field to `[]`. | No |
+| `step` | Token iteration(s) to sample. An empty list means every iteration. | No |
+| `level` | Dump level string (`"L0"`, `"L1"`, or `"mix"`). `L0` targets `nn.Module`, `L1` targets `torch.api`, and `mix` collects both. | Yes |
+| `async_dump` | Whether to enable asynchronous dump (supported for PyTorch `statistics`/`tensor` tasks). Defaults to `false`. | No |
+| `scope` | Module range to sample. An empty list collects every module. | No |
+| `list` | Operator range to sample. An empty list collects every operator. | No |
+
+To restrict the operators that are captured, configure the `list` block:
+
+- `scope` (list[str]): In PyTorch pynative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
+
+  ```
+  "scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
+  "scope": ["Cell.conv1.Conv2d.forward.0", "Cell.fc2.Dense.backward.0"]
+  "scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]
+  ```
+
+  The `level` setting determines what can be provided—modules when `level=L0`, APIs when `level=L1`, and either modules or APIs when `level=mix`.
+
+- `list` (list[str]): Custom operator list. Options include:
+  - Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
+  - When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
+  - Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
+
+Example configuration:
+
+```bash
+cat <<'JSON' > /data/msprobe_config.json
+{
+  "task": "statistics",
+  "dump_path": "/home/data_dump",
+  "rank": [],
+  "step": [],
+  "level": "L1",
+  "async_dump": false,
+
+  "statistics": {
+    "scope": [],
+    "list": [],
+    "tensor_list": [],
+    "data_mode": ["all"],
+    "summary_mode": "statistics"
+  }
+}
+JSON
+```
+
+## 2. Enable `msprobe` in vllm-ascend
+
+1. Start vLLM in eager mode by adding `--enforce-eager` (static-graph scenarios are not supported yet) and pass the config path through `--additional-config`:
+
+   ```bash
+   vllm serve Qwen/Qwen2.5-0.5B-Instruct \
+     --dtype float16 \
+     --enforce-eager \
+     --host 0.0.0.0 \
+     --port 8000 \
+     --additional-config '{"dump_config": "/data/msprobe_config.json"}' &
+   ```
+
+## 3. Send requests and collect dumps
+
+1. Send inference requests as usual, for example:
+
+   ```bash
+   curl http://localhost:8000/v1/completions \
+     -H "Content-Type: application/json" \
+     -d '{
+           "model": "Qwen/Qwen2.5-0.5B-Instruct",
+           "prompt": "Explain gravity in one sentence.",
+           "max_tokens": 32,
+           "temperature": 0
+         }' | python -m json.tool
+   ```
+
+2. Each request drives the sequence `msprobe: start -> forward/backward -> stop -> step`. The runner invokes `step()` on every code path, so you always get a complete dataset even if inference returns early.
+
+3. Dump files are written into `dump_path`. They usually contain:
+   - Tensor files grouped by operator/module.
+   - `dump.json`, which records metadata such as dtype, shape, min/max, and `requires_grad`.
+   - `construct.json`, which is generated when `level` is `L0` or `mix` (required for visualization).
+
+   Example directory layout:
+
+   ```text
+   ├── dump_path
+   │   ├── step0
+   │   │   ├── rank0
+   │   │   │   ├── dump_tensor_data
+   │   │   │   │    ├── Tensor.permute.1.forward.pt
+   │   │   │   │    ├── Functional.linear.5.backward.output.pt    # Format: {api_type}.{api_name}.{call_count}.{forward/backward}.{input/output}.{arg_index}.
+   │   │   │   │    │                                              # arg_index is the nth input or output of the API. If an input is a list, keep numbering with decimals (e.g., 1.1 is the first element of the first argument).
+   │   │   │   │    ├── Module.conv1.Conv2d.forward.0.input.0.pt          # Format: {Module}.{module_name}.{class_name}.{forward/backward}.{call_count}.{input/output}.{arg_index}.
+   │   │   │   │    ├── Module.conv1.Conv2d.forward.0.parameters.bias.pt  # Module parameter data: {Module}.{module_name}.{class_name}.forward.{call_count}.parameters.{parameter_name}.
+   │   │   │   │    └── Module.conv1.Conv2d.parameters_grad.weight.pt     # Module parameter gradients: {Module}.{module_name}.{class_name}.parameters_grad.{parameter_name}. Gradients do not include call_count because the same gradient updates all invocations.
+   │   │   │   │                                                          # When the `model` argument passed to dump is a List[torch.nn.Module] or Tuple[torch.nn.Module], module-level data names also include the index inside the list ({Module}.{index}.*), e.g., Module.0.conv1.Conv2d.forward.0.input.0.pt.
+   │   │   │   ├── dump.json
+   │   │   │   ├── stack.json
+   │   │   │   ├── dump_error_info.log
+   │   │   │   └── construct.json
+   │   │   ├── rank1
+   │   │   │   ├── dump_tensor_data
+   │   │   │   │   └── ...
+   │   │   │   ├── dump.json
+   │   │   │   ├── stack.json
+   │   │   │   ├── dump_error_info.log
+   │   │   │   └── construct.json
+   │   │   ├── ...
+   │   │   │
+   │   │   └── rank7
+   │   ├── step1
+   │   │   ├── ...
+   │   ├── step2
+   ```
+
+   - `rank`: Device ID. Each card writes its data to the corresponding `rank{ID}` directory. In non-distributed scenarios the directory is simply named `rank`.
+   - `dump_tensor_data`: Tensor payloads that were collected.
+   - `dump.json`: Statistics for the forward/backward data of each API or module, including names, dtype, shape, max, min, mean, L2 norm (square root of the L2 variance), and CRC-32 when `summary_mode="md5"`. See [dump.json file description](#dumpjson-file-description) for details.
+   - `dump_error_info.log`: Present only when the dump tool encountered an error and records the failure log.
+   - `stack.json`: Call stacks for APIs/modules.
+   - `construct.json`: Hierarchical structure description. Empty when `level=L1`.
+
+## 4. Analyze the results
+
+### 4.1 Prerequisites
+
+You typically need two dump datasets: one from the "problem side" (the run that exposes the accuracy or numerical error) and another from the "benchmark side" (a good baseline). These datasets do not have to be identical—they can come from different branches, framework versions, or even alternative implementations (operator substitutions, different graph-optimization switches, etc.). As long as they use the same or similar inputs, hardware topology, and sampling points (step/token), `msprobe` can compare them and locate the divergent nodes. If you cannot find a perfectly clean benchmark, start by capturing the problem-side data, craft the smallest reproducible case by hand, and perform a self-comparison. Below we assume the problem dump is `problem_dump` and the benchmark dump is `bench_dump`.
+
+### 4.2 Visualization
+
+Use `msprobe graph_visualize` to generate results that can be opened inside `tb_graph_ascend`.
+
+1. Ensure the dump contains `construct.json` (i.e., `level = L0` or `level = mix`).
+2. Prepare a comparison file such as `compare.json`. Its format and generation flow are described in section 3.1.3 of `msprobe_visualization.md`. Example (minimal runnable snippet):
+
+   ```json
+   {
+     "npu_path": "./problem_dump",
+     "bench_path": "./bench_dump",
+     "is_print_compare_log": true
+   }
+   ```
+
+   Replace the paths with your dump directories before invoking `msprobe graph_visualize`. **If you only need to build a single graph**, omit `bench_path` to visualize one dump.  
+   Multi-rank scenarios (single rank, multi-rank, or multi-step multi-rank) are also supported. `npu_path` or `bench_path` must contain folders named `rank+number`, and every rank folder must contain a non-empty `construct.json` together with `dump.json` and `stack.json`. If any `construct.json` is empty, verify that the dump level includes `L0` or `mix`. When comparing graphs, both `npu_path` and `bench_path` must contain the same set of rank folders so they can be paired one-to-one.
+
+   ```
+   ├── npu_path or bench_path
+   |   ├── rank0
+   |   |   ├── dump_tensor_data (only when the `tensor` option is enabled)
+   |   |   |    ├── Tensor.permute.1.forward.pt
+   |   |   |    ├── MyModule.0.forward.input.pt
+   |   |   |    ...
+   |   |   |    └── Function.linear.5.backward.output.pt
+   |   |   ├── dump.json         # Tensor metadata
+   |   |   ├── stack.json        # Operator call stack information
+   |   |   └── construct.json    # Hierarchical structure; empty when `level=L1`
+   |   ├── rank1
+   |   |   ├── dump_tensor_data
+   |   |   |   └── ...
+   |   |   ├── dump.json
+   |   |   ├── stack.json
+   |   |   └── construct.json
+   |   ├── ...
+   |   |
+   |   └── rankn
+   ```
+
+3. Run:
+
+   ```bash
+   msprobe graph_visualize \
+       --input_path ./compare.json \
+       --output_path ./graph_output
+   ```
+
+   After the comparison finishes, a `*.vis.db` file is created under `graph_output`.
+
+   - Graph build: `build_{timestamp}.vis.db`
+   - Graph comparison: `compare_{timestamp}.vis.db`
+
+4. Launch `tensorboard` and load the output directory to inspect structural differences, numerical comparisons, overflow detection results, cross-device communication nodes, and filters/search. Pass the directory containing the `.vis.db` files to `--logdir`:
+
+   ```bash
+   tensorboard --logdir out_path --bind_all --port [optional_port]
+   ```
+
+5. Inspect the visualization. The UI usually displays the overall model structure with operators, parameters, and tensor I/O. Click any node to expand its children.
+   - **Difference visualization**: Comparison results highlight divergent nodes with different colors (the larger the difference, the redder the node). Click a node to view its detailed information including tensor inputs/outputs, parameters, and operator type. Analyze the data difference and the surrounding connections to pinpoint the exact divergence.
+   - **Helper features**:
+     - Switch rank/step: Quickly check difference nodes on different ranks and steps.
+     - Search/filter: Use the search box to filter nodes by operator name, etc.
+     - Manual mapping: Automatic mapping cannot cover every case, so the tool lets you manually map nodes between the problem and benchmark graphs before generating comparison results.
+
+## 5. Troubleshooting
+
+- `RuntimeError: Please enforce eager mode`: Restart vLLM and add the `--enforce-eager` flag.
+- No dump files: Confirm that the JSON path is correct and every node has write permission. In distributed scenarios set `keep_all_ranks` so that every rank writes its own dump.
+- Dumps are too large: Start with a `statistics` task to locate abnormal tensors, then narrow the scope with `scope`/`list`/`tensor_list`, `filters`, `token_range`, etc.
+
+---
+
+## Appendix
+
+### dump.json file description
+
+#### L0 level
+
+An L0 `dump.json` contains forward/backward I/O for modules together with parameters and parameter gradients. Using PyTorch's `Conv2d` as an example, the network code looks like:
+
+`output = self.conv2(input)  # self.conv2 = torch.nn.Conv2d(64, 128, 5, padding=2, bias=True)`
+
+`dump.json` contains the following entries:
+
+- `Module.conv2.Conv2d.forward.0`: Forward data of the module. `input_args` represents positional inputs, `input_kwargs` represents keyword inputs, `output` stores forward outputs, and `parameters` stores weights/biases.
+- `Module.conv2.Conv2d.parameters_grad`: Parameter gradients (weight and bias).
+- `Module.conv2.Conv2d.backward.0`: Backward data of the module. `input` represents gradients that flow into the module (gradients of the forward outputs) and `output` represents gradients that flow out (gradients of the module inputs).
+
+**Note**: When the `model` parameter passed to the dump API is `List[torch.nn.Module]` or `Tuple[torch.nn.Module]`, module-level names include the index inside the list (`{Module}.{index}.*`). Example: `Module.0.conv1.Conv2d.forward.0`.
+
+```json
+{
+ "task": "tensor",
+ "level": "L0",
+ "framework": "pytorch",
+ "dump_data_dir": "/dump/path",
+ "data": {
+  "Module.conv2.Conv2d.forward.0": {
+   "input_args": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      8,
+      16,
+      14,
+      14
+     ],
+     "Max": 1.638758659362793,
+     "Min": 0.0,
+     "Mean": 0.2544615864753723,
+     "Norm": 70.50277709960938,
+     "requires_grad": true,
+     "data_name": "Module.conv2.Conv2d.forward.0.input.0.pt"
+    }
+   ],
+   "input_kwargs": {},
+   "output": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      8,
+      32,
+      10,
+      10
+     ],
+     "Max": 1.6815717220306396,
+     "Min": -1.5120246410369873,
+     "Mean": -0.025344856083393097,
+     "Norm": 149.65576171875,
+     "requires_grad": true,
+     "data_name": "Module.conv2.Conv2d.forward.0.output.0.pt"
+    }
+   ],
+   "parameters": {
+    "weight": {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      32,
+      16,
+      5,
+      5
+     ],
+     "Max": 0.05992485210299492,
+     "Min": -0.05999220535159111,
+     "Mean": -0.0006165213999338448,
+     "Norm": 3.421217441558838,
+     "requires_grad": true,
+     "data_name": "Module.conv2.Conv2d.forward.0.parameters.weight.pt"
+    },
+    "bias": {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      32
+     ],
+     "Max": 0.05744686722755432,
+     "Min": -0.04894155263900757,
+     "Mean": 0.006410328671336174,
+     "Norm": 0.17263513803482056,
+     "requires_grad": true,
+     "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt"
+    }
+   }
+  },
+  "Module.conv2.Conv2d.parameters_grad": {
+   "weight": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      32,
+      16,
+      5,
+      5
+     ],
+     "Max": 0.018550323322415352,
+     "Min": -0.008627401664853096,
+     "Mean": 0.0006675920449197292,
+     "Norm": 0.26084786653518677,
+     "requires_grad": false,
+     "data_name": "Module.conv2.Conv2d.parameters_grad.weight.pt"
+    }
+   ],
+   "bias": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      32
+     ],
+     "Max": 0.014914230443537235,
+     "Min": -0.006656786892563105,
+     "Mean": 0.002657240955159068,
+     "Norm": 0.029451673850417137,
+     "requires_grad": false,
+     "data_name": "Module.conv2.Conv2d.parameters_grad.bias.pt"
+    }
+   ]
+  },
+  "Module.conv2.Conv2d.backward.0": {
+   "input": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      8,
+      32,
+      10,
+      10
+     ],
+     "Max": 0.0015069986693561077,
+     "Min": -0.001139344065450132,
+     "Mean": 3.3215508210560074e-06,
+     "Norm": 0.020567523315548897,
+     "requires_grad": false,
+     "data_name": "Module.conv2.Conv2d.backward.0.input.0.pt"
+    }
+   ],
+   "output": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      8,
+      16,
+      14,
+      14
+     ],
+     "Max": 0.0007466732058674097,
+     "Min": -0.00044813455315306783,
+     "Mean": 6.814070275140693e-06,
+     "Norm": 0.01474067009985447,
+     "requires_grad": false,
+     "data_name": "Module.conv2.Conv2d.backward.0.output.0.pt"
+    }
+   ]
+  }
+ }
+}
+```
+
+#### L1 level
+
+An L1 `dump.json` records forward/backward I/O for APIs. Using PyTorch's `relu` function as an example (`output = torch.nn.functional.relu(input)`), the file contains:
+
+- `Functional.relu.0.forward`: Forward data of the API. `input_args` are positional inputs, `input_kwargs` are keyword inputs, and `output` stores the forward outputs.
+- `Functional.relu.0.backward`: Backward data of the API. `input` represents the gradients of the forward outputs, and `output` represents the gradients that flow back to the forward inputs.
+
+```json
+{
+ "task": "tensor",
+ "level": "L1",
+ "framework": "pytorch",
+ "dump_data_dir":"/dump/path",
+ "data": {
+  "Functional.relu.0.forward": {
+   "input_args": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      32,
+      16,
+      28,
+      28
+     ],
+     "Max": 1.3864083290100098,
+     "Min": -1.3364859819412231,
+     "Mean": 0.03711778670549393,
+     "Norm": 236.20692443847656,
+     "requires_grad": true,
+     "data_name": "Functional.relu.0.forward.input.0.pt"
+    }
+   ],
+   "input_kwargs": {},
+   "output": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      32,
+      16,
+      28,
+      28
+     ],
+     "Max": 1.3864083290100098,
+     "Min": 0.0,
+     "Mean": 0.16849493980407715,
+     "Norm": 175.23345947265625,
+     "requires_grad": true,
+     "data_name": "Functional.relu.0.forward.output.0.pt"
+    }
+   ]
+  },
+  "Functional.relu.0.backward": {
+   "input": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      32,
+      16,
+      28,
+      28
+     ],
+     "Max": 0.0001815402356442064,
+     "Min": -0.00013352684618439525,
+     "Mean": 0.00011915402356442064,
+     "Norm": 0.007598237134516239,
+     "requires_grad": false,
+     "data_name": "Functional.relu.0.backward.input.0.pt"
+    }
+   ],
+   "output": [
+    {
+     "type": "torch.Tensor",
+     "dtype": "torch.float32",
+     "shape": [
+      32,
+      16,
+      28,
+      28
+     ],
+     "Max": 0.0001815402356442064,
+     "Min": -0.00012117840378778055,
+     "Mean": 2.0098118724831693e-08,
+     "Norm": 0.006532244384288788,
+     "requires_grad": false,
+     "data_name": "Functional.relu.0.backward.output.0.pt"
+    }
+   ]
+  }
+ }
+}  
+```
+
+#### mix level
+
+A `mix` dump.json contains both L0 and L1 level data; the file format is the same as the examples above.
--- a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
+++ b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
@@ -0,0 +1,184 @@
+# Optimization and Tuning
+
+This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deployment guide and so on. Any feedback is welcome.
+
+## Preparation
+
+Run the container:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci0
+# Update the cann base image
+export IMAGE=m.daocloud.io/quay.io/ascend/cann:|cann_image_tag|
+docker run --rm \
+--name performance-test \
+--shm-size=1g \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-it $IMAGE bash
+```
+
+Configure your environment:
+
+```{code-block} bash
+   :substitutions:
+# Configure the mirror
+echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
+echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
+echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list
+
+# Install os packages
+apt update && apt install wget gcc g++ libnuma-dev git vim -y
+```
+
+Install vllm and vllm-ascend:
+
+```{code-block} bash
+   :substitutions:
+# Install necessary dependencies
+pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
+pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pytest
+
+# Configure this var to speed up model download
+VLLM_USE_MODELSCOPE=true
+```
+
+Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vLLM, vllm-ascend, and MindIE Turbo are installed correctly.
+
+:::{note}
+Make sure your vLLM and vllm-ascend are installed after your python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM, vllm-ascend, and MindIE Turbo before completing section 1.1, the binary files will not use the optimized python.
+:::
+
+## Optimizations
+
+### 1. Compilation Optimization
+
+#### 1.1. Install optimized `python`
+
+Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` build following this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
+
+```{code-block} bash
+   :substitutions:
+mkdir -p /workspace/tmp
+cd /workspace/tmp
+
+# Download prebuilt lib and packages
+wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1
+wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so
+wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1
+wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz
+
+# Configure python and pip
+cp ./*.so* /usr/local/lib
+tar -zxvf ./py311_bisheng.*  -C /usr/local/
+mv  /usr/local/py311_bisheng/  /usr/local/python
+sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
+sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
+ln -sf  /usr/local/python/bin/python3  /usr/bin/python
+ln -sf  /usr/local/python/bin/python3  /usr/bin/python3
+ln -sf  /usr/local/python/bin/python3.11  /usr/bin/python3.11
+ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip3
+ln -sf  /usr/local/python/bin/pip3  /usr/bin/pip
+
+export PATH=/usr/bin:/usr/local/python/bin:$PATH
+```
+
+### 2. OS Optimization
+
+#### 2.1. jemalloc
+
+**jemalloc** is a memory allocator that improves performance for multi-thread scenarios and can reduce memory fragmentation. jemalloc uses local thread memory manager to allocate variables, which can avoid lock competition between threads and can hugely optimize performance.
+
+```{code-block} bash
+   :substitutions:
+# Install jemalloc
+sudo apt update
+sudo apt install libjemalloc2
+
+# Configure jemalloc
+export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
+```
+
+#### 2.2. Tcmalloc
+
+**Tcmalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex competition and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
+
+```{code-block} bash
+   :substitutions:
+# Install tcmalloc
+sudo apt update
+sudo apt install libgoogle-perftools4 libgoogle-perftools-dev
+
+# Get the location of libtcmalloc.so*
+find /usr -name libtcmalloc.so*
+
+# Make the priority of tcmalloc higher
+# The <path> is the location of libtcmalloc.so we get from the upper command
+# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
+export LD_PRELOAD="$LD_PRELOAD:<path>"
+
+# Verify your configuration
+# The path of libtcmalloc.so will be contained in the result if your configuration is valid
+ldd `which python`
+```
+
+### 3. `torch_npu` Optimization
+
+Some performance tuning features in `torch_npu` are controlled by environment variables. Some features and their related environment variables are shown below.
+
+Memory optimization:
+
+```{code-block} bash
+   :substitutions:
+# Upper limit of memory block splitting allowed (MB): Setting this parameter can prevent large memory blocks from being split.
+export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
+
+# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
+export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
+```
+
+Scheduling optimization:
+
+```{code-block} bash
+   :substitutions:
+# Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
+export TASK_QUEUE_ENABLE=2
+
+# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
+export CPU_AFFINITY_CONF=1
+```
+
+### 4. CANN Optimization
+
+#### 4.1. HCCL Optimization
+
+There are some performance tuning features in HCCL, which are controlled by environment variables.
+
+You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with RoCE, instead of being scheduled by AI CPU.
+
+```{code-block} bash
+   :substitutions:
+export HCCL_OP_EXPANSION_MODE="AIV"
+```
+
+Plus, there are more features for performance optimization in specific scenarios, which are shown below.
+
+- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
+- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
+- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
+- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
--- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
+++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
@@ -0,0 +1,195 @@
+# Performance Benchmark
+This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
+
+**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
+
+## 1. Run docker container
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--shm-size=1g \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+```
+
+## 2. Install dependencies
+
+```bash
+cd /workspace/vllm-ascend
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+pip install -r benchmarks/requirements-bench.txt
+```
+
+## 3. (Optional) Prepare model weights
+For faster running speed, we recommend downloading the model in advance：
+
+```bash
+modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
+```
+
+You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
+
+```bash
+[
+  {
+    "test_name": "latency_llama8B_tp1",
+    "parameters": {
+      "model": "your local model path",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
+  }
+]
+```
+
+## 4. Run benchmark script
+Run benchmark script:
+
+```bash
+bash benchmarks/scripts/run-performance-benchmarks.sh
+```
+
+After about 10 mins, the output is shown below:
+
+```bash
+online serving:
+qps 1:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  212.77    
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              0.94      
+Output token throughput (tok/s):         204.66    
+Total Token throughput (tok/s):          405.16    
+---------------Time to First Token----------------
+Mean TTFT (ms):                          104.14    
+Median TTFT (ms):                        102.22    
+P99 TTFT (ms):                           153.82    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          38.78     
+Median TPOT (ms):                        38.70     
+P99 TPOT (ms):                           48.03     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           38.46     
+Median ITL (ms):                         36.96     
+P99 ITL (ms):                            75.03     
+==================================================
+
+qps 4:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  72.55     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              2.76      
+Output token throughput (tok/s):         600.24    
+Total Token throughput (tok/s):          1188.27   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          115.62    
+Median TTFT (ms):                        109.39    
+P99 TTFT (ms):                           169.03    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          51.48     
+Median TPOT (ms):                        52.40     
+P99 TPOT (ms):                           69.41     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           50.47     
+Median ITL (ms):                         43.95     
+P99 ITL (ms):                            130.29    
+==================================================
+
+qps 16:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  47.82     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              4.18      
+Output token throughput (tok/s):         910.62    
+Total Token throughput (tok/s):          1802.70   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          128.50    
+Median TTFT (ms):                        128.36    
+P99 TTFT (ms):                           187.87    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          83.60     
+Median TPOT (ms):                        77.85     
+P99 TPOT (ms):                           165.90    
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           65.72     
+Median ITL (ms):                         54.84     
+P99 ITL (ms):                            289.63    
+==================================================
+
+qps inf:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  41.26     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              4.85      
+Output token throughput (tok/s):         1055.44   
+Total Token throughput (tok/s):          2089.40   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          3394.37   
+Median TTFT (ms):                        3359.93   
+P99 TTFT (ms):                           3540.93   
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          66.28     
+Median TPOT (ms):                        64.19     
+P99 TPOT (ms):                           97.66     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           56.62     
+Median ITL (ms):                         55.69     
+P99 ITL (ms):                            82.90     
+==================================================
+
+offline:
+latency:
+Avg latency: 4.944929537673791 seconds
+10% percentile latency: 4.894104263186454 seconds
+25% percentile latency: 4.909652255475521 seconds
+50% percentile latency: 4.932477846741676 seconds
+75% percentile latency: 4.9608619548380375 seconds
+90% percentile latency: 5.035418218374252 seconds
+99% percentile latency: 5.052476694583893 seconds
+
+throughput:
+Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
+Total num prompt tokens:  42659
+Total num output tokens:  43545
+```
+
+The result json files are generated into the path `benchmark/results`.
+These files contain detailed benchmarking results for further analysis.
+
+```bash
+.
+|-- latency_llama8B_tp1.json
+|-- serving_llama8B_tp1_qps_1.json
+|-- serving_llama8B_tp1_qps_16.json
+|-- serving_llama8B_tp1_qps_4.json
+|-- serving_llama8B_tp1_qps_inf.json
+`-- throughput_llama8B_tp1.json
+```
--- a/docs/source/developer_guide/performance_and_debug/profile_execute_duration.md
+++ b/docs/source/developer_guide/performance_and_debug/profile_execute_duration.md
@@ -0,0 +1,40 @@
+# Profile Execute Duration
+
+The execution duration of each stage (including pre/post-processing, model forward, etc.) usually needs to be captured during a complete inference process. Typically, this is done by using `torch.npu.synchronize()` and obtaining CPU timestamps, which increases the performance overhead of host/device synchronization.
+
+**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
+
+## Usage
+* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
+* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
+* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
+
+**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execution duration profiling. Execute the script as follows:**
+
+```
+VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
+```
+
+## Example Output
+
+```
+5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
+5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
+5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
+5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
+5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
+5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
+5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
+5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
+5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
+5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
+5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
+5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
+5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
+5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
+5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
+5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
+5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
+5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
+
+```
--- a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
@@ -0,0 +1,250 @@
+# Service Profiling Guide
+
+In inference service processes, we sometimes need to monitor the internal execution flow of the inference service framework to identify performance issues. By collecting start and end timestamps of key processes, identifying critical functions or iterations, recording key events, and capturing diverse types of information, we can quickly pinpoint performance bottlenecks.
+
+This guide walks you through collecting performance data for the vllm-ascend service framework and operators. It covers the full workflow from preparation and collection to analysis and visualization, helping you quickly get started with the profiling tool.
+
+## Quick Start
+
+### 0 Installation
+
+Install the `msserviceprofiler` package using pip:
+
+```bash
+pip install msserviceprofiler==1.2.2
+```
+
+### 1 Preparation
+Before starting the service, set the environment variable `SERVICE_PROF_CONFIG_PATH` to point to the profiling configuration file, and set the environment variable `PROFILING_SYMBOLS_PATH` to specify the YAML configuration file for the symbols that need to be imported. After that, start the vLLM service according to your deployment method.
+
+```bash
+cd ${path_to_store_profiling_files}
+# Set environment variable
+export SERVICE_PROF_CONFIG_PATH=ms_service_profiler_config.json
+export PROFILING_SYMBOLS_PATH=service_profiling_symbols.yaml
+
+# Start vLLM service
+vllm serve Qwen/Qwen2.5-0.5B-Instruct &
+```
+
+The file `ms_service_profiler_config.json` is the profiling configuration. If it does not exist at the specified path, a default configuration will be generated automatically. If needed, you can customize it in advance according to the instructions in the `Profiling Configuration File` section below.
+
+`service_profiling_symbols.yaml` is the configuration file containing the profiling points to be imported. You can choose **not** to set the `PROFILING_SYMBOLS_PATH` environment variable, in which case the default configuration file will be used. If the file does not exist at the path you specified, likewise, the system will generate a configuration file at your specified path for future configuration. You can customize it according to the instructions in the `Symbols Configuration File` section below.
+
+### 2 Enable Profiling
+To enable the performance data collection switch, change the `enable` field from `0` to `1` in the configuration file `ms_service_profiler_config.json`. This can be accomplished by executing the following sed command:
+
+```bash
+sed -i 's/"enable":\s*0/"enable": 1/' ./ms_service_profiler_config.json
+```
+
+### 3 Send Requests
+Choose a request-sending method that suits your actual profiling needs:
+
+```bash
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json"  \
+    -d '{
+         "model": "Qwen/Qwen2.5-0.5B-Instruct",
+        "prompt": "Beijing is a",
+        "max_tokens": 5,
+        "temperature": 0
+}' | python3 -m json.tool
+```
+
+### 4 Analyze Data
+
+```bash
+# xxxx-xxxx is the directory automatically created based on vLLM startup time
+cd /root/.ms_server_profiler/xxxx-xxxx
+
+# Analyze data
+msserviceprofiler analyze --input-path=./ --output-path output
+```
+
+### 5 View Results
+
+After analysis, the `output` directory will contain:
+- `chrome_tracing.json`: Chrome tracing format data, which can be opened in [MindStudio Insight](https://www.hiascend.com/document/detail/zh/mindstudio/81RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html).
+- `profiler.db`: Performance data in database format.
+- `request.csv`: Request-related data.
+- `request_summary.csv`: Overall request metrics.
+- `kvcache.csv`: KV Cache-related data.
+- `batch.csv`: Batch scheduling-related data.
+- `batch_summary.csv`: Overall batch scheduling metrics.
+- `service_summary.csv`: Overall service-level metrics.
+
+---
+
+## Appendix
+(profiling-configuration-file)=
+### 1 Profiling Configuration File
+
+The profiling configuration file controls profiling parameters and behavior.
+
+#### File Format
+
+The configuration is in JSON format. Main parameters:
+
+| Parameter | Description | Required |
+|:------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----:|
+| enable | Switch for profiling: <br />0: disable<br />1: enable<br />Default: 0 | Yes |
+| prof_dir | Directory to store collected performance data. <br />Default: $HOME/.ms_service_profiler | No |
+| profiler_level | Data collection level. Default is "INFO" (normal level). | No |
+| host_system_usage_freq | Sampling frequency of host CPU and memory metrics. Disabled by default. Range: integer 1–50, unit: Hz (times per second). Set to -1 to disable. <br />Note: Enabling this may consume significant memory. | No |
+| npu_memory_usage_freq | Sampling frequency of NPU memory utilization. Disabled by default. Range: integer 1–50, unit: Hz (times per second). Set to -1 to disable. <br />Note: Enabling this may consume significant memory. | No |
+| acl_task_time | Switch to collect operator dispatch latency and execution latency: <br />0: disable (default; 0 or invalid values mean disabled).<br />1: enable; calls `aclprofCreateConfig` with `ACL_PROF_TASK_TIME_L0`.<br />2: enable MSPTI-based data dumping; uses MSPTI for profiling and requires: `export LD_PRELOAD=$ASCEND_TOOLKIT_HOME/lib64/libmspti.so` | No |
+| acl_prof_task_time_level | Level and duration for profiling: <br />L0: collect operator dispatch and execution latency only; lower overhead (no operator basic info).<br />L1: collect AscendCL interface performance (host–device and inter-device sync/async memory copy latencies), plus operator dispatch, execution, and basic info for comprehensive analysis.<br />time: profiling duration, integer 1–999, in seconds.<br />If unset, defaults to L0 until program exit; invalid values fall back to defaults.<br />Level and duration can be combined, e.g., `"acl_prof_task_time_level": "L1,10"`. | No |
+| api_filter | Filter to select API performance data to dump. For example, specifying "matmul" dumps all API data whose `name` contains "matmul". String, case-sensitive; use "；" to separate multiple targets. Empty means dump all. <br />Effective only when `acl_task_time` is 2. | No |
+| kernel_filter | Filter to select kernel performance data to dump. For example, specifying "matmul" dumps all kernel data whose `name` contains "matmul". String, case-sensitive; use "；" to separate multiple targets. Empty means dump all. <br />Effective only when `acl_task_time` is 2. | No |
+| timelimit | Profiling duration for the service. The process stops automatically after this time. Range: integer 0–7200, unit: seconds. Default 0 means unlimited. | No |
+| domain | Limit profiling to the specified domains to reduce data volume. String, separated by semicolons, case-sensitive, e.g., "Request; KVCache".<br />Empty means all available domains.<br />Available domains: Request, KVCache, ModelExecute, BatchSchedule, Communication.<br />Note: If the selected domains are incomplete, analysis output may show warnings due to missing data. See [Reference Table 1](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/devaids/Profiling/mindieprofiling_0009.html#ZH-CN_TOPIC_0000002370256365__table1985410131831). | No |
+
+#### Example Configuration
+
+```json
+{
+  "enable": 1,
+  "prof_dir": "vllm_prof",
+  "profiler_level": "INFO",
+  "acl_task_time": 0,
+  "acl_prof_task_time_level": "",
+  "timelimit": 0
+}
+```
+
+---
+
+(symbols-configuration-file)=
+### 2 Symbols Configuration File
+
+The symbols configuration file defines which functions/methods to profile and supports flexible configuration with custom attribute collection.
+
+#### 2.1 File Name and Loading
+
+- Default load path:`~/.config/vllm_ascend/service_profiling_symbols.MAJOR.MINOR.PATCH.yaml`( According to the installed version of vllm )
+
+If you need to customize the profiling points, it is highly recommended to copy a profiling configuration file to your working directory using the `PROFILING_SYMBOLS_PATH` environment variable.
+
+#### 2.2 Field Descriptions
+
+| Field | Description | Example |
+|:-----:|:-----|:-----|
+| symbol | Python import path + attribute chain | `"vllm.v1.core.kv_cache_manager:KVCacheManager.free"` |
+| handler | Handler type | `"timer"` (default) or `"pkg.mod:func"` (custom) |
+| domain | Domain tag | `"KVCache"`, `"ModelExecute"` |
+| name | Event name | `"EngineCoreExecute"` |
+| min_version | Upper version constraint | `"0.9.1"` |
+| max_version | Lower version constraint | `"0.11.0"` |
+| attributes | Custom attribute collection | Only support for `"timer"` handler. See the section below |
+
+#### 2.3 Examples
+
+- Example 1: Custom handler
+
+```yaml
+- symbol: vllm.v1.core.kv_cache_manager:KVCacheManager.free
+  handler: vllm_profiler.config.custom_handler_example:kvcache_manager_free_example_handler
+  domain: Example
+  name: example_custom
+```
+
+- Example 2: Default timer
+
+```yaml
+- symbol: vllm.v1.engine.core:EngineCore.execute_model
+  domain: ModelExecute
+  name: EngineCoreExecute
+```
+
+- Example 3: Version constraint
+
+```yaml
+- symbol: vllm.v1.executor.abstract:Executor.execute_model
+  min_version: "0.9.1"
+  # No handler specified -> default timer
+```
+
+#### 2.4 Custom Attribute Collection
+
+The `attributes` field supports flexible custom attribute collection and allows operations and transformations on function arguments and return values.
+
+##### Basic Syntax
+
+- Argument access: use the parameter name directly, e.g., `input_ids`
+- Return value access: use the `return` keyword
+- Pipeline operations: use `|` to chain multiple operations
+- Attribute access: use `attr` to access object attributes
+
+##### Example
+
+```yaml
+- symbol: vllm_ascend.worker.model_runner_v1:NPUModelRunner.execute_model
+  name: ModelRunnerExecuteModel
+  domain: ModelExecute
+  attributes:
+  - name: device
+    expr: args[0] | attr device | str
+  - name: dp
+    expr: args[0] | attr dp_rank | str
+  - name: batch_size
+    expr: args[0] | attr input_batch | attr _req_ids | len
+```
+
+##### Expression Notes
+
+1. `len(input_ids)`: get the length of parameter `input_ids`.
+2. `len(return) | str`: get the length of the return value and convert to string (equivalent to `str(len(return))`).
+3. `return[0] | attr input_ids | len`: get the length of the `input_ids` attribute of the first element in the return value.
+
+##### Supported Expression Types
+
+- Basic operations: `len()`, `str()`, `int()`, `float()`
+- Index access: `return[0]`, `return['key']`
+- Attribute access: `return | attr attr_name`
+- Pipeline composition: chain operations with `|`
+
+##### Advanced Examples
+
+```yaml
+attributes:
+  # Get tensor shape
+  - name: tensor_shape
+    expr: input_tensor | attr shape | str
+  
+  # Get specific value from a dict
+  - name: batch_size
+    expr: kwargs['batch_size']
+  
+  # Conditional expression (requires custom handler support)
+  - name: is_training_mode
+    expr: training | bool
+  
+  # Complex data processing
+  - name: processed_data_len
+    expr: data | attr items | len | str
+```
+
+#### 2.5 Custom Handler
+
+When `handler` specifies a custom function, it must match the following signature:
+
+```python
+def custom_handler(original_func, this, *args, **kwargs):
+    """
+    Custom handler
+    
+    Args:
+        original_func: the original function object
+        this: the bound object (for methods)
+        *args: positional arguments
+        **kwargs: keyword arguments
+    
+    Returns:
+        processing result
+    """
+    # Custom logic
+    pass
+```
+
+If the custom handler fails to import, the system will automatically fall back to the default timer mode.