[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
SILONG ZENG
2026-01-15 09:06:01 +08:00
committed by GitHub
parent 96edd4673f
commit 4811ba62e0
75 changed files with 711 additions and 308 deletions

View File

@@ -53,7 +53,7 @@ To restrict the operators that are captured, configure the `list` block:
- `scope` (list[str]): In PyTorch pynative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
```
```json
"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
"scope": ["Cell.conv1.Conv2d.forward.0", "Cell.fc2.Dense.backward.0"]
"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]
@@ -62,9 +62,9 @@ To restrict the operators that are captured, configure the `list` block:
The `level` setting determines what can be provided—modules when `level=L0`, APIs when `level=L1`, and either modules or APIs when `level=mix`.
- `list` (list[str]): Custom operator list. Options include:
- Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
- When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
- Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
- Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
- When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
- Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
Example configuration:
@@ -188,7 +188,7 @@ Use `msprobe graph_visualize` to generate results that can be opened inside `tb_
Replace the paths with your dump directories before invoking `msprobe graph_visualize`. **If you only need to build a single graph**, omit `bench_path` to visualize one dump.
Multi-rank scenarios (single rank, multi-rank, or multi-step multi-rank) are also supported. `npu_path` or `bench_path` must contain folders named `rank+number`, and every rank folder must contain a non-empty `construct.json` together with `dump.json` and `stack.json`. If any `construct.json` is empty, verify that the dump level includes `L0` or `mix`. When comparing graphs, both `npu_path` and `bench_path` must contain the same set of rank folders so they can be paired one-to-one.
```
```shell
├── npu_path or bench_path
| ├── rank0
| | ├── dump_tensor_data (only when the `tensor` option is enabled)

View File

@@ -200,10 +200,12 @@ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
```
Purpose
- Forces all CPU cores to run under the `performance` governor
- Disables dynamic frequency scaling (e.g., `ondemand`, `powersave`)
Benefits
- Keeps CPU cores at maximum frequency
- Reduces latency jitter
- Improves predictability for inference workloads
@@ -224,6 +226,7 @@ Benefits
- Improves stability for large in-memory models
Notes
- For inference workloads, swap can introduce second-level latency
- Recommended values are `0` or `1`
@@ -244,6 +247,7 @@ Benefits
- Improves performance stability on NUMA systems
Recommended For
- Multi-socket servers
- Ascend / NPU deployments with explicit NUMA binding
- Systems with manually managed CPU and memory affinity
@@ -255,14 +259,17 @@ sysctl -w kernel.sched_migration_cost_ns=50000
```
Purpose
- Increases the cost for the scheduler to migrate tasks between CPU cores
Benefits
- Reduces frequent thread migration
- Improves CPU cache locality
- Lowers latency jitter for inference workloads
Parameter Details
- Unit: nanoseconds (ns)
- Typical recommended range: 50000100000
- Higher values encourage threads to stay on the same CPU core

View File

@@ -1,4 +1,5 @@
# Performance Benchmark
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
@@ -38,10 +39,12 @@ pip install -r benchmarks/requirements-bench.txt
```
## 3. Run basic benchmarks
This section introduces how to perform performance testing using the benchmark suite built into VLLM.
### 3.1 Dataset
VLLM supports a variety of (datasets)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py].
VLLM supports a variety of [datasets](https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py).
<style>
th {

View File

@@ -5,19 +5,20 @@ The execution duration of each stage (including pre/post-processing, model forwa
**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
## Usage
* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execution duration profiling. Execute the script as follows:**
```
```shell
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
```
## Example Output
```
```shell
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms

View File

@@ -15,6 +15,7 @@ pip install msserviceprofiler==1.2.2
```
### 1 Preparation
Before starting the service, set the environment variable `SERVICE_PROF_CONFIG_PATH` to point to the profiling configuration file, and set the environment variable `PROFILING_SYMBOLS_PATH` to specify the YAML configuration file for the symbols that need to be imported. After that, start the vLLM service according to your deployment method.
```bash
@@ -32,6 +33,7 @@ The file `ms_service_profiler_config.json` is the profiling configuration. If it
`service_profiling_symbols.yaml` is the configuration file containing the profiling points to be imported. You can choose **not** to set the `PROFILING_SYMBOLS_PATH` environment variable, in which case the default configuration file will be used. If the file does not exist at the path you specified, likewise, the system will generate a configuration file at your specified path for future configuration. You can customize it according to the instructions in the `Symbols Configuration File` section below.
### 2 Enable Profiling
To enable the performance data collection switch, change the `enable` field from `0` to `1` in the configuration file `ms_service_profiler_config.json`. This can be accomplished by executing the following sed command:
```bash
@@ -39,6 +41,7 @@ sed -i 's/"enable":\s*0/"enable": 1/' ./ms_service_profiler_config.json
```
### 3 Send Requests
Choose a request-sending method that suits your actual profiling needs:
```bash
@@ -65,6 +68,7 @@ msserviceprofiler analyze --input-path=./ --output-path output
### 5 View Results
After analysis, the `output` directory will contain:
- `chrome_tracing.json`: Chrome tracing format data, which can be opened in [MindStudio Insight](https://www.hiascend.com/document/detail/zh/mindstudio/81RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html).
- `profiler.db`: Performance data in database format.
- `request.csv`: Request-related data.
@@ -77,7 +81,9 @@ After analysis, the `output` directory will contain:
---
## Appendix
(profiling-configuration-file)=
### 1 Profiling Configuration File
The profiling configuration file controls profiling parameters and behavior.
@@ -116,6 +122,7 @@ The configuration is in JSON format. Main parameters:
---
(symbols-configuration-file)=
### 2 Symbols Configuration File
The symbols configuration file defines which functions/methods to profile and supports flexible configuration with custom attribute collection.