[Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
@@ -53,7 +53,7 @@ To restrict the operators that are captured, configure the `list` block:
|
||||
|
||||
- `scope` (list[str]): In PyTorch pynative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
|
||||
|
||||
```
|
||||
```json
|
||||
"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
|
||||
"scope": ["Cell.conv1.Conv2d.forward.0", "Cell.fc2.Dense.backward.0"]
|
||||
"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]
|
||||
@@ -62,9 +62,9 @@ To restrict the operators that are captured, configure the `list` block:
|
||||
The `level` setting determines what can be provided—modules when `level=L0`, APIs when `level=L1`, and either modules or APIs when `level=mix`.
|
||||
|
||||
- `list` (list[str]): Custom operator list. Options include:
|
||||
- Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
|
||||
- When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
|
||||
- Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
|
||||
- Supply the full names of specific APIs in PyTorch pynative scenarios to only dump those APIs. Example: `"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]`.
|
||||
- When `level=mix`, you can provide module names so that the dump expands to everything produced while the module is running. Example: `"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]`.
|
||||
- Provide a substring such as `"list": ["relu"]` to dump every API whose name contains the substring. When `level=mix`, modules whose names contain the substring are also expanded.
|
||||
|
||||
Example configuration:
|
||||
|
||||
@@ -188,7 +188,7 @@ Use `msprobe graph_visualize` to generate results that can be opened inside `tb_
|
||||
Replace the paths with your dump directories before invoking `msprobe graph_visualize`. **If you only need to build a single graph**, omit `bench_path` to visualize one dump.
|
||||
Multi-rank scenarios (single rank, multi-rank, or multi-step multi-rank) are also supported. `npu_path` or `bench_path` must contain folders named `rank+number`, and every rank folder must contain a non-empty `construct.json` together with `dump.json` and `stack.json`. If any `construct.json` is empty, verify that the dump level includes `L0` or `mix`. When comparing graphs, both `npu_path` and `bench_path` must contain the same set of rank folders so they can be paired one-to-one.
|
||||
|
||||
```
|
||||
```shell
|
||||
├── npu_path or bench_path
|
||||
| ├── rank0
|
||||
| | ├── dump_tensor_data (only when the `tensor` option is enabled)
|
||||
|
||||
@@ -200,10 +200,12 @@ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
```
|
||||
|
||||
Purpose
|
||||
|
||||
- Forces all CPU cores to run under the `performance` governor
|
||||
- Disables dynamic frequency scaling (e.g., `ondemand`, `powersave`)
|
||||
|
||||
Benefits
|
||||
|
||||
- Keeps CPU cores at maximum frequency
|
||||
- Reduces latency jitter
|
||||
- Improves predictability for inference workloads
|
||||
@@ -224,6 +226,7 @@ Benefits
|
||||
- Improves stability for large in-memory models
|
||||
|
||||
Notes
|
||||
|
||||
- For inference workloads, swap can introduce second-level latency
|
||||
- Recommended values are `0` or `1`
|
||||
|
||||
@@ -244,6 +247,7 @@ Benefits
|
||||
- Improves performance stability on NUMA systems
|
||||
|
||||
Recommended For
|
||||
|
||||
- Multi-socket servers
|
||||
- Ascend / NPU deployments with explicit NUMA binding
|
||||
- Systems with manually managed CPU and memory affinity
|
||||
@@ -255,14 +259,17 @@ sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
```
|
||||
|
||||
Purpose
|
||||
|
||||
- Increases the cost for the scheduler to migrate tasks between CPU cores
|
||||
|
||||
Benefits
|
||||
|
||||
- Reduces frequent thread migration
|
||||
- Improves CPU cache locality
|
||||
- Lowers latency jitter for inference workloads
|
||||
|
||||
Parameter Details
|
||||
|
||||
- Unit: nanoseconds (ns)
|
||||
- Typical recommended range: 50000–100000
|
||||
- Higher values encourage threads to stay on the same CPU core
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
# Performance Benchmark
|
||||
|
||||
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
|
||||
|
||||
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
@@ -38,10 +39,12 @@ pip install -r benchmarks/requirements-bench.txt
|
||||
```
|
||||
|
||||
## 3. Run basic benchmarks
|
||||
|
||||
This section introduces how to perform performance testing using the benchmark suite built into VLLM.
|
||||
|
||||
### 3.1 Dataset
|
||||
VLLM supports a variety of (datasets)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py].
|
||||
|
||||
VLLM supports a variety of [datasets](https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py).
|
||||
|
||||
<style>
|
||||
th {
|
||||
|
||||
@@ -5,19 +5,20 @@ The execution duration of each stage (including pre/post-processing, model forwa
|
||||
**To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.**
|
||||
|
||||
## Usage
|
||||
|
||||
* Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature.
|
||||
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
|
||||
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
|
||||
|
||||
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execution duration profiling. Execute the script as follows:**
|
||||
|
||||
```
|
||||
```shell
|
||||
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
|
||||
```
|
||||
|
||||
## Example Output
|
||||
|
||||
```
|
||||
```shell
|
||||
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
|
||||
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
|
||||
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
|
||||
|
||||
@@ -15,6 +15,7 @@ pip install msserviceprofiler==1.2.2
|
||||
```
|
||||
|
||||
### 1 Preparation
|
||||
|
||||
Before starting the service, set the environment variable `SERVICE_PROF_CONFIG_PATH` to point to the profiling configuration file, and set the environment variable `PROFILING_SYMBOLS_PATH` to specify the YAML configuration file for the symbols that need to be imported. After that, start the vLLM service according to your deployment method.
|
||||
|
||||
```bash
|
||||
@@ -32,6 +33,7 @@ The file `ms_service_profiler_config.json` is the profiling configuration. If it
|
||||
`service_profiling_symbols.yaml` is the configuration file containing the profiling points to be imported. You can choose **not** to set the `PROFILING_SYMBOLS_PATH` environment variable, in which case the default configuration file will be used. If the file does not exist at the path you specified, likewise, the system will generate a configuration file at your specified path for future configuration. You can customize it according to the instructions in the `Symbols Configuration File` section below.
|
||||
|
||||
### 2 Enable Profiling
|
||||
|
||||
To enable the performance data collection switch, change the `enable` field from `0` to `1` in the configuration file `ms_service_profiler_config.json`. This can be accomplished by executing the following sed command:
|
||||
|
||||
```bash
|
||||
@@ -39,6 +41,7 @@ sed -i 's/"enable":\s*0/"enable": 1/' ./ms_service_profiler_config.json
|
||||
```
|
||||
|
||||
### 3 Send Requests
|
||||
|
||||
Choose a request-sending method that suits your actual profiling needs:
|
||||
|
||||
```bash
|
||||
@@ -65,6 +68,7 @@ msserviceprofiler analyze --input-path=./ --output-path output
|
||||
### 5 View Results
|
||||
|
||||
After analysis, the `output` directory will contain:
|
||||
|
||||
- `chrome_tracing.json`: Chrome tracing format data, which can be opened in [MindStudio Insight](https://www.hiascend.com/document/detail/zh/mindstudio/81RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html).
|
||||
- `profiler.db`: Performance data in database format.
|
||||
- `request.csv`: Request-related data.
|
||||
@@ -77,7 +81,9 @@ After analysis, the `output` directory will contain:
|
||||
---
|
||||
|
||||
## Appendix
|
||||
|
||||
(profiling-configuration-file)=
|
||||
|
||||
### 1 Profiling Configuration File
|
||||
|
||||
The profiling configuration file controls profiling parameters and behavior.
|
||||
@@ -116,6 +122,7 @@ The configuration is in JSON format. Main parameters:
|
||||
---
|
||||
|
||||
(symbols-configuration-file)=
|
||||
|
||||
### 2 Symbols Configuration File
|
||||
|
||||
The symbols configuration file defines which functions/methods to profile and supports flexible configuration with custom attribute collection.
|
||||
|
||||
Reference in New Issue
Block a user