[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
Cao Yi
2026-02-13 15:50:05 +08:00
committed by GitHub
parent b6bc3d2f9d
commit 6de207de88
50 changed files with 273 additions and 272 deletions

View File

@@ -10,7 +10,7 @@ The current process for registering and obtaining quantization methods in vLLM A
![get_quant_method](../../assets/quantization/get_quant_method.png)
vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendModelSlimConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
vLLM Ascend registers a custom Ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendModelSlimConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods:

View File

@@ -32,7 +32,7 @@ Install additional dependencies if you need to visualize the captured data.
## 2. Collecting Data with `msprobe`
We generally follow a coarse-to-fine strategy when capturing data. First identify the token where the issue shows up, and then decide which range needs to be sampled around that token. The typical workflow is described below.
We generally follow a coarse-to-fine strategy when capturing data. First, identify the token where the issue shows up, and then decide which range needs to be sampled around that token. The typical workflow is described below.
### 2.1 Prepare the dump configuration file
@@ -42,7 +42,7 @@ Create a `config.json` that can be parsed by `PrecisionDebugger` and place it in
|:---:|:----|:---:|
| `task` | Type of dump task. Common PyTorch values include `"statistics"` and `"tensor"`. A statistics task collects tensor statistics (mean, variance, max, min, etc.) while a tensor task captures arbitrary tensors. | Yes |
| `dump_path` | Directory where dump results are stored. When omitted, `msprobe` uses its default path. | No |
| `rank` | Ranks to sample. An empty list collects every rank. For single-card tasks you must set this field to `[]`. | No |
| `rank` | Ranks to sample. An empty list collects every rank. For single-card tasks, you must set this field to `[]`. | No |
| `step` | Token iteration(s) to sample. An empty list means every iteration. | No |
| `level` | Dump level string (`"L0"`, `"L1"`, or `"mix"`). `L0` targets `nn.Module`, `L1` targets `torch.api`, and `mix` collects both. | Yes |
| `async_dump` | Whether to enable asynchronous dump (supported for PyTorch `statistics`/`tensor` tasks). Defaults to `false`. | No |
@@ -51,7 +51,7 @@ Create a `config.json` that can be parsed by `PrecisionDebugger` and place it in
To restrict the operators that are captured, configure the `list` block:
- `scope` (list[str]): In PyTorch pynative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
- `scope` (list[str]): In PyTorch PyNative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
```json
"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]

View File

@@ -61,7 +61,7 @@ VLLM_USE_MODELSCOPE=true
Please follow the [Installation Guide](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) to make sure vLLM and vllm-ascend are installed correctly.
:::{note}
Make sure your vLLM and vllm-ascend are installed after your python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM and vllm-ascend before completing section 1.1, the binary files will not use the optimized python.
Make sure your vLLM and vllm-ascend are installed after your Python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM and vllm-ascend before completing section 1.1, the binary files will not use the optimized python.
:::
## Optimizations
@@ -102,7 +102,7 @@ export PATH=/usr/bin:/usr/local/python/bin:$PATH
#### 2.1. jemalloc
**jemalloc** is a memory allocator that improves performance for multi-thread scenarios and can reduce memory fragmentation. jemalloc uses local thread memory manager to allocate variables, which can avoid lock competition between threads and can hugely optimize performance.
**jemalloc** is a memory allocator that improves performance for multi-threaded scenarios and can reduce memory fragmentation. jemalloc uses a local thread memory manager to allocate variables, which can avoid lock competition between threads and can hugely optimize performance.
```{code-block} bash
:substitutions:
@@ -116,7 +116,7 @@ export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
#### 2.2. Tcmalloc
**Tcmalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex competition and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
```{code-block} bash
:substitutions:
@@ -188,7 +188,7 @@ Plus, there are more features for performance optimization in specific scenarios
This section describes operating systemlevel optimizations applied on the host machine (bare metal or Kubernetes node) to improve performance stability, latency, and throughput for inference workloads.
:::{note}
These settings must be applied on the host OS and with root privileges. not inside containers.
These settings must be applied on the host OS and with root privileges. Not inside containers.
:::
#### 5.1

View File

@@ -222,7 +222,7 @@ vllm bench serve \
--backend openai-embeddings \
--endpoint /v1/embeddings \
--dataset-name sharegpt \
--num-prompt 10 \
--num-prompts 10 \
--dataset-path <your dataset path>/datasets/ShareGPT_V3_unfiltered_cleaned_split.json
```

View File

@@ -101,8 +101,8 @@ The configuration is in JSON format. Main parameters:
| npu_memory_usage_freq | Sampling frequency of NPU memory utilization. Disabled by default. Range: integer 150, unit: Hz (times per second). Set to -1 to disable. <br />Note: Enabling this may consume significant memory. | No |
| acl_task_time | Switch to collect operator dispatch latency and execution latency: <br />0: disable (default; 0 or invalid values mean disabled).<br />1: enable; calls `aclprofCreateConfig` with `ACL_PROF_TASK_TIME_L0`.<br />2: enable MSPTI-based data dumping; uses MSPTI for profiling and requires: `export LD_PRELOAD=$ASCEND_TOOLKIT_HOME/lib64/libmspti.so` | No |
| acl_prof_task_time_level | Level and duration for profiling: <br />L0: collect operator dispatch and execution latency only; lower overhead (no operator basic info).<br />L1: collect AscendCL interface performance (hostdevice and inter-device sync/async memory copy latencies), plus operator dispatch, execution, and basic info for comprehensive analysis.<br />time: profiling duration, integer 1999, in seconds.<br />If unset, defaults to L0 until program exit; invalid values fall back to defaults.<br />Level and duration can be combined, e.g., `"acl_prof_task_time_level": "L1,10"`. | No |
| api_filter | Filter to select API performance data to dump. For example, specifying "matmul" dumps all API data whose `name` contains "matmul". String, case-sensitive; use "" to separate multiple targets. Empty means dump all. <br />Effective only when `acl_task_time` is 2. | No |
| kernel_filter | Filter to select kernel performance data to dump. For example, specifying "matmul" dumps all kernel data whose `name` contains "matmul". String, case-sensitive; use "" to separate multiple targets. Empty means dump all. <br />Effective only when `acl_task_time` is 2. | No |
| api_filter | Filter to select API performance data to dump. For example, specifying "matmul" dumps all API data whose `name` contains "matmul". String, case-sensitive; use ";" to separate multiple targets. Empty means dump all. <br />Effective only when `acl_task_time` is 2. | No |
| kernel_filter | Filter to select kernel performance data to dump. For example, specifying "matmul" dumps all kernel data whose `name` contains "matmul". String, case-sensitive; use ";" to separate multiple targets. Empty means dump all. <br />Effective only when `acl_task_time` is 2. | No |
| timelimit | Profiling duration for the service. The process stops automatically after this time. Range: integer 07200, unit: seconds. Default 0 means unlimited. | No |
| domain | Limit profiling to the specified domains to reduce data volume. String, separated by semicolons, case-sensitive, e.g., "Request; KVCache".<br />Empty means all available domains.<br />Available domains: Request, KVCache, ModelExecute, BatchSchedule, Communication.<br />Note: If the selected domains are incomplete, analysis output may show warnings due to missing data. See [Reference Table 1](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/devaids/Profiling/mindieprofiling_0009.html#ZH-CN_TOPIC_0000002370256365__table1985410131831). | No |
@@ -129,7 +129,7 @@ The symbols configuration file defines which functions/methods to profile and su
#### 2.1 File Name and Loading
- Default load path:`~/.config/vllm_ascend/service_profiling_symbols.MAJOR.MINOR.PATCH.yaml`( According to the installed version of vllm )
- Default load path:`~/.config/vllm_ascend/service_profiling_symbols.MAJOR.MINOR.PATCH.yaml`(According to the installed version of vllm )
If you need to customize the profiling points, it is highly recommended to copy a profiling configuration file to your working directory using the `PROFILING_SYMBOLS_PATH` environment variable.
@@ -143,7 +143,7 @@ If you need to customize the profiling points, it is highly recommended to copy
| name | Event name | `"EngineCoreExecute"` |
| min_version | Upper version constraint | `"0.9.1"` |
| max_version | Lower version constraint | `"0.11.0"` |
| attributes | Custom attribute collection | Only support for `"timer"` handler. See the section below |
| attributes | Custom attribute collection | Only supported for `"timer"` handler. See the section below |
#### 2.3 Examples