diff --git a/benchmarks/README.md b/benchmarks/README.md
index 141baa7c..2a2fe8ec 100644
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -145,7 +145,7 @@ These files contain detailed benchmarking results for further analysis.
#### Use benchmark cli
For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
-Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
+Similarly, let's take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
##### Online serving
diff --git a/docs/source/community/user_stories/llamafactory.md b/docs/source/community/user_stories/llamafactory.md
index 6f3cc530..c96a798d 100644
--- a/docs/source/community/user_stories/llamafactory.md
+++ b/docs/source/community/user_stories/llamafactory.md
@@ -4,7 +4,7 @@
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
-LLaMA-Facotory users need to evaluate and inference the model after fine-tuning.
+LLaMA-Factory users need to evaluate and inference the model after fine-tuning.
**Business challenge**
diff --git a/docs/source/developer_guide/feature_guide/quantization.md b/docs/source/developer_guide/feature_guide/quantization.md
index cfade88d..569b5fb7 100644
--- a/docs/source/developer_guide/feature_guide/quantization.md
+++ b/docs/source/developer_guide/feature_guide/quantization.md
@@ -10,7 +10,7 @@ The current process for registering and obtaining quantization methods in vLLM A

-vLLM Ascend registers a custom ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendModelSlimConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
+vLLM Ascend registers a custom Ascend quantization method. By configuring the `--quantization ascend` parameter (or `quantization="ascend"` for offline), the quantization feature is enabled. When constructing the `quant_config`, the registered `AscendModelSlimConfig` is initialized and `get_quant_method` is called to obtain the quantization method corresponding to each weight part, stored in the `quant_method` attribute.
Currently supported quantization methods include `AscendLinearMethod`, `AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding non-quantized methods:
diff --git a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
index 30558f00..b6e8779e 100644
--- a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
@@ -32,7 +32,7 @@ Install additional dependencies if you need to visualize the captured data.
## 2. Collecting Data with `msprobe`
-We generally follow a coarse-to-fine strategy when capturing data. First identify the token where the issue shows up, and then decide which range needs to be sampled around that token. The typical workflow is described below.
+We generally follow a coarse-to-fine strategy when capturing data. First, identify the token where the issue shows up, and then decide which range needs to be sampled around that token. The typical workflow is described below.
### 2.1 Prepare the dump configuration file
@@ -42,7 +42,7 @@ Create a `config.json` that can be parsed by `PrecisionDebugger` and place it in
|:---:|:----|:---:|
| `task` | Type of dump task. Common PyTorch values include `"statistics"` and `"tensor"`. A statistics task collects tensor statistics (mean, variance, max, min, etc.) while a tensor task captures arbitrary tensors. | Yes |
| `dump_path` | Directory where dump results are stored. When omitted, `msprobe` uses its default path. | No |
-| `rank` | Ranks to sample. An empty list collects every rank. For single-card tasks you must set this field to `[]`. | No |
+| `rank` | Ranks to sample. An empty list collects every rank. For single-card tasks, you must set this field to `[]`. | No |
| `step` | Token iteration(s) to sample. An empty list means every iteration. | No |
| `level` | Dump level string (`"L0"`, `"L1"`, or `"mix"`). `L0` targets `nn.Module`, `L1` targets `torch.api`, and `mix` collects both. | Yes |
| `async_dump` | Whether to enable asynchronous dump (supported for PyTorch `statistics`/`tensor` tasks). Defaults to `false`. | No |
@@ -51,7 +51,7 @@ Create a `config.json` that can be parsed by `PrecisionDebugger` and place it in
To restrict the operators that are captured, configure the `list` block:
-- `scope` (list[str]): In PyTorch pynative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
+- `scope` (list[str]): In PyTorch PyNative scenarios this field restricts the dump range. Provide two module or API names that follow the tool's naming convention to lock a range; only data between the two names will be dumped. Examples:
```json
"scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"]
diff --git a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
index 0ba3acd7..3f531ab8 100644
--- a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
+++ b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
@@ -61,7 +61,7 @@ VLLM_USE_MODELSCOPE=true
Please follow the [Installation Guide](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) to make sure vLLM and vllm-ascend are installed correctly.
:::{note}
-Make sure your vLLM and vllm-ascend are installed after your python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM and vllm-ascend before completing section 1.1, the binary files will not use the optimized python.
+Make sure your vLLM and vllm-ascend are installed after your Python configuration is completed, because these packages will build binary files using python in current environment. If you install vLLM and vllm-ascend before completing section 1.1, the binary files will not use the optimized python.
:::
## Optimizations
@@ -102,7 +102,7 @@ export PATH=/usr/bin:/usr/local/python/bin:$PATH
#### 2.1. jemalloc
-**jemalloc** is a memory allocator that improves performance for multi-thread scenarios and can reduce memory fragmentation. jemalloc uses local thread memory manager to allocate variables, which can avoid lock competition between threads and can hugely optimize performance.
+**jemalloc** is a memory allocator that improves performance for multi-threaded scenarios and can reduce memory fragmentation. jemalloc uses a local thread memory manager to allocate variables, which can avoid lock competition between threads and can hugely optimize performance.
```{code-block} bash
:substitutions:
@@ -116,7 +116,7 @@ export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
#### 2.2. Tcmalloc
-**Tcmalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex competition and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
+**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
```{code-block} bash
:substitutions:
@@ -188,7 +188,7 @@ Plus, there are more features for performance optimization in specific scenarios
This section describes operating system–level optimizations applied on the host machine (bare metal or Kubernetes node) to improve performance stability, latency, and throughput for inference workloads.
:::{note}
-These settings must be applied on the host OS and with root privileges. not inside containers.
+These settings must be applied on the host OS and with root privileges. Not inside containers.
:::
#### 5.1
diff --git a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
index 47aacb86..b2aecfdb 100644
--- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
+++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
@@ -222,7 +222,7 @@ vllm bench serve \
--backend openai-embeddings \
--endpoint /v1/embeddings \
--dataset-name sharegpt \
- --num-prompt 10 \
+ --num-prompts 10 \
--dataset-path /datasets/ShareGPT_V3_unfiltered_cleaned_split.json
```
diff --git a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
index d8b91b7a..ee62b2c4 100644
--- a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
@@ -101,8 +101,8 @@ The configuration is in JSON format. Main parameters:
| npu_memory_usage_freq | Sampling frequency of NPU memory utilization. Disabled by default. Range: integer 1–50, unit: Hz (times per second). Set to -1 to disable.
Note: Enabling this may consume significant memory. | No |
| acl_task_time | Switch to collect operator dispatch latency and execution latency:
0: disable (default; 0 or invalid values mean disabled).
1: enable; calls `aclprofCreateConfig` with `ACL_PROF_TASK_TIME_L0`.
2: enable MSPTI-based data dumping; uses MSPTI for profiling and requires: `export LD_PRELOAD=$ASCEND_TOOLKIT_HOME/lib64/libmspti.so` | No |
| acl_prof_task_time_level | Level and duration for profiling:
L0: collect operator dispatch and execution latency only; lower overhead (no operator basic info).
L1: collect AscendCL interface performance (host–device and inter-device sync/async memory copy latencies), plus operator dispatch, execution, and basic info for comprehensive analysis.
time: profiling duration, integer 1–999, in seconds.
If unset, defaults to L0 until program exit; invalid values fall back to defaults.
Level and duration can be combined, e.g., `"acl_prof_task_time_level": "L1,10"`. | No |
-| api_filter | Filter to select API performance data to dump. For example, specifying "matmul" dumps all API data whose `name` contains "matmul". String, case-sensitive; use ";" to separate multiple targets. Empty means dump all.
Effective only when `acl_task_time` is 2. | No |
-| kernel_filter | Filter to select kernel performance data to dump. For example, specifying "matmul" dumps all kernel data whose `name` contains "matmul". String, case-sensitive; use ";" to separate multiple targets. Empty means dump all.
Effective only when `acl_task_time` is 2. | No |
+| api_filter | Filter to select API performance data to dump. For example, specifying "matmul" dumps all API data whose `name` contains "matmul". String, case-sensitive; use ";" to separate multiple targets. Empty means dump all.
Effective only when `acl_task_time` is 2. | No |
+| kernel_filter | Filter to select kernel performance data to dump. For example, specifying "matmul" dumps all kernel data whose `name` contains "matmul". String, case-sensitive; use ";" to separate multiple targets. Empty means dump all.
Effective only when `acl_task_time` is 2. | No |
| timelimit | Profiling duration for the service. The process stops automatically after this time. Range: integer 0–7200, unit: seconds. Default 0 means unlimited. | No |
| domain | Limit profiling to the specified domains to reduce data volume. String, separated by semicolons, case-sensitive, e.g., "Request; KVCache".
Empty means all available domains.
Available domains: Request, KVCache, ModelExecute, BatchSchedule, Communication.
Note: If the selected domains are incomplete, analysis output may show warnings due to missing data. See [Reference Table 1](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/devaids/Profiling/mindieprofiling_0009.html#ZH-CN_TOPIC_0000002370256365__table1985410131831). | No |
@@ -129,7 +129,7 @@ The symbols configuration file defines which functions/methods to profile and su
#### 2.1 File Name and Loading
-- Default load path:`~/.config/vllm_ascend/service_profiling_symbols.MAJOR.MINOR.PATCH.yaml`( According to the installed version of vllm )
+- Default load path:`~/.config/vllm_ascend/service_profiling_symbols.MAJOR.MINOR.PATCH.yaml`(According to the installed version of vllm )
If you need to customize the profiling points, it is highly recommended to copy a profiling configuration file to your working directory using the `PROFILING_SYMBOLS_PATH` environment variable.
@@ -143,7 +143,7 @@ If you need to customize the profiling points, it is highly recommended to copy
| name | Event name | `"EngineCoreExecute"` |
| min_version | Upper version constraint | `"0.9.1"` |
| max_version | Lower version constraint | `"0.11.0"` |
-| attributes | Custom attribute collection | Only support for `"timer"` handler. See the section below |
+| attributes | Custom attribute collection | Only supported for `"timer"` handler. See the section below |
#### 2.3 Examples
diff --git a/docs/source/faqs.md b/docs/source/faqs.md
index 11225dda..020af72a 100644
--- a/docs/source/faqs.md
+++ b/docs/source/faqs.md
@@ -2,14 +2,14 @@
## Version Specific FAQs
-- [[v0.14.0c1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6148)
+- [[v0.14.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6148)
- [[v0.13.0] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6583)
## General FAQs
### 1. What devices are currently supported?
-Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b),Atlas A3 series(Atlas-A3-cann-kernels) and Atlas 300I(Ascend-cann-kernels-310p) series are supported:
+Currently, **ONLY** Atlas A2 series (Ascend-cann-kernels-910b),Atlas A3 series (Atlas-A3-cann-kernels) and Atlas 300I (Ascend-cann-kernels-310p) series are supported:
- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 Inference series (Atlas 800I A2)
@@ -23,7 +23,7 @@ Below series are NOT supported yet:
- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet
-From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We also welcome you to join us to improve together.
+From a technical view, vllm-ascend support would be possible if torch-npu is supported. Otherwise, we have to implement it by using custom ops. We also welcome you to join us to improve together.
### 2. How to get our docker containers?
@@ -92,7 +92,7 @@ Basically, the reason is that the NPU environment is not configured correctly. Y
2. try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
3. try `npu-smi info` to check whether the NPU is working.
-If all above steps are not working, you can try the following code with python to check whether there is any error:
+If the above steps are not working, you can try the following code in Python to check whether there are any errors:
```python
import torch
@@ -104,7 +104,7 @@ If all above steps are not working, feel free to submit a GitHub issue.
### 7. How vllm-ascend work with vLLM?
-vllm-ascend is a hardware plugin for vLLM. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
+`vllm-ascend` is a hardware plugin for vLLM. The version of `vllm-ascend` is the same as the version of `vllm`. For example, if you use `vllm` 0.9.1, you should use vllm-ascend 0.9.1 as well. For the main branch, we ensure that `vllm-ascend` and `vllm` are compatible at every commit.
### 8. Does vllm-ascend support Prefill Disaggregation feature?
@@ -112,7 +112,7 @@ Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend.
### 9. Does vllm-ascend support quantization method?
-Currently, w8a8, w4a8 and w4a4 quantization methods are already supported by vllm-ascend.
+Currently, w8a8, w4a8, and w4a4 quantization methods are already supported by vllm-ascend.
### 10. How to run a W8A8 DeepSeek model?
@@ -120,21 +120,21 @@ Follow the [inference tutorial](https://docs.vllm.ai/projects/ascend/en/latest/t
### 11. How is vllm-ascend tested?
-vllm-ascend is tested in three aspects, functions, performance, and accuracy.
+vllm-ascend is tested in three aspects: functions, performance, and accuracy.
-- **Functional test**: We added CI, including part of vllm's native unit tests and vllm-ascend's own unit tests. On vllm-ascend's test, we test basic functionalities, popular model availability, and [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html) through E2E test.
+- **Functional test**: We added CI, including part of vllm's native unit tests and vllm-ascend's own unit tests. In vllm-ascend's tests, we test basic functionalities, popular model availability, and [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html) through E2E test.
-- **Performance test**: We provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for E2E performance benchmark, which can be easily re-routed locally. We will publish a perf website to show the performance test results for each pull request.
+- **Performance test**: We provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for E2E performance benchmark, which can be easily re-run locally. We will publish a perf website to show the performance test results for each pull request.
- **Accuracy test**: We are working on adding accuracy test to the CI as well.
- **Nightly test**: we'll run full test every night to make sure the code is working.
-Finally, for each release, we'll publish the performance test and accuracy test report in the future.
+For each release, we'll publish the performance test and accuracy test report in the future.
### 12. How to fix the error "InvalidVersion" when using vllm-ascend?
-The problem is usually caused by the installation of a dev or editable version of the vLLM package. In this case, we provide the environment variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the environment variable `VLLM_VERSION` to the version of the vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
+The problem is usually caused by the installation of a development or editable version of the vLLM package. In this case, we provide the environment variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the environment variable `VLLM_VERSION` to the version of the vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
### 13. How to handle the out-of-memory issue?
@@ -142,17 +142,17 @@ OOM errors typically occur when the model exceeds the memory capacity of a singl
In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
-- **Limit `--max-model-len`**: It can save the HBM usage for kv cache initialization step.
+- **Limit `--max-model-len`**: It can save the HBM usage for KV cache initialization step.
- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
-- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
+- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
### 14. Failed to enable NPU graph mode when running DeepSeek
Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update.
-And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads/num_kv_heads is {32, 64, 128}.
+And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, `num_heads`/`num_kv_heads` is {32, 64, 128}.
```bash
[rank0]: RuntimeError: EZ9999: Inner Error!
@@ -161,13 +161,13 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend
-You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, use `python setup.py install` (recommended) to install, or use `python setup.py clean` to clear the cache.
+You may encounter the problem of C/C++ compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, use `python setup.py install` (recommended) to install, or use `python setup.py clean` to clear the cache.
### 16. How to generate deterministic results when using vllm-ascend?
-There are several factors that affect output certainty:
+There are several factors that affect output determinism:
-1. Sampler method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
+1. Sampler method: using **greedy sampling** by setting `temperature=0` in `SamplingParams`, e.g.:
```python
from vllm import LLM, SamplingParams
@@ -203,8 +203,8 @@ export ATB_LLM_LCOC_ENABLE=0
### 17. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model?
-The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`.
-This package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensure that the audio processing functionality works correctly.
+The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met, run `pip install qwen-omni-utils`.
+This package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring that the audio processing functionality works correctly.
### 18. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
@@ -217,10 +217,10 @@ ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph has insufficient availab
Recommended mitigation strategies:
1. Manually configure the compilation_config parameter with a reduced size set: '{"cudagraph_capture_sizes":[size1, size2, size3, ...]}'.
-2. Employ ACLgraph's full graph mode as an alternative to the piece-wise approach.
+2. Employ ACLgraph's full graph mode as an alternative to the piecewise approach.
Root cause analysis:
-The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements, such as operator characteristics and specific hardware features, consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
+The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream-overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements, such as operator characteristics and specific hardware features, consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
### 19. How to install custom version of torch_npu?
@@ -228,7 +228,7 @@ torch-npu will be overridden when installing vllm-ascend. If you need to instal
### 20. On certain systems (e.g., Kylin OS), `docker pull` may fail with an `invalid tar header` error
-On certain operating systems, such as Kylin OS , you may encounter an `invalid tar header` error during the `docker pull` process:
+On certain operating systems, such as Kylin OS, you may encounter an `invalid tar header` error during the `docker pull` process:
```text
failed to register layer: ApplyLayer exit status 1 stdout: stderr: archive/tar: invalid tar header
@@ -259,7 +259,7 @@ When using `--shm-size`, you may need to add the `--privileged=true` flag to you
### 22. How to achieve low latency in a small batch scenario?
-The performance of `torch_npu.npu_fused_infer_attention_score` in small batch scenario is not satisfactory, mainly due to the lack of flash decoding function. We offer an alternative operator in `tools/install_flash_infer_attention_score_ops_a2.sh` and `tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it by the following instruction:
+The performance of `torch_npu.npu_fused_infer_attention_score` in small batch scenarios is not satisfactory, mainly due to the lack of flash decoding function. We offer an alternative operator in `tools/install_flash_infer_attention_score_ops_a2.sh` and `tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it using the following instruction:
```bash
bash tools/install_flash_infer_attention_score_ops_a2.sh
@@ -267,5 +267,5 @@ bash tools/install_flash_infer_attention_score_ops_a2.sh
# bash tools/install_flash_infer_attention_score_ops_a3.sh
```
-**NOTE**: Don't set `additional_config.pa_shape_list` when using this method, otherwise it will lead to another attention operator.
-**Important**: Please make sure you're using the **official image** of vllm-ascend, otherwise you **must change** the directory `/vllm-workspace` in `tools/install_flash_infer_attention_score_ops_a2.sh` or `tools/install_flash_infer_attention_score_ops_a3.sh` to your own or create one. If you're not in root user, you need `sudo` permission to run this script.
+**NOTE**: Don't set `additional_config.pa_shape_list` when using this method; otherwise, it will lead to another attention operator.
+**Important**: Please make sure you're using the **official image** of `vllm-ascend`; otherwise, you **must change** the directory `/vllm-workspace` in `tools/install_flash_infer_attention_score_ops_a2.sh` or `tools/install_flash_infer_attention_score_ops_a3.sh` to your own, or create one. If you're not the root user, you need `sudo` **privileges** to run this script.
diff --git a/docs/source/index.md b/docs/source/index.md
index a8a2d28d..76f971c3 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -21,7 +21,7 @@
:::
-vLLM Ascend plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU.
+vLLM Ascend plugin (vllm-ascend) is a community-maintained hardware plugin for running vLLM on the Ascend NPU.
This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
diff --git a/docs/source/installation.md b/docs/source/installation.md
index d316608e..bff1cf31 100644
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -6,25 +6,25 @@ This document describes how to install vllm-ascend manually.
- OS: Linux
- Python: >= 3.10, < 3.12
-- A hardware with Ascend NPU. It's usually the Atlas 800 A2 series.
+- Hardware with Ascend NPUs. It's usually the Atlas 800 A2 series.
- Software:
| Software | Supported version | Note |
|---------------|----------------------------------|-------------------------------------------|
- | Ascend HDK | Refer to [here](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html) | Required for CANN |
- | CANN | == 8.5.0 | Required for vllm-ascend and torch-npu |
+ | Ascend HDK | Refer to the documentation [here](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html) | Required for CANN |
+ | CANN | == 8.5.0 | Required for vllm-ascend and torch-npu |
| torch-npu | == 2.9.0 | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
| torch | == 2.9.0 | Required for torch-npu and vllm |
| NNAL | == 8.5.0 | Required for libatb.so, enables advanced tensor operations |
There are two installation methods:
-- **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
+- **Using pip**: first prepare the environment manually or via a CANN image, then install `vllm-ascend` using pip.
- **Using docker**: use the `vllm-ascend` pre-built docker image directly.
## Configure Ascend CANN environment
-Before installation, you need to make sure firmware/driver and CANN are installed correctly, refer to [Ascend Environment Setup Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) for more details.
+Before installation, you need to make sure firmware/driver, and CANN are installed correctly, refer to [Ascend Environment Setup Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) for more details.
### Configure hardware environment
@@ -48,7 +48,7 @@ Refer to [Ascend Environment Setup Guide](https://ascend.github.io/docs/sources/
The easiest way to prepare your software environment is using CANN image directly:
```{note}
-The CANN prebuilt image includes NNAL (Ascend Neural Network Acceleration Library) which provides libatb.so for advanced tensor operations. No additional installation is required when using the prebuilt image.
+The CANN prebuilt image includes NNAL (Ascend Neural Network Acceleration Library), which provides libatb.so for advanced tensor operations. No additional installation is required when using the prebuilt image.
```
```{code-block} bash
@@ -112,15 +112,15 @@ source /usr/local/Ascend/nnal/atb/set_env.sh
::::{tab-item} Before using docker
:sync: docker
-No more extra step if you are using `vllm-ascend` prebuilt Docker image.
+No extra steps are needed if you are using the `vllm-ascend` prebuilt Docker image.
::::
:::::
-Once it is done, you can start to set up `vllm` and `vllm-ascend`.
+Once this is done, you can start to set up `vllm` and `vllm-ascend`.
## Set up using Python
-First install system dependencies and configure pip mirror:
+First, install system dependencies and configure the pip mirror:
```bash
# Using apt-get with mirror
@@ -139,7 +139,7 @@ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/si
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/"
```
-Then you can install `vllm` and `vllm-ascend` from **pre-built wheel**:
+Then you can install `vllm` and `vllm-ascend` from a **pre-built wheel**:
```{code-block} bash
:substitutions:
@@ -171,12 +171,12 @@ pip install -v -e .
cd ..
```
-If you are building custom operators for Atlas A3, you should run `git submodule update --init --recursive` manually, or ensure your environment has Internet access.
+If you are building custom operators for Atlas A3, you should run `git submodule update --init --recursive` manually, or ensure your environment has internet access.
:::
```{note}
-To build custom operators, gcc/g++ higher than 8 and c++ 17 or higher is required. If you're using `pip install -e .` and encounter a torch-npu version conflict, please install with `pip install --no-build-isolation -e .` to build on system env.
-If you encounter other problems during compiling, it is probably because unexpected compiler is being used, you may export `CXX_COMPILER` and `C_COMPILER` in environment to specify your g++ and gcc locations before compiling.
+To build custom operators, gcc/g++ higher than 8 and C++17 or higher are required. If you are using `pip install -e .` and encounter a torch-npu version conflict, please install with `pip install --no-build-isolation -e .` to build on system env.
+If you encounter other problems during compiling, it is probably because an unexpected compiler is being used, you may export `CXX_COMPILER` and `C_COMPILER` in the environment to specify your g++ and gcc locations before compiling.
```
## Set up using Docker
diff --git a/docs/source/quick_start.md b/docs/source/quick_start.md
index 4553108a..e46ed1a9 100644
--- a/docs/source/quick_start.md
+++ b/docs/source/quick_start.md
@@ -99,7 +99,7 @@ There are two ways to start vLLM on Ascend NPU:
:::::{tab-set}
::::{tab-item} Offline Batched Inference
-With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).
+With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inference).
Try to run below Python script directly or use `python3` shell to generate texts:
diff --git a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md
index eb179d8e..40efbe91 100644
--- a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md
+++ b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md
@@ -1,6 +1,6 @@
# Long-Sequence Context Parallel (Deepseek)
-## Getting Start
+## Getting Started
:::{note}
Context parallel feature currently is only supported on Atlas A3 device, and will be supported on Atlas A2 in the future.
@@ -8,13 +8,13 @@ Context parallel feature currently is only supported on Atlas A3 device, and wil
vLLM-Ascend now supports long sequence with context parallel options. This guide takes one-by-one steps to verify these features with constrained resources.
-Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
+Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
## Environment Preparation
### Model Weight
-- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
+- `DeepSeek-V3.1_w8a8mix_mtp` (Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -24,9 +24,9 @@ Refer to [verify multi-node communication environment](../../installation.md#ver
### Installation
-You can use our official docker image to run `DeepSeek-V3.1` directly.
+You can use our official Docker image to run `DeepSeek-V3.1` directly.
-Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
+Select an image based on your machine type and start the Docker image on your node, refer to [using Docker](../../installation.md#set-up-using-docker).
```{code-block} bash
:substitutions:
@@ -35,7 +35,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
-# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -273,7 +273,7 @@ vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
:::::
-2. Prefill master node `proxy.sh` scripts
+2. Prefill master node `proxy.sh` script
```shell
python load_balance_proxy_server_example.py \
@@ -289,7 +289,7 @@ python load_balance_proxy_server_example.py \
8004
```
-3. run proxy
+3. Run proxy
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
@@ -318,12 +318,12 @@ The parameters are explained as follows:
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tensor-parallel-size > 1.
-- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the pd co-locate deployment scenario. It will be removed in the future.
+- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the PD co-locate deployment scenario. It will be removed in the future.
**Notice:**
- tensor-parallel-size needs to be divisible by decode-context-parallel-size.
-- decode-context-parallel-size must less than or equal to tensor-parallel-size.
+- decode-context-parallel-size must be less than or equal to tensor-parallel-size.
## Accuracy Evaluation
@@ -361,7 +361,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
-vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompt 20 --request-rate 0 --save-result --result-dir ./
+vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
diff --git a/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md
index ef84706f..dc43a45a 100644
--- a/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md
+++ b/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md
@@ -1,16 +1,16 @@
# Long-Sequence Context Parallel (Qwen3-235B-A22B)
-## Getting Start
+## Getting Started
vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources.
-Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture.
+Using the `Qwen3-235B-A22B-w8a8` (Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture.
## Environment Preparation
### Model Weight
-- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
+- `Qwen3-235B-A22B-w8a8` (Quantized version): requires 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -25,7 +25,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
-# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -65,7 +65,7 @@ docker run --rm \
### Single-node Deployment
`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16).
-Quantized version need to start with parameter `--quantization ascend`.
+Quantized version needs to start with parameter `--quantization ascend`.
Run the following script to execute online 128k inference.
@@ -111,8 +111,8 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
The parameters are explained as follows:
- `--tensor-parallel-size` 8 are common settings for tensor parallelism (TP) sizes.
-- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism PCP) sizes.
-- `--decode-context-parallel-size` 2 are common settings for decode context parallelism DCP) sizes.
+- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
+- `--decode-context-parallel-size` 2 are common settings for decode context parallelism (DCP) sizes.
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
@@ -131,7 +131,7 @@ The parameters are explained as follows:
**Notice:**
- tp_size needs to be divisible by dcp_size
-- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
+- decode context parallel size must be less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
## Accuracy Evaluation
@@ -169,7 +169,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
-vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompt 1 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
diff --git a/docs/source/tutorials/features/pd_colocated_mooncake_multi_instance.md b/docs/source/tutorials/features/pd_colocated_mooncake_multi_instance.md
index 0ea3bb64..9f76687a 100644
--- a/docs/source/tutorials/features/pd_colocated_mooncake_multi_instance.md
+++ b/docs/source/tutorials/features/pd_colocated_mooncake_multi_instance.md
@@ -70,7 +70,7 @@ such as IP addresses according to your actual environment.
5. Check NPU TLS Configuration
```bash
- # The tls settings should be consistent across all nodes
+ # The tls settings should be consistent across all nodes.
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -267,7 +267,7 @@ configuration.
## Benchmark
-We recommend using the **aisbench** tool to assess performance. The test uses
+We recommend using the **AISBench** tool to assess performance. The test uses
**Dataset A**, consisting of fully random data, with the following
configuration:
@@ -328,7 +328,7 @@ models = [
]
```
-**Performance Benchmarking Commands**:
+**Performance Benchmarking Commands**:
```shell
ais_bench --models vllm_api_stream_chat \
diff --git a/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md b/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
index 94d4de47..2683d988 100644
--- a/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
+++ b/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
@@ -1,10 +1,10 @@
# Prefill-Decode Disaggregation (Deepseek)
-## Getting Start
+## Getting Started
-vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
+vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
-Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
+Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
## Verify Multi-Node Communication Environment
@@ -48,7 +48,7 @@ cat /etc/hccn.conf
3. Get NPU IP Addresses
```bash
-# Get virtual npu ip
+# Get virtual NPU IP.
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
@@ -61,14 +61,14 @@ for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-i
5. Cross-Node PING Test
```bash
-# Execute on the target node (replace 'x.x.x.x' with virtual npu ip address)
+# Execute on the target node (replace 'x.x.x.x' with virtual NPU IP address).
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
6. Check NPU TLS Configuration
```bash
-# The tls settings should be consistent across all nodes
+# The TLS settings should be consistent across all nodes
for i in {0..15}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -117,7 +117,7 @@ for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
5. Check NPU TLS Configuration
```bash
-# The tls settings should be consistent across all nodes
+# The TLS settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
@@ -136,7 +136,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
-# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+# Note: If you are running bridge network with Docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -173,7 +173,7 @@ docker run --rm \
## Install Mooncake
-Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: .
+Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide:
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
@@ -886,7 +886,7 @@ cd benchmark/
pip3 install -e ./
```
-You need to canncel the http proxy before assessing performance, as following
+You need to cancel the http proxy before assessing performance, as following
```shell
# unset proxy
@@ -895,7 +895,7 @@ unset https_proxy
```
- You can place your datasets in the dir: `benchmark/ais_bench/datasets`
-- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples
+- You can change the configuration in the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for example
```python
models = [
@@ -943,7 +943,7 @@ Check service health using the proxy server endpoint.
curl http://192.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
- "model": "qwen3-moe",
+ "model": "ds_r1",
"prompt": "Who are you?",
"max_completion_tokens": 100,
"temperature": 0
diff --git a/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md b/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md
index 8de8425b..e8a49f6f 100644
--- a/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md
+++ b/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md
@@ -44,7 +44,7 @@ for i in {0..7}; do hccn_tool -i $i -ip -g;done
4. Cross-Node PING Test
```bash
-# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
+# Execute on the target node (replace 'x.x.x.x' with actual npu ip address).
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
@@ -94,21 +94,21 @@ docker run --rm \
## Install Mooncake
-Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: .
+Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Installation and Compilation Guide: .
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
```
-(Optional) Replace go install url if the network is poor
+(Optional) Replace go install url if the network is poor.
```shell
cd Mooncake
sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
```
-Install mpi
+Install mpi.
```shell
apt-get install mpich libmpich-dev -y
@@ -120,7 +120,7 @@ Install the relevant dependencies. The installation of Go is not required.
bash dependencies.sh -y
```
-Compile and install
+Compile and install.
```shell
mkdir build
@@ -130,7 +130,7 @@ make -j
make install
```
-Set environment variables
+Set environment variables.
**Note:**
@@ -268,7 +268,7 @@ curl http://192.0.0.1:8080/v1/chat/completions \
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
- {"type": "text", "text": "What is the text in the illustrate?"}
+ {"type": "text", "text": "What is the text in the illustration?"}
]}
],
"max_completion_tokens": 100,
diff --git a/docs/source/tutorials/features/ray.md b/docs/source/tutorials/features/ray.md
index af1c3fc0..137ac8a3 100644
--- a/docs/source/tutorials/features/ray.md
+++ b/docs/source/tutorials/features/ray.md
@@ -64,7 +64,7 @@ export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
-# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
+# Note if you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
diff --git a/docs/source/tutorials/features/suffix_speculative_decoding.md b/docs/source/tutorials/features/suffix_speculative_decoding.md
index df537256..bad8ea61 100644
--- a/docs/source/tutorials/features/suffix_speculative_decoding.md
+++ b/docs/source/tutorials/features/suffix_speculative_decoding.md
@@ -15,7 +15,7 @@ This document provides step-by-step guidance on how to deploy and benchmark the
| Comprehensive Examination | agieval |
| Multi-turn Dialogue | sharegpt |
-The benchmarking tool used in this tutorial is AISbench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
+The benchmarking tool used in this tutorial is AISBench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
## **Download vllm-ascend Image**
@@ -71,14 +71,14 @@ pip install arctic-inference
## **vLLM Instance Deployment**
-Use the following command to start the container service instance. Speculative inference is enabled via the `--speculative-config` parameter, where `method` is set to`suffix`. For this test, `num_speculative_tokens` is uniformly set to`3`.
+Use the following command to start the container service instance. Speculative inference is enabled via the `--speculative-config` parameter, where `method` is set to `suffix`. For this test, `num_speculative_tokens` is uniformly set to `3`.
```bash
-# set the NPU device number
+# set the NPU device number:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
export TASK_QUEUE_ENABLE=1
-# Enable the AIVector core to directly schedule ROCE communication
+# Enable the AIVector core to directly schedule ROCE communication.
export HCCL_OP_EXPANSION_MODE="AIV"
# Enable MLP prefetch for better performance.
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1
diff --git a/docs/source/tutorials/hardwares/310p.md b/docs/source/tutorials/hardwares/310p.md
index c32347c2..2cbbc40b 100644
--- a/docs/source/tutorials/hardwares/310p.md
+++ b/docs/source/tutorials/hardwares/310p.md
@@ -69,7 +69,7 @@ vllm serve Qwen/Qwen3-0.6B \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
-Once your server is started, you can query the model with input prompts
+Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \
@@ -99,7 +99,7 @@ vllm serve Qwen/Qwen2.5-7B-Instruct \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
-Once your server is started, you can query the model with input prompts
+Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \
@@ -129,7 +129,7 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
-Once your server is started, you can query the model with input prompts
+Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/completions \
diff --git a/docs/source/tutorials/models/DeepSeek-R1.md b/docs/source/tutorials/models/DeepSeek-R1.md
index 34e9ee00..980f393b 100644
--- a/docs/source/tutorials/models/DeepSeek-R1.md
+++ b/docs/source/tutorials/models/DeepSeek-R1.md
@@ -39,7 +39,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
-# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -70,7 +70,7 @@ If you want to deploy multi-node environment, you need to set up environment on
## Deployment
-### Service-oriented Deployment
+### Service-oriented Deployment
- `DeepSeek-R1-W8A8`: require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8).
@@ -303,7 +303,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
-vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
diff --git a/docs/source/tutorials/models/DeepSeek-V3.1.md b/docs/source/tutorials/models/DeepSeek-V3.1.md
index 3e067a66..9f796a1e 100644
--- a/docs/source/tutorials/models/DeepSeek-V3.1.md
+++ b/docs/source/tutorials/models/DeepSeek-V3.1.md
@@ -10,7 +10,7 @@ DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinkin
- Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
-The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`
+The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
@@ -30,7 +30,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
- `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
-It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
+It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
### Verify Multi-node Communication(Optional)
@@ -52,7 +52,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
-# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -580,7 +580,7 @@ The parameters are explained as follows:
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
-6. run server for each node
+6. run server for each node:
```shell
# p0
@@ -593,7 +593,7 @@ python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
```
-7. Run proxy `proxy.sh` scripts on the prefill master node
+7. Run the `proxy.sh` script on the prefill master node
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
@@ -716,7 +716,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows.
```shell
-vllm bench serve --model /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
diff --git a/docs/source/tutorials/models/DeepSeek-V3.2.md b/docs/source/tutorials/models/DeepSeek-V3.2.md
index 9435aec4..dd20aa84 100644
--- a/docs/source/tutorials/models/DeepSeek-V3.2.md
+++ b/docs/source/tutorials/models/DeepSeek-V3.2.md
@@ -21,7 +21,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)
-It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
+It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
### Verify Multi-node Communication(Optional)
@@ -390,7 +390,7 @@ We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environm
Before you start, please
-1. prepare the script `launch_online_dp.py` on each node.
+1. prepare the script `launch_online_dp.py` on each node:
```python
import argparse
@@ -905,7 +905,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
-vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
## Function Call
diff --git a/docs/source/tutorials/models/GLM4.x.md b/docs/source/tutorials/models/GLM4.x.md
index 6cb96c90..ed6dbb32 100644
--- a/docs/source/tutorials/models/GLM4.x.md
+++ b/docs/source/tutorials/models/GLM4.x.md
@@ -2,9 +2,9 @@
## Introduction
-GLM-4.x series models use a Mixture-of-Experts (MoE) architecture and are foundational models specifically designed for agent applications
+GLM-4.x series models use a Mixture-of-Experts (MoE) architecture and are foundational models specifically designed for agent applications.
-The `GLM-4.5` model is first supported in `vllm-ascend:v0.10.0rc1`
+The `GLM-4.5` model is first supported in `vllm-ascend:v0.10.0rc1`.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
@@ -25,7 +25,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm do not support GLM4.6 mtp in October, so we do not provide mtp version. And last month, it supported, you can use the following quantization scheme to add mtp weights to Quantized weights.
- `Method of Quantify`: [quantization scheme](https://blog.csdn.net/qq_37368095/article/details/156429653?spm=1011.2124.3001.6209). You can use these methods to quantify the model.
-It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
+It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
### Installation
@@ -43,7 +43,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
-# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
diff --git a/docs/source/tutorials/models/PaddleOCR-VL.md b/docs/source/tutorials/models/PaddleOCR-VL.md
index e73424bc..7f1d521a 100644
--- a/docs/source/tutorials/models/PaddleOCR-VL.md
+++ b/docs/source/tutorials/models/PaddleOCR-VL.md
@@ -78,7 +78,7 @@ Single-node deployment is recommended.
### Prefill-Decode Disaggregation
-Not supported yet
+Not supported yet.
## Functional Verification
@@ -190,7 +190,7 @@ pip install safetensors
```
:::{note}
-The OpenCV component may be missing:
+The OpenCV component may be missing:
```bash
apt-get update
diff --git a/docs/source/tutorials/models/Qwen-VL-Dense.md b/docs/source/tutorials/models/Qwen-VL-Dense.md
index 6426b796..fb55dd3a 100644
--- a/docs/source/tutorials/models/Qwen-VL-Dense.md
+++ b/docs/source/tutorials/models/Qwen-VL-Dense.md
@@ -563,7 +563,7 @@ The performance evaluation must be conducted in an online mode. Take the `serve`
:sync: single
```shell
-vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
::::
@@ -571,7 +571,7 @@ vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --rand
:sync: multi
```shell
-vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
::::
diff --git a/docs/source/tutorials/models/Qwen2.5-7B.md b/docs/source/tutorials/models/Qwen2.5-7B.md
index 1ecdc765..3ec2800f 100644
--- a/docs/source/tutorials/models/Qwen2.5-7B.md
+++ b/docs/source/tutorials/models/Qwen2.5-7B.md
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
### Model Weight
-- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1). [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
+- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1) [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.
@@ -171,7 +171,7 @@ vllm bench serve \
--model ./Qwen2.5-7B-Instruct/ \
--dataset-name random \
--random-input 200 \
- --num-prompt 200 \
+ --num-prompts 200 \
--request-rate 1 \
--save-result \
--result-dir ./perf_results/
diff --git a/docs/source/tutorials/models/Qwen2.5-Omni.md b/docs/source/tutorials/models/Qwen2.5-Omni.md
index 55436958..a869dd6f 100644
--- a/docs/source/tutorials/models/Qwen2.5-Omni.md
+++ b/docs/source/tutorials/models/Qwen2.5-Omni.md
@@ -95,7 +95,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
:::
-`--allowed-local-media-path` is optional, only set it if you need infer model with local media file
+`--allowed-local-media-path` is optional, only set it if you need infer model with local media file.
`--gpu-memory-utilization` should not be set manually only if you know what this parameter aims to.
@@ -118,11 +118,11 @@ vllm serve ${MODEL_PATH}\
--no-enable-prefix-caching
```
-`--tensor_parallel_size` no need to set for this 7B model, but if you really need tensor parallel, tp size can be one of `1\2\4`
+`--tensor_parallel_size` no need to set for this 7B model, but if you really need tensor parallel, tp size can be one of `1/2/4`.
### Prefill-Decode Disaggregation
-Not supported yet
+Not supported yet.
## Functional Verification
@@ -145,7 +145,7 @@ curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/j
"content": [
{
"type": "text",
- "text": "What is the text in the illustrate?"
+ "text": "What is the text in the illustration?"
},
{
"type": "image_url",
@@ -170,7 +170,7 @@ If you query the server successfully, you can see the info shown below (client):
## Accuracy Evaluation
-Qwen2.5-Omni on vllm-ascend has been test on AISBench.
+Qwen2.5-Omni on vllm-ascend has been tested on AISBench.
### Using AISBench
@@ -204,7 +204,7 @@ There are three `vllm bench` subcommands:
Take the `serve` as an example. Run the code as follows.
```shell
-vllm bench serve --model Qwen/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model Qwen/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
diff --git a/docs/source/tutorials/models/Qwen3-235B-A22B.md b/docs/source/tutorials/models/Qwen3-235B-A22B.md
index 5ab6f8e6..cf74f381 100644
--- a/docs/source/tutorials/models/Qwen3-235B-A22B.md
+++ b/docs/source/tutorials/models/Qwen3-235B-A22B.md
@@ -18,10 +18,10 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
### Model Weight
-- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
-- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
+- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
+- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
-It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
+It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
### Verify Multi-node Communication(Optional)
@@ -46,7 +46,7 @@ Select an image based on your machine type and start the docker image on your no
export NAME=vllm-ascend
# Run the container using the defined variables
- # Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+ # Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance.
docker run --rm \
--name $NAME \
--net=host \
@@ -87,7 +87,7 @@ If you want to deploy multi-node environment, you need to set up environment on
### Single-node Deployment
-`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 Atlas 800 A3(64G*16)、 1 Atlas 800 A2(64G*8).
+`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 Atlas 800 A3(64G*16), 1 Atlas 800 A2(64G*8).
Quantized version need to start with parameter `--quantization ascend`.
Run the following script to execute online 128k inference.
@@ -310,7 +310,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
-vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
@@ -328,7 +328,7 @@ In this section, we provide simple scripts to re-produce our latest performance.
- HDK/driver 25.3.RC1
- triton_ascend 3.2.0
-### Single Node A3 (64G*16)
+### Single Node A3 (64G*16)
Example server scripts:
@@ -394,7 +394,7 @@ Note:
### Three Node A3 -- PD disaggregation
-On three Atlas 800 A3(64G*16)server, we recommend to use one node as one prefill instance and two nodes as one decode instance. Example server scripts:
+On three Atlas 800 A3(64G*16) server, we recommend to use one node as one prefill instance and two nodes as one decode instance. Example server scripts:
Prefill Node 1
```shell
diff --git a/docs/source/tutorials/models/Qwen3-Dense.md b/docs/source/tutorials/models/Qwen3-Dense.md
index 70814d7a..9f929a9a 100644
--- a/docs/source/tutorials/models/Qwen3-Dense.md
+++ b/docs/source/tutorials/models/Qwen3-Dense.md
@@ -304,7 +304,7 @@ Take the `serve` as an example. Run the code as follows.
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
```shell
-vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
@@ -389,4 +389,4 @@ If this list is not manually specified, it will be filled with a series of evenl
Therefore, like the above real-world scenario, when adjusting the benchmark request concurrency, we always ensure that the concurrency is actually included in the cudagraph_capture_sizes list. This way, during the decode phase, padding operations are essentially avoided, ensuring the reliability of the experimental data.
-It’s important to note that if you enable FlashComm_v1, the values in this list must be integer multiples of the TP size. Any values that do not meet this condition will be automatically filtered out. Therefore, I recommend incrementally adding concurrency based on the TP size after enabling FlashComm_v1.
+It's important to note that if you enable FlashComm_v1, the values in this list must be integer multiples of the TP size. Any values that do not meet this condition will be automatically filtered out. Therefore, I recommend incrementally adding concurrency based on the TP size after enabling FlashComm_v1.
diff --git a/docs/source/tutorials/models/Qwen3-Next.md b/docs/source/tutorials/models/Qwen3-Next.md
index d0f5becb..fa279af8 100644
--- a/docs/source/tutorials/models/Qwen3-Next.md
+++ b/docs/source/tutorials/models/Qwen3-Next.md
@@ -164,7 +164,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
-vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
diff --git a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
index c2be1f3a..0a8b3704 100644
--- a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
+++ b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
@@ -285,7 +285,7 @@ python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-s
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r vllm-ascend/benchmarks/requirements-bench.txt
-vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
diff --git a/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md b/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
index 8107315d..c02a501a 100644
--- a/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
+++ b/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md
@@ -270,7 +270,7 @@ Take the `serve` as an example. Run the code as follows.
```shell
export VLLM_USE_MODELSCOPE=true
-vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```
After about several minutes, you can get the performance evaluation result.
diff --git a/docs/source/user_guide/configuration/additional_config.md b/docs/source/user_guide/configuration/additional_config.md
index 51315d6f..b08c2015 100644
--- a/docs/source/user_guide/configuration/additional_config.md
+++ b/docs/source/user_guide/configuration/additional_config.md
@@ -1,6 +1,6 @@
# Additional Configuration
-Additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by themselves. VLLM Ascend uses this mechanism to make the project more flexible.
+Additional configuration is a mechanism provided by vLLM to allow plugins to control internal behavior by themselves. VLLM Ascend uses this mechanism to make the project more flexible.
## How to use
@@ -26,25 +26,25 @@ The following table lists additional configuration options available in vLLM Asc
| Name | Type | Default | Description |
|-------------------------------------|------|---------|-----------------------------------------------------------------------------------------------------------|
-| `xlite_graph_config` | dict | `{}` | Configuration options for xlite graph mode |
+| `xlite_graph_config` | dict | `{}` | Configuration options for Xlite graph mode |
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
| `finegrained_tp_config` | dict | `{}` | Configuration options for module tensor parallelism |
| `ascend_compilation_config` | dict | `{}` | Configuration options for ascend compilation |
| `eplb_config` | dict | `{}` | Configuration options for ascend compilation |
-| `npugraph_ex_config` | dict | `{}` | Configuration options for npugraph_ex backend |
+| `npugraph_ex_config` | dict | `{}` | Configuration options for Npugraph_ex backend |
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
| `dump_config_path` | str | `None` | Configuration file path for msprobe dump(eager mode). |
-| `enable_async_exponential` | bool | `False` | Whether to enable async exponential overlap. To enable async exponential, set this config to True. |
+| `enable_async_exponential` | bool | `False` | Whether to enable asynchronous exponential overlap. To enable asynchronous exponential, set this config to True. |
| `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
-| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effect on MoE models with shared experts. |
-| `multistream_overlap_gate` | bool | `False` | Whether to enable multistream overlap gate. This option only takes effect on MoE models with shared experts. |
+| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multi-stream shared expert. This option only takes effect on MoE models with shared experts. |
+| `multistream_overlap_gate` | bool | `False` | Whether to enable multi-stream overlap gate. This option only takes effect on MoE models with shared experts. |
| `recompute_scheduler_enable` | bool | `False` | Whether to enable recompute scheduler. |
-| `enable_cpu_binding` | bool | `False` | Whether to enable CPU binding. |
-| `SLO_limits_for_dynamic_batch` | int | `-1` | SLO limits for dynamic batch. This is new scheduler to support dynamic feature |
-| `enable_npugraph_ex` | bool | `False` | Whether to enable npugraph ex graph mode. |
+| `enable_cpu_binding` | bool | `False` | Whether to enable CPU Binding. |
+| `SLO_limits_for_dynamic_batch` | int | `-1` | SLO limits for dynamic batch. This is new scheduler to support dynamic batch feature |
+| `enable_npugraph_ex` | bool | `False` | Whether to enable npugraph_ex graph mode. |
| `pa_shape_list` | list | `[]` | The custom shape list of page attention ops. |
-| `enable_kv_nz` | bool | `False` | Whether to enable kvcache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
-| `layer_sharding` | dict | `{}` | Configuration options for layer sharding linear |
+| `enable_kv_nz` | bool | `False` | Whether to enable KV cache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
+| `layer_sharding` | dict | `{}` | Configuration options for Layer Sharding Linear |
The details of each configuration option are as follows:
@@ -52,8 +52,8 @@ The details of each configuration option are as follows:
| Name | Type | Default | Description |
| ---- | ---- | ------- | ----------- |
-| `enabled` | bool | `False` | Whether to enable xlite graph mode. Currently only Llama, Qwen dense series models, and Qwen3-vl are supported. |
-| `full_mode` | bool | `False` | Whether to enable xlite for both the prefill and decode stages. By default, xlite is only enabled for the decode stage. |
+| `enabled` | bool | `False` | Whether to enable Xlite graph mode. Currently only Llama, Qwen dense series models, and Qwen3-VL are supported. |
+| `full_mode` | bool | `False` | Whether to enable Xlite for both the prefill and decode stages. By default, Xlite is only enabled for the decode stage. |
**weight_prefetch_config**
@@ -66,8 +66,8 @@ The details of each configuration option are as follows:
| Name | Type | Default | Description |
| ---- | ---- | ------- | ----------- |
-| `lmhead_tensor_parallel_size` | int | `0` | The custom tensor parallel size of lmhead. |
-| `oproj_tensor_parallel_size` | int | `0` | The custom tensor parallel size of oproj. |
+| `lmhead_tensor_parallel_size` | int | `0` | The custom tensor parallel size of lm_head. |
+| `oproj_tensor_parallel_size` | int | `0` | The custom tensor parallel size of o_proj. |
| `embedding_tensor_parallel_size` | int | `0` | The custom tensor parallel size of embedding. |
| `mlp_tensor_parallel_size` | int | `0` | The custom tensor parallel size of mlp. |
diff --git a/docs/source/user_guide/deployment_guide/using_volcano_kthena.md b/docs/source/user_guide/deployment_guide/using_volcano_kthena.md
index 213d226d..f2b51055 100644
--- a/docs/source/user_guide/deployment_guide/using_volcano_kthena.md
+++ b/docs/source/user_guide/deployment_guide/using_volcano_kthena.md
@@ -1,6 +1,6 @@
# Using Volcano Kthena
-This guide shows how to run **prefill–decode (PD) disaggregation** on Huawei Ascend NPUs using **vLLM-Ascend**, with [**Kthena**](https://kthena.volcano.sh/) handling orchestration on Kubernetes. About vLLM support with kthena, please refer to [Deploy vLLM with Kthena](https://docs.vllm.ai/en/latest/deployment/integrations/kthena/).
+This guide shows how to run **prefill–decode (PD) disaggregation** on Huawei Ascend NPUs using **vLLM-Ascend**, with [**Kthena**](https://kthena.volcano.sh/) handling orchestration on Kubernetes. About vLLM support with Kthena, please refer to [Deploy vLLM with Kthena](https://docs.vllm.ai/en/latest/deployment/integrations/kthena/).
---
@@ -10,18 +10,18 @@ Large language model inference naturally splits into two phases:
- **Prefill**
- Processes input tokens and builds the key–value (KV) cache.
- - Batch‑friendly, high throughput, well suited to parallel NPU execution.
+ - Batch-friendly, high-throughput, well-suited to parallel NPU execution.
- **Decode**
- Consumes the KV cache to generate output tokens.
- - Latency‑sensitive, memory‑intensive, more sequential.
+ - Latency-sensitive, memory-intensive, more sequential.
-From the client’s perspective, this still looks like a single Chat / Completions endpoint.
+From the client's perspective, this still looks like a single Chat / Completions endpoint.
---
## 2. Deploy on Kubernetes with Kthena
-[Kthena](https://kthena.volcano.sh/) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads. In this example, we use three key Custom Resource Definitions (CRDs):
+[Kthena](https://kthena.volcano.sh/) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high-performance and enterprise-grade scalability for LLM inference workloads. In this example, we use three key Custom Resource Definitions (CRDs):
- `ModelServing` — defines the workloads (prefill and decode roles).
- `ModelServer` — manages PD groupings and internal routing.
@@ -33,7 +33,7 @@ This section uses the `deepseek-ai/DeepSeek-V2-Lite` example, but you can swap i
- Kubernetes cluster with Ascend NPU nodes:
- The Resources corresponding to different NPU Drivers may vary slightly. For example:
+ The resources corresponding to different NPU Drivers may vary slightly. For example:
- If using [MindCluster](https://gitee.com/ascend/mind-cluster#https://gitee.com/link?target=https%3A%2F%2Fgitcode.com%2FAscend%2Fmind-cluster), please use `huawei.com/Ascend310P` or `huawei.com/Ascend910`.
@@ -45,7 +45,7 @@ This section uses the `deepseek-ai/DeepSeek-V2-Lite` example, but you can swap i
A concrete example is provided in Kthena as
-Deploy it with below command:
+Deploy it with the command below:
```bash
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-serving/prefill-decode-disaggregation.yaml
@@ -295,7 +295,7 @@ You should see Pods such as:
- `deepseek-v2-lite-0-prefill-0-0`
- `deepseek-v2-lite-0-decode-0-0`
-To enable the llm access, we still need to configure the routing layer with `ModelServer` and `ModelRoute`.
+To enable the LLM access, we still need to configure the routing layer with `ModelServer` and `ModelRoute`.
### 2.3 ModelServer: PD Group Management
@@ -306,7 +306,7 @@ The `ModelServer` resource:
- Configures KV connector details and timeouts.
- Exposes an internal gRPC/HTTP interface.
-Create modelServer with below command:
+Create ModelServer with the command below:
```bash
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/kthena-router/ModelServer-prefill-decode-disaggregation.yaml
diff --git a/docs/source/user_guide/feature_guide/Fine_grained_TP.md b/docs/source/user_guide/feature_guide/Fine_grained_TP.md
index ef16a353..ae81f3db 100644
--- a/docs/source/user_guide/feature_guide/Fine_grained_TP.md
+++ b/docs/source/user_guide/feature_guide/Fine_grained_TP.md
@@ -2,7 +2,7 @@
## Overview
-Fine-Grained Tensor Parallelism (Finegrained TP) extends standard tensor parallelism by enabling **independent tensor parallel sizes for different model components**. Instead of applying a single global `tensor_parallel_size` to all layers, Finegrained TP allows users to configure separate TP size for key modules—such as embedding, language model head (lm_head), attention output projection (oproj), and MLP blocks—via the `finegrained_tp_config` parameter.
+Fine-Grained Tensor Parallelism (Fine-grained TP) extends standard tensor parallelism by enabling **independent tensor-parallel sizes for different model components**. Instead of applying a single global `tensor_parallel_size` to all layers, Fine-grained TP allows users to configure separate TP sizes for key modules—such as embedding, language model head (lm_head), attention output projection (o_proj), and MLP blocks—via the `finegrained_tp_config` parameter.
This capability supports heterogeneous parallelism strategies within a single model, providing finer control over weight distribution, memory layout, and communication patterns across devices. The feature is compatible with standard dense transformer architectures and integrates seamlessly into vLLM’s serving pipeline.
@@ -12,10 +12,10 @@ This capability supports heterogeneous parallelism strategies within a single mo
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
-- **Reduced Per-Device Memory Footprint**:
- Finegrained TP shards large weight matrices(如 LM Head、o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
+- **Reduced Per-Device Memory Footprint**:
+ Fine-grained TP shards large weight matrices(如 LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
-- **Faster Memory Access in GEMMs**:
+- **Faster Memory Access in GEMMs**:
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.
Together, these effects allow practitioners to better balance memory, communication, and compute—particularly in high-concurrency serving scenarios—while maintaining compatibility with standard dense transformer models.
@@ -26,7 +26,7 @@ Together, these effects allow practitioners to better balance memory, communicat
### Models
-Finegrained TP is **model-agnostic** and supports all standard dense transformer architectures, including Llama, Qwen, DeepSeek (base/dense variants), and others.
+Fine-grained TP is **model-agnostic** and supports all standard dense transformer architectures, including Llama, Qwen, DeepSeek (base/dense variants), and others.
### Component & Execution Mode Support
@@ -57,7 +57,7 @@ The Fine-Grained TP size for any component must:
### Configuration Format
-Finegrained TP is controlled via the `finegrained_tp_config` field inside `--additional-config`.
+Fine-grained TP is controlled via the `finegrained_tp_config` field inside `--additional-config`.
```bash
--additional-config '{
@@ -90,7 +90,7 @@ vllm serve deepseek-ai/DeepSeek-R1 \
## Experimental Results
-To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD-separated decode instances in the environment of 32 cards Ascend 910B*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8, the performance data is as follows.
+To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated decode instances in an environment of 32 cards Ascend 910B*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8; the performance data is as follows.
| Module | Memory Savings | TPOT Impact (batch=24) |
| ---------------- | -------------- | ------------------------- |
diff --git a/docs/source/user_guide/feature_guide/Multi_Token_Prediction.md b/docs/source/user_guide/feature_guide/Multi_Token_Prediction.md
index 7794fc7a..7052acc9 100644
--- a/docs/source/user_guide/feature_guide/Multi_Token_Prediction.md
+++ b/docs/source/user_guide/feature_guide/Multi_Token_Prediction.md
@@ -10,7 +10,7 @@ To enable MTP for DeepSeek-V3 models, add the following parameter when starting
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
-- `num_speculative_tokens`: The number of speculative tokens which enable model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
+- `num_speculative_tokens`: The number of speculative tokens that enables the model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
- `disable_padded_drafter_batch`: Disable input padding for speculative decoding. If set to True, speculative input batches can contain sequences of different lengths, which may only be supported by certain attention backends. This currently only affects the MTP method of speculation, default is False.
## How It Works
@@ -28,7 +28,7 @@ vllm_ascend
**1. sample**
-- *rejection_sample.py*: During decoding, the main model processes the previous round’s output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus We employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
+- *rejection_sample.py*: During decoding, the main model processes the previous round’s output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus we employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
```shell
rejection_sample.py
@@ -38,9 +38,9 @@ rejection_sample.py
**2. spec_decode**
-This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token ids. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
+This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token IDs. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
-- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by deepseek mtp layer.
+- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by DeepSeek MTP layer.
```shell
mtp_proposer.py
@@ -54,7 +54,7 @@ mtp_proposer.py
### Algorithm
-**1. Reject_Sample**
+**1. Rejection Sampling**
- *Greedy Strategy*
@@ -68,17 +68,17 @@ For each draft token, acceptance is determined by verifying whether the inequali
The decision logic for each draft token is as follows: if the inequality `P_target / P_draft ≥ U` holds, the draft token is accepted as output; conversely, if `P_target / P_draft < U`, the draft token is rejected.
-When a draft token is rejected, a recovery sampling process is triggered where a "recovered token" is resampled from the adjusted probability distribution defined as `Q = max(P_target - P_draft, 0)`. In the current MTP implementation, since `P_draft` is not provided and defaults to 1, the formulas simplify such that token acceptance occurs when `P_target ≥ U,` and the recovery distribution becomes `Q = max(P_target - 1, 0)`.
+When a draft token is rejected, a recovery sampling process is triggered where a "recovered token" is resampled from the adjusted probability distribution defined as `Q = max(P_target - P_draft, 0)`. In the current MTP implementation, since `P_draft` is not provided and defaults to 1, the formulas simplify such that token acceptance occurs when `P_target ≥ U` and the recovery distribution becomes `Q = max(P_target - 1, 0)`.
**2. Performance**
-If the bonus token is accepted, the MTP model performs inference for (num_speculative +1) tokens, including original main model output token and bonus token. If rejected, inference is performed for less token, determining on how many tokens accepted.
+If the bonus token is accepted, the MTP model performs inference for (num_speculative + 1) tokens, including original main model output token and bonus token. If rejected, inference is performed for fewer tokens, depending on how many tokens are accepted.
## DFX
### Method Validation
-- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
+- Currently, the spec_decode scenario only supports methods such as n-gram, EAGLE, EAGLE3, and MTP. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
```python
def get_spec_decode_method(method,
@@ -112,4 +112,4 @@ if self.speculative_config:
## Limitations
- Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15.
-- In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1).
+- In the fullgraph mode with MTP > 1, the capture size of each ACLGraph must be an integer multiple of (num_speculative_tokens + 1).
diff --git a/docs/source/user_guide/feature_guide/context_parallel.md b/docs/source/user_guide/feature_guide/context_parallel.md
index c3de26cb..79d3eea7 100644
--- a/docs/source/user_guide/feature_guide/context_parallel.md
+++ b/docs/source/user_guide/feature_guide/context_parallel.md
@@ -9,7 +9,7 @@ This guide shows how to use Context Parallel, a long sequence inference optimiza
Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are:
- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
-- For long context decode, we can use context parallel to reduce KV cache duplication and offer more space for KV cache to increase the batchsize (and hence the throughput).
+- For long context decode, we can use context parallel to reduce KV cache duplication and offer more space for KV cache to increase the batch size (and hence the throughput).
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md).
@@ -54,19 +54,19 @@ You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_co
--prefill-context-parallel-size 2 \
```
-The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`, so the examples above need 4 NPUs for each.
+The total world size is `tensor_parallel_size` * `prefill_context_parallel_size`, so the examples above need 4 NPUs for each.
## Constraints
- While using DCP, the following constraints must be met:
- - For MLA based model, such as Deepseek-R1:
+ - For MLA-based model, such as DeepSeek-R1:
- `tensor_parallel_size >= decode_context_parallel_size`
- `tensor_parallel_size % decode_context_parallel_size == 0`
- - For GQA based model, such as Qwen3-235B:
+ - For GQA-based model, such as Qwen3-235B:
- `(tensor_parallel_size // num_key_value_heads) >= decode_context_parallel_size`
- `(tensor_parallel_size // num_key_value_heads) % decode_context_parallel_size == 0`
-- While using Context Parallel in KV cache transfer needed scenario (e.g. KV pooling, PD-disaggregation), to simplify KV cache transmission, `cp_kv_cache_interleave_size` must be set to the same value of KV cache `block_size`(default: 128), which specify cp to split KV cache in a block-interleave style. For example:
+- While using Context Parallel in KV cache transfer-needed scenario (e.g. KV pooling, PD disaggregation), to simplify KV cache transmission, `cp_kv_cache_interleave_size` must be set to the same value of KV cache `block_size`(default: 128), which specifies CP to split KV cache in a block-interleave style. For example:
```shell
vllm serve deepseek-ai/DeepSeek-V2-Lite \
@@ -79,7 +79,7 @@ The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`
## Experimental Results
-To evaluate the effectiveness of Context Parallel in in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD-disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows.
+To evaluate the effectiveness of Context Parallel in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows.
- DeepSeek-R1-W8A8:
diff --git a/docs/source/user_guide/feature_guide/dynamic_batch.md b/docs/source/user_guide/feature_guide/dynamic_batch.md
index cfb4fa13..b20a5756 100644
--- a/docs/source/user_guide/feature_guide/dynamic_batch.md
+++ b/docs/source/user_guide/feature_guide/dynamic_batch.md
@@ -3,7 +3,7 @@
Dynamic batch is a technique that dynamically adjusts the chunksize during each inference iteration within the chunked prefilling strategy according to the resources and SLO targets, thereby improving the effective throughput and decreasing the TBT.
Dynamic batch is controlled by the value of the `--SLO_limits_for_dynamic_batch`.
-Notably, only 910 B3 is supported with decode token numbers scales below 2048 so far.
+Notably, only 910 B3 is supported with decode token number scales below 2048 so far.
Especially, the improvements are quite obvious on Qwen, Llama models.
We are working on further improvements and this feature will support more XPUs in the future.
@@ -11,16 +11,16 @@ We are working on further improvements and this feature will support more XPUs i
### Prerequisites
-1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
+1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
-2. `Pandas` is needed to load the lookup table, in case `pandas` is not installed.
+2. `Pandas` is needed to load the lookup table, in case pandas is not installed.
```bash
pip install pandas
```
-### Tuning Parameter
-`--SLO_limits_for_dynamic_batch` is the tuning parameters (integer type) for the dynamic batch feature, greater values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
+### Tuning Parameters
+`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
```python
--SLO_limits_for_dynamic_batch =-1 # default value, dynamic batch disabled.
@@ -45,7 +45,7 @@ vllm serve Qwen/Qwen2.5-14B-Instruct\
--tensor_parallel_size 8 \
--load_format dummy \
--max_num_batched_tokens 1024 \
- --max_model_len 9000 \
+ --max-model-len 9000 \
--host localhost \
--port 12091 \
--gpu-memory-utilization 0.9 \
diff --git a/docs/source/user_guide/feature_guide/eplb_swift_balancer.md b/docs/source/user_guide/feature_guide/eplb_swift_balancer.md
index 35f78931..a5982a8a 100644
--- a/docs/source/user_guide/feature_guide/eplb_swift_balancer.md
+++ b/docs/source/user_guide/feature_guide/eplb_swift_balancer.md
@@ -16,17 +16,17 @@ Expert balancing for MoE models in LLM serving is essential for optimal performa
### Models
-DeepseekV3/V3.1/R1、Qwen3-MOE
+DeepSeekV3/V3.1/R1, Qwen3-MoE
### MOE QuantType
-W8A8-dynamic
+W8A8-Dynamic
## How to Use EPLB
### Dynamic EPLB
-We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vllm eplb. Enable dynamic balancing with auto-tuned parameters. Adjust expert_heat_collection_interval and algorithm_execution_interval based on workload patterns.
+We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vLLM EPLB. Enable dynamic balancing with auto-tuned parameters. Adjust expert_heat_collection_interval and algorithm_execution_interval based on workload patterns.
```shell
vllm serve Qwen/Qwen3-235B-A22 \
@@ -87,7 +87,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
4. Monitoring & Validation:
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.
- - Use vllm monitor to detect imbalances during runtime.
+ - Use vLLM monitor to detect imbalances during runtime.
- Always verify expert map JSON structure before loading (validate with jq or similar tools).
5. Startup Behavior:
diff --git a/docs/source/user_guide/feature_guide/external_dp.md b/docs/source/user_guide/feature_guide/external_dp.md
index 17bae552..6cdc8521 100644
--- a/docs/source/user_guide/feature_guide/external_dp.md
+++ b/docs/source/user_guide/feature_guide/external_dp.md
@@ -1,6 +1,6 @@
# External DP
-For larger scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally.
+For larger-scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally.
In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions.
@@ -8,8 +8,8 @@ In this case, it's more convenient to treat each DP rank like a separate vLLM de
The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) is already natively supported by vLLM. In vllm-ascend we provide two enhanced functionalities:
-1. A launch script which helps to launch multi vllm instances in one command.
-2. A request-length-aware load balance proxy for external dp.
+1. A launch script that helps to launch multiple vLLM instances in one command.
+2. A request-length-aware load-balance proxy for external DP.
This tutorial will introduce the usage of them.
@@ -24,9 +24,9 @@ pip install fastapi httpx uvicorn
## Starting Exeternal DP Servers
-First you need to have at least two vLLM servers running in data parallel. These can be mock servers or actual vLLM servers. Note that this proxy also works with only one vLLM server running, but will fall back to direct request forwarding which is meaningless.
+First, you need to have at least two vLLM servers running in data parallel. These can be mock servers or actual vLLM servers. Note that this proxy also works with only one vLLM server running, but will fall back to direct request forwarding which is meaningless.
-You can start external vLLM dp servers one-by-one manually or using the launch script in `examples/external_online_dp`. For scenarios of large dp size across multi nodes, we recommend using our launch script for convenience.
+You can start external vLLM DP servers one-by-one manually or using the launch script in `examples/external_online_dp`. For scenarios of large DP size across multiple nodes, we recommend using our launch script for convenience.
### Manually Launch
@@ -38,7 +38,7 @@ vllm serve --host 0.0.0.0 --port 8101 --data-parallel-size 2 --data-parallel-ran
### Use Launch Script
-Firstly, you need to modify the `examples/external_online_dp/run_dp_template.sh` according to your vLLM configuration. Then you can use `examples/external_online_dp/launch_online_dp.py` to launch multiple vLLM instances in one command each node. It will internally call `examples/external_online_dp/run_dp_template.sh` for each DP rank with proper DP-related parameters.
+Firstly, you need to modify the `examples/external_online_dp/run_dp_template.sh` according to your vLLM configuration. Then you can use `examples/external_online_dp/launch_online_dp.py` to launch multiple vLLM instances in one command on each node. It will internally call `examples/external_online_dp/run_dp_template.sh` for each DP rank with proper DP-related parameters.
An example of running external DP in one single node:
@@ -65,9 +65,9 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 2 --dp-rank-s
## Starting Load-balance Proxy Server
-After all vLLM DP instances are launched, you can now launch the load-balance proxy server which serves as entrypoint for coming requests and load balance them between vLLM DP instances.
+After all vLLM DP instances are launched, you can now launch the load-balance proxy server, which serves as an entrypoint for coming requests and load-balances them between vLLM DP instances.
-The proxy server has following features:
+The proxy server has the following features:
- Load balances requests to multiple vLLM servers based on request length.
- Supports OpenAI-compatible `/v1/completions` and `/v1/chat/completions` endpoints.
@@ -88,4 +88,4 @@ python dp_load_balance_proxy_server.py \
--dp-ports 9000 9001 \
```
-After this, you can directly send requests to the proxy server and run DP with external load-balance.
+After this, you can directly send requests to the proxy server and run DP with external load balancing.
diff --git a/docs/source/user_guide/feature_guide/graph_mode.md b/docs/source/user_guide/feature_guide/graph_mode.md
index c5e208ab..c2e07956 100644
--- a/docs/source/user_guide/feature_guide/graph_mode.md
+++ b/docs/source/user_guide/feature_guide/graph_mode.md
@@ -8,16 +8,16 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
## Getting Started
-From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
+From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fall back to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
-There are two kinds for graph mode supported by vLLM Ascend:
+There are two kinds of graph mode supported by vLLM Ascend:
-- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and Deepseek series models are well tested.
-- **XliteGraph**: This is the openeuler xlite graph mode. In v0.11.0, only Llama, Qwen dense series models, Qwen MoE series models, and Qwen3-vl are supported.
+- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and DeepSeek series models are well tested.
+- **XliteGraph**: This is the OpenEuler Xlite graph mode. In v0.11.0, only Llama, Qwen dense series models, Qwen MoE series models, and Qwen3-VL are supported.
## Using ACLGraph
-ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
+ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine.
Offline example:
@@ -38,7 +38,7 @@ vllm serve Qwen/Qwen2-7B-Instruct
## Using XliteGraph
-If you want to run Llama, Qwen dense series models, Qwen MoE series models, or Qwen3-vl with xlite graph mode, please install xlite, and set xlite_graph_config.
+If you want to run Llama, Qwen dense series models, Qwen MoE series models, or Qwen3-VL with Xlite graph mode, please install xlite, and set xlite_graph_config.
```bash
pip install xlite
@@ -61,11 +61,11 @@ Online example:
vllm serve path/to/Qwen3-32B --tensor-parallel-size 8 --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}'
```
-You can find more details about xlite [here](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
+You can find more details about Xlite [here](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
## Fallback to the Eager Mode
-If `ACLGraph` and `XliteGraph` all fail to run, you should fallback to the eager mode.
+If `ACLGraph` and `XliteGraph` all fail to run, you should fall back to the eager mode.
Offline example:
diff --git a/docs/source/user_guide/feature_guide/large_scale_ep.md b/docs/source/user_guide/feature_guide/large_scale_ep.md
index 049bf3a5..21b788d1 100644
--- a/docs/source/user_guide/feature_guide/large_scale_ep.md
+++ b/docs/source/user_guide/feature_guide/large_scale_ep.md
@@ -1,9 +1,9 @@
-# Distributed DP Server With Large Scale Expert Parallelism
+# Distributed DP Server With Large-Scale Expert Parallelism
## Getting Start
-vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large scale **Expert Parallelism (EP)** scenario. To achieve better performance,the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
-Take the deepseek model as an example, use 8 Atlas 800T A3 servers to deploy the model. Assume the ip of the servers start from 192.0.0.1, and end by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes deployed as master node independently, the decoder nodes set 192.0.0.5 node to be the master node.
+vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-scale **Expert Parallelism (EP)** scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
+Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes are deployed as master nodes independently, while the decoder nodes use the 192.0.0.5 node as the master node.
## Verify Multi-Node Communication Environment
@@ -51,7 +51,7 @@ for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-i
4. Cross-Node PING Test
```bash
-# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
+# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
@@ -87,7 +87,7 @@ for i in {0..7}; do hccn_tool -i $i -ip -g;done
3. Cross-Node PING Test
```bash
-# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
+# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
@@ -95,11 +95,11 @@ for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
:::::
-## Large Scale EP model deployment
+## Large-Scale EP model deployment
### Generate script with configurations
-In the PD separation scenario, we provide a optimized configuration. You can use the following shell script for configuring the prefiller and decoder nodes respectively.
+In the PD separation scenario, we provide an optimized configuration. You can use the following shell script for configuring the prefiller and decoder nodes respectively.
:::::{tab-set}
@@ -140,7 +140,7 @@ export VLLM_USE_MODELSCOPE="True"
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
-# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
+# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
--host 0.0.0.0 \
@@ -207,7 +207,7 @@ export VLLM_USE_MODELSCOPE="True"
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
-# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
+# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
--host 0.0.0.0 \
@@ -254,7 +254,7 @@ dp_size = 2 # total number of DP engines for decode/prefill
dp_size_local = 2 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node
# dp_ip is different on prefiller nodes in this example
-dp_ip = "192.0.0.1" # master node ip for DP communication
+dp_ip = "192.0.0.1" # master node IP for DP communication
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
@@ -288,7 +288,7 @@ dp_size = 64 # total number of DP engines for decode/prefill
dp_size_local = 16 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node. e.g. 0/16/32/48
# dp_ip is the same on decoder nodes in this example
-dp_ip = "192.0.0.5" # master node ip for DP communication.
+dp_ip = "192.0.0.5" # master node IP for DP communication.
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
@@ -314,7 +314,7 @@ for process in processes:
:::::
-Note that the prefiller nodes and the decoder nodes may have different configurations. In this example, each prefiller node deployed as master node independently, but all decoder nodes take the first node as the master node. So it leads to difference in 'dp_size_local' and 'dp_rank_start'
+Note that the prefiller nodes and the decoder nodes may have different configurations. In this example, each prefiller node is deployed as a master node independently, while the decoder nodes use the 192.0.0.5 node as the master node. This leads to differences in 'dp_size_local' and 'dp_rank_start'
## Example proxy for Distributed DP Server
@@ -365,7 +365,7 @@ You can get the proxy program in the repository's examples, [load\_balance\_prox
## Benchmark
-We recommend use aisbench tool to assess performance. [aisbench](https://gitee.com/aisbench/benchmark) Execute the following commands to install aisbench
+We recommend using aisbench tool to assess performance. [aisbench](https://gitee.com/aisbench/benchmark). Execute the following commands to install aisbench
```shell
git clone https://gitee.com/aisbench/benchmark.git
@@ -373,7 +373,7 @@ cd benchmark/
pip3 install -e ./
```
-You need to canncel the http proxy before assessing performance, as following
+You need to cancel the http proxy before assessing performance, as follows:
```shell
# unset proxy
@@ -381,8 +381,8 @@ unset http_proxy
unset https_proxy
```
-- You can place your datasets in the dir: `benchmark/ais_bench/datasets`
-- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples
+- You can place your datasets in the directory: `benchmark/ais_bench/datasets`
+- You can change the configuration in the directory :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take `vllm_api_stream_chat.py` as an example:
```python
models = [
@@ -408,23 +408,23 @@ models = [
]
```
-- Take gsm8k dataset for example, execute the following commands to assess performance.
+- Taking the gsm8k dataset as an example, execute the following commands to assess performance.
```shell
ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --mode perf
```
-- For more details for commands and parameters for aisbench, refer to [aisbench](https://gitee.com/aisbench/benchmark)
+- For more details on commands and parameters for aisbench, refer to [aisbench](https://gitee.com/aisbench/benchmark)
## Prefill & Decode Configuration Details
-In the PD separation scenario, we provide a optimized configuration.
+In the PD separation scenario, we provide an optimized configuration.
- **prefiller node**
1. set HCCL_BUFFSIZE=256
2. add '--enforce-eager' command to 'vllm serve'
-3. Take '--kv-transfer-config' as follow
+3. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config \
@@ -437,7 +437,7 @@ In the PD separation scenario, we provide a optimized configuration.
}'
```
-4. Take '--additional-config' as follow
+4. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
@@ -446,7 +446,7 @@ In the PD separation scenario, we provide a optimized configuration.
- **decoder node**
1. set HCCL_BUFFSIZE=1024
-2. Take '--kv-transfer-config' as follow
+2. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config
@@ -459,7 +459,7 @@ In the PD separation scenario, we provide a optimized configuration.
}'
```
-3. Take '--additional-config' as follow
+3. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true}'
@@ -467,13 +467,13 @@ In the PD separation scenario, we provide a optimized configuration.
### Parameters Description
-1.'--additional-config' Parameter Introduction:
+1. '--additional-config' Parameter Introduction:
-- **"enable_weight_nz_layout":** Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
-- **"enable_prefill_optimizations":** Whether to enable DeepSeek models' prefill optimizations.
+- **"enable_weight_nz_layout"**: Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
+- **"enable_prefill_optimizations"**: Whether to enable DeepSeek models' prefill optimizations.
-3.enable MTP
+2. Enable MTP
Add the following command to your configurations.
```shell
@@ -482,7 +482,7 @@ Add the following command to your configurations.
### Recommended Configuration Example
-For example,if the average input length is 3.5k, and the output length is 1.1k, the context length is 16k, the max length of the input dataset is 7K. In this scenario, we give a recommended configuration for distributed DP server with high EP. Here we use 4 nodes for prefill and 4 nodes for decode.
+For example, if the average input length is 3.5k, and the output length is 1.1k, the context length is 16k, the max length of the input dataset is 7K. In this scenario, we give a recommended configuration for distributed DP server with high EP. Here we use 4 nodes for prefill and 4 nodes for decode.
| node | DP | TP | EP | max-model-len | max-num-batched-tokens | max-num-seqs | gpu-memory-utilization |
|----------|----|----|----|---------------|------------------------|--------------|-----------|
@@ -495,6 +495,6 @@ Note that these configurations are not related to optimization. You need to adju
## FAQ
-### 1. Prefiller nodes need to warmup
+### 1. Prefiller nodes need to warm up
Since the computation of some NPU operators requires several rounds of warm-up to achieve best performance, we recommend preheating the service with some requests before conducting performance tests to achieve the best end-to-end throughput.
diff --git a/docs/source/user_guide/feature_guide/layer_sharding.md b/docs/source/user_guide/feature_guide/layer_sharding.md
index 6abc4d4b..62fc3a2c 100644
--- a/docs/source/user_guide/feature_guide/layer_sharding.md
+++ b/docs/source/user_guide/feature_guide/layer_sharding.md
@@ -7,7 +7,7 @@
Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
- The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group.
-- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand via asynchronous broadcast** during the forward pass.
+- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand** via asynchronous broadcast during the forward pass.
As illustrated in the figure below, this design enables broadcast to reach weights: while the current layer (e.g., MLA or MOE) is being computed, the system **asynchronously broadcasts the next layer's weight** in the background. Because the attention computation in the MLA module is sufficiently latency-bound, the weight transfer for `o_proj` is **fully overlapped with computation**, making the communication **latency-free from the perspective of end-to-end inference**.
@@ -23,7 +23,7 @@ This approach **preserves exact computational semantics** while **significantly

-> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast pre-fetches the next layer's weight while the current layer computes—enabling zero-overhead weight loading.
+> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes—enabling **zero-overhead** weight loading.
---
diff --git a/docs/source/user_guide/feature_guide/netloader.md b/docs/source/user_guide/feature_guide/netloader.md
index 857f3473..3bb0384d 100644
--- a/docs/source/user_guide/feature_guide/netloader.md
+++ b/docs/source/user_guide/feature_guide/netloader.md
@@ -24,7 +24,7 @@ The server runs alongside normal inference tasks via sub-threads and via `statel
### Application Scenarios
-- **Reduce startup latency**: By reusing already loaded weights and transferring them directly between NPU cards, Netloader cuts down model loading time vs conventional remote/local pull strategies.
+- **Reduce startup latency**: By reusing already loaded weights and transferring them directly between NPU cards, Netloader cuts down model loading time versus conventional remote/local pull strategies.
- **Relieve network & storage load**: Avoid repeated downloads of weight files from remote repositories, thus reducing pressure on central storage and network traffic.
- **Improve resource utilization & lower cost**: Faster loading allows less reliance on standby compute nodes; resources can be scaled up/down more flexibly.
- **Enhance business continuity & high availability**: In failure recovery, new instances can quickly take over without long downtime, improving system reliability and user experience.
@@ -37,7 +37,7 @@ To enable Netloader, pass `--load-format=netloader` and provide configuration vi
| Field Name | Type | Description | Allowed Values / Notes |
|--------------------|---------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
-| **SOURCE** | List | Weighted data sources. Each item is a map with `device_id` and `sources`, specifying the rank and its endpoints (IP:port).
Example: `{"SOURCE": [{"device_id": 0, "sources": ["10.170.22.152:19374"]}, {"device_id": 1, "sources": ["10.170.22.152:11228"]}]}`
If omitted or empty, fallback to default loader. The SOURCE here is second priority. | A list of objects with keys `device_id: int` and `sources: List[str]` |
+| **SOURCE** | List | Weight data sources. Each item is a map with `device_id` and `sources`, specifying the rank and its endpoints (IP:port).
Example: `{"SOURCE": [{"device_id": 0, "sources": ["10.170.22.152:19374"]}, {"device_id": 1, "sources": ["10.170.22.152:11228"]}]}`
If omitted or empty, fallback to default loader. The SOURCE here is second priority. | A list of objects with keys `device_id: int` and `sources: List[str]` |
| **MODEL** | String | The model name, used to verify consistency between client and server. | Defaults to the `--model` argument if not specified. |
| **LISTEN_PORT** | Integer | Base port for the server listener. | The actual port = `LISTEN_PORT + RANK`. If omitted, a random valid port is chosen. Valid range: 1024–65535. If out of range, that server instance won’t open a listener. |
| **INT8_CACHE** | String | Behavior for handling int8 parameters in quantized models. | One of `["hbm", "dram", "no"]`.
- `hbm`: copy original int8 parameters to high-bandwidth memory (HBM) (may cost a lot of HBM).
- `dram`: copy to DRAM.
- `no`: no special handling (may lead to divergence or unpredictable behavior). Default: `"no"`. |
diff --git a/docs/source/user_guide/feature_guide/quantization.md b/docs/source/user_guide/feature_guide/quantization.md
index ecf66588..3fb4a264 100644
--- a/docs/source/user_guide/feature_guide/quantization.md
+++ b/docs/source/user_guide/feature_guide/quantization.md
@@ -8,7 +8,7 @@ Model quantization is a technique that reduces model size and computational over
>
> You can choose to convert the model yourself or use the quantized model we uploaded.
> See .
-> Before you quantize a model, ensure that the RAM size is enough.
+> Before you quantize a model, ensure sufficient RAM is available.
## Quantization Tools
diff --git a/docs/source/user_guide/feature_guide/sleep_mode.md b/docs/source/user_guide/feature_guide/sleep_mode.md
index 64a24576..ba7a0bb0 100644
--- a/docs/source/user_guide/feature_guide/sleep_mode.md
+++ b/docs/source/user_guide/feature_guide/sleep_mode.md
@@ -2,7 +2,7 @@
## Overview
-Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
+Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs autoregressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
@@ -71,7 +71,7 @@ The following is a simple example of how to use sleep mode.
- Online serving:
:::{note}
- Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the dev environment `VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up).
+ Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicitly specify the dev environment `VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up).
:::
```bash
diff --git a/docs/source/user_guide/feature_guide/speculative_decoding.md b/docs/source/user_guide/feature_guide/speculative_decoding.md
index 168bcc24..d1609613 100644
--- a/docs/source/user_guide/feature_guide/speculative_decoding.md
+++ b/docs/source/user_guide/feature_guide/speculative_decoding.md
@@ -150,3 +150,4 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+ ```
diff --git a/docs/source/user_guide/feature_guide/ucm_deployment.md b/docs/source/user_guide/feature_guide/ucm_deployment.md
index a5596972..66d28a49 100644
--- a/docs/source/user_guide/feature_guide/ucm_deployment.md
+++ b/docs/source/user_guide/feature_guide/ucm_deployment.md
@@ -7,7 +7,7 @@ Unified Cache Management (UCM) provides an external KV-cache storage layer desig
## Prerequisites
* OS: Linux
-* A hardware with Ascend NPU. It’s usually the Atlas 800 A2 series.
+* Hardware with Ascend NPUs. It's usually the Atlas 800 A2 series.
* **vLLM: main branch**
* **vLLM Ascend: main branch**
@@ -17,7 +17,7 @@ Unified Cache Management (UCM) provides an external KV-cache storage layer desig
## Configure UCM for Prefix Caching
-Modify the UCM configuration file to specify which UCM connector to use and where KV blocks should be stored.
+Modify the UCM configuration file to specify which UCM connector to use and where KV blocks should be stored.
You may directly edit the example file at:
`unified-cache-management/examples/ucm_config_example.yaml`
@@ -78,7 +78,7 @@ vllm serve Qwen/Qwen2.5-14B-Instruct \
**⚠️ Make sure to replace `"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**
-If you see log as below:
+If you see the log below:
```bash
INFO: Started server process [1049932]
diff --git a/docs/source/user_guide/feature_guide/weight_prefetch.md b/docs/source/user_guide/feature_guide/weight_prefetch.md
index 46b447df..cd1ab841 100644
--- a/docs/source/user_guide/feature_guide/weight_prefetch.md
+++ b/docs/source/user_guide/feature_guide/weight_prefetch.md
@@ -2,7 +2,7 @@
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution. Linear layers sometimes exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as quantize, MoE gating top_k, RMSNorm and SwiGlu. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the linear layer computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
-Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it it best not to enable prefetching.
+Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it is best not to enable prefetching.
## Quick Start
@@ -10,25 +10,25 @@ With `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to op
## Fine-tune Prefetch Ratio
-Since weight prefetch use vector computations to hide the weight prefetching pipeline, the setting of the prefetch size is crucial. If the size is too small, the optimization benefits will not be fully realized, while a larger size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have add `prefetch_ratio` to allow for flexible size configuration based on the specific workload, detail as following:
+Since weight prefetch use vector computations to hide the weight prefetching pipeline, the setting of the prefetch size is crucial. If the size is too small, the optimization benefits will not be fully realized, while a larger size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have added `prefetch_ratio` to allow for flexible size configuration based on the specific workload, details as follows:
-With `prefetch_ratio` in `"weight_prefetch_config"` to custom the weight prefetch ratio for specify linear layers.
+With `prefetch_ratio` in `"weight_prefetch_config"` to custom the weight prefetch ratio for specific linear layers.
-The “attn” and “moe” configuration options are used for MoE model, detail as following:
+The “attn” and “moe” configuration options are used for MoE model, details as follows:
`"attn": { "qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}`
-The “mlp” configuration option is used to optimize the performance of the Dense model, detail as following:
+The “mlp” configuration option is used to optimize the performance of the Dense model, details as follows:
`"mlp": {"gate_up": 1.0, "down": 1.0}`
-Above value are the default config, the default value has a good performance for Qwen3-235B-A22B-W8A8 when `--max-num-seqs`is 144, for Qwen3-32B-W8A8 when `--max-num-seqs`is 72.
+Above value are the default config, the default value has a good performance for Qwen3-235B-A22B-W8A8 when `--max-num-seqs` is 144, for Qwen3-32B-W8A8 when `--max-num-seqs` is 72.
However, this may not be the optimal configuration for your scenario. For higher concurrency, you can try increasing the prefetch size. For lower concurrency, prefetching may not offer any advantages, so you can decrease the size or disable prefetching. Determine if the prefetch size is appropriate by collecting profiling data. Specifically, check if the time required for the prefetch operation (e.g., MLP Down Proj weight prefetching) overlaps with the time required for parallel vector computation operators (e.g., SwiGlu computation), and whether the prefetch operation is no later than the completion time of the vector computation operator. In the profiling timeline, a prefetch operation appears as a CMO operation on a single stream; this CMO operation is the prefetch operation.
-Notices:
+Notes:
-1) Weight prefetch of MLP `down` project prefetch dependence sequence parallel, if you want open for mlp `down` please also enable sequence parallel.
+1) Weight prefetch of MLP `down` project prefetch depends on sequence parallel, if you want to open for mlp `down` please also enable sequence parallel.
2) Due to the current size of the L2 cache, the maximum prefetch cannot exceed 18MB. If `prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024` bytes, the backend will only prefetch 18MB.
## Example
@@ -55,7 +55,7 @@ Notices:
2) For dense model:
-Following is the default configuration that can get a good performance for `--max-num-seqs`is 72 for Qwen3-32B-W8A8
+Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
```shell
--additional-config \
diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md
index 9b60e859..8084db65 100644
--- a/docs/source/user_guide/release_notes.md
+++ b/docs/source/user_guide/release_notes.md
@@ -108,7 +108,7 @@ Many custom ops and triton kernels were added in this release to speed up model
### Known Issue
-- Due the upgrade of `transformers` package, some models quantization weight, such as `qwen2.5vl`, `gemma3`, `minimax`, may not work. We'll fix it in the next post release. [#6302](https://github.com/vllm-project/vllm-ascend/pull/6302)
+- Due to the upgrade of `transformers` package, some models quantization weight, such as `qwen2.5vl`, `gemma3`, `minimax`, may not work. We'll fix it in the next post release. [#6302](https://github.com/vllm-project/vllm-ascend/pull/6302)
- The performance of `Qwen3-32B` will not be good with 128K input case, it's suggested to enable pcp&dcp feature for this case. This will be improved in the next CANN release.
- The performance of `Qwen3-235B`, `Qwen3-480B` under prefill-decode scenario and EP=32 scenario is not good as expect. We'll improve it in the next post release.
- When deploy deepseek3.1 under prefill-decode scenario, please make sure the tp size for decode node is great than 1. `TP=1` doesn't work. This will be fixed in the next CANN release.
@@ -144,7 +144,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
- model runner v2 support triton of penalty. [#5854](https://github.com/vllm-project/vllm-ascend/pull/5854)
- model runner v2 support eagle spec decoding. [#5840](https://github.com/vllm-project/vllm-ascend/pull/5840)
-- Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
+- Fix multimodal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
@@ -670,7 +670,7 @@ This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the
### Others
- Bug fixes:
- - Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. [#1803](https://github.com/vllm-project/vllm-ascend/pull/1803)
+ - Fix functional problem of multimodality models like Qwen2-audio with Aclgraph. [#1803](https://github.com/vllm-project/vllm-ascend/pull/1803)
- Fix the process group creating error with external launch scenario. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681)
- Fix the functional problem with guided decoding. [#2022](https://github.com/vllm-project/vllm-ascend/pull/2022)
- Fix the accuracy issue with common MoE models in DP scenario. [#1856](https://github.com/vllm-project/vllm-ascend/pull/1856)
@@ -953,7 +953,7 @@ This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [
### Models
- Qwen2.5 VL works with V1 Engine now. [#736](https://github.com/vllm-project/vllm-ascend/pull/736)
-- LLama4 works now. [#740](https://github.com/vllm-project/vllm-ascend/pull/740)
+- Llama4 works now. [#740](https://github.com/vllm-project/vllm-ascend/pull/740)
- A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set `VLLM_ASCEND_ENABLE_DBO=1` to use it. [#941](https://github.com/vllm-project/vllm-ascend/pull/941)
### Others
@@ -1090,7 +1090,7 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the
- A new communicator `pyhccl` is added. It's used for call CANN HCCL library directly instead of using `torch.distribute`. More usage of it will be added in the next release [#503](https://github.com/vllm-project/vllm-ascend/pull/503)
- The custom ops build is enabled by default. You should install the packages like `gcc`, `cmake` first to build `vllm-ascend` from source. Set `COMPILE_CUSTOM_KERNELS=0` environment to disable the compilation if you don't need it. [#466](https://github.com/vllm-project/vllm-ascend/pull/466)
-- The custom op `rotay embedding` is enabled by default now to improve the performance. [#555](https://github.com/vllm-project/vllm-ascend/pull/555)
+- The custom op `rotary embedding` is enabled by default now to improve the performance. [#555](https://github.com/vllm-project/vllm-ascend/pull/555)
## v0.7.3rc2 - 2025.03.29