[main][Docs] Fix typos across documentation (#6728)

## Summary Fix typos and improve grammar consistency across 50 documentation files. ### Changes include: - Spelling corrections (e.g., "Facotory" → "Factory", "certainty" → "determinism") - Grammar improvements (e.g., "multi-thread" → "multi-threaded", "re-routed" → "re-run") - Punctuation fixes (semicolon consistency in filter parameters) - Code style fixes (correct flag name `--num-prompts` instead of `--num-prompt`) - Capitalization consistency (e.g., "python" → "Python", "ascend" → "Ascend") - vLLM version: v0.15.0 - vLLM main: 9562912cea --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2026-02-13 15:50:05 +08:00
parent b6bc3d2f9d
commit 6de207de88
50 changed files with 273 additions and 272 deletions
--- a/docs/source/faqs.md
+++ b/docs/source/faqs.md
@@ -2,14 +2,14 @@

 ## Version Specific FAQs

- [[v0.14.0c1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6148)
+- [[v0.14.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6148)
 - [[v0.13.0] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/6583)

 ## General FAQs

 ### 1. What devices are currently supported?

-Currently, **ONLY** Atlas A2 series(Ascend-cann-kernels-910b)，Atlas A3 series(Atlas-A3-cann-kernels) and Atlas 300I(Ascend-cann-kernels-310p) series are supported:
+Currently, **ONLY** Atlas A2 series (Ascend-cann-kernels-910b)，Atlas A3 series (Atlas-A3-cann-kernels) and Atlas 300I (Ascend-cann-kernels-310p) series are supported:

 - Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
 - Atlas 800I A2 Inference series (Atlas 800I A2)
@@ -23,7 +23,7 @@ Below series are NOT supported yet:
 - Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
 - Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet

-From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We also welcome you to join us to improve together.
+From a technical view, vllm-ascend support would be possible if torch-npu is supported. Otherwise, we have to implement it by using custom ops. We also welcome you to join us to improve together.

 ### 2. How to get our docker containers?

@@ -92,7 +92,7 @@ Basically, the reason is that the NPU environment is not configured correctly. Y
 2. try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
 3. try `npu-smi info` to check whether the NPU is working.

-If all above steps are not working, you can try the following code with python to check whether there is any error:
+If the above steps are not working, you can try the following code in Python to check whether there are any errors:

 ```python
 import torch
@@ -104,7 +104,7 @@ If all above steps are not working, feel free to submit a GitHub issue.

 ### 7. How vllm-ascend work with vLLM?

-vllm-ascend is a hardware plugin for vLLM. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
+`vllm-ascend` is a hardware plugin for vLLM. The version of `vllm-ascend` is the same as the version of `vllm`. For example, if you use `vllm` 0.9.1, you should use vllm-ascend 0.9.1 as well. For the main branch, we ensure that `vllm-ascend` and `vllm` are compatible at every commit.

 ### 8. Does vllm-ascend support Prefill Disaggregation feature?

@@ -112,7 +112,7 @@ Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend.

 ### 9. Does vllm-ascend support quantization method?

-Currently, w8a8, w4a8 and w4a4 quantization methods are already supported by vllm-ascend.
+Currently, w8a8, w4a8, and w4a4 quantization methods are already supported by vllm-ascend.

 ### 10. How to run a W8A8 DeepSeek model?

@@ -120,21 +120,21 @@ Follow the [inference tutorial](https://docs.vllm.ai/projects/ascend/en/latest/t

 ### 11. How is vllm-ascend tested?

-vllm-ascend is tested in three aspects, functions, performance, and accuracy.
+vllm-ascend is tested in three aspects: functions, performance, and accuracy.

- **Functional test**: We added CI, including part of vllm's native unit tests and vllm-ascend's own unit tests. On vllm-ascend's test, we test basic functionalities, popular model availability, and [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html) through E2E test.
+- **Functional test**: We added CI, including part of vllm's native unit tests and vllm-ascend's own unit tests. In vllm-ascend's tests, we test basic functionalities, popular model availability, and [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html) through E2E test.

- **Performance test**: We provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for E2E performance benchmark, which can be easily re-routed locally. We will publish a perf website to show the performance test results for each pull request.
+- **Performance test**: We provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for E2E performance benchmark, which can be easily re-run locally. We will publish a perf website to show the performance test results for each pull request.

 - **Accuracy test**: We are working on adding accuracy test to the CI as well.

 - **Nightly test**: we'll run full test every night to make sure the code is working.

-Finally, for each release, we'll publish the performance test and accuracy test report in the future.
+For each release, we'll publish the performance test and accuracy test report in the future.

 ### 12. How to fix the error "InvalidVersion" when using vllm-ascend?

-The problem is usually caused by the installation of a dev or editable version of the vLLM package. In this case, we provide the environment variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the environment variable `VLLM_VERSION` to the version of the vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
+The problem is usually caused by the installation of a development or editable version of the vLLM package. In this case, we provide the environment variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the environment variable `VLLM_VERSION` to the version of the vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.

 ### 13. How to handle the out-of-memory issue?

@@ -142,17 +142,17 @@ OOM errors typically occur when the model exceeds the memory capacity of a singl

 In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:

- **Limit `--max-model-len`**:  It can save the HBM usage for kv cache initialization step.
+- **Limit `--max-model-len`**: It can save the HBM usage for KV cache initialization step.

 - **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).

- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
+- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).

 ### 14. Failed to enable NPU graph mode when running DeepSeek

 Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update.

-And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads/num_kv_heads is {32, 64, 128}.
+And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, `num_heads`/`num_kv_heads` is {32, 64, 128}.

 ```bash
 [rank0]: RuntimeError: EZ9999: Inner Error!
@@ -161,13 +161,13 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso

 ### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend

-You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, use `python setup.py install` (recommended) to install, or use `python setup.py clean` to clear the cache.
+You may encounter the problem of C/C++ compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, use `python setup.py install` (recommended) to install, or use `python setup.py clean` to clear the cache.

 ### 16. How to generate deterministic results when using vllm-ascend?

-There are several factors that affect output certainty:
+There are several factors that affect output determinism:

-1. Sampler method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
+1. Sampler method: using **greedy sampling** by setting `temperature=0` in `SamplingParams`, e.g.:

 ```python
 from vllm import LLM, SamplingParams
@@ -203,8 +203,8 @@ export ATB_LLM_LCOC_ENABLE=0

 ### 17. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model？

-The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`.
-This package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensure that the audio processing functionality works correctly.
+The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met, run `pip install qwen-omni-utils`.
+This package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring that the audio processing functionality works correctly.

 ### 18. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?

@@ -217,10 +217,10 @@ ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph has insufficient availab
 Recommended mitigation strategies:

 1. Manually configure the compilation_config parameter with a reduced size set: '{"cudagraph_capture_sizes":[size1, size2, size3, ...]}'.
-2. Employ ACLgraph's full graph mode as an alternative to the piece-wise approach.
+2. Employ ACLgraph's full graph mode as an alternative to the piecewise approach.

 Root cause analysis:
-The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements, such as operator characteristics and specific hardware features, consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
+The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream-overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements, such as operator characteristics and specific hardware features, consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.

 ### 19. How to install custom version of torch_npu?

@@ -228,7 +228,7 @@ torch-npu will be overridden  when installing vllm-ascend. If you need to instal

 ### 20. On certain systems (e.g., Kylin OS), `docker pull` may fail with an `invalid tar header` error

-On certain operating systems, such as Kylin OS , you may encounter an `invalid tar header` error during the `docker pull` process:
+On certain operating systems, such as Kylin OS, you may encounter an `invalid tar header` error during the `docker pull` process:

 ```text
 failed to register layer: ApplyLayer exit status 1 stdout: stderr: archive/tar: invalid tar header
@@ -259,7 +259,7 @@ When using `--shm-size`, you may need to add the `--privileged=true` flag to you

 ### 22. How to achieve low latency in a small batch scenario?

-The performance of `torch_npu.npu_fused_infer_attention_score` in small batch scenario is not satisfactory, mainly due to the lack of flash decoding function. We offer an alternative operator in `tools/install_flash_infer_attention_score_ops_a2.sh` and `tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it by the following instruction:
+The performance of `torch_npu.npu_fused_infer_attention_score` in small batch scenarios is not satisfactory, mainly due to the lack of flash decoding function. We offer an alternative operator in `tools/install_flash_infer_attention_score_ops_a2.sh` and `tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it using the following instruction:

 ```bash
 bash tools/install_flash_infer_attention_score_ops_a2.sh
@@ -267,5 +267,5 @@ bash tools/install_flash_infer_attention_score_ops_a2.sh
 # bash tools/install_flash_infer_attention_score_ops_a3.sh
 ```

-**NOTE**: Don't set `additional_config.pa_shape_list` when using this method, otherwise it will lead to another attention operator.
-**Important**: Please make sure you're using the **official image** of vllm-ascend, otherwise you **must change** the directory `/vllm-workspace` in `tools/install_flash_infer_attention_score_ops_a2.sh` or `tools/install_flash_infer_attention_score_ops_a3.sh` to your own or create one. If you're not in root user, you need `sudo` permission to run this script.
+**NOTE**: Don't set `additional_config.pa_shape_list` when using this method; otherwise, it will lead to another attention operator.
+**Important**: Please make sure you're using the **official image** of `vllm-ascend`; otherwise, you **must change** the directory `/vllm-workspace` in `tools/install_flash_infer_attention_score_ops_a2.sh` or `tools/install_flash_infer_attention_score_ops_a3.sh` to your own, or create one. If you're not the root user, you need `sudo` **privileges** to run this script.