diff --git a/benchmarks/README.md b/benchmarks/README.md index e1d8a018..141baa7c 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -29,7 +29,7 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B. - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99). -**Benchmarking Duration**: about 800 senond for single model. +**Benchmarking Duration**: about 800 seconds for single model. ## Quick Use diff --git a/docs/source/community/versioning_policy.md b/docs/source/community/versioning_policy.md index 24cb19d1..e6e6236b 100644 --- a/docs/source/community/versioning_policy.md +++ b/docs/source/community/versioning_policy.md @@ -141,7 +141,7 @@ Usually, each minor version of vLLM (such as 0.7) corresponds to a vLLM Ascend v ### Backward compatibility -For main branch, vLLM Ascend should works with vLLM main branch and latest 1 or 2 releases. To ensure backward compatibility, do as follows: +For main branch, vLLM Ascend should work with vLLM main branch and latest 1 or 2 releases. To ensure backward compatibility, do as follows: - Both main branch and target vLLM release, such as the vLLM main branch and vLLM 0.8.4, are tested by Ascend E2E CI. - To make sure that code changes are compatible with the latest 1 or 2 vLLM releases, vLLM Ascend introduces a version check mechanism inside the code. It checks the version of the installed vLLM package first to decide which code logic to use. If users hit the `InvalidVersion` error, it may indicate that they have installed a dev or editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. diff --git a/docs/source/developer_guide/feature_guide/ACL_Graph.md b/docs/source/developer_guide/feature_guide/ACL_Graph.md index fab751cf..a2bd6b42 100644 --- a/docs/source/developer_guide/feature_guide/ACL_Graph.md +++ b/docs/source/developer_guide/feature_guide/ACL_Graph.md @@ -96,7 +96,7 @@ There are some other rules about ACL Graph and stream. Currently, we use func `u We will expand the stream resource limitation in the future. -## Limitation +## Limitations 1. `FULL` and `FULL_AND_PIECEWISE` are not supported now; 2. When use ACL Graph and MTP and `num_speculative_tokens > 1`, as vLLM don't support this case in v0.11.0, we need to set `cudagraph_capture_sizes` explicitly. diff --git a/docs/source/developer_guide/feature_guide/KV_Cache_Pool_Guide.md b/docs/source/developer_guide/feature_guide/KV_Cache_Pool_Guide.md index 843c7b56..de8797a9 100644 --- a/docs/source/developer_guide/feature_guide/KV_Cache_Pool_Guide.md +++ b/docs/source/developer_guide/feature_guide/KV_Cache_Pool_Guide.md @@ -12,7 +12,7 @@ vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Moo While one can utilize mooncake store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports mooncake store and can utilize the data transfer strategy to one that is best fit to Huawei NPU hardware. -Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largly inspired by **LMCacheConnectorV1** (see the `How is MooncakestoreConnectorV1 Implemented?` section). +Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section). ## Usage @@ -47,7 +47,7 @@ By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Ca When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cache Pool can further decouple prefill and decode stages across devices or nodes. -Currently, we only perform put and get operation of KV Pool for **Prefiil Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e. MooncakeConnector. +Currently, we only perform put and get operation of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e. MooncakeConnector. The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfer KV Caches between NPU devices directly. @@ -55,11 +55,11 @@ To Enable this feature, we need to setup both Mooncake Connector and Mooncake St For details, please also refer to the Mooncake Connector Store Deployment Guide. -## How is MooncakestoreConnectorV1 Implemented? +## How is MooncakeStoreConnectorV1 Implemented? -**MooncakestoreConnectorV1** inhereits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a thrid-party KV cache transfer/storage backend into the vLLM framework. +**MooncakeStoreConnectorV1** inherits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a third-party KV cache transfer/storage backend into the vLLM framework. -MooncakeStoreConnectorV1 is also largly inspried by LMCacheConnectorV1 in term of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer. +MooncakeStoreConnectorV1 is also largely inspired by LMCacheConnectorV1 in terms of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer. The KV Connector methods that need to be implemented can be categorized into scheduler-side methods that are called in V1 scheduler and worker-side methods that are called in V1 worker, namely: @@ -77,14 +77,14 @@ The KV Connector methods that need to be implemented can be categorized into sch `wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario. `save_kv_layer`: Optional Do layerwise KV cache put into KV Pool. `wait_for_save`: Wait for KV Save to finish if async KV cache save/put. -`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_reciving` if `get` finished. +`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_receiving` if `get` finished. ## DFX 1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request. -2. Similaly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change). +2. Similarly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change). -## Limitation +## Limitations 1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool. diff --git a/docs/source/developer_guide/feature_guide/eplb_swift_balancer.md b/docs/source/developer_guide/feature_guide/eplb_swift_balancer.md index 4a165841..4d085604 100644 --- a/docs/source/developer_guide/feature_guide/eplb_swift_balancer.md +++ b/docs/source/developer_guide/feature_guide/eplb_swift_balancer.md @@ -243,7 +243,7 @@ During the update process, if only a few layers or the expert table of a certain When updating expert weights, ensure that the memory allocated for the expert weights has been released, or that the expert (referring to the old version) is no longer in use. -## Limitation +## Limitations Before using EPLB, start the script and add `export DYNAMIC_EPLB="true"`. Before performing load data collection (or performance data collection), start the script and add `export EXPERT_MAP_RECORD="true"`. diff --git a/docs/source/developer_guide/feature_guide/npugraph_ex.md b/docs/source/developer_guide/feature_guide/npugraph_ex.md index 1d446812..730fe0a1 100644 --- a/docs/source/developer_guide/feature_guide/npugraph_ex.md +++ b/docs/source/developer_guide/feature_guide/npugraph_ex.md @@ -94,7 +94,7 @@ with fake_mode: ) ``` -The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascned and nugraph_ex code repositories. +The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories. ### DFX diff --git a/docs/source/developer_guide/feature_guide/patch.md b/docs/source/developer_guide/feature_guide/patch.md index 8a983217..c48fc7ee 100644 --- a/docs/source/developer_guide/feature_guide/patch.md +++ b/docs/source/developer_guide/feature_guide/patch.md @@ -70,7 +70,7 @@ Before writing a patch, following the principle above, we should patch the least 7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md) -## Limitation +## Limitations 1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely. 2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work. diff --git a/docs/source/developer_guide/feature_guide/quantization.md b/docs/source/developer_guide/feature_guide/quantization.md index 5784243e..cfade88d 100644 --- a/docs/source/developer_guide/feature_guide/quantization.md +++ b/docs/source/developer_guide/feature_guide/quantization.md @@ -26,7 +26,7 @@ The `create_weights` method is used for weight initialization; the `process_weig We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**). -**Supplemnet**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example: +**Supplement**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example: ```json { diff --git a/docs/source/faqs.md b/docs/source/faqs.md index 4d476b2c..11225dda 100644 --- a/docs/source/faqs.md +++ b/docs/source/faqs.md @@ -130,7 +130,7 @@ vllm-ascend is tested in three aspects, functions, performance, and accuracy. - **Nightly test**: we'll run full test every night to make sure the code is working. -Finnall, for each release, we'll publish the performance test and accuracy test report in the future. +Finally, for each release, we'll publish the performance test and accuracy test report in the future. ### 12. How to fix the error "InvalidVersion" when using vllm-ascend? diff --git a/docs/source/locale/zh_CN/LC_MESSAGES/faqs.po b/docs/source/locale/zh_CN/LC_MESSAGES/faqs.po index f961f09a..6a8513be 100644 --- a/docs/source/locale/zh_CN/LC_MESSAGES/faqs.po +++ b/docs/source/locale/zh_CN/LC_MESSAGES/faqs.po @@ -282,7 +282,7 @@ msgstr "11. 如何运行 w8a8 DeepSeek 模型?" #: ../../faqs.md:87 msgid "" -"Please following the [inferencing tutorail](https://vllm-" +"Please following the [inferencing tutorial](https://vllm-" "ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model" " to DeepSeek." msgstr "" @@ -459,7 +459,7 @@ msgstr "" "install` 进行安装,或者使用 `python setup.py clean` 清除缓存。" #: ../../faqs.md:130 -msgid "18. How to generate determinitic results when using vllm-ascend?" +msgid "18. How to generate deterministic results when using vllm-ascend?" msgstr "18. 使用 vllm-ascend 时如何生成确定性结果?" #: ../../faqs.md:131 @@ -475,5 +475,5 @@ msgstr "" "sample)**,例如:" #: ../../faqs.md:158 -msgid "Set the following enveriments parameters:" +msgid "Set the following environment parameters:" msgstr "设置以下环境参数:" diff --git a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po index ddd1eb54..bafbce5a 100644 --- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po +++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po @@ -695,7 +695,7 @@ msgstr "" #: ../../user_guide/release_notes.md:118 msgid "" "Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 " -"is well tested. You can try it now. Mindie Turbo is recomanded to improve " +"is well tested. You can try it now. Mindie Turbo is recommended to improve " "the performance of Qwen3. [#903](https://github.com/vllm-project/vllm-" "ascend/pull/903) [#915](https://github.com/vllm-project/vllm-ascend/" "pull/915)" @@ -1204,7 +1204,7 @@ msgid "" "visit [official guide](https://docs.vllm.ai/en/latest/getting_started/" "v1_user_guide.html) to get more detail. By default, vLLM will fallback to " "V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want " -"to use V1 forcely." +"to use V1 forcibly." msgstr "" "本版本包含了对 vLLM V1 引擎的实验性支持。你可以访问[官方指南](https://docs." "vllm.ai/en/latest/getting_started/v1_user_guide.html)获取更多详细信息。默认" @@ -1335,7 +1335,7 @@ msgstr "" #: ../../user_guide/release_notes.md:232 msgid "" -"Add Ascend Custom Ops framewrok. Developers now can write customs ops using " +"Add Ascend Custom Ops framework. Developers now can write customs ops using " "AscendC. An example ops `rotary_embedding` is added. More tutorials will " "come soon. The Custom Ops compilation is disabled by default when " "installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371]" @@ -1512,7 +1512,7 @@ msgstr "" msgid "" "Improved and reduced the garbled code in model output. But if you still hit " "the issue, try to change the generation config value, such as " -"`temperature`, and try again. There is also a knonwn issue shown below. Any " +"`temperature`, and try again. There is also a known issue shown below. Any " "[feedback](https://github.com/vllm-project/vllm-ascend/issues/267) is " "welcome. [#277](https://github.com/vllm-project/vllm-ascend/pull/277)" msgstr "" diff --git a/docs/source/tutorials/DeepSeek-R1.md b/docs/source/tutorials/DeepSeek-R1.md index 1305b454..5371ec8c 100644 --- a/docs/source/tutorials/DeepSeek-R1.md +++ b/docs/source/tutorials/DeepSeek-R1.md @@ -3,7 +3,7 @@ ## Introduction DeepSeek-R1 is a high-performance Mixture-of-Experts (MoE) large language model developed by DeepSeek Company. It excels in complex logical reasoning, mathematical problem-solving, and code generation. By dynamically activating its expert networks, it delivers exceptional performance while maintaining computational efficiency. Building upon R1, DeepSeek-R1-W8A8 is a fully quantized version of the model. It employs 8-bit integer (INT8) quantization for both weights and activations, which significantly reduces the model's memory footprint and computational requirements, enabling more efficient deployment and application in resource-constrained environments. -This article takes the deepseek- R1-W8A8 version as an example to introduce the deployment of the R1 series models. +This article takes the `DeepSeek-R1-W8A8` version as an example to introduce the deployment of the R1 series models. ## Supported Features @@ -25,7 +25,7 @@ If you want to deploy multi-node environment, you need to verify multi-node comm ### Installation -You can using our official docker image to run `DeepSeek-R1-W8A8` directly. +You can use our official docker image to run `DeepSeek-R1-W8A8` directly. Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). @@ -293,7 +293,7 @@ Run performance evaluation of `DeepSeek-R1-W8A8` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/DeepSeek-V3.1.md b/docs/source/tutorials/DeepSeek-V3.1.md index 7781ce75..acac9c17 100644 --- a/docs/source/tutorials/DeepSeek-V3.1.md +++ b/docs/source/tutorials/DeepSeek-V3.1.md @@ -38,7 +38,7 @@ If you want to deploy multi-node environment, you need to verify multi-node comm ### Installation -You can using our official docker image to run `DeepSeek-V3.1` directly. +You can use our official docker image to run `DeepSeek-V3.1` directly. Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). @@ -96,7 +96,7 @@ nic_name="xxxx" local_ip="xxxx" # [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on. +# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD # AIV @@ -158,7 +158,7 @@ local_ip="xxxx" node0_ip="xxxx" # [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on. +# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD export HCCL_IF_IP=$local_ip @@ -210,7 +210,7 @@ local_ip="xxx" node0_ip="xxxx" # [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on. +# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD export HCCL_IF_IP=$local_ip @@ -275,7 +275,7 @@ local_ip="141.xx.xx.1" node0_ip="xxxx" # [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on. +# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD export HCCL_IF_IP=$local_ip @@ -352,7 +352,7 @@ local_ip="141.xx.xx.2" node0_ip="xxxx" # [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on. +# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD export HCCL_IF_IP=$local_ip @@ -429,7 +429,7 @@ local_ip="141.xx.xx.3" node0_ip="xxxx" # [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on. +# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD export HCCL_IF_IP=$local_ip @@ -505,7 +505,7 @@ local_ip="141.xx.xx.4" node0_ip="xxxx" # [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on. +# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD export HCCL_IF_IP=$local_ip @@ -707,7 +707,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/DeepSeek-V3.2.md b/docs/source/tutorials/DeepSeek-V3.2.md index b44a4059..528176f3 100644 --- a/docs/source/tutorials/DeepSeek-V3.2.md +++ b/docs/source/tutorials/DeepSeek-V3.2.md @@ -29,7 +29,7 @@ If you want to deploy multi-node environment, you need to verify multi-node comm ### Installation -You can using our official docker image to run `DeepSeek-V3.2` directly.. +You can use our official docker image to run `DeepSeek-V3.2` directly. :::::{tab-set} :sync-group: install @@ -456,11 +456,11 @@ Before you start, please dp_rpc_port = args.dp_rpc_port vllm_start_port = args.vllm_start_port - def run_command(visiable_devices, dp_rank, vllm_engine_port): + def run_command(visible_devices, dp_rank, vllm_engine_port): command = [ "bash", "./run_dp_template.sh", - visiable_devices, + visible_devices, str(vllm_engine_port), str(dp_size), str(dp_rank), @@ -481,9 +481,9 @@ Before you start, please for i in range(dp_size_local): dp_rank = dp_rank_start + i vllm_engine_port = vllm_start_port + i - visiable_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size)) + visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size)) process = multiprocessing.Process(target=run_command, - args=(visiable_devices, dp_rank, + args=(visible_devices, dp_rank, vllm_engine_port)) processes.append(process) process.start() @@ -895,7 +895,7 @@ Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/GLM4.x.md b/docs/source/tutorials/GLM4.x.md index e142ae26..5a6d3178 100644 --- a/docs/source/tutorials/GLM4.x.md +++ b/docs/source/tutorials/GLM4.x.md @@ -29,7 +29,7 @@ It is recommended to download the model weight to the shared directory of multip ### Installation -You can using our official docker image to run `GLM-4.x` directly. +You can use our official docker image to run `GLM-4.x` directly. Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). @@ -146,7 +146,7 @@ Run performance evaluation of `GLM-4.x` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/PaddleOCR-VL.md b/docs/source/tutorials/PaddleOCR-VL.md index 8891b161..8ba020a7 100644 --- a/docs/source/tutorials/PaddleOCR-VL.md +++ b/docs/source/tutorials/PaddleOCR-VL.md @@ -22,7 +22,7 @@ It is recommended to download the model weights to a local directory (e.g., `./P ### Installation -You can using our official docker image to run `PaddleOCR-VL` directly. +You can use our official docker image to run `PaddleOCR-VL` directly. Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). diff --git a/docs/source/tutorials/Qwen-VL-Dense.md b/docs/source/tutorials/Qwen-VL-Dense.md index 94841a5f..55f71a12 100644 --- a/docs/source/tutorials/Qwen-VL-Dense.md +++ b/docs/source/tutorials/Qwen-VL-Dense.md @@ -548,7 +548,7 @@ lm_eval \ Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/Qwen2.5-7B.md b/docs/source/tutorials/Qwen2.5-7B.md index abab2564..be92c1cd 100644 --- a/docs/source/tutorials/Qwen2.5-7B.md +++ b/docs/source/tutorials/Qwen2.5-7B.md @@ -24,7 +24,7 @@ It is recommended to download the model weights to a local directory (e.g., `./Q ### Installation -You can using our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`. +You can use our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`. :::::{tab-set} :sync-group: install @@ -158,7 +158,7 @@ Run performance evaluation of `Qwen2.5-7B-Instruct` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/Qwen2.5-Omni.md b/docs/source/tutorials/Qwen2.5-Omni.md index d5582d5b..ff5eecaa 100644 --- a/docs/source/tutorials/Qwen2.5-Omni.md +++ b/docs/source/tutorials/Qwen2.5-Omni.md @@ -19,11 +19,11 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur - `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) - `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) -Following examples use the 7B version deafultly. +Following examples use the 7B version by default. ### Installation -You can using our official docker image to run `Qwen2.5-Omni` directly. +You can use our official docker image to run `Qwen2.5-Omni` directly. Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). @@ -97,7 +97,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]" `--allowed-local-media-path` is optional, only set it if you need infer model with local media file -`--gpu-memory-utilization` should not be set manually only if yous know what this parameter aims to. +`--gpu-memory-utilization` should not be set manually only if you know what this parameter aims to. #### Multiple NPU (Qwen2.5-Omni-7B) @@ -195,7 +195,7 @@ Run performance evaluation of `Qwen2.5-Omni-7B` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/Qwen3-235B-A22B.md b/docs/source/tutorials/Qwen3-235B-A22B.md index bcdde1bb..3a2c0a9f 100644 --- a/docs/source/tutorials/Qwen3-235B-A22B.md +++ b/docs/source/tutorials/Qwen3-235B-A22B.md @@ -300,7 +300,7 @@ Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/Qwen3-Coder-30B-A3B.md b/docs/source/tutorials/Qwen3-Coder-30B-A3B.md index a0670f31..03fb0085 100644 --- a/docs/source/tutorials/Qwen3-Coder-30B-A3B.md +++ b/docs/source/tutorials/Qwen3-Coder-30B-A3B.md @@ -24,7 +24,7 @@ It is recommended to download the model weight to the shared directory of multip `Qwen3-Coder` is first supported in `vllm-ascend:v0.10.0rc1`, please run this model using a later version. -You can using our official docker image to run `Qwen3-Coder-30B-A3B-Instruct` directly. +You can use our official docker image to run `Qwen3-Coder-30B-A3B-Instruct` directly. ```{code-block} bash :substitutions: diff --git a/docs/source/tutorials/Qwen3-Dense.md b/docs/source/tutorials/Qwen3-Dense.md index cf99beb8..413d125d 100644 --- a/docs/source/tutorials/Qwen3-Dense.md +++ b/docs/source/tutorials/Qwen3-Dense.md @@ -42,7 +42,7 @@ If you want to deploy multi-node environment, you need to verify multi-node comm ### Installation -You can using our official docker image for supporting Qwen3 Dense models. +You can use our official docker image for supporting Qwen3 Dense models. Currently, we provide the all-in-one images.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags) #### Docker Pull (by tag) @@ -161,7 +161,7 @@ export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 export TASK_QUEUE_ENABLE=1 # [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on. +# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. # if os is Ubuntu # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD # if os is openEuler @@ -291,7 +291,7 @@ Run performance evaluation of `Qwen3-32B-W8A8` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index d778ecc9..3c92f565 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -154,7 +154,7 @@ Run performance evaluation of `Qwen3-Next` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md b/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md index 748a8a8b..578d91a7 100644 --- a/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md +++ b/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md @@ -24,7 +24,7 @@ It is recommended to download the model weight to the shared directory of multip :::::{tab-set} ::::{tab-item} Use docker image -You can using our official docker image to run Qwen3-Omni-30B-A3B-Thinking directly +You can use our official docker image to run Qwen3-Omni-30B-A3B-Thinking directly Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). @@ -269,7 +269,7 @@ Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example. Refer to vllm benchmark for more details. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/Qwen3-VL-235B-A22B-Instruct.md b/docs/source/tutorials/Qwen3-VL-235B-A22B-Instruct.md index 279f8867..55a5d280 100644 --- a/docs/source/tutorials/Qwen3-VL-235B-A22B-Instruct.md +++ b/docs/source/tutorials/Qwen3-VL-235B-A22B-Instruct.md @@ -260,7 +260,7 @@ Run performance evaluation of `Qwen3-VL-235B-A22B-Instruct` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/long_sequence_context_parallel_multi_node.md b/docs/source/tutorials/long_sequence_context_parallel_multi_node.md index 9754fe26..3cbe6748 100644 --- a/docs/source/tutorials/long_sequence_context_parallel_multi_node.md +++ b/docs/source/tutorials/long_sequence_context_parallel_multi_node.md @@ -24,7 +24,7 @@ Refer to [verify multi-node communication environment](../installation.md#verify ### Installation -You can using our official docker image to run `DeepSeek-V3.1` directly. +You can use our official docker image to run `DeepSeek-V3.1` directly. Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). @@ -351,7 +351,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/tutorials/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/long_sequence_context_parallel_single_node.md index f7298473..cc49fa42 100644 --- a/docs/source/tutorials/long_sequence_context_parallel_single_node.md +++ b/docs/source/tutorials/long_sequence_context_parallel_single_node.md @@ -159,7 +159,7 @@ Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. -There are three `vllm bench` subcommand: +There are three `vllm bench` subcommands: - `latency`: Benchmark the latency of a single batch of requests. - `serve`: Benchmark the online serving throughput. diff --git a/docs/source/user_guide/feature_guide/Multi_Token_Prediction.md b/docs/source/user_guide/feature_guide/Multi_Token_Prediction.md index 965cdc06..7794fc7a 100644 --- a/docs/source/user_guide/feature_guide/Multi_Token_Prediction.md +++ b/docs/source/user_guide/feature_guide/Multi_Token_Prediction.md @@ -109,7 +109,7 @@ if self.speculative_config: got {self.decode_threshold}" ``` -## Limitation +## Limitations - Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15. - In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1). diff --git a/docs/source/user_guide/feature_guide/weight_prefetch.md b/docs/source/user_guide/feature_guide/weight_prefetch.md index ad01d2f8..46b447df 100644 --- a/docs/source/user_guide/feature_guide/weight_prefetch.md +++ b/docs/source/user_guide/feature_guide/weight_prefetch.md @@ -29,7 +29,7 @@ However, this may not be the optimal configuration for your scenario. For higher Notices: 1) Weight prefetch of MLP `down` project prefetch dependence sequence parallel, if you want open for mlp `down` please also enable sequence parallel. -2) Due to the current size of the L2 cache, the maximum prefetch cannot exceed 18MB. If `prefetch_ration * lineaer_layer_weight_size >= 18 * 1024 * 1024` bytes, the backend will only prefetch 18MB. +2) Due to the current size of the L2 cache, the maximum prefetch cannot exceed 18MB. If `prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024` bytes, the backend will only prefetch 18MB. ## Example diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md index 60b1cea9..9b60e859 100644 --- a/docs/source/user_guide/release_notes.md +++ b/docs/source/user_guide/release_notes.md @@ -341,7 +341,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o - [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/KV_Cache_Pool_Guide.html) feature is added - [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it. - LLMdatadist kv connector is removed. Please use mooncake connector instead. -- Ascend scheduler is removed. `--additional-config {"ascend_scheudler": {"enabled": true}` doesn't work anymore. +- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}` doesn't work anymore. - Torchair graph mode is removed. `--additional-config {"torchair_graph_config": {"enabled": true}}` doesn't work anymore. Please use aclgraph instead. - `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` env is removed. This feature is stable enough. We enable it by default now. - speculative decode method `Ngram` is back now. @@ -637,7 +637,7 @@ This is the 3rd release candidate of v0.9.1 for vLLM Ascend. Please follow the [ - Fix header include issue in rope [#2398](https://github.com/vllm-project/vllm-ascend/pull/2398) - Fix mtp config bug [#2412](https://github.com/vllm-project/vllm-ascend/pull/2412) - Fix error info and adapt `attn_metedata` refactor [#2402](https://github.com/vllm-project/vllm-ascend/pull/2402) -- Fix torchair runtime error caused by configuration mismtaches and `.kv_cache_bytes` file missing [#2312](https://github.com/vllm-project/vllm-ascend/pull/2312) +- Fix torchair runtime error caused by configuration mismatches and `.kv_cache_bytes` file missing [#2312](https://github.com/vllm-project/vllm-ascend/pull/2312) - Move `with_prefill` allreduce from cpu to npu [#2230](https://github.com/vllm-project/vllm-ascend/pull/2230) ### Docs