[main][Docs] Fix spelling errors across documentation (#6649)
Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
@@ -29,7 +29,7 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
|
||||
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
|
||||
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
|
||||
|
||||
**Benchmarking Duration**: about 800 senond for single model.
|
||||
**Benchmarking Duration**: about 800 seconds for single model.
|
||||
|
||||
## Quick Use
|
||||
|
||||
|
||||
@@ -141,7 +141,7 @@ Usually, each minor version of vLLM (such as 0.7) corresponds to a vLLM Ascend v
|
||||
|
||||
### Backward compatibility
|
||||
|
||||
For main branch, vLLM Ascend should works with vLLM main branch and latest 1 or 2 releases. To ensure backward compatibility, do as follows:
|
||||
For main branch, vLLM Ascend should work with vLLM main branch and latest 1 or 2 releases. To ensure backward compatibility, do as follows:
|
||||
|
||||
- Both main branch and target vLLM release, such as the vLLM main branch and vLLM 0.8.4, are tested by Ascend E2E CI.
|
||||
- To make sure that code changes are compatible with the latest 1 or 2 vLLM releases, vLLM Ascend introduces a version check mechanism inside the code. It checks the version of the installed vLLM package first to decide which code logic to use. If users hit the `InvalidVersion` error, it may indicate that they have installed a dev or editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use.
|
||||
|
||||
@@ -96,7 +96,7 @@ There are some other rules about ACL Graph and stream. Currently, we use func `u
|
||||
|
||||
We will expand the stream resource limitation in the future.
|
||||
|
||||
## Limitation
|
||||
## Limitations
|
||||
|
||||
1. `FULL` and `FULL_AND_PIECEWISE` are not supported now;
|
||||
2. When use ACL Graph and MTP and `num_speculative_tokens > 1`, as vLLM don't support this case in v0.11.0, we need to set `cudagraph_capture_sizes` explicitly.
|
||||
|
||||
@@ -12,7 +12,7 @@ vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Moo
|
||||
|
||||
While one can utilize mooncake store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports mooncake store and can utilize the data transfer strategy to one that is best fit to Huawei NPU hardware.
|
||||
|
||||
Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largly inspired by **LMCacheConnectorV1** (see the `How is MooncakestoreConnectorV1 Implemented?` section).
|
||||
Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).
|
||||
|
||||
## Usage
|
||||
|
||||
@@ -47,7 +47,7 @@ By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Ca
|
||||
|
||||
When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cache Pool can further decouple prefill and decode stages across devices or nodes.
|
||||
|
||||
Currently, we only perform put and get operation of KV Pool for **Prefiil Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e. MooncakeConnector.
|
||||
Currently, we only perform put and get operation of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e. MooncakeConnector.
|
||||
|
||||
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfer KV Caches between NPU devices directly.
|
||||
|
||||
@@ -55,11 +55,11 @@ To Enable this feature, we need to setup both Mooncake Connector and Mooncake St
|
||||
|
||||
For details, please also refer to the Mooncake Connector Store Deployment Guide.
|
||||
|
||||
## How is MooncakestoreConnectorV1 Implemented?
|
||||
## How is MooncakeStoreConnectorV1 Implemented?
|
||||
|
||||
**MooncakestoreConnectorV1** inhereits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a thrid-party KV cache transfer/storage backend into the vLLM framework.
|
||||
**MooncakeStoreConnectorV1** inherits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a third-party KV cache transfer/storage backend into the vLLM framework.
|
||||
|
||||
MooncakeStoreConnectorV1 is also largly inspried by LMCacheConnectorV1 in term of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer.
|
||||
MooncakeStoreConnectorV1 is also largely inspired by LMCacheConnectorV1 in terms of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer.
|
||||
|
||||
The KV Connector methods that need to be implemented can be categorized into scheduler-side methods that are called in V1 scheduler and worker-side methods that are called in V1 worker, namely:
|
||||
|
||||
@@ -77,14 +77,14 @@ The KV Connector methods that need to be implemented can be categorized into sch
|
||||
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
|
||||
`save_kv_layer`: Optional Do layerwise KV cache put into KV Pool.
|
||||
`wait_for_save`: Wait for KV Save to finish if async KV cache save/put.
|
||||
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_reciving` if `get` finished.
|
||||
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_receiving` if `get` finished.
|
||||
|
||||
## DFX
|
||||
|
||||
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request.
|
||||
2. Similaly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).
|
||||
2. Similarly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).
|
||||
|
||||
## Limitation
|
||||
## Limitations
|
||||
|
||||
1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.
|
||||
|
||||
|
||||
@@ -243,7 +243,7 @@ During the update process, if only a few layers or the expert table of a certain
|
||||
|
||||
When updating expert weights, ensure that the memory allocated for the expert weights has been released, or that the expert (referring to the old version) is no longer in use.
|
||||
|
||||
## Limitation
|
||||
## Limitations
|
||||
|
||||
Before using EPLB, start the script and add `export DYNAMIC_EPLB="true"`.
|
||||
Before performing load data collection (or performance data collection), start the script and add `export EXPERT_MAP_RECORD="true"`.
|
||||
|
||||
@@ -94,7 +94,7 @@ with fake_mode:
|
||||
)
|
||||
```
|
||||
|
||||
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascned and nugraph_ex code repositories.
|
||||
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
|
||||
|
||||
### DFX
|
||||
|
||||
|
||||
@@ -70,7 +70,7 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
|
||||
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
|
||||
|
||||
## Limitation
|
||||
## Limitations
|
||||
|
||||
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.
|
||||
|
||||
@@ -26,7 +26,7 @@ The `create_weights` method is used for weight initialization; the `process_weig
|
||||
|
||||
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).
|
||||
|
||||
**Supplemnet**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
|
||||
**Supplement**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
|
||||
|
||||
```json
|
||||
{
|
||||
|
||||
@@ -130,7 +130,7 @@ vllm-ascend is tested in three aspects, functions, performance, and accuracy.
|
||||
|
||||
- **Nightly test**: we'll run full test every night to make sure the code is working.
|
||||
|
||||
Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
|
||||
Finally, for each release, we'll publish the performance test and accuracy test report in the future.
|
||||
|
||||
### 12. How to fix the error "InvalidVersion" when using vllm-ascend?
|
||||
|
||||
|
||||
@@ -282,7 +282,7 @@ msgstr "11. 如何运行 w8a8 DeepSeek 模型?"
|
||||
|
||||
#: ../../faqs.md:87
|
||||
msgid ""
|
||||
"Please following the [inferencing tutorail](https://vllm-"
|
||||
"Please following the [inferencing tutorial](https://vllm-"
|
||||
"ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model"
|
||||
" to DeepSeek."
|
||||
msgstr ""
|
||||
@@ -459,7 +459,7 @@ msgstr ""
|
||||
"install` 进行安装,或者使用 `python setup.py clean` 清除缓存。"
|
||||
|
||||
#: ../../faqs.md:130
|
||||
msgid "18. How to generate determinitic results when using vllm-ascend?"
|
||||
msgid "18. How to generate deterministic results when using vllm-ascend?"
|
||||
msgstr "18. 使用 vllm-ascend 时如何生成确定性结果?"
|
||||
|
||||
#: ../../faqs.md:131
|
||||
@@ -475,5 +475,5 @@ msgstr ""
|
||||
"sample)**,例如:"
|
||||
|
||||
#: ../../faqs.md:158
|
||||
msgid "Set the following enveriments parameters:"
|
||||
msgid "Set the following environment parameters:"
|
||||
msgstr "设置以下环境参数:"
|
||||
|
||||
@@ -695,7 +695,7 @@ msgstr ""
|
||||
#: ../../user_guide/release_notes.md:118
|
||||
msgid ""
|
||||
"Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 "
|
||||
"is well tested. You can try it now. Mindie Turbo is recomanded to improve "
|
||||
"is well tested. You can try it now. Mindie Turbo is recommended to improve "
|
||||
"the performance of Qwen3. [#903](https://github.com/vllm-project/vllm-"
|
||||
"ascend/pull/903) [#915](https://github.com/vllm-project/vllm-ascend/"
|
||||
"pull/915)"
|
||||
@@ -1204,7 +1204,7 @@ msgid ""
|
||||
"visit [official guide](https://docs.vllm.ai/en/latest/getting_started/"
|
||||
"v1_user_guide.html) to get more detail. By default, vLLM will fallback to "
|
||||
"V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want "
|
||||
"to use V1 forcely."
|
||||
"to use V1 forcibly."
|
||||
msgstr ""
|
||||
"本版本包含了对 vLLM V1 引擎的实验性支持。你可以访问[官方指南](https://docs."
|
||||
"vllm.ai/en/latest/getting_started/v1_user_guide.html)获取更多详细信息。默认"
|
||||
@@ -1335,7 +1335,7 @@ msgstr ""
|
||||
|
||||
#: ../../user_guide/release_notes.md:232
|
||||
msgid ""
|
||||
"Add Ascend Custom Ops framewrok. Developers now can write customs ops using "
|
||||
"Add Ascend Custom Ops framework. Developers now can write customs ops using "
|
||||
"AscendC. An example ops `rotary_embedding` is added. More tutorials will "
|
||||
"come soon. The Custom Ops compilation is disabled by default when "
|
||||
"installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371]"
|
||||
@@ -1512,7 +1512,7 @@ msgstr ""
|
||||
msgid ""
|
||||
"Improved and reduced the garbled code in model output. But if you still hit "
|
||||
"the issue, try to change the generation config value, such as "
|
||||
"`temperature`, and try again. There is also a knonwn issue shown below. Any "
|
||||
"`temperature`, and try again. There is also a known issue shown below. Any "
|
||||
"[feedback](https://github.com/vllm-project/vllm-ascend/issues/267) is "
|
||||
"welcome. [#277](https://github.com/vllm-project/vllm-ascend/pull/277)"
|
||||
msgstr ""
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
## Introduction
|
||||
|
||||
DeepSeek-R1 is a high-performance Mixture-of-Experts (MoE) large language model developed by DeepSeek Company. It excels in complex logical reasoning, mathematical problem-solving, and code generation. By dynamically activating its expert networks, it delivers exceptional performance while maintaining computational efficiency. Building upon R1, DeepSeek-R1-W8A8 is a fully quantized version of the model. It employs 8-bit integer (INT8) quantization for both weights and activations, which significantly reduces the model's memory footprint and computational requirements, enabling more efficient deployment and application in resource-constrained environments.
|
||||
This article takes the deepseek- R1-W8A8 version as an example to introduce the deployment of the R1 series models.
|
||||
This article takes the `DeepSeek-R1-W8A8` version as an example to introduce the deployment of the R1 series models.
|
||||
|
||||
## Supported Features
|
||||
|
||||
@@ -25,7 +25,7 @@ If you want to deploy multi-node environment, you need to verify multi-node comm
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image to run `DeepSeek-R1-W8A8` directly.
|
||||
You can use our official docker image to run `DeepSeek-R1-W8A8` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
@@ -293,7 +293,7 @@ Run performance evaluation of `DeepSeek-R1-W8A8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -38,7 +38,7 @@ If you want to deploy multi-node environment, you need to verify multi-node comm
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image to run `DeepSeek-V3.1` directly.
|
||||
You can use our official docker image to run `DeepSeek-V3.1` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
@@ -96,7 +96,7 @@ nic_name="xxxx"
|
||||
local_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
# AIV
|
||||
@@ -158,7 +158,7 @@ local_ip="xxxx"
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
@@ -210,7 +210,7 @@ local_ip="xxx"
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
@@ -275,7 +275,7 @@ local_ip="141.xx.xx.1"
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
@@ -352,7 +352,7 @@ local_ip="141.xx.xx.2"
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
@@ -429,7 +429,7 @@ local_ip="141.xx.xx.3"
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
@@ -505,7 +505,7 @@ local_ip="141.xx.xx.4"
|
||||
node0_ip="xxxx"
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
|
||||
export HCCL_IF_IP=$local_ip
|
||||
@@ -707,7 +707,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -29,7 +29,7 @@ If you want to deploy multi-node environment, you need to verify multi-node comm
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image to run `DeepSeek-V3.2` directly..
|
||||
You can use our official docker image to run `DeepSeek-V3.2` directly.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
@@ -456,11 +456,11 @@ Before you start, please
|
||||
dp_rpc_port = args.dp_rpc_port
|
||||
vllm_start_port = args.vllm_start_port
|
||||
|
||||
def run_command(visiable_devices, dp_rank, vllm_engine_port):
|
||||
def run_command(visible_devices, dp_rank, vllm_engine_port):
|
||||
command = [
|
||||
"bash",
|
||||
"./run_dp_template.sh",
|
||||
visiable_devices,
|
||||
visible_devices,
|
||||
str(vllm_engine_port),
|
||||
str(dp_size),
|
||||
str(dp_rank),
|
||||
@@ -481,9 +481,9 @@ Before you start, please
|
||||
for i in range(dp_size_local):
|
||||
dp_rank = dp_rank_start + i
|
||||
vllm_engine_port = vllm_start_port + i
|
||||
visiable_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
|
||||
visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
|
||||
process = multiprocessing.Process(target=run_command,
|
||||
args=(visiable_devices, dp_rank,
|
||||
args=(visible_devices, dp_rank,
|
||||
vllm_engine_port))
|
||||
processes.append(process)
|
||||
process.start()
|
||||
@@ -895,7 +895,7 @@ Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -29,7 +29,7 @@ It is recommended to download the model weight to the shared directory of multip
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image to run `GLM-4.x` directly.
|
||||
You can use our official docker image to run `GLM-4.x` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
@@ -146,7 +146,7 @@ Run performance evaluation of `GLM-4.x` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -22,7 +22,7 @@ It is recommended to download the model weights to a local directory (e.g., `./P
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image to run `PaddleOCR-VL` directly.
|
||||
You can use our official docker image to run `PaddleOCR-VL` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
|
||||
@@ -548,7 +548,7 @@ lm_eval \
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -24,7 +24,7 @@ It is recommended to download the model weights to a local directory (e.g., `./Q
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`.
|
||||
You can use our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
@@ -158,7 +158,7 @@ Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -19,11 +19,11 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
|
||||
- `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)
|
||||
- `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-7B)
|
||||
|
||||
Following examples use the 7B version deafultly.
|
||||
Following examples use the 7B version by default.
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image to run `Qwen2.5-Omni` directly.
|
||||
You can use our official docker image to run `Qwen2.5-Omni` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
@@ -97,7 +97,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
|
||||
|
||||
`--allowed-local-media-path` is optional, only set it if you need infer model with local media file
|
||||
|
||||
`--gpu-memory-utilization` should not be set manually only if yous know what this parameter aims to.
|
||||
`--gpu-memory-utilization` should not be set manually only if you know what this parameter aims to.
|
||||
|
||||
#### Multiple NPU (Qwen2.5-Omni-7B)
|
||||
|
||||
@@ -195,7 +195,7 @@ Run performance evaluation of `Qwen2.5-Omni-7B` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -300,7 +300,7 @@ Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -24,7 +24,7 @@ It is recommended to download the model weight to the shared directory of multip
|
||||
|
||||
`Qwen3-Coder` is first supported in `vllm-ascend:v0.10.0rc1`, please run this model using a later version.
|
||||
|
||||
You can using our official docker image to run `Qwen3-Coder-30B-A3B-Instruct` directly.
|
||||
You can use our official docker image to run `Qwen3-Coder-30B-A3B-Instruct` directly.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
@@ -42,7 +42,7 @@ If you want to deploy multi-node environment, you need to verify multi-node comm
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image for supporting Qwen3 Dense models.
|
||||
You can use our official docker image for supporting Qwen3 Dense models.
|
||||
Currently, we provide the all-in-one images.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)
|
||||
|
||||
#### Docker Pull (by tag)
|
||||
@@ -161,7 +161,7 @@ export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
# [Optional] jemalloc
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
|
||||
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
|
||||
# if os is Ubuntu
|
||||
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
# if os is openEuler
|
||||
@@ -291,7 +291,7 @@ Run performance evaluation of `Qwen3-32B-W8A8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -154,7 +154,7 @@ Run performance evaluation of `Qwen3-Next` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -24,7 +24,7 @@ It is recommended to download the model weight to the shared directory of multip
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Use docker image
|
||||
|
||||
You can using our official docker image to run Qwen3-Omni-30B-A3B-Thinking directly
|
||||
You can use our official docker image to run Qwen3-Omni-30B-A3B-Thinking directly
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
@@ -269,7 +269,7 @@ Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example.
|
||||
Refer to vllm benchmark for more details.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -260,7 +260,7 @@ Run performance evaluation of `Qwen3-VL-235B-A22B-Instruct` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -24,7 +24,7 @@ Refer to [verify multi-node communication environment](../installation.md#verify
|
||||
|
||||
### Installation
|
||||
|
||||
You can using our official docker image to run `DeepSeek-V3.1` directly.
|
||||
You can use our official docker image to run `DeepSeek-V3.1` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
@@ -351,7 +351,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -159,7 +159,7 @@ Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
|
||||
@@ -109,7 +109,7 @@ if self.speculative_config:
|
||||
got {self.decode_threshold}"
|
||||
```
|
||||
|
||||
## Limitation
|
||||
## Limitations
|
||||
|
||||
- Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15.
|
||||
- In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1).
|
||||
|
||||
@@ -29,7 +29,7 @@ However, this may not be the optimal configuration for your scenario. For higher
|
||||
Notices:
|
||||
|
||||
1) Weight prefetch of MLP `down` project prefetch dependence sequence parallel, if you want open for mlp `down` please also enable sequence parallel.
|
||||
2) Due to the current size of the L2 cache, the maximum prefetch cannot exceed 18MB. If `prefetch_ration * lineaer_layer_weight_size >= 18 * 1024 * 1024` bytes, the backend will only prefetch 18MB.
|
||||
2) Due to the current size of the L2 cache, the maximum prefetch cannot exceed 18MB. If `prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024` bytes, the backend will only prefetch 18MB.
|
||||
|
||||
## Example
|
||||
|
||||
|
||||
@@ -341,7 +341,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
|
||||
- [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/KV_Cache_Pool_Guide.html) feature is added
|
||||
- [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it.
|
||||
- LLMdatadist kv connector is removed. Please use mooncake connector instead.
|
||||
- Ascend scheduler is removed. `--additional-config {"ascend_scheudler": {"enabled": true}` doesn't work anymore.
|
||||
- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}` doesn't work anymore.
|
||||
- Torchair graph mode is removed. `--additional-config {"torchair_graph_config": {"enabled": true}}` doesn't work anymore. Please use aclgraph instead.
|
||||
- `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` env is removed. This feature is stable enough. We enable it by default now.
|
||||
- speculative decode method `Ngram` is back now.
|
||||
@@ -637,7 +637,7 @@ This is the 3rd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
|
||||
- Fix header include issue in rope [#2398](https://github.com/vllm-project/vllm-ascend/pull/2398)
|
||||
- Fix mtp config bug [#2412](https://github.com/vllm-project/vllm-ascend/pull/2412)
|
||||
- Fix error info and adapt `attn_metedata` refactor [#2402](https://github.com/vllm-project/vllm-ascend/pull/2402)
|
||||
- Fix torchair runtime error caused by configuration mismtaches and `.kv_cache_bytes` file missing [#2312](https://github.com/vllm-project/vllm-ascend/pull/2312)
|
||||
- Fix torchair runtime error caused by configuration mismatches and `.kv_cache_bytes` file missing [#2312](https://github.com/vllm-project/vllm-ascend/pull/2312)
|
||||
- Move `with_prefill` allreduce from cpu to npu [#2230](https://github.com/vllm-project/vllm-ascend/pull/2230)
|
||||
|
||||
### Docs
|
||||
|
||||
Reference in New Issue
Block a user