[Doc][Misc] Improve readability and fix typos in documentation (#8340)
### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>
This commit is contained in:
@@ -302,7 +302,7 @@ bash proxy.sh
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--tensor-parallel-size` 16 are common settings for tensor parallelism (TP) sizes.
|
||||
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
|
||||
- `--prefill-context-parallel-size` 2 is common setting for prefill context parallelism (PCP) sizes.
|
||||
- `--decode-context-parallel-size` 8 are common settings for decode context parallelism (DCP) sizes.
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
|
||||
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs and 16 chips to deploy one service instance.
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
@@ -735,14 +735,14 @@ vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
### Start the service
|
||||
|
||||
```bash
|
||||
# on 190.0.0.1
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.1 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.2
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.2 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.3
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.4
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 192.0.0.1
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 192.0.0.1 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 192.0.0.2
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 192.0.0.2 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 192.0.0.3
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 192.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 192.0.0.4
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 192.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
```
|
||||
|
||||
## Example Proxy for Deployment
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
## Getting Started
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide provides step-by-step instructions to verify this features in resource-constrained environments.
|
||||
|
||||
Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1.
|
||||
Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture (one Prefiller and one Decoder on the same node). Assume the IP address is 192.0.0.1.
|
||||
|
||||
## Verify Communication Environment
|
||||
|
||||
|
||||
Reference in New Issue
Block a user