[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701)
### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com>
This commit is contained in:
@@ -38,9 +38,9 @@ So far, dynamic batch performs better on several dense models including Qwen and
|
||||
Dynamic batch is used in the online inference. A fully executable example is as follows:
|
||||
|
||||
```shell
|
||||
SLO_LITMIT=50
|
||||
SLO_LIMIT=50
|
||||
vllm serve Qwen/Qwen2.5-14B-Instruct\
|
||||
--additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LITMIT}'}' \
|
||||
--additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LIMIT}'}' \
|
||||
--max-num-seqs 256 \
|
||||
--block-size 128 \
|
||||
--tensor_parallel_size 8 \
|
||||
|
||||
@@ -54,25 +54,25 @@ To enable Netloader, pass `--load-format=netloader` and provide configuration vi
|
||||
### Server
|
||||
|
||||
```shell
|
||||
VLLM_SLEEP_WHEN_IDLE=1 vllm serve `<model_file>` \
|
||||
VLLM_SLEEP_WHEN_IDLE=1 vllm serve <model_file> \
|
||||
--tensor-parallel-size 1 \
|
||||
--served-model-name `<model_name>` \
|
||||
--served-model-name <model_name> \
|
||||
--enforce-eager \
|
||||
--port `<port>` \
|
||||
--port <port> \
|
||||
--load-format netloader
|
||||
```
|
||||
|
||||
### Client
|
||||
|
||||
```shell
|
||||
export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["`<server_IP>`:`<server_Port>`"]}]}'
|
||||
export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["<server_IP>:<server_Port>"]}]}'
|
||||
|
||||
VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=`<device_id_diff_from_server>` \
|
||||
vllm serve `<model_file>` \
|
||||
VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=<device_id_diff_from_server> \
|
||||
vllm serve <model_file> \
|
||||
--tensor-parallel-size 1 \
|
||||
--served-model-name `<model_name>` \
|
||||
--served-model-name <model_name> \
|
||||
--enforce-eager \
|
||||
--port `<client_port>` \
|
||||
--port <client_port> \
|
||||
--load-format netloader \
|
||||
--model-loader-extra-config="${NETLOADER_CONFIG}"
|
||||
```
|
||||
|
||||
@@ -80,7 +80,7 @@ A simple planner implementation is provided at [`rfork_planner.py`](../../../../
|
||||
```shell
|
||||
python rfork_planner.py \
|
||||
--host 0.0.0.0 \
|
||||
--port `<planner_port>`
|
||||
--port <planner_port>
|
||||
```
|
||||
|
||||
### 3. Start vLLM Instances
|
||||
@@ -93,15 +93,15 @@ For later instances, if the planner can allocate a compatible seed, RFork will t
|
||||
|
||||
```shell
|
||||
export RFORK_CONFIG='{
|
||||
"model_url": "`<model_url>`",
|
||||
"model_deploy_strategy_name": "`<deploy_strategy>`",
|
||||
"rfork_scheduler_url": "http://`<planner_ip>`:`<planner_port>`"
|
||||
"model_url": "<model_url>",
|
||||
"model_deploy_strategy_name": "<deploy_strategy>",
|
||||
"rfork_scheduler_url": "http://<planner_ip>:<planner_port>"
|
||||
}'
|
||||
|
||||
vllm serve `<model_path>` \
|
||||
vllm serve <model_path> \
|
||||
--tensor-parallel-size 1 \
|
||||
--served-model-name `<served_model_name>` \
|
||||
--port `<port>` \
|
||||
--served-model-name <served_model_name> \
|
||||
--port <port> \
|
||||
--load-format rfork \
|
||||
--model-loader-extra-config "${RFORK_CONFIG}"
|
||||
```
|
||||
|
||||
@@ -381,7 +381,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
|
||||
- Fix multimodal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
|
||||
- `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
|
||||
- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
|
||||
- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
|
||||
- support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
|
||||
|
||||
### Dependencies
|
||||
|
||||
|
||||
Reference in New Issue
Block a user