[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701)
### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com>
This commit is contained in:
@@ -96,7 +96,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
||||
--served_model_name qwen --dtype float16 \
|
||||
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
|
||||
--quantization ascend --max_model_len 16384
|
||||
--quantization ascend --max-model-len 16384
|
||||
# `--load_format` is required only for the W8A8SC quantized weight format.
|
||||
#
|
||||
```
|
||||
@@ -134,7 +134,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
||||
--enforce-eager \
|
||||
--dtype float16 \
|
||||
--quantization ascend \
|
||||
--max_model_len 10240
|
||||
--max-model-len 10240
|
||||
```
|
||||
|
||||
Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.
|
||||
@@ -159,7 +159,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
||||
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
|
||||
--quantization ascend \
|
||||
--max_model_len 16384 \
|
||||
--max-model-len 16384 \
|
||||
--no-enable-prefix-caching \
|
||||
--load_format="sharded_state"
|
||||
```
|
||||
@@ -178,7 +178,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
||||
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
|
||||
--quantization ascend \
|
||||
--max_model_len 16384 \
|
||||
--max-model-len 16384 \
|
||||
--no-enable-prefix-caching \
|
||||
--load_format="sharded_state"
|
||||
```
|
||||
@@ -199,7 +199,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser
|
||||
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
|
||||
--quantization ascend \
|
||||
--max_model_len 20480 \
|
||||
--max-model-len 20480 \
|
||||
--no-enable-prefix-caching \
|
||||
--load_format="sharded_state"
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user