[Doc] Sensitive word modification (#8303)
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
@@ -48,7 +48,7 @@ docker run --rm \
|
||||
```
|
||||
|
||||
:::{note}
|
||||
The 310P device is supported from version 0.15.0rc1. You need to select the corresponding image for installation.
|
||||
The Atlas 300 inference products are supported from version 0.15.0rc1. You need to select the corresponding image for installation.
|
||||
:::
|
||||
|
||||
## Deployment
|
||||
@@ -57,7 +57,7 @@ The 310P device is supported from version 0.15.0rc1. You need to select the corr
|
||||
|
||||
#### Single NPU (PaddleOCR-VL)
|
||||
|
||||
PaddleOCR-VL supports single-node single-card deployment on the 910B4 and 310P platform. Follow these steps to start the inference service:
|
||||
PaddleOCR-VL supports single-node single-card deployment on the 910B4 and Atlas 300 inference products platform. Follow these steps to start the inference service:
|
||||
|
||||
1. Prepare model weights: Ensure the downloaded model weights are stored in the `PaddleOCR-VL` directory.
|
||||
2. Create and execute the deployment script (save as `deploy.sh`):
|
||||
@@ -90,10 +90,10 @@ vllm serve ${MODEL_PATH} \
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} 310P
|
||||
:sync: 310P
|
||||
::::{tab-item} Atlas 300 inference products
|
||||
:sync: Atlas 300 inference products
|
||||
|
||||
Run the following script to start the vLLM server on single 310P:
|
||||
Run the following script to start the vLLM server on single Atlas 300 inference products:
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -112,7 +112,7 @@ vllm serve ${MODEL_PATH} \
|
||||
```
|
||||
|
||||
:::{note}
|
||||
The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the 310P device.
|
||||
The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products.
|
||||
:::
|
||||
|
||||
::::
|
||||
@@ -260,7 +260,7 @@ The 910B4 device supports inference using the PaddlePaddle framework.
|
||||
::::{tab-item} OM inference
|
||||
:sync: om
|
||||
|
||||
The 310P device supports only the OM model inference. For details about the process, see the guide provided in [ModelZoo](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2).
|
||||
The Atlas 300 inference products support only the OM model inference. For details about the process, see the guide provided in [ModelZoo](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2).
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
@@ -328,7 +328,7 @@ vllm serve Qwen/Qwen3-VL-8B-Instruct \
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
@@ -415,7 +415,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
Reference in New Issue
Block a user