2026-01-09 11:01:25 +08:00
# PaddleOCR-VL
## Introduction
PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition.
This document provides a detailed workflow for the complete deployment and verification of the model, including supported features, environment preparation, single-node deployment, and functional verification. It is designed to help users quickly complete model deployment and verification.
## Supported Features
2026-02-27 11:50:27 +08:00
Refer to [supported features ](../../user_guide/support_matrix/supported_models.md ) to get the model's supported feature matrix.
2026-01-09 11:01:25 +08:00
2026-02-27 11:50:27 +08:00
Refer to [feature guide ](../../user_guide/feature_guide/index.md ) to get the feature's configuration.
2026-01-09 11:01:25 +08:00
## Environment Preparation
### Model Weight
* `PaddleOCR-VL-0.9B` : [PaddleOCR-VL-0.9B ](https://www.modelscope.cn/models/PaddlePaddle/PaddleOCR-VL )
It is recommended to download the model weights to a local directory (e.g., `./PaddleOCR-VL` ) for quick access during deployment.
### Installation
2026-02-10 11:14:57 +08:00
You can use our official docker image to run `PaddleOCR-VL` directly.
2026-01-09 11:01:25 +08:00
2026-02-10 15:03:35 +08:00
Select an image based on your machine type and start the docker image on your node, refer to [using docker ](../../installation.md#set-up-using-docker ).
2026-01-09 11:01:25 +08:00
```{code-block} bash
:substitutions:
export IMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
2026-02-28 16:03:07 +08:00
:::{note}
The 310P device is supported from version 0.15.0rc1. You need to select the corresponding image for installation.
:::
2026-01-09 11:01:25 +08:00
## Deployment
### Single-node Deployment
#### Single NPU (PaddleOCR-VL)
2026-02-28 16:03:07 +08:00
PaddleOCR-VL supports single-node single-card deployment on the 910B4 and 310P platform. Follow these steps to start the inference service:
2026-01-09 11:01:25 +08:00
1. Prepare model weights: Ensure the downloaded model weights are stored in the `PaddleOCR-VL` directory.
2. Create and execute the deployment script (save as `deploy.sh` ):
2026-02-28 16:03:07 +08:00
:::::{tab-set}
:sync-group: install
::::{tab-item} 910B4
:sync: 910B4
Run the following script to start the vLLM server on single 910B4:
2026-01-09 11:01:25 +08:00
```shell
#!/bin/sh
export VLLM_USE_MODELSCOPE=true
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
[Doc]Refresh model tutorial examples and serving commands (#7426)
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87
Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-20 11:34:18 +08:00
export TASK_QUEUE_ENABLE=1
2026-02-28 16:03:07 +08:00
export CPU_AFFINITY_CONF=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
2026-01-09 11:01:25 +08:00
vllm serve ${MODEL_PATH} \
--max-num-batched-tokens 16384 \
--served-model-name PaddleOCR-VL-0.9B \
--trust-remote-code \
--no-enable-prefix-caching \
--mm-processor-cache-gb 0 \
2026-02-28 16:03:07 +08:00
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional_config '{"enable_cpu_binding":true}' \
--port 8000
2026-01-09 11:01:25 +08:00
```
2026-02-28 16:03:07 +08:00
::::
::::{tab-item} 310P
:sync: 310P
Run the following script to start the vLLM server on single 310P:
```shell
#!/bin/sh
export VLLM_USE_MODELSCOPE=true
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
vllm serve ${MODEL_PATH} \
--max_model_len 16384 \
--served-model-name PaddleOCR-VL-0.9B \
--trust-remote-code \
--no-enable-prefix-caching \
--mm-processor-cache-gb 0 \
--enforce-eager \
--dtype float16 \
--port 8000
```
:::{note}
The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the 310P device.
:::
::::
:::::
2026-01-09 11:01:25 +08:00
#### Multiple NPU (PaddleOCR-VL)
Single-node deployment is recommended.
### Prefill-Decode Disaggregation
2026-02-13 15:50:05 +08:00
Not supported yet.
2026-01-09 11:01:25 +08:00
## Functional Verification
If your service start successfully, you can see the info shown below:
```bash
INFO: Started server process [87471]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
Once your server is started, you can use the OpenAI API client to make queries.
```python
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
# Task-specific base prompts
TASKS = {
"ocr": "OCR:",
"table": "Table Recognition:",
"formula": "Formula Recognition:",
"chart": "Chart Recognition:",
}
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": TASKS["ocr"]
}
]
}
]
response = client.chat.completions.create(
model="PaddleOCR-VL-0.9B",
messages=messages,
temperature=0.0,
)
print(f"Generated text: {response.choices[0].message.content}")
```
If you query the server successfully, you can see the info shown below (client):
```bash
Generated text: CINNAMON SUGAR
1 x 17,000
17,000
SUB TOTAL
17,000
GRAND TOTAL
17,000
CASH IDR
20,000
CHANGE DUE
3,000
```
## Offline Inference with vLLM and PP-DocLayoutV2
In the above example, we demonstrated how to use vLLM to infer the PaddleOCR-VL-0.9B model. Typically, we also need to integrate the PP-DocLayoutV2 model to fully unleash the capabilities of the PaddleOCR-VL model, making it more consistent with the examples provided by the official PaddlePaddle documentation.
:::{note}
2026-02-28 16:03:07 +08:00
Use separate virtual environments for VLLM and PP-DocLayoutV2 to prevent dependency conflicts.
2026-01-09 11:01:25 +08:00
:::
2026-02-28 16:03:07 +08:00
:::::{tab-set}
:sync-group: install
::::{tab-item} PaddlePaddle
:sync: paddlepaddle
The 910B4 device supports inference using the PaddlePaddle framework.
2026-01-09 11:01:25 +08:00
2026-02-28 16:03:07 +08:00
1. Pull the PaddlePaddle-compatible CANN image
2026-01-09 11:01:25 +08:00
```bash
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-aarch64-gcc84
```
Start the container using the following command:
```bash
docker run -it --name paddle-npu-dev -v $(pwd):/work \
--privileged --network=host --shm-size=128G -w=/work \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/dcmi:/usr/local/dcmi \
-e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-$(uname -m)-gcc84 /bin/bash
```
2026-02-28 16:03:07 +08:00
2. Install [PaddlePaddle ](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined ) and [PaddleOCR ](https://github.com/PaddlePaddle/PaddleOCR )
2026-01-09 11:01:25 +08:00
```bash
python -m pip install paddlepaddle==3.2.0
wget https://paddle-whl.bj.bcebos.com/stable/npu/paddle-custom-npu/paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
pip install paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
python -m pip install -U "paddleocr[doc-parser]"
pip install safetensors
```
:::{note}
2026-02-13 15:50:05 +08:00
The OpenCV component may be missing:
2026-01-09 11:01:25 +08:00
```bash
apt-get update
apt-get install -y libgl1 libglib2.0-0
```
CANN-8.0.0 does not support some versions of NumPy and OpenCV. It is recommended to install the specified versions.
```bash
python -m pip install numpy==1.26.4
python -m pip install opencv-python==3.4.18.65
```
2026-02-28 16:03:07 +08:00
::::
::::{tab-item} OM inference
:sync: om
The 310P device supports only the OM model inference. For details about the process, see the guide provided in [ModelZoo ](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2 ).
::::
:::::
2026-01-09 11:01:25 +08:00
### Using vLLM as the backend, combined with PP-DocLayoutV2 for offline inference
```python
from paddleocr import PaddleOCRVL
doclayout_model_path = "/path/to/your/PP-DocLayoutV2/"
pipeline = PaddleOCRVL(vl_rec_backend="vllm-server",
vl_rec_server_url="http://localhost:8000/v1",
layout_detection_model_name="PP-DocLayoutV2",
layout_detection_model_dir=doclayout_model_path,
device="npu")
output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png")
for i, res in enumerate(output):
res.save_to_json(save_path=f"output_{i}.json")
res.save_to_markdown(save_path=f"output_{i}.md")
```