enginex-bi_series-vl/README.md

# 天数智芯 天垓100 视觉理解多模态
该模型测试框架在天垓100加速卡上，基于Transfomer框架，适配了 gemma-3-4b-it、MiniCPM-Llama3-V-2_5 、MiniCPM_V_2_6 这3个模型。

* Gemma 3-4B‑IT 是 Google 发布的 Gemma 3 系列中参数量为 4 B 的轻量 multimodal 模型，支持图文输入、128 K 长上下文、多语种（140+ 语言），专为嵌入设备快速部署设计
* MiniCPM‑Llama3‑V 2.5 是 openbmb 的 8 B multimodal 模型，基于 SigLip‑400M 与 Llama3-8B-Instruct 构建，在 OCR 能力、多语言支持、部署效率等方面表现优秀，整体性能达到 GPT‑4V 级别
* MiniCPM‑V 2.6 是 MiniCPM‑V 系列中最新且最强大的 8 B 参数模型，具备更优的单图、多图与视频理解能力、卓越 OCR 效果、低 hallucination 率，并支持端侧设备（如 iPad）实时视频理解

## Quick Start
1. 首先从modelscope上下载vlm，如`gemma-3-4b-it`  
```bash
modelscope download --model LLM-Research/gemma-3-4b-it --local_dir /mnt/contest_ceph/wenyunqing/models/gemma-3-4b-it
```
2. 拉取server 镜像.   
```
docker pull git.modelhub.org.cn:9443/enginex-iluvatar/bi100-3.2.1-x86-ubuntu20.04-py3.10-poc-vlm-infer:0.0.1
```   
3. 启动docker 
```bash
docker run -it --rm \
  -p 10086:8000 \
  --name test_wyq1 \
  -v /mnt/contest_ceph/wenyunqing/models/gemma-3-4b-it:/model:rw \
  --privileged git.modelhub.org.cn:9443/enginex-iluvatar/bi100-3.2.1-x86-ubuntu20.04-py3.10-poc-vlm-infer:0.0.1
```   
注意需要在本地使用天垓100 芯片   
4. 测试服务   
4.1 加载模型    
```bash
curl -X POST http://localhost:10086/load_model \
  -H "Content-Type: application/json" \
  -d '{"model_path":"/model","dtype":"auto"}'
```
4.2 模型推理   
```bash
base64 -w 0 demo.jpeg | \
jq -Rs --arg mp "/model" --arg prompt "Describe the picture" \
   '{model_path: $mp, prompt: $prompt, images: ["data:image/jpeg;base64," + .], generation: {max_new_tokens: 50, temperature: 0.7}}' | \
curl -X POST "http://localhost:10086/infer" \
     -H "Content-Type: application/json" \
     -d @-
```

如果本地没有安装`jq`命令，可以使用`test.json`:   
```bash
curl -X POST "http://localhost:10086/infer" \
     -H "Content-Type: application/json" \
     -d @test.json
```


## 模型测试服务原理
尽管对于视觉多模态理解没有一个业界统一的API协议标准，但我们也可以基于目前比较流行的Transfomer框架**适配**各类视觉理解多模态模型。
为了让我们的测试框架更通用一些，我们基于Transfomer框架对于不同类型的模型系列adpat了一层，方便对外提供http服务。

目前，测试框架要求用户首先测试时指定需要测试的模型的地址mount到本地文件系统中，如`/model`，之后通过unvicorn拉起服务。

测试过程中，外围测试环境，会首先调用“加载模型接口”:

```bash
curl -X POST http://localhost:10086/load_model \
  -H "Content-Type: application/json" \
  -d '{"model_path":"/model","dtype":"auto"}'
```


## 模型测试服务请求示例
准备好用于测试的图片和问题，通过infer接口获取推理结果：

```bash
base64 -w 0 demo.jpeg | \
jq -Rs --arg mp "/model" --arg prompt "Describe the picture" \
   '{model_path: $mp, prompt: $prompt, images: ["data:image/jpeg;base64," + .], generation: {max_new_tokens: 50, temperature: 0.7}}' | \
curl -X POST "http://localhost:10086/infer" \
     -H "Content-Type: application/json" \
     -d @-
```

以上，图片为`demo.jpeg`，问题为`Describe the picture`，可根据需要相应替换。
## 如何使用视觉理解多模态测试框架
由于VLM相关的模型一般需要较大的存储空间，为了更好的测试效率，需要提前下载好模型相关文件，k8s集群可以mount的持久化介质（比如cephFS），之后提交测试时指定模型存放的地址。

`docker-images/server.py`代码实现了一个接收图片和问题并返回回答文本和统计延迟信息的VLM HTTP 服务。测试框架集成了现成的可用的镜像`git.modelhub.org.cn:9443/enginex-iluvatar/bi100-3.2.1-x86-ubuntu20.04-py3.10-poc-vlm-infer:0.0.1`（`server.py`作为入口），可以用于本地端（如有GPU卡）测试。

作为测试对比，我们也提供a100相对应的镜像 `git.modelhub.org.cn:9443/enginex-iluvatar/a100-3.2.1-x86-ubuntu20.04-py3.10-poc-vlm-infer:0.0.1`
## 天垓100上视觉理解多模态模型运行测试结果
在天垓100上对部分视觉理解多模态模型进行适配，测试方式为在 Nvidia A100 和 天垓100 加速卡上对10个图片相关问题回答，获取运行时间

| 模型名称   | 模型类型               | 适配状态 | 天垓100运行时间/s | Nvidia A100运行时间/s |
| ---------- | ---------------------- | -------- | ----------------- | --------------------- |
| Gemma 3-4B‑IT     |  Gemma 3 系列   | 成功     | 30.9210               | 5.9024                   |
| MiniCPM‑Llama3‑V 2.5     | openbmb 8 B multimodal      | 成功     | 	112.0708               | 6.5424                   |
| MiniCPM‑V 2.6 | MiniCPM‑V 系列                   | 成功     | 80.7702               | 3.7767                   |
-												feature: chore

											
										
										
											2025-08-25 11:52:41 +08:00
+								# 天数智芯 天垓100 视觉理解多模态
 								该模型测试框架在天垓100加速卡上，基于Transfomer框架，适配了 gemma-3-4b-it、MiniCPM-Llama3-V-2_5 、MiniCPM_V_2_6 这3个模型。
 								* Gemma 3-4B‑IT 是 Google 发布的 Gemma 3 系列中参数量为 4 B 的轻量 multimodal 模型，支持图文输入、128 K 长上下文、多语种（140+ 语言），专为嵌入设备快速部署设计
 								* MiniCPM‑Llama3‑V 2.5 是 openbmb 的 8 B multimodal 模型，基于 SigLip‑400M 与 Llama3-8B-Instruct 构建，在 OCR 能力、多语言支持、部署效率等方面表现优秀，整体性能达到 GPT‑4V 级别
 								* MiniCPM‑V 2.6 是 MiniCPM‑V 系列中最新且最强大的 8 B 参数模型，具备更优的单图、多图与视频理解能力、卓越 OCR 效果、低 hallucination 率，并支持端侧设备（如 iPad）实时视频理解
-												feature: add

											
										
										
											2025-08-27 17:52:42 +08:00
+								## Quick Start
-												feature: add

											
										
										
											2025-08-27 19:05:36 +08:00
+. 首先从modelscope上下载vlm，如`gemma-3-4b-it`
-												feature: add

											
										
										
											2025-08-27 17:52:42 +08:00
+								```bash
 								modelscope download --model LLM-Research/gemma-3-4b-it --local_dir /mnt/contest_ceph/wenyunqing/models/gemma-3-4b-it
 								```
-												feature: add

											
										
										
											2025-08-27 19:05:36 +08:00
+. 拉取server 镜像.
-												feature: add

											
										
										
											2025-08-27 17:52:42 +08:00
+								```
-												feature: add

											
										
										
											2025-08-27 19:05:36 +08:00
+								docker pull git.modelhub.org.cn:9443/enginex-iluvatar/bi100-3.2.1-x86-ubuntu20.04-py3.10-poc-vlm-infer:0.0.1
-												feature: add

											
										
										
											2025-08-27 17:52:42 +08:00
+								```
 . 启动docker
 								```bash
 								docker run -it --rm \
 								  -p 10086:8000 \
 								  --name test_wyq1 \
 								  -v /mnt/contest_ceph/wenyunqing/models/gemma-3-4b-it:/model:rw \
-												feature: add

											
										
										
											2025-08-27 19:05:36 +08:00
+								  --privileged git.modelhub.org.cn:9443/enginex-iluvatar/bi100-3.2.1-x86-ubuntu20.04-py3.10-poc-vlm-infer:0.0.1
-												feature: add

											
										
										
											2025-08-27 17:52:42 +08:00
+								```
-												feature: add

											
										
										
											2025-08-27 19:05:36 +08:00
+								注意需要在本地使用天垓100 芯片
-												feature: add

											
										
										
											2025-08-27 17:52:42 +08:00
+. 测试服务
 .1 加载模型
 								```bash
 								curl -X POST http://localhost:10086/load_model \
 								  -H "Content-Type: application/json" \
 								  -d '{"model_path":"/model","dtype":"auto"}'
 								```
 .2 模型推理
 								```bash
 								base64 -w 0 demo.jpeg | \
 								jq -Rs --arg mp "/model" --arg prompt "Describe the picture" \
 								   '{model_path: $mp, prompt: $prompt, images: ["data:image/jpeg;base64," + .], generation: {max_new_tokens: 50, temperature: 0.7}}' | \
 								curl -X POST "http://localhost:10086/infer" \
 								     -H "Content-Type: application/json" \
 								     -d @-
 								```
-												fix: support

											
										
										
											2025-08-29 16:11:41 +08:00
+								如果本地没有安装`jq`命令，可以使用`test.json`:
 								```bash
 								curl -X POST "http://localhost:10086/infer" \
 								     -H "Content-Type: application/json" \
 								     -d @test.json
 								```
-												feature: chore

											
										
										
											2025-08-25 11:52:41 +08:00
 								## 模型测试服务原理
 								尽管对于视觉多模态理解没有一个业界统一的API协议标准，但我们也可以基于目前比较流行的Transfomer框架**适配**各类视觉理解多模态模型。
 								为了让我们的测试框架更通用一些，我们基于Transfomer框架对于不同类型的模型系列adpat了一层，方便对外提供http服务。
 								目前，测试框架要求用户首先测试时指定需要测试的模型的地址mount到本地文件系统中，如`/model`，之后通过unvicorn拉起服务。
 								测试过程中，外围测试环境，会首先调用“加载模型接口”:
 								```bash
 								curl -X POST http://localhost:10086/load_model \
 								  -H "Content-Type: application/json" \
 								  -d '{"model_path":"/model","dtype":"auto"}'
 								```
 								## 模型测试服务请求示例
 								准备好用于测试的图片和问题，通过infer接口获取推理结果：
 								```bash
 								base64 -w 0 demo.jpeg | \
 								jq -Rs --arg mp "/model" --arg prompt "Describe the picture" \
 								   '{model_path: $mp, prompt: $prompt, images: ["data:image/jpeg;base64," + .], generation: {max_new_tokens: 50, temperature: 0.7}}' | \
 								curl -X POST "http://localhost:10086/infer" \
 								     -H "Content-Type: application/json" \
 								     -d @-
 								```
 								以上，图片为`demo.jpeg`，问题为`Describe the picture`，可根据需要相应替换。
 								## 如何使用视觉理解多模态测试框架
 								由于VLM相关的模型一般需要较大的存储空间，为了更好的测试效率，需要提前下载好模型相关文件，k8s集群可以mount的持久化介质（比如cephFS），之后提交测试时指定模型存放的地址。
-												feature: add

											
										
										
											2025-08-27 19:05:36 +08:00
+								`docker-images/server.py`代码实现了一个接收图片和问题并返回回答文本和统计延迟信息的VLM HTTP 服务。测试框架集成了现成的可用的镜像`git.modelhub.org.cn:9443/enginex-iluvatar/bi100-3.2.1-x86-ubuntu20.04-py3.10-poc-vlm-infer:0.0.1`（`server.py`作为入口），可以用于本地端（如有GPU卡）测试。
-												feature: chore

											
										
										
											2025-08-25 11:52:41 +08:00
-												feature: add

											
										
										
											2025-08-27 19:05:36 +08:00
+								作为测试对比，我们也提供a100相对应的镜像 `git.modelhub.org.cn:9443/enginex-iluvatar/a100-3.2.1-x86-ubuntu20.04-py3.10-poc-vlm-infer:0.0.1`
-												feature: chore

											
										
										
											2025-08-25 11:52:41 +08:00
+								## 天垓100上视觉理解多模态模型运行测试结果
 								在天垓100上对部分视觉理解多模态模型进行适配，测试方式为在 Nvidia A100 和 天垓100 加速卡上对10个图片相关问题回答，获取运行时间
 								| 模型名称   | 模型类型               | 适配状态 | 天垓100运行时间/s | Nvidia A100运行时间/s |
 								| ---------- | ---------------------- | -------- | ----------------- | --------------------- |
 								| Gemma 3-4B‑IT     |  Gemma 3 系列   | 成功     | 30.9210               | 5.9024                   |
 								| MiniCPM‑Llama3‑V 2.5     | openbmb 8 B multimodal      | 成功     | 	112.0708               | 6.5424                   |
 								| MiniCPM‑V 2.6 | MiniCPM‑V 系列                   | 成功     | 80.7702               | 3.7767                   |