304 lines
7.1 KiB
Markdown
304 lines
7.1 KiB
Markdown
# 中餐多模态识别大模型
|
||
|
||
## 项目概述
|
||
|
||
本项目旨在构建一食品识别&食品组成解析的多模态大模型,基于`qwen-vl2.5-3B`模型作为基座模型,使用LoRA(Low-Rank Adaptation)技术进行微调。模型训练数据使用中外食物及主要食品组成描述等多模态信息,并通过问答题目进行微调,完成模型对于食品及组成的解析能力。
|
||
|
||
## 模型架构
|
||
|
||
- **基座模型**: `qwen-vl2.5-3B`
|
||
- **微调方法**: LoRA(Low-Rank Adaptation)
|
||
- **训练数据**:
|
||
- 餐品与对应食材内容解析的对应数据;
|
||
|
||
## 数据集
|
||
|
||
### 数据来源
|
||
|
||
1. **中餐图像**: 相关食品照片及对应食材的主要成份解析结果;
|
||
2. **外餐图像**: 相关食品照片及对应食材的主要成份解析结果;
|
||
|
||
### 数据预处理
|
||
- **图像处理**: 对食物图像进行标准化处理(如尺寸调整、归一化等)。
|
||
- **文本处理**: 对问答对话进行清洗和标注,确保问题和答案的准确性。
|
||
- **多模态对齐**: 将图像与对应的文本信息进行对齐,构建多模态训练样本。
|
||
- **数据分割**: 将数据集分为训练集、验证集和测试集。
|
||
|
||
## 模型训练
|
||
|
||
### 微调方法
|
||
使用LoRA技术对`qwen-vl2.5`模型进行微调。LoRA通过在预训练模型的权重矩阵中引入低秩矩阵来减少参数量,从而在保持模型性能的同时降低计算成本。
|
||
|
||
### 训练步骤
|
||
1. **加载预训练模型**: 加载`qwen-vl2.5`模型。
|
||
2. **应用LoRA**: 在模型的关键层应用LoRA技术。
|
||
3. **训练模型**: 使用准备好的多模态数据集进行微调。
|
||
4. **验证与测试**: 在验证集和测试集上评估模型性能,调整超参数以优化结果。
|
||
|
||
## 模型评估
|
||
|
||
### 评估指标
|
||
- **准确率**: 模型在问答任务中的准确率。
|
||
- **召回率**: 模型能够正确回答的问题比例。
|
||
- **F1分数**: 准确率和召回率的调和平均数。
|
||
- **多模态理解能力**: 模型对图像和文本信息的综合理解能力。
|
||
|
||
### 评估结果
|
||
在测试集上的评估结果如下:
|
||
待更新
|
||
|
||
## 使用指南
|
||
|
||
### 环境配置
|
||
1. **Python版本**: 3.10+
|
||
2. **依赖库**:
|
||
- `transformers`
|
||
- `torch`
|
||
- `modelscope`
|
||
|
||
### 下载模型
|
||
|
||
|
||
SDK下载
|
||
```bash
|
||
#安装ModelScope
|
||
pip install modelscope
|
||
```
|
||
```python
|
||
#模型下载
|
||
from modelscope import snapshot_download
|
||
model_dir = snapshot_download('zpeng1989/Chinese_Food_Qwen25vl_3B_Model')
|
||
```
|
||
Git下载
|
||
```
|
||
#Git模型下载
|
||
git clone https://oauth2:eDTzbKYiKrNCswNiDx1s@www.modelscope.cn/zpeng1989/Chinese_Food_Qwen25vl_3B_Model.git
|
||
```
|
||
|
||
|
||
|
||
### 模型推理
|
||
以下是一个使用案例,展示如何加载模型并进行推理:
|
||
|
||
```python
|
||
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
||
from qwen_vl_utils import process_vision_info
|
||
|
||
# default: Load the model on the available device(s)
|
||
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("qwen25_3b_vl_lora_sft_v20250511", torch_dtype="auto", device_map="auto"
|
||
)
|
||
|
||
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
||
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
||
# "Qwen/Qwen2.5-VL-7B-Instruct",
|
||
# torch_dtype=torch.bfloat16,
|
||
# attn_implementation="flash_attention_2",
|
||
# device_map="auto",
|
||
# )
|
||
|
||
# default processor
|
||
#export_dir: /data/cgu/Model_Train/Model/qwen25_3b_vl_lora_sft_v20250511
|
||
processor = AutoProcessor.from_pretrained("qwen25_3b_vl_lora_sft_v20250511")
|
||
|
||
# The default range for the number of visual tokens per image in the model is 4-16384.
|
||
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
|
||
# min_pixels = 256*28*28
|
||
# max_pixels = 1280*28*28
|
||
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
|
||
|
||
messages = [
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{
|
||
"type": "image",
|
||
"image": "*.jpg",
|
||
},
|
||
{"type": "text", "text": "图上有什么?"},
|
||
],
|
||
}
|
||
]
|
||
|
||
# Preparation for inference
|
||
text = processor.apply_chat_template(
|
||
messages, tokenize=False, add_generation_prompt=True
|
||
)
|
||
image_inputs, video_inputs = process_vision_info(messages)
|
||
inputs = processor(
|
||
text=[text],
|
||
images=image_inputs,
|
||
videos=video_inputs,
|
||
padding=True,
|
||
return_tensors="pt",
|
||
)
|
||
inputs = inputs.to(model.device)
|
||
|
||
# Inference: Generation of the output
|
||
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
||
generated_ids_trimmed = [
|
||
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
||
]
|
||
output_text = processor.batch_decode(
|
||
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
||
)
|
||
print(output_text)
|
||
```
|
||
|
||
|
||
## 案例
|
||
|
||
### Case 01
|
||
|
||

|
||
|
||
输入问题:
|
||
```
|
||
messages = [
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{
|
||
"type": "image",
|
||
"image": "A.jpg",
|
||
},
|
||
{"type": "text", "text": "图上有什么?"},
|
||
],
|
||
}
|
||
]
|
||
```
|
||
|
||
输出结果:
|
||
|
||
```
|
||
{
|
||
"食物名称": {
|
||
"炒菜": {
|
||
"食品组成": [
|
||
"鸡肉",
|
||
"蘑菇",
|
||
"青豆",
|
||
"胡萝卜",
|
||
"玉米",
|
||
"辣椒"
|
||
]
|
||
}
|
||
}
|
||
}e
|
||
```
|
||
|
||
### Case 02
|
||
|
||
|
||
|
||

|
||
|
||
输入问题:
|
||
```
|
||
messages = [
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{
|
||
"type": "image",
|
||
"image": "B.jpg",
|
||
},
|
||
{"type": "text", "text": "图上有什么?"},
|
||
],
|
||
}
|
||
]
|
||
```
|
||
|
||
输出结果:
|
||
|
||
```
|
||
{
|
||
"食物名称": {
|
||
"咖喱肉": {
|
||
"食品组成": [
|
||
"牛肉",
|
||
"咖喱酱",
|
||
"米饭"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
|
||
```
|
||
|
||
|
||
|
||
### Case 03
|
||
|
||

|
||
|
||
|
||
输入问题:
|
||
|
||
```
|
||
messages = [
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{
|
||
"type": "image",
|
||
"image": "E.png",
|
||
},
|
||
{"type": "text", "text": "图上有什么?"},
|
||
],
|
||
}
|
||
]
|
||
```
|
||
|
||
输出结果:
|
||
|
||
```
|
||
{
|
||
"食物名称": {
|
||
"红油火锅": {
|
||
"食品组成": [
|
||
"红油",
|
||
"辣椒",
|
||
"花椒",
|
||
"蔬菜",
|
||
"肉类",
|
||
"配菜"
|
||
]
|
||
},
|
||
"蔬菜": {
|
||
"食品组成": [
|
||
"胡萝卜",
|
||
"青菜"
|
||
]
|
||
},
|
||
"肉类": {
|
||
"食品组成": [
|
||
"牛肉",
|
||
"鸡肉",
|
||
"猪肉"
|
||
]
|
||
},
|
||
"配菜": {
|
||
"食品组成": [
|
||
"蘑菇",
|
||
"香菇",
|
||
"玉米"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
|
||
```
|
||
|
||
|
||
|
||
## 致谢
|
||
- 感谢 `qwen-vl2.5` 模型的开发者。
|
||
|
||
## 联系方式
|
||
如有任何问题,请联系 [592392714@qq.com]。
|
||
|
||
---
|
||
|
||
**注意**: 本项目仅供学术研究使用,不构成医疗建议。
|
||
|