109 lines
3.4 KiB
Markdown
109 lines
3.4 KiB
Markdown
|
|
---
|
||
|
|
license: apache-2.0
|
||
|
|
base_model: Qwen/Qwen2.5-VL-3B-Instruct
|
||
|
|
tags:
|
||
|
|
- medical
|
||
|
|
- vqa
|
||
|
|
- qwen2.5-vl
|
||
|
|
- synthvision
|
||
|
|
pipeline_tag: visual-question-answering
|
||
|
|
---
|
||
|
|
|
||
|
|
# Qwen2.5-3B-MedVL
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
Qwen2.5-VL-3B-Instruct fine-tuned on ~200K medical VQA records from the SynthVision pipeline.
|
||
|
|
|
||
|
|
|
||
|
|
## Benchmark Results (Exact Match)
|
||
|
|
|
||
|
|
| Split | VQA-RAD | PathVQA | SLAKE | Avg EM |
|
||
|
|
|-------|---------|---------|-------|--------|
|
||
|
|
| Base (Qwen2.5-VL-3B-Instruct) | 0.5033 | 0.3038 | 0.5438 | 0.4503 |
|
||
|
|
| **Fine-tuned** | **0.5211** | **0.3468** | **0.6032** | **0.4903** |
|
||
|
|
| Delta | +3.5% | +14.2% | +10.9% | +8.9% |
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
|
||
|
|
### Transformers
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoProcessor, AutoModelForImageTextToText
|
||
|
|
|
||
|
|
model_id = "OpenMed/Qwen2.5-3B-MedVL"
|
||
|
|
processor = AutoProcessor.from_pretrained(model_id)
|
||
|
|
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
|
||
|
|
|
||
|
|
messages = [
|
||
|
|
{
|
||
|
|
"role": "user",
|
||
|
|
"content": [
|
||
|
|
{"type": "image", "url": "https://example.com/xray.jpg"},
|
||
|
|
{"type": "text", "text": "What are the key findings in this chest X-ray?"},
|
||
|
|
],
|
||
|
|
}
|
||
|
|
]
|
||
|
|
|
||
|
|
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
|
||
|
|
output = model.generate(**inputs, max_new_tokens=512)
|
||
|
|
print(processor.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
|
||
|
|
```
|
||
|
|
|
||
|
|
### vLLM
|
||
|
|
|
||
|
|
```python
|
||
|
|
from vllm import LLM, SamplingParams
|
||
|
|
|
||
|
|
llm = LLM(model="OpenMed/Qwen2.5-3B-MedVL", max_model_len=4096, limit_mm_per_prompt={"image": 1})
|
||
|
|
|
||
|
|
messages = [{"role": "user", "content": [
|
||
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/xray.jpg"}},
|
||
|
|
{"type": "text", "text": "What are the key findings in this chest X-ray?"},
|
||
|
|
]}]
|
||
|
|
|
||
|
|
output = llm.chat(messages, SamplingParams(temperature=0, max_tokens=512))
|
||
|
|
print(output[0].outputs[0].text)
|
||
|
|
```
|
||
|
|
|
||
|
|
### SGLang
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Launch server
|
||
|
|
python -m sglang.launch_server --model-path OpenMed/Qwen2.5-3B-MedVL --chat-template qwen2-vl --port 8000
|
||
|
|
```
|
||
|
|
|
||
|
|
```python
|
||
|
|
from openai import OpenAI
|
||
|
|
|
||
|
|
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
||
|
|
response = client.chat.completions.create(
|
||
|
|
model="OpenMed/Qwen2.5-3B-MedVL",
|
||
|
|
messages=[{"role": "user", "content": [
|
||
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/xray.jpg"}},
|
||
|
|
{"type": "text", "text": "What are the key findings in this chest X-ray?"},
|
||
|
|
]}],
|
||
|
|
max_tokens=512,
|
||
|
|
)
|
||
|
|
print(response.choices[0].message.content)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Training Details
|
||
|
|
|
||
|
|
- **Base model**: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
|
||
|
|
- **Data**: ~200K medical VQA records from the [SynthVision pipeline](https://huggingface.co/blog/OpenMed/synthvision)
|
||
|
|
- **Method**: LoRA (rank=32, alpha=32)
|
||
|
|
- **Target modules**: q_proj, v_proj, k_proj, o_proj
|
||
|
|
- **Learning rate**: 7e-5, cosine schedule
|
||
|
|
- **Steps**: 700
|
||
|
|
- **Weight decay**: 0.03
|
||
|
|
- **Hardware**: 4x NVIDIA A100 80GB (48 vCPU, 568 GB RAM) via [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs)
|
||
|
|
- **Training time**: 1h 14m
|
||
|
|
|
||
|
|
## Links
|
||
|
|
|
||
|
|
- [SynthVision blog post](https://huggingface.co/blog/OpenMed/synthvision)
|
||
|
|
- [Source code](https://github.com/openmed-labs/synthvision)
|
||
|
|
- [All SynthVision artifacts](https://huggingface.co/collections/OpenMed/synthvision-69baac655b557943aa1babd3)
|
||
|
|
- [OpenMed on Hugging Face](https://huggingface.co/OpenMed)
|