model(vlm): pixtral (#5084)

2025-05-13 00:16:10 -07:00
parent b2e95f62b4
commit 5380cd7ea3
16 changed files with 1125 additions and 39 deletions
--- a/docs/supported_models/vision_language_models.md
+++ b/docs/supported_models/vision_language_models.md
@@ -20,6 +20,7 @@ python3 -m sglang.launch_server \
 | **Janus-Pro** (1B, 7B)      | `deepseek-ai/Janus-Pro-7B`                     | `janus-pro`       | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
 | **MiniCPM-V / MiniCPM-o**      | `openbmb/MiniCPM-V-2_6`                          | `minicpmv`           | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
 | **Llama 3.2 Vision** (11B)     | `meta-llama/Llama-3.2-11B-Vision-Instruct`        | `llama_3_vision`     | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
+| **Pixtral** (12B, 124B)              | `mistral-community/pixtral-12b`                           | `mistral`     | Pixtral is a vision-language model from Mistral AI that can process both text and images. |
 | **LLaVA** (v1.5 & v1.6)        | *e.g.* `liuhaotian/llava-v1.5-13b`               | `vicuna_v1.1`        | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
 | **LLaVA-NeXT** (8B, 72B)       | `lmms-lab/llava-next-72b`                        | `chatml-llava`       | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
 | **LLaVA-OneVision**            | `lmms-lab/llava-onevision-qwen2-7b-ov`           | `chatml-llava`       | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |