FAST-3B/README.md

---
frameworks:
- ""
tasks: []
---
# FAST-3B Model Documentation

## Overview
This repository provides access to the **FAST-3B** model, which is built on the **Qwen/Qwen2.5-VL-3B-Instruct** base model.

## System Prompt
```
"""You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."""
```
## Decoding Paramater

We recommend setting `temperature=0` to reproduce the reported performance. Note that performance may vary depending on the version of vLLM being used.

## Inference Guide

### Installation

Install the required dependencies:

```bash
pip install vllm==0.8.1
```

### Starting the Server

Start the vLLM server with the following command:

```bash
CUDA_VISIBLE_DEVICES=0 vllm serve /PATH/TO/FAST \
    --max-model-len 12800 \
    --dtype auto \
    --gpu_memory_utilization 0.75 \
    --trust-remote-code \
    --max-num-seqs 12 \
    --mm-processor-kwargs '{"max_pixels":1002112}'
```

Replace `/PATH/TO/FAST` with the actual path to your model.

### Simple Demo

```python
import base64

# Define the system prompt
SYSTEM_PROMPT = """You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."""

def simple_inference(image_path, query, max_tokens=2048, temperature=0):
    """
    Perform a simple inference with an image and a text query.
    Args:
        image_path (str): Path to the input image file.
        query (str): Text query for the model.
        max_tokens (int): Maximum number of tokens in the response.
        temperature (float): Sampling temperature for the model.
    Returns:
        str: The model's response.
    """
    # Load the image as a base64 string
    with open(image_path, 'rb') as file:
        image_base64 = "data:image/jpeg;base64," + base64.b64encode(file.read()).decode('utf-8')
    
    # Prepare the chat request
    request = {
        "model": "/PATH/TO/FAST",  # Replace with your model path
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},  # Add the system prompt
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": query},
                    {"type": "image_url", "image_url": {"url": image_base64}},
                ],
            },
        ],
        "temperature": temperature,
        "max_tokens": max_tokens,
    }
    # Call the chat API
    try:
        from openai import OpenAI
        client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
        chat_response = client.chat.completions.create(**request)
        return chat_response.choices[0].message.content
    except Exception as e:
        print(f"Error during inference: {e}")
        return None
```
FAST-3B, using FAST-GRPO on Qwen2.5-VL-3B to achieve fast-slow thinking for visual reasoning. 2026-01-20 09:31:18 +00:00			`---`
			`frameworks:`
			`- ""`
			`tasks: []`
			`---`
Update README.md 2025-04-28 16:17:02 +00:00			`# FAST-3B Model Documentation`
System update meta information 2025-04-28 15:51:59 +00:00
Update README.md 2025-04-28 16:17:02 +00:00			`## Overview`
			`This repository provides access to the FAST-3B model, which is built on the Qwen/Qwen2.5-VL-3B-Instruct base model.`
System update meta information 2025-04-28 15:51:59 +00:00
Update README.md 2025-04-28 16:17:02 +00:00			`## System Prompt`
System update meta information 2025-04-28 15:51:59 +00:00			```
Update README.md 2025-04-28 16:17:02 +00:00			`"""You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."""`
System update meta information 2025-04-28 15:51:59 +00:00			```
Update README.md 2025-04-28 16:17:02 +00:00			`## Decoding Paramater`
System update meta information 2025-04-28 15:51:59 +00:00
Update README.md 2025-04-28 17:01:10 +00:00			We recommend setting `temperature=0` to reproduce the reported performance. Note that performance may vary depending on the version of vLLM being used.

			`## Inference Guide`

			`### Installation`

			`Install the required dependencies:`

			```bash
			`pip install vllm==0.8.1`
			```

			`### Starting the Server`

			`Start the vLLM server with the following command:`

			```bash
			`CUDA_VISIBLE_DEVICES=0 vllm serve /PATH/TO/FAST \`
			`--max-model-len 12800 \`
			`--dtype auto \`
			`--gpu_memory_utilization 0.75 \`
			`--trust-remote-code \`
			`--max-num-seqs 12 \`
			`--mm-processor-kwargs '{"max_pixels":1002112}'`
			```

			Replace `/PATH/TO/FAST` with the actual path to your model.

			`### Simple Demo`

			```python
			`import base64`

			`# Define the system prompt`
			`SYSTEM_PROMPT = """You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."""`

			`def simple_inference(image_path, query, max_tokens=2048, temperature=0):`
			`"""`
			`Perform a simple inference with an image and a text query.`
			`Args:`
			`image_path (str): Path to the input image file.`
			`query (str): Text query for the model.`
			`max_tokens (int): Maximum number of tokens in the response.`
			`temperature (float): Sampling temperature for the model.`
			`Returns:`
			`str: The model's response.`
			`"""`
			`# Load the image as a base64 string`
			`with open(image_path, 'rb') as file:`
			`image_base64 = "data:image/jpeg;base64," + base64.b64encode(file.read()).decode('utf-8')`

			`# Prepare the chat request`
			`request = {`
			`"model": "/PATH/TO/FAST", # Replace with your model path`
			`"messages": [`
			`{"role": "system", "content": SYSTEM_PROMPT}, # Add the system prompt`
			`{`
			`"role": "user",`
			`"content": [`
			`{"type": "text", "text": query},`
			`{"type": "image_url", "image_url": {"url": image_base64}},`
			`],`
			`},`
			`],`
			`"temperature": temperature,`
			`"max_tokens": max_tokens,`
			`}`
			`# Call the chat API`
			`try:`
			`from openai import OpenAI`
			`client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")`
			`chat_response = client.chat.completions.create(**request)`
			`return chat_response.choices[0].message.content`
			`except Exception as e:`
			`print(f"Error during inference: {e}")`
			`return None`
FAST-3B, using FAST-GRPO on Qwen2.5-VL-3B to achieve fast-slow thinking for visual reasoning. 2026-01-20 09:31:18 +00:00			```