209 lines
7.1 KiB
Markdown
209 lines
7.1 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
base_model: Qwen/Qwen3-4B
|
|||
|
|
datasets:
|
|||
|
|
- PeterJinGo/nq_hotpotqa_train
|
|||
|
|
- orbit-ai/orbit-20k
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
tags:
|
|||
|
|
- qwen3
|
|||
|
|
- coversational
|
|||
|
|
- verl
|
|||
|
|
- verl-tool
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
library_name: transformers
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
> [!Note]
|
|||
|
|
> For full information, go check out the ORBIT paper [here](https://arxiv.org/abs/2604.01195).
|
|||
|
|
> This is an ablation model: ORBIT-4B model fine-tuned with 165 GRPO steps with top-10 search documents at each turn.
|
|||
|
|
> We advise the reader to use [orbit-ai/orbit-4b-v0.1](https://huggingface.co/orbit-ai/orbit-4b-v0.1) for their own usecases.
|
|||
|
|
|
|||
|
|
<img src="https://huggingface.co/orbit-ai/orbit-4b-v0.1/resolve/main/orbit-with-name-logo.png" alt="Figure 1" width="500"/>
|
|||
|
|
|
|||
|
|
## Orbit-4B (v0.1)
|
|||
|
|
|
|||
|
|
This is the RL-trained checkpoint of Orbit-4B, a small-sized (4B parameters) expert open search agent using the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) base model, fine-tuned with GRPO to use web search as a tool for multi-turn question answering. It was trained using the [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) framework with a live DDGS-based retriever.
|
|||
|
|
|
|||
|
|
Training was conducted for **165 GRPO steps** on a mixed dataset of Natural Questions (NQ), HotpotQA, and our ORBIT dataset (orbit-ai/orbit) in a 1:1:1 ratio with top-10 search documents for each turn.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Model Details
|
|||
|
|
|
|||
|
|
| Property | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| **Architecture** | Qwen3-4B |
|
|||
|
|
| **Base checkpoint** | `Qwen/Qwen3-4B` |
|
|||
|
|
| **Training algorithm** | GRPO |
|
|||
|
|
| **Training steps** | 165 |
|
|||
|
|
| **Training framework** | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) |
|
|||
|
|
| **Hardware** | 4 × H100 SXM5 (80 GB HBM3), NVLink |
|
|||
|
|
| **Parallelism** | FSDP (param + optimizer offload) |
|
|||
|
|
| **Rollout mode** | Async (vLLM v1) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Training Dataset
|
|||
|
|
|
|||
|
|
The model was trained on a mixed retrieval QA dataset with equal sampling across three tasks:
|
|||
|
|
|
|||
|
|
| Task | Source | Type |
|
|||
|
|
|---|---|---|
|
|||
|
|
| Natural Questions (NQ) | Open-domain QA | Single-hop factoid |
|
|||
|
|
| HotpotQA | Multi-hop QA | 2-hop reasoning |
|
|||
|
|
| ORBIT | Multi-hop QA | Difficult and multi-hop reasoning queries |
|
|||
|
|
|
|||
|
|
**Dataset name:** `PeterJinGo/nq_hotpotqa_train`, `orbit-ai/orbit-20k`
|
|||
|
|
**Train batch size:** 256 samples per step (n=8 rollouts per sample → 2048 trajectories/step)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Training Hyperparameters
|
|||
|
|
|
|||
|
|
| Hyperparameter | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Batch size | 256 |
|
|||
|
|
| Rollouts | 8 |
|
|||
|
|
| Learning rate | `1e-6` |
|
|||
|
|
| LR warmup steps | 10 |
|
|||
|
|
| LR warmup ratio | 0.285 |
|
|||
|
|
| Optimizer | AdamW |
|
|||
|
|
| GRPO rollouts per sample (n) | 8 |
|
|||
|
|
| PPO mini-batch size | 32 |
|
|||
|
|
| PPO micro-batch size per GPU | 1 |
|
|||
|
|
| Temperature | 1.0 |
|
|||
|
|
| Top-p | 1.0 |
|
|||
|
|
| Top-k | disabled (−1) |
|
|||
|
|
| KL loss coefficient | 0.0 |
|
|||
|
|
| KL loss type | `low_var_kl` |
|
|||
|
|
| Entropy coefficient | 0.0 |
|
|||
|
|
| Max prompt length | 2048 tokens |
|
|||
|
|
| Max response length | 8192 tokens |
|
|||
|
|
| Max action length | 2048 tokens |
|
|||
|
|
| Max observation length | 1024 tokens |
|
|||
|
|
| Max turns | 5 |
|
|||
|
|
| Max concurrent trajectories | 32 |
|
|||
|
|
| GPU memory utilization (vLLM) | 0.6 |
|
|||
|
|
| vLLM max model length | 8192 |
|
|||
|
|
| Sequence parallelism | 1 (disabled) |
|
|||
|
|
| FSDP size | −1 (full sharding) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Tool Configuration
|
|||
|
|
|
|||
|
|
The model was trained with live web search via a DDGS-based retrieval server:
|
|||
|
|
|
|||
|
|
| Setting | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Retriever | DDGS (Dux Distributed Global Search) |
|
|||
|
|
| Search backends | `google`, `brave`, `bing`, `wikipedia`, `grokipedia` |
|
|||
|
|
| Top-k documents per query | 10 |
|
|||
|
|
| Backend strategy | **Parallel fan-out** — all backends queried simultaneously, results merged and deduplicated |
|
|||
|
|
| Per-backend HTTP timeout | 10 s |
|
|||
|
|
| Tool server workers | 4 |
|
|||
|
|
| Action stop tokens | `</search>`, `</answer>` |
|
|||
|
|
| Observations masked in loss | Yes (`mask_observations=True`) |
|
|||
|
|
|
|||
|
|
The retriever server runs as a FastAPI service. At each agent turn the model issues a `<search> query </search>` action; the tool server retrieves results and returns them as `<information>…</information>` observations. The trajectory ends when the model emits `<answer> … </answer>` or the turn budget is exhausted.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Training Infrastructure
|
|||
|
|
|
|||
|
|
| Setting | Value |
|
|||
|
|
|---|---|
|
|||
|
|
| Nodes | 1 × H100 node (g-series) |
|
|||
|
|
| GPUs per node | 4 × H100 SXM5 80 GB |
|
|||
|
|
| CPUs per node | 48 (allocated) |
|
|||
|
|
| System memory | 256 GB |
|
|||
|
|
| Local scratch (SSD) | 200 GB (`$TMPDIR`, used for Triton/Ray caches) |
|
|||
|
|
| NCCL | NVLink P2P enabled (no `NCCL_P2P_DISABLE`) |
|
|||
|
|
| vLLM version | v1 (`VLLM_USE_V1=1`) |
|
|||
|
|
| Checkpoint frequency | Every 5 steps |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Reward Model
|
|||
|
|
|
|||
|
|
**Reward manager:** `search_r1_qa_em`
|
|||
|
|
Reward is computed as exact-match (EM) between the model's `<answer>` and the reference target string. Multi-turn rollouts are scored at the final answer only; intermediate search turns receive no intermediate reward. Observations are masked from the KL and policy gradient loss.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
This model is designed to be used with a running tool server that handles `<search>` actions. Inference without a live retriever will fall back to the model's parametric knowledge.
|
|||
|
|
|
|||
|
|
### With verl-tool (recommended)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git clone https://github.com/TIGER-AI-Lab/verl-tool
|
|||
|
|
cd verl-tool
|
|||
|
|
uv sync
|
|||
|
|
source .venv/bin/activate
|
|||
|
|
|
|||
|
|
# Start the DDGS retriever
|
|||
|
|
python ddgs_retrieval_optimized.py --port 8280 --topk 5 \
|
|||
|
|
--backend "google,brave,bing,wikipedia,grokipedia"
|
|||
|
|
|
|||
|
|
# Start the tool server
|
|||
|
|
python -m verl_tool.servers.serve \
|
|||
|
|
--host 0.0.0.0 --port 30500 \
|
|||
|
|
--tool_type search_retrieval \
|
|||
|
|
--workers_per_tool 4
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Direct inference (parametric knowledge only)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|||
|
|
|
|||
|
|
model_id = "orbit-ai/orbit-4b-v0.1"
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
|
|||
|
|
|
|||
|
|
prompt = (
|
|||
|
|
"Answer the given question. Please break down the question, using it to plan "
|
|||
|
|
"a potential solution trajectory. You must conduct reasoning inside <think> and "
|
|||
|
|
"</think> first, then you may use tools to gather information. "
|
|||
|
|
"For search, use <search> query </search>. "
|
|||
|
|
"Provide your final answer with <answer> answer </answer>.\n\n"
|
|||
|
|
"Question: What percentage of blood is made up of plasma?"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
|||
|
|
outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, do_sample=True)
|
|||
|
|
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Intended Use & Limitations
|
|||
|
|
|
|||
|
|
- **Intended use:** Research into multi-turn retrieval-augmented reasoning and RL-based tool-use training.
|
|||
|
|
- **Language:** English only.
|
|||
|
|
- **Search dependency:** Peak performance requires a live web search backend. Without search, the model still reasons using parametric knowledge but accuracy degrades on fine-grained entity questions (particularly InfoSeek).
|
|||
|
|
- **Not intended for production deployment** without additional safety filtering.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
If you use this model or the training methodology, please cite:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
@misc{thakur2026orbit,
|
|||
|
|
title={ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget},
|
|||
|
|
author={Nandan Thakur and Zijian Chen and Xueguang Ma and Jimmy Lin},
|
|||
|
|
year={2026},
|
|||
|
|
eprint={2604.01195},
|
|||
|
|
archivePrefix={arXiv},
|
|||
|
|
primaryClass={cs.CL},
|
|||
|
|
url={https://arxiv.org/abs/2604.01195},
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|