Files
orbit-4b-ablation-top-10-do…/README.md
ModelHub XC ab272972aa 初始化项目,由ModelHub XC社区提供模型
Model: orbit-ai/orbit-4b-ablation-top-10-docs-v0.1
Source: Original Platform
2026-04-13 06:17:00 +08:00

209 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
base_model: Qwen/Qwen3-4B
datasets:
- PeterJinGo/nq_hotpotqa_train
- orbit-ai/orbit-20k
language:
- en
tags:
- qwen3
- coversational
- verl
- verl-tool
pipeline_tag: text-generation
library_name: transformers
---
> [!Note]
> For full information, go check out the ORBIT paper [here](https://arxiv.org/abs/2604.01195).
> This is an ablation model: ORBIT-4B model fine-tuned with 165 GRPO steps with top-10 search documents at each turn.
> We advise the reader to use [orbit-ai/orbit-4b-v0.1](https://huggingface.co/orbit-ai/orbit-4b-v0.1) for their own usecases.
<img src="https://huggingface.co/orbit-ai/orbit-4b-v0.1/resolve/main/orbit-with-name-logo.png" alt="Figure 1" width="500"/>
## Orbit-4B (v0.1)
This is the RL-trained checkpoint of Orbit-4B, a small-sized (4B parameters) expert open search agent using the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) base model, fine-tuned with GRPO to use web search as a tool for multi-turn question answering. It was trained using the [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) framework with a live DDGS-based retriever.
Training was conducted for **165 GRPO steps** on a mixed dataset of Natural Questions (NQ), HotpotQA, and our ORBIT dataset (orbit-ai/orbit) in a 1:1:1 ratio with top-10 search documents for each turn.
---
## Model Details
| Property | Value |
|---|---|
| **Architecture** | Qwen3-4B |
| **Base checkpoint** | `Qwen/Qwen3-4B` |
| **Training algorithm** | GRPO |
| **Training steps** | 165 |
| **Training framework** | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) |
| **Hardware** | 4 × H100 SXM5 (80 GB HBM3), NVLink |
| **Parallelism** | FSDP (param + optimizer offload) |
| **Rollout mode** | Async (vLLM v1) |
---
## Training Dataset
The model was trained on a mixed retrieval QA dataset with equal sampling across three tasks:
| Task | Source | Type |
|---|---|---|
| Natural Questions (NQ) | Open-domain QA | Single-hop factoid |
| HotpotQA | Multi-hop QA | 2-hop reasoning |
| ORBIT | Multi-hop QA | Difficult and multi-hop reasoning queries |
**Dataset name:** `PeterJinGo/nq_hotpotqa_train`, `orbit-ai/orbit-20k`
**Train batch size:** 256 samples per step (n=8 rollouts per sample → 2048 trajectories/step)
---
## Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Batch size | 256 |
| Rollouts | 8 |
| Learning rate | `1e-6` |
| LR warmup steps | 10 |
| LR warmup ratio | 0.285 |
| Optimizer | AdamW |
| GRPO rollouts per sample (n) | 8 |
| PPO mini-batch size | 32 |
| PPO micro-batch size per GPU | 1 |
| Temperature | 1.0 |
| Top-p | 1.0 |
| Top-k | disabled (1) |
| KL loss coefficient | 0.0 |
| KL loss type | `low_var_kl` |
| Entropy coefficient | 0.0 |
| Max prompt length | 2048 tokens |
| Max response length | 8192 tokens |
| Max action length | 2048 tokens |
| Max observation length | 1024 tokens |
| Max turns | 5 |
| Max concurrent trajectories | 32 |
| GPU memory utilization (vLLM) | 0.6 |
| vLLM max model length | 8192 |
| Sequence parallelism | 1 (disabled) |
| FSDP size | 1 (full sharding) |
---
## Tool Configuration
The model was trained with live web search via a DDGS-based retrieval server:
| Setting | Value |
|---|---|
| Retriever | DDGS (Dux Distributed Global Search) |
| Search backends | `google`, `brave`, `bing`, `wikipedia`, `grokipedia` |
| Top-k documents per query | 10 |
| Backend strategy | **Parallel fan-out** — all backends queried simultaneously, results merged and deduplicated |
| Per-backend HTTP timeout | 10 s |
| Tool server workers | 4 |
| Action stop tokens | `</search>`, `</answer>` |
| Observations masked in loss | Yes (`mask_observations=True`) |
The retriever server runs as a FastAPI service. At each agent turn the model issues a `<search> query </search>` action; the tool server retrieves results and returns them as `<information>…</information>` observations. The trajectory ends when the model emits `<answer> … </answer>` or the turn budget is exhausted.
---
## Training Infrastructure
| Setting | Value |
|---|---|
| Nodes | 1 × H100 node (g-series) |
| GPUs per node | 4 × H100 SXM5 80 GB |
| CPUs per node | 48 (allocated) |
| System memory | 256 GB |
| Local scratch (SSD) | 200 GB (`$TMPDIR`, used for Triton/Ray caches) |
| NCCL | NVLink P2P enabled (no `NCCL_P2P_DISABLE`) |
| vLLM version | v1 (`VLLM_USE_V1=1`) |
| Checkpoint frequency | Every 5 steps |
---
## Reward Model
**Reward manager:** `search_r1_qa_em`
Reward is computed as exact-match (EM) between the model's `<answer>` and the reference target string. Multi-turn rollouts are scored at the final answer only; intermediate search turns receive no intermediate reward. Observations are masked from the KL and policy gradient loss.
---
## Usage
This model is designed to be used with a running tool server that handles `<search>` actions. Inference without a live retriever will fall back to the model's parametric knowledge.
### With verl-tool (recommended)
```bash
git clone https://github.com/TIGER-AI-Lab/verl-tool
cd verl-tool
uv sync
source .venv/bin/activate
# Start the DDGS retriever
python ddgs_retrieval_optimized.py --port 8280 --topk 5 \
--backend "google,brave,bing,wikipedia,grokipedia"
# Start the tool server
python -m verl_tool.servers.serve \
--host 0.0.0.0 --port 30500 \
--tool_type search_retrieval \
--workers_per_tool 4
```
### Direct inference (parametric knowledge only)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "orbit-ai/orbit-4b-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
prompt = (
"Answer the given question. Please break down the question, using it to plan "
"a potential solution trajectory. You must conduct reasoning inside <think> and "
"</think> first, then you may use tools to gather information. "
"For search, use <search> query </search>. "
"Provide your final answer with <answer> answer </answer>.\n\n"
"Question: What percentage of blood is made up of plasma?"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```
---
## Intended Use & Limitations
- **Intended use:** Research into multi-turn retrieval-augmented reasoning and RL-based tool-use training.
- **Language:** English only.
- **Search dependency:** Peak performance requires a live web search backend. Without search, the model still reasons using parametric knowledge but accuracy degrades on fine-grained entity questions (particularly InfoSeek).
- **Not intended for production deployment** without additional safety filtering.
---
## Citation
If you use this model or the training methodology, please cite:
```
@misc{thakur2026orbit,
title={ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget},
author={Nandan Thakur and Zijian Chen and Xueguang Ma and Jimmy Lin},
year={2026},
eprint={2604.01195},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.01195},
}
```
---