pokee_research_7b/README.md

---
base_model:
- Qwen/Qwen2.5-7B-Instruct
datasets:
- miromind-ai/MiroRL-GenQA
language:
- en
license: apache-2.0
tags:
- agent
- deepresearch
- llm
- rl
- reinforcementlearning
pipeline_tag: text-generation
library_name: transformers
---

# Model Card for PokeeResearch

## Model Details

### Model Description

**PokeeResearch-7B** is a **7-billion-parameter deep research agent** developed by **Pokee AI** to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.
The model integrates **Reinforcement Learning from AI Feedback (RLAIF)** with a **robust reasoning scaffold**, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.

- **Developed by:** Pokee AI
- **Model type:** Tool-augmented large language model (LLM) research agent
- **Language(s):** English, Chinese and many more
- **License:** Apache 2.0
- **Finetuned from model:** Qwen2.5-7B-Instruct

### Model Sources

- **Repository:** [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
- **Paper:** [*PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold*](https://arxiv.org/pdf/2510.15862), Pokee AI, October 2025
- **Project Page:** [https://pokee.ai/deepresearch-preview](https://pokee.ai/deepresearch-preview)

---

## Uses

### Direct Use
PokeeResearch-7B is designed for **deep research automation**, where the model autonomously:
- Decomposes complex user queries
- Retrieves and reads from external sources
- Synthesizes factual, verifiable, and grounded answers

It can be used as a **standalone research assistant** or integrated into **multi-agent systems** to support academic, enterprise, or product-level research tasks.

### Downstream Use
PokeeResearch-7B can be **fine-tuned** or **extended** for:
- Domain-specific scientific discovery
- Autonomous document retrieval and synthesis
- Multi-source verification and summarization pipelines
- Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)

### Out-of-Scope Use
The model should **not** be used for:
- Generating unverified or speculative claims
- Automated decision-making in high-stakes domains (medical, legal, or financial)
- Applications requiring strict factual precision without external verification
- Generating content without citation or evidence tracing

---

## Bias, Risks, and Limitations

PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:
- Dependence on **external data quality** and **retrieval accuracy**
- Potential **semantic bias** introduced by AI-based feedback signals
- Limited coverage for **non-English** or **multi-modal** reasoning tasks
- Risk of **hallucinated synthesis** when sources conflict or lack clarity

### Recommendations
Users should:
- Cross-verify answers, especially in multi-hop reasoning cases
- Monitor output for citation accuracy and alignment with source data
- Refrain from using outputs as sole evidence in decision-critical contexts

---

## How to Get Started with the Model
please refer to the following codebase for how to use PokeeResearch-7B
https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md

---

## Training Details

### Training Data
- **Dataset:** MiroRL-GenQA dataset (MiroMind AI, 2025)
- **Data characteristics:** Complex, multi-turn question–answer pairs requiring multi-step reasoning
- **Data filtering:** No benchmark data used for testing; the model was trained only on open-domain text Q&A samples

### Training Procedure

#### Preprocessing
- Normalization and tokenization aligned with Qwen2.5 tokenizer
- Structured prompt–response pairs in research/verification format (`<tool_call>`, `<answer>`, `<verification>`)

#### Training Hyperparameters
- **Algorithm:** RLOO (REINFORCE Leave-One-Out)
- **Batch size:** 64
- **Research threads per prompt:** 8
- **Learning rate:** 3e-6
- **Context limit:** 32,768 tokens
- **Steps:** 140 fine-tuning iterations
- **Regularization:** None (no entropy or KL regularization)
- **Precision regime:** bf16 mixed precision

#### Reward Design
- Combined reward signal from:
  - **AI feedback** (semantic equivalence via external LLM judge)
  - **Format adherence reward** (ensures correct agent behavior)

#### Speeds, Sizes, Times
- **Model size:** 7 billion parameters
- **Training duration:** ~5 days on 8 × A100 80G GPUs
- **Checkpoint size:** ~13 GB

---

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data
10 open-domain research and QA benchmarks:
- NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam

#### Factors
- Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.
- Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).

#### Metrics
- Mean accuracy (mean@4 across independent research threads) based on

### Results

**PokeeResearch-7B (RTS variant)** and **PokeeResearch-7B** outperforms all baselines at 7B scale across 10 benchmarks.
Highlights (mean@4 accuracy):
| **Method** | **HLE** | **GAIA** | **BrowseComp** | **BAMB** | **2WIKI** | **TQ** | **NQ** | **POPQA** | **MUSIQUE** | **HOTPOTQA** |
|-------------|----------|-----------|----------------|-----------|-----------|----------|----------|-------------|---------------|----------------|
| R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 |
| SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 |
| ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 |
| ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 |
| DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 |
| **PR** | **15.2** | **36.9** | **5.4** | **74.5** | **74.0** | **91.3** | **75.1** | **59.8** | **39.8** | **71.2** |
| **PR+** | **17.6** | **41.3** | **8.4** | **75.0** | **75.0** | **91.8** | **75.0** | **60.0** | **41.4** | **71.6** |

#### Summary
PokeeResearch-7B variants achieves **state-of-the-art performance among 7B-scale open deep research agents**, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.

---

## Technical Specifications

### Model Architecture and Objective
- **Base Architecture:** Transformer decoder (Qwen2.5-7B-Instruct backbone)
- **Objective:** Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning

### Compute Infrastructure
#### Hardware
- NVIDIA A100 80GB GPUs ×8 for training and x1 for inference
---

## Citation

**BibTeX:**
```bibtex
@article{pokee2025deepresearch,
  title={PokeeResearch: Effective Deep Research via
          Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
  author={Yi Wan* and Jiuqi Wang* and Liam Li
          and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
  journal={Pokee AI Technical Report},
  year={2025},
  url={https://arxiv.org/pdf/2510.15862}
}
```

**APA:**
Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold.* Pokee AI.

---

## Glossary

- **RLAIF:** Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals.
- **RLOO:** REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning.
- **RTS:** Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time.

---

## More Information
For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
For inquiries, contact: hello@pokee.ai

---

## Model Card Authors
**Yi Wan**, **Jiuqi Wang**, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team

## Model Card Contact
Pokee AI Team — hello@pokee.ai