Files
pokee_research_7b/README.md
ModelHub XC ab9c6761d4 初始化项目,由ModelHub XC社区提供模型
Model: PokeeAI/pokee_research_7b
Source: Original Platform
2026-05-06 04:20:34 +08:00

208 lines
8.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
base_model:
- Qwen/Qwen2.5-7B-Instruct
datasets:
- miromind-ai/MiroRL-GenQA
language:
- en
license: apache-2.0
tags:
- agent
- deepresearch
- llm
- rl
- reinforcementlearning
pipeline_tag: text-generation
library_name: transformers
---
# Model Card for PokeeResearch
## Model Details
### Model Description
**PokeeResearch-7B** is a **7-billion-parameter deep research agent** developed by **Pokee AI** to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.
The model integrates **Reinforcement Learning from AI Feedback (RLAIF)** with a **robust reasoning scaffold**, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.
- **Developed by:** Pokee AI
- **Model type:** Tool-augmented large language model (LLM) research agent
- **Language(s):** English, Chinese and many more
- **License:** Apache 2.0
- **Finetuned from model:** Qwen2.5-7B-Instruct
### Model Sources
- **Repository:** [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
- **Paper:** [*PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold*](https://arxiv.org/pdf/2510.15862), Pokee AI, October 2025
- **Project Page:** [https://pokee.ai/deepresearch-preview](https://pokee.ai/deepresearch-preview)
---
## Uses
### Direct Use
PokeeResearch-7B is designed for **deep research automation**, where the model autonomously:
- Decomposes complex user queries
- Retrieves and reads from external sources
- Synthesizes factual, verifiable, and grounded answers
It can be used as a **standalone research assistant** or integrated into **multi-agent systems** to support academic, enterprise, or product-level research tasks.
### Downstream Use
PokeeResearch-7B can be **fine-tuned** or **extended** for:
- Domain-specific scientific discovery
- Autonomous document retrieval and synthesis
- Multi-source verification and summarization pipelines
- Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)
### Out-of-Scope Use
The model should **not** be used for:
- Generating unverified or speculative claims
- Automated decision-making in high-stakes domains (medical, legal, or financial)
- Applications requiring strict factual precision without external verification
- Generating content without citation or evidence tracing
---
## Bias, Risks, and Limitations
PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:
- Dependence on **external data quality** and **retrieval accuracy**
- Potential **semantic bias** introduced by AI-based feedback signals
- Limited coverage for **non-English** or **multi-modal** reasoning tasks
- Risk of **hallucinated synthesis** when sources conflict or lack clarity
### Recommendations
Users should:
- Cross-verify answers, especially in multi-hop reasoning cases
- Monitor output for citation accuracy and alignment with source data
- Refrain from using outputs as sole evidence in decision-critical contexts
---
## How to Get Started with the Model
please refer to the following codebase for how to use PokeeResearch-7B
https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md
---
## Training Details
### Training Data
- **Dataset:** MiroRL-GenQA dataset (MiroMind AI, 2025)
- **Data characteristics:** Complex, multi-turn questionanswer pairs requiring multi-step reasoning
- **Data filtering:** No benchmark data used for testing; the model was trained only on open-domain text Q&A samples
### Training Procedure
#### Preprocessing
- Normalization and tokenization aligned with Qwen2.5 tokenizer
- Structured promptresponse pairs in research/verification format (`<tool_call>`, `<answer>`, `<verification>`)
#### Training Hyperparameters
- **Algorithm:** RLOO (REINFORCE Leave-One-Out)
- **Batch size:** 64
- **Research threads per prompt:** 8
- **Learning rate:** 3e-6
- **Context limit:** 32,768 tokens
- **Steps:** 140 fine-tuning iterations
- **Regularization:** None (no entropy or KL regularization)
- **Precision regime:** bf16 mixed precision
#### Reward Design
- Combined reward signal from:
- **AI feedback** (semantic equivalence via external LLM judge)
- **Format adherence reward** (ensures correct agent behavior)
#### Speeds, Sizes, Times
- **Model size:** 7 billion parameters
- **Training duration:** ~5 days on 8 × A100 80G GPUs
- **Checkpoint size:** ~13 GB
---
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
10 open-domain research and QA benchmarks:
- NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanitys Last Exam
#### Factors
- Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.
- Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).
#### Metrics
- Mean accuracy (mean@4 across independent research threads) based on
### Results
**PokeeResearch-7B (RTS variant)** and **PokeeResearch-7B** outperforms all baselines at 7B scale across 10 benchmarks.
Highlights (mean@4 accuracy):
| **Method** | **HLE** | **GAIA** | **BrowseComp** | **BAMB** | **2WIKI** | **TQ** | **NQ** | **POPQA** | **MUSIQUE** | **HOTPOTQA** |
|-------------|----------|-----------|----------------|-----------|-----------|----------|----------|-------------|---------------|----------------|
| R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 |
| SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 |
| ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 |
| ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 |
| DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 |
| **PR** | **15.2** | **36.9** | **5.4** | **74.5** | **74.0** | **91.3** | **75.1** | **59.8** | **39.8** | **71.2** |
| **PR+** | **17.6** | **41.3** | **8.4** | **75.0** | **75.0** | **91.8** | **75.0** | **60.0** | **41.4** | **71.6** |
#### Summary
PokeeResearch-7B variants achieves **state-of-the-art performance among 7B-scale open deep research agents**, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.
---
## Technical Specifications
### Model Architecture and Objective
- **Base Architecture:** Transformer decoder (Qwen2.5-7B-Instruct backbone)
- **Objective:** Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning
### Compute Infrastructure
#### Hardware
- NVIDIA A100 80GB GPUs ×8 for training and x1 for inference
---
## Citation
**BibTeX:**
```bibtex
@article{pokee2025deepresearch,
title={PokeeResearch: Effective Deep Research via
Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
author={Yi Wan* and Jiuqi Wang* and Liam Li
and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
journal={Pokee AI Technical Report},
year={2025},
url={https://arxiv.org/pdf/2510.15862}
}
```
**APA:**
Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold.* Pokee AI.
---
## Glossary
- **RLAIF:** Reinforcement Learning from AI Feedback optimization using LLM-based reward signals.
- **RLOO:** REINFORCE Leave-One-Out unbiased policy gradient variant for on-policy learning.
- **RTS:** Research Threads Synthesis synthesis of multiple independent reasoning threads at inference time.
---
## More Information
For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
For inquiries, contact: hello@pokee.ai
---
## Model Card Authors
**Yi Wan**, **Jiuqi Wang**, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team
## Model Card Contact
Pokee AI Team — hello@pokee.ai