Files
uncertain-calibrate/README.md
ModelHub XC a6180b63b8 初始化项目,由ModelHub XC社区提供模型
Model: jamesjunyuguo/uncertain-calibrate
Source: Original Platform
2026-04-20 15:41:19 +08:00

68 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- adaptive-rag
- uncertainty-quantification
- retrieval-augmented-generation
- question-answering
- reinforcement-learning
- grpo
language:
- en
---
# uncertain-calibrate
Fine-tuned from `meta-llama/Llama-3.1-8B-Instruct` via **GRPO reinforcement learning** to emit a special `<uncertain>` token when the model is uncertain during reasoning, enabling uncertainty-guided adaptive retrieval.
## What it does
The model reasons step-by-step and inserts `<uncertain>` at any point where it lacks confidence in a fact. A lightweight ridge regression probe (trained on layer-13 hidden states at the `<uncertain>` span) then decides whether to trigger BM25 retrieval and a second-pass generation.
## Training
- **Base model**: `meta-llama/Llama-3.1-8B-Instruct`
- **Training method**: GRPO (Group Relative Policy Optimization) with EM-based reward; the model is rewarded for correct final answers, encouraging it to emit `<uncertain>` in contexts where retrieval would help
- **Target datasets**: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA)
## Retrieval gating (probe)
A separate ridge regression probe on layer-13 hidden states over `<uncertain>` spans must be trained to use this model for adaptive RAG. The probe AUROC on held-out data is ~0.82. Use the companion probe artifact `uncertain_probe_layer13_alpha3000.pkl` from the [AdaRAGUE repository](https://github.com/JamesJunyuGuo/AdaRAGUE).
## Evaluation (dev_500_subsampled, 500 questions × 5 datasets, with probe gating)
| Dataset | EM | F1 | Trigger Rate |
|---|---|---|---|
| HotpotQA | 32.6 | 42.7 | 67.4% |
| MuSiQue | 7.6 | 14.1 | 94.2% |
| 2WikiMultiHopQA | 26.2 | 29.6 | 59.2% |
| NQ | 31.4 | 41.0 | 52.0% |
| TriviaQA | 56.6 | 63.2 | 34.0% |
| **Overall** | **30.9** | **38.1** | **61.4%** |
Trigger rate = fraction of questions where the probe decided to retrieve.
## Intended use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("your-username/uncertain-calibrate")
model = AutoModelForCausalLM.from_pretrained("your-username/uncertain-calibrate")
SYSTEM = (
"You are a helpful reasoning assistant. Think step by step. "
"If at any point you are uncertain about a fact, emit the special token "
"<uncertain> to signal that you need more information. "
"End your response with 'Answer: <your answer>' on the last line."
)
prompt = tokenizer.apply_chat_template([
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "Who directed the film Interstellar?"},
], tokenize=False, add_generation_prompt=True)