Files
verbal-calibrate/README.md
ModelHub XC 858d923aee 初始化项目,由ModelHub XC社区提供模型
Model: jamesjunyuguo/verbal-calibrate
Source: Original Platform
2026-04-22 08:51:32 +08:00

102 lines
3.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- en
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- question-answering
- uncertainty-estimation
- retrieval-augmented-generation
- calibration
- llama-3.1
pipeline_tag: text-generation
---
# verbal-calibrate
This checkpoint is a fine-tuned variant of `meta-llama/Llama-3.1-8B-Instruct` for factual QA with explicit verbalized confidence.
## Intended behavior
Given a factual question, the model answers step by step and ends with exactly:
```text
Answer: <answer>
Confidence: <decimal between 0 and 1>
```
The confidence score is intended to reflect the model's uncertainty about the answer and can be used as a retrieval trigger in adaptive RAG pipelines.
## Motivation
- Adaptive retrieval gating with verbalized confidence
- Confidence-aware factual QA
- Research on uncertainty calibration and selective retrieval
---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- adaptive-rag
- uncertainty-quantification
- retrieval-augmented-generation
- question-answering
language:
- en
---
# verbal-calibrate
Fine-tuned from `meta-llama/Llama-3.1-8B-Instruct` to express **calibrated verbal confidence** for adaptive retrieval-augmented generation (RAG).
## What it does
Given a factual question, the model reasons step-by-step and ends every response with exactly two lines:
The confidence score reflects the model's genuine uncertainty. At inference, a confidence below 0.5 triggers BM25 retrieval and a second-pass generation with retrieved context. This allows the model to selectively retrieve only when it needs external evidence.
## Training
- **Base model**: `meta-llama/Llama-3.1-8B-Instruct`
- **Training method**: Supervised fine-tuning on QA data with confidence labels, followed by calibration to align expressed confidence with empirical accuracy
- **Target datasets**: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA)
## Evaluation (dev_500_subsampled, 500 questions × 5 datasets)
| Dataset | EM | F1 | Trigger Rate |
|---|---|---|---|
| HotpotQA | 32.0 | 43.8 | 61.6% |
| MuSiQue | 11.8 | 18.8 | 76.8% |
| 2WikiMultiHopQA | 28.4 | 32.9 | 48.2% |
| NQ | 32.4 | 44.4 | 25.0% |
| TriviaQA | 53.2 | 62.5 | 28.8% |
| **Overall** | **31.6** | **40.5** | **48.1%** |
Trigger rate = fraction of questions where confidence < 0.5 triggered retrieval.
## Intended use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("your-username/verbal-calibrate")
model = AutoModelForCausalLM.from_pretrained("your-username/verbal-calibrate")
prompt = tokenizer.apply_chat_template([{
"role": "user",
"content": (
"Answer the following factual question step by step, then state your answer "
"and how confident you are.\n\n"
"{question}\n\n"
"Your response must end with exactly these two lines:\n"
"Answer: $Answer\n"
"Confidence: $Confidence\n\n"
"Where $Confidence is a decimal between 0 and 1."
).format(question="What is the capital of France?")
}], tokenize=False, add_generation_prompt=True)