初始化项目,由ModelHub XC社区提供模型
Model: jamesjunyuguo/verbal-calibrate Source: Original Platform
This commit is contained in:
101
README.md
Normal file
101
README.md
Normal file
@@ -0,0 +1,101 @@
|
||||
---
|
||||
language:
|
||||
- en
|
||||
license: llama3.1
|
||||
base_model: meta-llama/Llama-3.1-8B-Instruct
|
||||
tags:
|
||||
- question-answering
|
||||
- uncertainty-estimation
|
||||
- retrieval-augmented-generation
|
||||
- calibration
|
||||
- llama-3.1
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# verbal-calibrate
|
||||
|
||||
This checkpoint is a fine-tuned variant of `meta-llama/Llama-3.1-8B-Instruct` for factual QA with explicit verbalized confidence.
|
||||
|
||||
## Intended behavior
|
||||
|
||||
Given a factual question, the model answers step by step and ends with exactly:
|
||||
|
||||
```text
|
||||
Answer: <answer>
|
||||
Confidence: <decimal between 0 and 1>
|
||||
```
|
||||
|
||||
The confidence score is intended to reflect the model's uncertainty about the answer and can be used as a retrieval trigger in adaptive RAG pipelines.
|
||||
|
||||
## Motivation
|
||||
|
||||
- Adaptive retrieval gating with verbalized confidence
|
||||
- Confidence-aware factual QA
|
||||
- Research on uncertainty calibration and selective retrieval
|
||||
|
||||
---
|
||||
license: llama3.1
|
||||
base_model: meta-llama/Llama-3.1-8B-Instruct
|
||||
tags:
|
||||
- adaptive-rag
|
||||
- uncertainty-quantification
|
||||
- retrieval-augmented-generation
|
||||
- question-answering
|
||||
language:
|
||||
- en
|
||||
---
|
||||
|
||||
# verbal-calibrate
|
||||
|
||||
Fine-tuned from `meta-llama/Llama-3.1-8B-Instruct` to express **calibrated verbal confidence** for adaptive retrieval-augmented generation (RAG).
|
||||
|
||||
## What it does
|
||||
|
||||
Given a factual question, the model reasons step-by-step and ends every response with exactly two lines:
|
||||
|
||||
|
||||
The confidence score reflects the model's genuine uncertainty. At inference, a confidence below 0.5 triggers BM25 retrieval and a second-pass generation with retrieved context. This allows the model to selectively retrieve only when it needs external evidence.
|
||||
|
||||
## Training
|
||||
|
||||
- **Base model**: `meta-llama/Llama-3.1-8B-Instruct`
|
||||
- **Training method**: Supervised fine-tuning on QA data with confidence labels, followed by calibration to align expressed confidence with empirical accuracy
|
||||
- **Target datasets**: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA)
|
||||
|
||||
## Evaluation (dev_500_subsampled, 500 questions × 5 datasets)
|
||||
|
||||
| Dataset | EM | F1 | Trigger Rate |
|
||||
|---|---|---|---|
|
||||
| HotpotQA | 32.0 | 43.8 | 61.6% |
|
||||
| MuSiQue | 11.8 | 18.8 | 76.8% |
|
||||
| 2WikiMultiHopQA | 28.4 | 32.9 | 48.2% |
|
||||
| NQ | 32.4 | 44.4 | 25.0% |
|
||||
| TriviaQA | 53.2 | 62.5 | 28.8% |
|
||||
| **Overall** | **31.6** | **40.5** | **48.1%** |
|
||||
|
||||
Trigger rate = fraction of questions where confidence < 0.5 triggered retrieval.
|
||||
|
||||
## Intended use
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("your-username/verbal-calibrate")
|
||||
model = AutoModelForCausalLM.from_pretrained("your-username/verbal-calibrate")
|
||||
|
||||
prompt = tokenizer.apply_chat_template([{
|
||||
"role": "user",
|
||||
"content": (
|
||||
"Answer the following factual question step by step, then state your answer "
|
||||
"and how confident you are.\n\n"
|
||||
"{question}\n\n"
|
||||
"Your response must end with exactly these two lines:\n"
|
||||
"Answer: $Answer\n"
|
||||
"Confidence: $Confidence\n\n"
|
||||
"Where $Confidence is a decimal between 0 and 1."
|
||||
).format(question="What is the capital of France?")
|
||||
}], tokenize=False, add_generation_prompt=True)
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user