初始化项目，由ModelHub XC社区提供模型

Model: jamesjunyuguo/verbal-calibrate Source: Original Platform
2026-04-22 08:51:32 +08:00
commit 858d923aee
12 changed files with 2579 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,101 @@
+---
+language:
+- en
+license: llama3.1
+base_model: meta-llama/Llama-3.1-8B-Instruct
+tags:
+- question-answering
+- uncertainty-estimation
+- retrieval-augmented-generation
+- calibration
+- llama-3.1
+pipeline_tag: text-generation
+---
+
+# verbal-calibrate
+
+This checkpoint is a fine-tuned variant of `meta-llama/Llama-3.1-8B-Instruct` for factual QA with explicit verbalized confidence.
+
+## Intended behavior
+
+Given a factual question, the model answers step by step and ends with exactly:
+
+```text
+Answer: <answer>
+Confidence: <decimal between 0 and 1>
+```
+
+The confidence score is intended to reflect the model's uncertainty about the answer and can be used as a retrieval trigger in adaptive RAG pipelines.
+
+## Motivation
+
+- Adaptive retrieval gating with verbalized confidence
+- Confidence-aware factual QA
+- Research on uncertainty calibration and selective retrieval
+
+---
+license: llama3.1
+base_model: meta-llama/Llama-3.1-8B-Instruct
+tags:
+  - adaptive-rag
+  - uncertainty-quantification
+  - retrieval-augmented-generation
+  - question-answering
+language:
+  - en
+---
+
+# verbal-calibrate
+
+Fine-tuned from `meta-llama/Llama-3.1-8B-Instruct` to express **calibrated verbal confidence** for adaptive retrieval-augmented generation (RAG).
+
+## What it does
+
+Given a factual question, the model reasons step-by-step and ends every response with exactly two lines:
+
+
+The confidence score reflects the model's genuine uncertainty. At inference, a confidence below 0.5 triggers BM25 retrieval and a second-pass generation with retrieved context. This allows the model to selectively retrieve only when it needs external evidence.
+
+## Training
+
+- **Base model**: `meta-llama/Llama-3.1-8B-Instruct`
+- **Training method**: Supervised fine-tuning on QA data with confidence labels, followed by calibration to align expressed confidence with empirical accuracy
+- **Target datasets**: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA)
+
+## Evaluation (dev_500_subsampled, 500 questions × 5 datasets)
+
+| Dataset | EM | F1 | Trigger Rate |
+|---|---|---|---|
+| HotpotQA | 32.0 | 43.8 | 61.6% |
+| MuSiQue | 11.8 | 18.8 | 76.8% |
+| 2WikiMultiHopQA | 28.4 | 32.9 | 48.2% |
+| NQ | 32.4 | 44.4 | 25.0% |
+| TriviaQA | 53.2 | 62.5 | 28.8% |
+| **Overall** | **31.6** | **40.5** | **48.1%** |
+
+Trigger rate = fraction of questions where confidence < 0.5 triggered retrieval.
+
+## Intended use
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+tokenizer = AutoTokenizer.from_pretrained("your-username/verbal-calibrate")
+model = AutoModelForCausalLM.from_pretrained("your-username/verbal-calibrate")
+
+prompt = tokenizer.apply_chat_template([{
+    "role": "user",
+    "content": (
+        "Answer the following factual question step by step, then state your answer "
+        "and how confident you are.\n\n"
+        "{question}\n\n"
+        "Your response must end with exactly these two lines:\n"
+        "Answer: $Answer\n"
+        "Confidence: $Confidence\n\n"
+        "Where $Confidence is a decimal between 0 and 1."
+    ).format(question="What is the capital of France?")
+}], tokenize=False, add_generation_prompt=True)
+
+
+
+