--- language: - en license: llama3.1 base_model: meta-llama/Llama-3.1-8B-Instruct tags: - question-answering - uncertainty-estimation - retrieval-augmented-generation - calibration - llama-3.1 pipeline_tag: text-generation --- # verbal-calibrate This checkpoint is a fine-tuned variant of `meta-llama/Llama-3.1-8B-Instruct` for factual QA with explicit verbalized confidence. ## Intended behavior Given a factual question, the model answers step by step and ends with exactly: ```text Answer: Confidence: ``` The confidence score is intended to reflect the model's uncertainty about the answer and can be used as a retrieval trigger in adaptive RAG pipelines. ## Motivation - Adaptive retrieval gating with verbalized confidence - Confidence-aware factual QA - Research on uncertainty calibration and selective retrieval --- license: llama3.1 base_model: meta-llama/Llama-3.1-8B-Instruct tags: - adaptive-rag - uncertainty-quantification - retrieval-augmented-generation - question-answering language: - en --- # verbal-calibrate Fine-tuned from `meta-llama/Llama-3.1-8B-Instruct` to express **calibrated verbal confidence** for adaptive retrieval-augmented generation (RAG). ## What it does Given a factual question, the model reasons step-by-step and ends every response with exactly two lines: The confidence score reflects the model's genuine uncertainty. At inference, a confidence below 0.5 triggers BM25 retrieval and a second-pass generation with retrieved context. This allows the model to selectively retrieve only when it needs external evidence. ## Training - **Base model**: `meta-llama/Llama-3.1-8B-Instruct` - **Training method**: Supervised fine-tuning on QA data with confidence labels, followed by calibration to align expressed confidence with empirical accuracy - **Target datasets**: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA) ## Evaluation (dev_500_subsampled, 500 questions × 5 datasets) | Dataset | EM | F1 | Trigger Rate | |---|---|---|---| | HotpotQA | 32.0 | 43.8 | 61.6% | | MuSiQue | 11.8 | 18.8 | 76.8% | | 2WikiMultiHopQA | 28.4 | 32.9 | 48.2% | | NQ | 32.4 | 44.4 | 25.0% | | TriviaQA | 53.2 | 62.5 | 28.8% | | **Overall** | **31.6** | **40.5** | **48.1%** | Trigger rate = fraction of questions where confidence < 0.5 triggered retrieval. ## Intended use ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("your-username/verbal-calibrate") model = AutoModelForCausalLM.from_pretrained("your-username/verbal-calibrate") prompt = tokenizer.apply_chat_template([{ "role": "user", "content": ( "Answer the following factual question step by step, then state your answer " "and how confident you are.\n\n" "{question}\n\n" "Your response must end with exactly these two lines:\n" "Answer: $Answer\n" "Confidence: $Confidence\n\n" "Where $Confidence is a decimal between 0 and 1." ).format(question="What is the capital of France?") }], tokenize=False, add_generation_prompt=True)