初始化项目，由ModelHub XC社区提供模型

Model: Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering Source: Original Platform
2026-04-27 04:45:32 +08:00
commit 14713c3f6d
12 changed files with 2629 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,134 @@
+---
+library_name: transformers
+license: cc-by-nc-4.0
+base_model: meta-llama/Llama-3.1-8B-Instruct
+tags:
+- rag
+- filtering
+---
+
+
+## Model Description
+
+This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), trained for 🚀**evidence relevance classification or evidence filtering**🚀 in medical RAG pipelines.  
+Given a clinical query and a candidate passage, the model outputs *“Yes”* if the passage contains supporting evidence and *“No”* otherwise.
+
+This lightweight classifier is designed to help researchers:
+- Improve retrieval quality in medical RAG systems.
+- Filter irrelevant passages before generation.
+- Build more reliable, interpretable RAG pipelines for medical QA.
+
+For additional context, methodology, and full experimental details, please refer to our paper below.
+
+📄 **Paper**: [Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights](https://arxiv.org/abs/2511.06738) 
+
+
+## Quick Start
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+
+# Instruction used during training
+INSTRUCTION = (
+    "Given a query and a text passage, determine whether the passage contains supporting evidence for the query. "
+    "Supporting evidence means that the passage provides clear, relevant, and factual information that directly backs or justifies the answer to the query.\n\n"
+    "Respond with one of the following labels:\n\"Yes\" if the passage contains supporting evidence for the query.\n"
+    "\"No\" if the passage does not contain supporting evidence.\n"
+    "You should respond with only the label (Yes or No) without any additional explanation."
+)
+
+# Example query + retrieved passage
+query = "What is the first-line treatment for acute angle-closure glaucoma?"
+doc = "Acute angle-closure glaucoma requires immediate treatment with topical beta-blockers, alpha agonists, and systemic carbonic anhydrase inhibitors."
+
+# Build chat-style prompt
+content = tokenizer.apply_chat_template(
+    [
+        {"role": "system", "content": INSTRUCTION},
+        {"role": "user", "content": f"Question: {query}\nPassage: {doc}"}
+    ],
+    add_generation_prompt=True,
+    tokenize=False,
+)
+
+# Tokenize
+input_ids = tokenizer(content, return_tensors="pt").input_ids.to(model.device)
+
+# Define stopping tokens (Llama-3 style)
+terminators = [
+    tokenizer.eos_token_id,
+    tokenizer.convert_tokens_to_ids("<|eot_id|>")
+]
+
+# Generate evidence-filtering judgment
+outputs = model.generate(
+    input_ids=input_ids,
+    max_new_tokens=256,
+    eos_token_id=terminators,
+    do_sample=False,
+    temperature=0.0,
+)
+
+# Decode model response
+response = outputs[0][input_ids.shape[-1]:]
+print(tokenizer.decode(response, skip_special_tokens=True))
+```
+
+
+## Training Setup
+
+- **Dataset:** 3,200 query–passage pairs with expert-provided Yes/No labels (dataset to be released in a future update).
+- **Task:** Given a query and a candidate passage, the model generates *"Yes"* if the passage contains supporting evidence and *"No"* otherwise.
+- **Objective:** Causal language modeling (cross-entropy next-token loss).
+- **Prompt:** See the *Quick Start* section for an example usage prompt.
+- **Hyperparameter Tuning:** Five-fold cross-validation.
+- **Final Hyperparameters:**  
+  - Learning rate: 2e-6
+  - Batch size: 8  
+  - Epochs: 3
+- **Training Framework:** [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
+
+
+## Performance
+
+Evaluation was conducted on 3,200 expert-annotated query–passage pairs using five-fold cross-validation.
+
+| Model                               | Precision | Recall | F1   |
+|-------------------------------------|-----------|--------|------|
+| **Llama-3.1-8B (zero-shot)**        | 0.483     | 0.566  | 0.521 |
+| **GPT-4o (zero-shot)**              | 0.697     | 0.324  | 0.442 |
+| **Llama-3.1-8B (fine-tuned, ours)** | **0.592** | **0.657** | **0.623** |
+
+🔥 Fine-tuning yields substantial gains over all zero-shot baselines.
+
+
+## Intended Use
+
+This model is intended for research purposes only.
+
+
+## Reference
+
+Please see the information below to cite our paper.
+```bibtex
+@article{kim2025rethinking,
+  title={Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights},
+  author={Kim, Hyunjae and Sohn, Jiwoong and Gilson, Aidan and Cochran-Caggiano, Nicholas and Applebaum, Serina and Jin, Heeju and Park, Seihee and Park, Yujin and Park, Jiyeong and Choi, Seoyoung and others},
+  journal={arXiv preprint arXiv:2511.06738},
+  year={2025}
+}
+```
+
+## Contact
+
+Feel free to email `hyunjae.kim@yale.edu` if you have any questions.