初始化项目，由ModelHub XC社区提供模型

Model: metaresearch/PapersRAG-1.5B Source: Original Platform
2026-05-16 18:44:58 +08:00
commit 60c1651765
22 changed files with 531708 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,120 @@
+---
+license: apache-2.0
+language:
+- en
+tags:
+- rag
+- question-answering
+- scientific-literature
+- arxiv
+- nlp
+- research-tool
+pipeline_tag: text-generation
+base_model:
+- Qwen/Qwen2.5-1.5B
+---
+
+# PapersRAG-1.5B 🧪
+
+**A retrieval-augmented generation system for querying recent scientific literature — continuously updated.**
+
+PapersRAG-1.5B helps researchers explore and answer questions across a growing corpus of recent NLP papers from arXiv. It pairs a lightweight language model with a curated knowledge base of paper abstracts and a retrieval pipeline that prioritizes faithful, citation-backed answers over hallucination.
+
+The model is **automatically refreshed every day** with the latest `cs.CL` papers. The knowledge base expands on its own. No manual upkeep required.
+
+---
+
+## Model description
+
+- **Type:** Retrieval-augmented generation (RAG)
+- **Base language model:** Qwen 2.5 1.5B — small, fast, coherent when grounded with good context
+- **Knowledge base:** A continuously growing collection of abstracts from the most recent `cs.CL` papers on arXiv, updated daily via an automated pipeline
+- **Retrieval pipeline:** Dense embeddings for initial candidate retrieval, cross-encoder for re-ranking — only the most relevant chunks reach the language model
+- **Answer style:** Every answer cites the paper title it draws from. If no relevant paper is found, the model says so instead of fabricating one
+
+---
+
+## Intended use
+
+PapersRAG is a **research assistant**. It helps scientists and students locate information within indexed NLP papers, ask comparative questions like *"What are the latest trends in retrieval-augmented generation?"*, and surface specific details about a paper's methodology or findings.
+
+It is not a general-purpose chatbot. It does not have access to full paper text. It only knows what has been explicitly indexed. It will tell you when it doesn't know something.
+
+---
+
+## How it works
+
+1. **Indexing** — Paper abstracts are split into overlapping chunks, embedded with a dense bi-encoder, and stored in a FAISS index
+2. **Retrieval** — The bi-encoder fetches a pool of candidate chunks for any given question
+3. **Re-ranking** — A cross-encoder scores each candidate; only chunks above a confidence threshold are kept
+4. **Generation** — Retained chunks are passed as context to the 1.5B model, which generates a cited answer
+5. **Safety** — If nothing clears the confidence threshold, the model refuses to answer rather than hallucinate
+
+No relevant chunk, no answer. That's the rule.
+
+---
+
+## Automated daily updates
+
+Every day, the update pipeline:
+
+- Downloads the existing index and chunk store from this repository
+- Scrapes the 100 most recent papers from `cs.CL` on arXiv
+- Chunks, embeds, and appends the new papers to the existing knowledge base
+- Rebuilds the FAISS index and uploads everything back
+
+The knowledge base grows by roughly **100 papers per day**, automatically.
+
+---
+
+## Quick start
+
+```python
+from huggingface_hub import snapshot_download
+from pipeline import PapersRAG
+
+model_dir = snapshot_download("metaresearch/PapersRAG-1.5B")
+
+rag = PapersRAG(model_dir)
+
+print(rag.ask("What are the latest approaches to retrieval-augmented generation?"))
+```
+
+Requires `transformers`, `sentence-transformers`, and `faiss`. Everything else is in `pipeline.py`.
+
+---
+
+## Model composition
+
+| Component | Description |
+|---|---|
+| **Language Model** | Qwen 2.5 1.5B (float16) |
+| **Bi-encoder** | Dense embedding model for initial retrieval |
+| **Cross-encoder** | Re-ranking model that scores chunks for relevance |
+| **Vector Index** | FAISS index of embedded paper chunks |
+| **Knowledge Chunks** | Processed snippets from indexed arXiv abstracts |
+| **Pipeline** | `pipeline.py` — one class, handles loading, retrieval, and generation |
+
+Exact model names for the bi-encoder and cross-encoder are in the repository's configuration files.
+
+---
+
+## Limitations
+
+**Knowledge base scope.** Only `cs.CL` papers from arXiv. Papers from other fields are not included unless manually added.
+
+**Abstracts only.** Full paper text is not indexed. Deep methodological comparisons may be incomplete.
+
+**Small language model.** 1.5B parameters is lightweight. The retrieval pipeline handles factual accuracy well, but nuanced multi-paper synthesis has limits.
+
+**English only.**
+
+---
+
+## License
+
+Apache-2.0.
+
+---
+
+*PapersRAG is part of the Meta Research initiative — building open tools that accelerate scientific discovery.*