初始化项目,由ModelHub XC社区提供模型

Model: metaresearch/PapersRAG-1.5B
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-16 18:44:58 +08:00
commit 60c1651765
22 changed files with 531708 additions and 0 deletions

120
README.md Normal file
View File

@@ -0,0 +1,120 @@
---
license: apache-2.0
language:
- en
tags:
- rag
- question-answering
- scientific-literature
- arxiv
- nlp
- research-tool
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-1.5B
---
# PapersRAG-1.5B 🧪
**A retrieval-augmented generation system for querying recent scientific literature — continuously updated.**
PapersRAG-1.5B helps researchers explore and answer questions across a growing corpus of recent NLP papers from arXiv. It pairs a lightweight language model with a curated knowledge base of paper abstracts and a retrieval pipeline that prioritizes faithful, citation-backed answers over hallucination.
The model is **automatically refreshed every day** with the latest `cs.CL` papers. The knowledge base expands on its own. No manual upkeep required.
---
## Model description
- **Type:** Retrieval-augmented generation (RAG)
- **Base language model:** Qwen 2.5 1.5B — small, fast, coherent when grounded with good context
- **Knowledge base:** A continuously growing collection of abstracts from the most recent `cs.CL` papers on arXiv, updated daily via an automated pipeline
- **Retrieval pipeline:** Dense embeddings for initial candidate retrieval, cross-encoder for re-ranking — only the most relevant chunks reach the language model
- **Answer style:** Every answer cites the paper title it draws from. If no relevant paper is found, the model says so instead of fabricating one
---
## Intended use
PapersRAG is a **research assistant**. It helps scientists and students locate information within indexed NLP papers, ask comparative questions like *"What are the latest trends in retrieval-augmented generation?"*, and surface specific details about a paper's methodology or findings.
It is not a general-purpose chatbot. It does not have access to full paper text. It only knows what has been explicitly indexed. It will tell you when it doesn't know something.
---
## How it works
1. **Indexing** — Paper abstracts are split into overlapping chunks, embedded with a dense bi-encoder, and stored in a FAISS index
2. **Retrieval** — The bi-encoder fetches a pool of candidate chunks for any given question
3. **Re-ranking** — A cross-encoder scores each candidate; only chunks above a confidence threshold are kept
4. **Generation** — Retained chunks are passed as context to the 1.5B model, which generates a cited answer
5. **Safety** — If nothing clears the confidence threshold, the model refuses to answer rather than hallucinate
No relevant chunk, no answer. That's the rule.
---
## Automated daily updates
Every day, the update pipeline:
- Downloads the existing index and chunk store from this repository
- Scrapes the 100 most recent papers from `cs.CL` on arXiv
- Chunks, embeds, and appends the new papers to the existing knowledge base
- Rebuilds the FAISS index and uploads everything back
The knowledge base grows by roughly **100 papers per day**, automatically.
---
## Quick start
```python
from huggingface_hub import snapshot_download
from pipeline import PapersRAG
model_dir = snapshot_download("metaresearch/PapersRAG-1.5B")
rag = PapersRAG(model_dir)
print(rag.ask("What are the latest approaches to retrieval-augmented generation?"))
```
Requires `transformers`, `sentence-transformers`, and `faiss`. Everything else is in `pipeline.py`.
---
## Model composition
| Component | Description |
|---|---|
| **Language Model** | Qwen 2.5 1.5B (float16) |
| **Bi-encoder** | Dense embedding model for initial retrieval |
| **Cross-encoder** | Re-ranking model that scores chunks for relevance |
| **Vector Index** | FAISS index of embedded paper chunks |
| **Knowledge Chunks** | Processed snippets from indexed arXiv abstracts |
| **Pipeline** | `pipeline.py` — one class, handles loading, retrieval, and generation |
Exact model names for the bi-encoder and cross-encoder are in the repository's configuration files.
---
## Limitations
**Knowledge base scope.** Only `cs.CL` papers from arXiv. Papers from other fields are not included unless manually added.
**Abstracts only.** Full paper text is not indexed. Deep methodological comparisons may be incomplete.
**Small language model.** 1.5B parameters is lightweight. The retrieval pipeline handles factual accuracy well, but nuanced multi-paper synthesis has limits.
**English only.**
---
## License
Apache-2.0.
---
*PapersRAG is part of the Meta Research initiative — building open tools that accelerate scientific discovery.*