120 lines
4.4 KiB
Markdown
120 lines
4.4 KiB
Markdown
---
|
|
license: apache-2.0
|
|
language:
|
|
- en
|
|
tags:
|
|
- rag
|
|
- question-answering
|
|
- scientific-literature
|
|
- arxiv
|
|
- nlp
|
|
- research-tool
|
|
pipeline_tag: text-generation
|
|
base_model:
|
|
- Qwen/Qwen2.5-1.5B
|
|
---
|
|
|
|
# PapersRAG-1.5B 🧪
|
|
|
|
**A retrieval-augmented generation system for querying recent scientific literature — continuously updated.**
|
|
|
|
PapersRAG-1.5B helps researchers explore and answer questions across a growing corpus of recent NLP papers from arXiv. It pairs a lightweight language model with a curated knowledge base of paper abstracts and a retrieval pipeline that prioritizes faithful, citation-backed answers over hallucination.
|
|
|
|
The model is **automatically refreshed every day** with the latest `cs.CL` papers. The knowledge base expands on its own. No manual upkeep required.
|
|
|
|
---
|
|
|
|
## Model description
|
|
|
|
- **Type:** Retrieval-augmented generation (RAG)
|
|
- **Base language model:** Qwen 2.5 1.5B — small, fast, coherent when grounded with good context
|
|
- **Knowledge base:** A continuously growing collection of abstracts from the most recent `cs.CL` papers on arXiv, updated daily via an automated pipeline
|
|
- **Retrieval pipeline:** Dense embeddings for initial candidate retrieval, cross-encoder for re-ranking — only the most relevant chunks reach the language model
|
|
- **Answer style:** Every answer cites the paper title it draws from. If no relevant paper is found, the model says so instead of fabricating one
|
|
|
|
---
|
|
|
|
## Intended use
|
|
|
|
PapersRAG is a **research assistant**. It helps scientists and students locate information within indexed NLP papers, ask comparative questions like *"What are the latest trends in retrieval-augmented generation?"*, and surface specific details about a paper's methodology or findings.
|
|
|
|
It is not a general-purpose chatbot. It does not have access to full paper text. It only knows what has been explicitly indexed. It will tell you when it doesn't know something.
|
|
|
|
---
|
|
|
|
## How it works
|
|
|
|
1. **Indexing** — Paper abstracts are split into overlapping chunks, embedded with a dense bi-encoder, and stored in a FAISS index
|
|
2. **Retrieval** — The bi-encoder fetches a pool of candidate chunks for any given question
|
|
3. **Re-ranking** — A cross-encoder scores each candidate; only chunks above a confidence threshold are kept
|
|
4. **Generation** — Retained chunks are passed as context to the 1.5B model, which generates a cited answer
|
|
5. **Safety** — If nothing clears the confidence threshold, the model refuses to answer rather than hallucinate
|
|
|
|
No relevant chunk, no answer. That's the rule.
|
|
|
|
---
|
|
|
|
## Automated daily updates
|
|
|
|
Every day, the update pipeline:
|
|
|
|
- Downloads the existing index and chunk store from this repository
|
|
- Scrapes the 100 most recent papers from `cs.CL` on arXiv
|
|
- Chunks, embeds, and appends the new papers to the existing knowledge base
|
|
- Rebuilds the FAISS index and uploads everything back
|
|
|
|
The knowledge base grows by roughly **100 papers per day**, automatically.
|
|
|
|
---
|
|
|
|
## Quick start
|
|
|
|
```python
|
|
from huggingface_hub import snapshot_download
|
|
from pipeline import PapersRAG
|
|
|
|
model_dir = snapshot_download("metaresearch/PapersRAG-1.5B")
|
|
|
|
rag = PapersRAG(model_dir)
|
|
|
|
print(rag.ask("What are the latest approaches to retrieval-augmented generation?"))
|
|
```
|
|
|
|
Requires `transformers`, `sentence-transformers`, and `faiss`. Everything else is in `pipeline.py`.
|
|
|
|
---
|
|
|
|
## Model composition
|
|
|
|
| Component | Description |
|
|
|---|---|
|
|
| **Language Model** | Qwen 2.5 1.5B (float16) |
|
|
| **Bi-encoder** | Dense embedding model for initial retrieval |
|
|
| **Cross-encoder** | Re-ranking model that scores chunks for relevance |
|
|
| **Vector Index** | FAISS index of embedded paper chunks |
|
|
| **Knowledge Chunks** | Processed snippets from indexed arXiv abstracts |
|
|
| **Pipeline** | `pipeline.py` — one class, handles loading, retrieval, and generation |
|
|
|
|
Exact model names for the bi-encoder and cross-encoder are in the repository's configuration files.
|
|
|
|
---
|
|
|
|
## Limitations
|
|
|
|
**Knowledge base scope.** Only `cs.CL` papers from arXiv. Papers from other fields are not included unless manually added.
|
|
|
|
**Abstracts only.** Full paper text is not indexed. Deep methodological comparisons may be incomplete.
|
|
|
|
**Small language model.** 1.5B parameters is lightweight. The retrieval pipeline handles factual accuracy well, but nuanced multi-paper synthesis has limits.
|
|
|
|
**English only.**
|
|
|
|
---
|
|
|
|
## License
|
|
|
|
Apache-2.0.
|
|
|
|
---
|
|
|
|
*PapersRAG is part of the Meta Research initiative — building open tools that accelerate scientific discovery.* |