PapersRAG-1.5B/README.md

---
license: apache-2.0
language:
- en
tags:
- rag
- question-answering
- scientific-literature
- arxiv
- nlp
- research-tool
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-1.5B
---

# PapersRAG-1.5B 🧪

**A retrieval-augmented generation system for querying recent scientific literature — continuously updated.**

PapersRAG-1.5B helps researchers explore and answer questions across a growing corpus of recent NLP papers from arXiv. It pairs a lightweight language model with a curated knowledge base of paper abstracts and a retrieval pipeline that prioritizes faithful, citation-backed answers over hallucination.

The model is **automatically refreshed every day** with the latest `cs.CL` papers. The knowledge base expands on its own. No manual upkeep required.

---

## Model description

- **Type:** Retrieval-augmented generation (RAG)
- **Base language model:** Qwen 2.5 1.5B — small, fast, coherent when grounded with good context
- **Knowledge base:** A continuously growing collection of abstracts from the most recent `cs.CL` papers on arXiv, updated daily via an automated pipeline
- **Retrieval pipeline:** Dense embeddings for initial candidate retrieval, cross-encoder for re-ranking — only the most relevant chunks reach the language model
- **Answer style:** Every answer cites the paper title it draws from. If no relevant paper is found, the model says so instead of fabricating one

---

## Intended use

PapersRAG is a **research assistant**. It helps scientists and students locate information within indexed NLP papers, ask comparative questions like *"What are the latest trends in retrieval-augmented generation?"*, and surface specific details about a paper's methodology or findings.

It is not a general-purpose chatbot. It does not have access to full paper text. It only knows what has been explicitly indexed. It will tell you when it doesn't know something.

---

## How it works

1. **Indexing** — Paper abstracts are split into overlapping chunks, embedded with a dense bi-encoder, and stored in a FAISS index
2. **Retrieval** — The bi-encoder fetches a pool of candidate chunks for any given question
3. **Re-ranking** — A cross-encoder scores each candidate; only chunks above a confidence threshold are kept
4. **Generation** — Retained chunks are passed as context to the 1.5B model, which generates a cited answer
5. **Safety** — If nothing clears the confidence threshold, the model refuses to answer rather than hallucinate

No relevant chunk, no answer. That's the rule.

---

## Automated daily updates

Every day, the update pipeline:

- Downloads the existing index and chunk store from this repository
- Scrapes the 100 most recent papers from `cs.CL` on arXiv
- Chunks, embeds, and appends the new papers to the existing knowledge base
- Rebuilds the FAISS index and uploads everything back

The knowledge base grows by roughly **100 papers per day**, automatically.

---

## Quick start

```python
from huggingface_hub import snapshot_download
from pipeline import PapersRAG

model_dir = snapshot_download("metaresearch/PapersRAG-1.5B")

rag = PapersRAG(model_dir)

print(rag.ask("What are the latest approaches to retrieval-augmented generation?"))
```

Requires `transformers`, `sentence-transformers`, and `faiss`. Everything else is in `pipeline.py`.

---

## Model composition

| Component | Description |
|---|---|
| **Language Model** | Qwen 2.5 1.5B (float16) |
| **Bi-encoder** | Dense embedding model for initial retrieval |
| **Cross-encoder** | Re-ranking model that scores chunks for relevance |
| **Vector Index** | FAISS index of embedded paper chunks |
| **Knowledge Chunks** | Processed snippets from indexed arXiv abstracts |
| **Pipeline** | `pipeline.py` — one class, handles loading, retrieval, and generation |

Exact model names for the bi-encoder and cross-encoder are in the repository's configuration files.

---

## Limitations

**Knowledge base scope.** Only `cs.CL` papers from arXiv. Papers from other fields are not included unless manually added.

**Abstracts only.** Full paper text is not indexed. Deep methodological comparisons may be incomplete.

**Small language model.** 1.5B parameters is lightweight. The retrieval pipeline handles factual accuracy well, but nuanced multi-paper synthesis has limits.

**English only.**

---

## License

Apache-2.0.

---

*PapersRAG is part of the Meta Research initiative — building open tools that accelerate scientific discovery.*