--- license: apache-2.0 language: - en tags: - rag - question-answering - scientific-literature - arxiv - nlp - research-tool pipeline_tag: text-generation base_model: - Qwen/Qwen2.5-1.5B --- # PapersRAG-1.5B ๐Ÿงช **A retrieval-augmented generation system for querying recent scientific literature โ€” continuously updated.** PapersRAG-1.5B helps researchers explore and answer questions across a growing corpus of recent NLP papers from arXiv. It pairs a lightweight language model with a curated knowledge base of paper abstracts and a retrieval pipeline that prioritizes faithful, citation-backed answers over hallucination. The model is **automatically refreshed every day** with the latest `cs.CL` papers. The knowledge base expands on its own. No manual upkeep required. --- ## Model description - **Type:** Retrieval-augmented generation (RAG) - **Base language model:** Qwen 2.5 1.5B โ€” small, fast, coherent when grounded with good context - **Knowledge base:** A continuously growing collection of abstracts from the most recent `cs.CL` papers on arXiv, updated daily via an automated pipeline - **Retrieval pipeline:** Dense embeddings for initial candidate retrieval, cross-encoder for re-ranking โ€” only the most relevant chunks reach the language model - **Answer style:** Every answer cites the paper title it draws from. If no relevant paper is found, the model says so instead of fabricating one --- ## Intended use PapersRAG is a **research assistant**. It helps scientists and students locate information within indexed NLP papers, ask comparative questions like *"What are the latest trends in retrieval-augmented generation?"*, and surface specific details about a paper's methodology or findings. It is not a general-purpose chatbot. It does not have access to full paper text. It only knows what has been explicitly indexed. It will tell you when it doesn't know something. --- ## How it works 1. **Indexing** โ€” Paper abstracts are split into overlapping chunks, embedded with a dense bi-encoder, and stored in a FAISS index 2. **Retrieval** โ€” The bi-encoder fetches a pool of candidate chunks for any given question 3. **Re-ranking** โ€” A cross-encoder scores each candidate; only chunks above a confidence threshold are kept 4. **Generation** โ€” Retained chunks are passed as context to the 1.5B model, which generates a cited answer 5. **Safety** โ€” If nothing clears the confidence threshold, the model refuses to answer rather than hallucinate No relevant chunk, no answer. That's the rule. --- ## Automated daily updates Every day, the update pipeline: - Downloads the existing index and chunk store from this repository - Scrapes the 100 most recent papers from `cs.CL` on arXiv - Chunks, embeds, and appends the new papers to the existing knowledge base - Rebuilds the FAISS index and uploads everything back The knowledge base grows by roughly **100 papers per day**, automatically. --- ## Quick start ```python from huggingface_hub import snapshot_download from pipeline import PapersRAG model_dir = snapshot_download("metaresearch/PapersRAG-1.5B") rag = PapersRAG(model_dir) print(rag.ask("What are the latest approaches to retrieval-augmented generation?")) ``` Requires `transformers`, `sentence-transformers`, and `faiss`. Everything else is in `pipeline.py`. --- ## Model composition | Component | Description | |---|---| | **Language Model** | Qwen 2.5 1.5B (float16) | | **Bi-encoder** | Dense embedding model for initial retrieval | | **Cross-encoder** | Re-ranking model that scores chunks for relevance | | **Vector Index** | FAISS index of embedded paper chunks | | **Knowledge Chunks** | Processed snippets from indexed arXiv abstracts | | **Pipeline** | `pipeline.py` โ€” one class, handles loading, retrieval, and generation | Exact model names for the bi-encoder and cross-encoder are in the repository's configuration files. --- ## Limitations **Knowledge base scope.** Only `cs.CL` papers from arXiv. Papers from other fields are not included unless manually added. **Abstracts only.** Full paper text is not indexed. Deep methodological comparisons may be incomplete. **Small language model.** 1.5B parameters is lightweight. The retrieval pipeline handles factual accuracy well, but nuanced multi-paper synthesis has limits. **English only.** --- ## License Apache-2.0. --- *PapersRAG is part of the Meta Research initiative โ€” building open tools that accelerate scientific discovery.*