初始化项目,由ModelHub XC社区提供模型
Model: metaresearch/PapersRAG-1.5B Source: Original Platform
This commit is contained in:
120
README.md
Normal file
120
README.md
Normal file
@@ -0,0 +1,120 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- en
|
||||
tags:
|
||||
- rag
|
||||
- question-answering
|
||||
- scientific-literature
|
||||
- arxiv
|
||||
- nlp
|
||||
- research-tool
|
||||
pipeline_tag: text-generation
|
||||
base_model:
|
||||
- Qwen/Qwen2.5-1.5B
|
||||
---
|
||||
|
||||
# PapersRAG-1.5B 🧪
|
||||
|
||||
**A retrieval-augmented generation system for querying recent scientific literature — continuously updated.**
|
||||
|
||||
PapersRAG-1.5B helps researchers explore and answer questions across a growing corpus of recent NLP papers from arXiv. It pairs a lightweight language model with a curated knowledge base of paper abstracts and a retrieval pipeline that prioritizes faithful, citation-backed answers over hallucination.
|
||||
|
||||
The model is **automatically refreshed every day** with the latest `cs.CL` papers. The knowledge base expands on its own. No manual upkeep required.
|
||||
|
||||
---
|
||||
|
||||
## Model description
|
||||
|
||||
- **Type:** Retrieval-augmented generation (RAG)
|
||||
- **Base language model:** Qwen 2.5 1.5B — small, fast, coherent when grounded with good context
|
||||
- **Knowledge base:** A continuously growing collection of abstracts from the most recent `cs.CL` papers on arXiv, updated daily via an automated pipeline
|
||||
- **Retrieval pipeline:** Dense embeddings for initial candidate retrieval, cross-encoder for re-ranking — only the most relevant chunks reach the language model
|
||||
- **Answer style:** Every answer cites the paper title it draws from. If no relevant paper is found, the model says so instead of fabricating one
|
||||
|
||||
---
|
||||
|
||||
## Intended use
|
||||
|
||||
PapersRAG is a **research assistant**. It helps scientists and students locate information within indexed NLP papers, ask comparative questions like *"What are the latest trends in retrieval-augmented generation?"*, and surface specific details about a paper's methodology or findings.
|
||||
|
||||
It is not a general-purpose chatbot. It does not have access to full paper text. It only knows what has been explicitly indexed. It will tell you when it doesn't know something.
|
||||
|
||||
---
|
||||
|
||||
## How it works
|
||||
|
||||
1. **Indexing** — Paper abstracts are split into overlapping chunks, embedded with a dense bi-encoder, and stored in a FAISS index
|
||||
2. **Retrieval** — The bi-encoder fetches a pool of candidate chunks for any given question
|
||||
3. **Re-ranking** — A cross-encoder scores each candidate; only chunks above a confidence threshold are kept
|
||||
4. **Generation** — Retained chunks are passed as context to the 1.5B model, which generates a cited answer
|
||||
5. **Safety** — If nothing clears the confidence threshold, the model refuses to answer rather than hallucinate
|
||||
|
||||
No relevant chunk, no answer. That's the rule.
|
||||
|
||||
---
|
||||
|
||||
## Automated daily updates
|
||||
|
||||
Every day, the update pipeline:
|
||||
|
||||
- Downloads the existing index and chunk store from this repository
|
||||
- Scrapes the 100 most recent papers from `cs.CL` on arXiv
|
||||
- Chunks, embeds, and appends the new papers to the existing knowledge base
|
||||
- Rebuilds the FAISS index and uploads everything back
|
||||
|
||||
The knowledge base grows by roughly **100 papers per day**, automatically.
|
||||
|
||||
---
|
||||
|
||||
## Quick start
|
||||
|
||||
```python
|
||||
from huggingface_hub import snapshot_download
|
||||
from pipeline import PapersRAG
|
||||
|
||||
model_dir = snapshot_download("metaresearch/PapersRAG-1.5B")
|
||||
|
||||
rag = PapersRAG(model_dir)
|
||||
|
||||
print(rag.ask("What are the latest approaches to retrieval-augmented generation?"))
|
||||
```
|
||||
|
||||
Requires `transformers`, `sentence-transformers`, and `faiss`. Everything else is in `pipeline.py`.
|
||||
|
||||
---
|
||||
|
||||
## Model composition
|
||||
|
||||
| Component | Description |
|
||||
|---|---|
|
||||
| **Language Model** | Qwen 2.5 1.5B (float16) |
|
||||
| **Bi-encoder** | Dense embedding model for initial retrieval |
|
||||
| **Cross-encoder** | Re-ranking model that scores chunks for relevance |
|
||||
| **Vector Index** | FAISS index of embedded paper chunks |
|
||||
| **Knowledge Chunks** | Processed snippets from indexed arXiv abstracts |
|
||||
| **Pipeline** | `pipeline.py` — one class, handles loading, retrieval, and generation |
|
||||
|
||||
Exact model names for the bi-encoder and cross-encoder are in the repository's configuration files.
|
||||
|
||||
---
|
||||
|
||||
## Limitations
|
||||
|
||||
**Knowledge base scope.** Only `cs.CL` papers from arXiv. Papers from other fields are not included unless manually added.
|
||||
|
||||
**Abstracts only.** Full paper text is not indexed. Deep methodological comparisons may be incomplete.
|
||||
|
||||
**Small language model.** 1.5B parameters is lightweight. The retrieval pipeline handles factual accuracy well, but nuanced multi-paper synthesis has limits.
|
||||
|
||||
**English only.**
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
Apache-2.0.
|
||||
|
||||
---
|
||||
|
||||
*PapersRAG is part of the Meta Research initiative — building open tools that accelerate scientific discovery.*
|
||||
Reference in New Issue
Block a user