120 lines
5.3 KiB
Markdown
120 lines
5.3 KiB
Markdown
|
|
---
|
||
|
|
base_model: Qwen/Qwen3-8B
|
||
|
|
tags:
|
||
|
|
- transformers
|
||
|
|
- factual-grounding
|
||
|
|
- fact-checking
|
||
|
|
- qwen3
|
||
|
|
license: apache-2.0
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
---
|
||
|
|
# About Resect Research Labs
|
||
|
|
- Resect Research Labs focuses on improving factual grounding as well as detecting, reducing, and mitigating hallucinations in AI models through proprietary reinforcement learning and novel fine-tuning techniques.
|
||
|
|
|
||
|
|
## Introducing: Veritas-8B-Fact-Checker-Non-Thinking-1.0
|
||
|
|
- **Veritas-8B-Fact-Checker-Non-Thinking-1.0** is built on the **Qwen3 architecture**, starting from [**(Qwen/Qwen3-8B)**](https://huggingface.co/Qwen/Qwen3-8B). Resect Research Labs has specialized, finetuned, and optimized this model for **fact-checking and factual consistency verification**.
|
||
|
|
|
||
|
|
### Model Performance
|
||
|
|
- The performance of this model is evaluated on [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) (unseen by this model during training),
|
||
|
|
the benchmark is an aggregation of 11 human annotated datasets on fact-checking and grounding.
|
||
|
|
|
||
|
|
### Overall Performance
|
||
|
|
- **Veritas-8B-Fact-Checker-Non-Thinking-1.0** achieves an **average score of 75.47%**, an improvement of **2.3%** above Qwen3-8B in non-thinking mode.
|
||
|
|
|
||
|
|
### Benchmark Details (LLM-AggreFact)
|
||
|
|
|
||
|
|
Balanced Accuracy Scores
|
||
|
|
|
||
|
|
| Model | Size | Avg | CNN | XSum | MediaS | MeetB | WiCE | REVEAL | Claim Verify | Fact Check | Expert QA | LFQA | RAG Truth |
|
||
|
|
|------|------|------|------|------|--------|-------|------|--------|--------------|------------|-----------|------|-----------|
|
||
|
|
| Qwen3-8B (non-thinking) | 8B | 73.17 | 66.30 | 71.25 | 69.05 | 77.50 | 76.16 | 84.81 | 67.71 | **77.39** | 57.14 | 80.62 | 76.91 |
|
||
|
|
| Veritas-8B-Fact-Checker-Non-Thinking-1.0 | 8B | **75.47** | **68.84** | **74.33** | **71.83** | **76.44** | **79.06** | **86.94** | **72.40** | 76.75 | **58.12** | **84.14** | **81.30** |
|
||
|
|
|
||
|
|
|
||
|
|
<sup>**The benchmarks noted here for Veritas-0.6B-Fact-Checker-Non-Thinking-1.0 were performed on the test set and a PR has been submitted to [Minicheck's Library (Pull Request)](https://github.com/Liyan06/MiniCheck/pull/17) to support additional operating modes including this model.**</sup>
|
||
|
|
<sup>Note: Performance may vary slightly depending on hardware configuration and vLLM version</sup>
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
# Model Usage
|
||
|
|
|
||
|
|
## Scope of Use
|
||
|
|
|
||
|
|
* Veritas-8B-Fact-Checker-Non-Thinking model must only be used strictly for the prescribed scoring mode, which generates a binary classification based on the specified template. Any deviation from this intended use may lead to unexpected outputs.
|
||
|
|
|
||
|
|
## Using Minicheck's library [^2]
|
||
|
|
|
||
|
|
**Requires the changes from our Pull Request to be merged, see [^2]**
|
||
|
|
|
||
|
|
Please run the following command to install the **MiniCheck package** and all necessary dependencies.
|
||
|
|
```sh
|
||
|
|
pip install "minicheck[llm] @ git+https://github.com/Liyan06/MiniCheck.git@main"
|
||
|
|
```
|
||
|
|
|
||
|
|
[^2]: Pull Request to [Minicheck's library submitted](https://github.com/Liyan06/MiniCheck/pull/17) awaiting review
|
||
|
|
|
||
|
|
#### Below is a simple use case
|
||
|
|
|
||
|
|
```python
|
||
|
|
from minicheck.minicheck import MiniCheck
|
||
|
|
import os
|
||
|
|
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
|
||
|
|
|
||
|
|
doc = "A group of students gather in the school library to study for their upcoming final exams."
|
||
|
|
claim_1 = "The students are preparing for an examination."
|
||
|
|
claim_2 = "The students are on vacation."
|
||
|
|
|
||
|
|
chat_kwargs = {'enable_thinking': False}
|
||
|
|
|
||
|
|
scorer = MiniCheck(model_name='resect-ai/veritas-8B-fact-checker-non-thinking-1.0', enable_prefix_caching=False, extra_chat_template_kwargs=chat_kwargs, operating_mode="bespoke", max_new_tokens=1, cache_dir='./ckpts', bypass_model_check=True)
|
||
|
|
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2]) # can set `chunk_size=your-specified-value` here, default to 32K chunk size.
|
||
|
|
|
||
|
|
print(pred_label) # [1, 0]
|
||
|
|
print(raw_prob) # [0.9465315396494047, 0.008577206810662688]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Test on [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark [^2]
|
||
|
|
|
||
|
|
```python
|
||
|
|
import pandas as pd
|
||
|
|
from datasets import load_dataset
|
||
|
|
from minicheck.minicheck import MiniCheck
|
||
|
|
import os
|
||
|
|
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
|
||
|
|
|
||
|
|
# load 30K test data
|
||
|
|
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
|
||
|
|
docs = df.doc.values
|
||
|
|
claims = df.claim.values
|
||
|
|
|
||
|
|
chat_kwargs = {'enable_thinking': False}
|
||
|
|
|
||
|
|
scorer = MiniCheck(model_name='resect-ai/veritas-8B-fact-checker-non-thinking-1.0', enable_prefix_caching=False, extra_chat_template_kwargs=chat_kwargs, operating_mode="bespoke", max_new_tokens=1, cache_dir='./ckpts', bypass_model_check=True)
|
||
|
|
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims)
|
||
|
|
```
|
||
|
|
|
||
|
|
To evaluate the result on the benchmark
|
||
|
|
```python
|
||
|
|
from sklearn.metrics import balanced_accuracy_score
|
||
|
|
|
||
|
|
df['preds'] = pred_label
|
||
|
|
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
|
||
|
|
for dataset in df.dataset.unique():
|
||
|
|
sub_df = df[df.dataset == dataset]
|
||
|
|
bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
|
||
|
|
result_df.loc[len(result_df)] = [dataset, bacc]
|
||
|
|
|
||
|
|
result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
|
||
|
|
result_df.round(1)
|
||
|
|
```
|
||
|
|
|
||
|
|
[^2]: Pull Request to [Minicheck's library submitted](https://github.com/Liyan06/MiniCheck/pull/17) awaiting review
|
||
|
|
|
||
|
|
# License
|
||
|
|
- This model **Veritas-8B-Fact-Checker-Non-Thinking-1.0** is bound by the Apache 2.0 license found at https://choosealicense.com/licenses/apache-2.0. By downloading and using this model you agree to the license terms.
|
||
|
|
|
||
|
|
|
||
|
|
# Acknowledgements
|
||
|
|
|
||
|
|
Model perfected by [Resect Research Labs](https://www.resect.ai/).
|