初始化项目,由ModelHub XC社区提供模型
Model: RISys-Lab/RedSage-Qwen3-8B-Base Source: Original Platform
This commit is contained in:
138
README.md
Normal file
138
README.md
Normal file
@@ -0,0 +1,138 @@
|
||||
---
|
||||
library_name: transformers
|
||||
tags:
|
||||
- generated_from_trainer
|
||||
- cybersecurity
|
||||
- continual-pretraining
|
||||
- targeted-pretraining
|
||||
- text-generation
|
||||
- casual-lm
|
||||
- risys-lab
|
||||
model-index:
|
||||
- name: RedSage-Qwen3-8B-Base
|
||||
results: []
|
||||
language:
|
||||
- en
|
||||
base_model:
|
||||
- RISys-Lab/RedSage-Qwen3-8B-CFW
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# RedSage-Qwen3-8B-Base
|
||||
|
||||
<div align="center">
|
||||
<img src="https://img.shields.io/badge/Task-Cybersecurity-red" alt="Cybersecurity">
|
||||
<img src="https://img.shields.io/badge/Stage-Targeted_Pretraining-blue" alt="Targeted Pretraining">
|
||||
</div>
|
||||
|
||||
## Model Summary
|
||||
|
||||
**RedSage-Qwen3-8B-Base** is a cybersecurity-specialized Large Language Model (LLM) developed by **RISys-Lab**. It represents the **second stage** of the RedSage pre-training pipeline.
|
||||
|
||||
This model builds upon **RedSage-Qwen3-8B-CFW** by undergoing **Targeted Pre-Training** on high-quality, curated cybersecurity resources (`RedSage-Seed` and `RedSage-Dump`). While the previous stage focused on breadth using web data, this stage focuses on depth, technical standards, and verified skills.
|
||||
|
||||
- **Paper:** [RedSage: A Cybersecurity Generalist LLM](https://openreview.net/forum?id=W4FAenIrQ2) ([arXiv](https://arxiv.org/abs/2601.22159))
|
||||
- **Repository:** [GitHub](https://github.com/RISys-Lab/RedSage)
|
||||
- **Base Model:** [RISys-Lab/RedSage-Qwen3-8B-CFW](https://huggingface.co/RISys-Lab/RedSage-Qwen3-8B-CFW)
|
||||
- **Variant:** Base (Final Pre-trained Checkpoint)
|
||||
|
||||
## Intended Use
|
||||
|
||||
This model is a **base model** intended for:
|
||||
1. **Fine-tuning:** Serving as a high-quality foundation for downstream cybersecurity tasks (e.g., incident response, malware analysis).
|
||||
2. **Research:** Investigating the impact of curated versus web-scale data in domain adaptation.
|
||||
3. **Completion:** Code completion and technical writing in cybersecurity contexts.
|
||||
|
||||
**Note:** As a base model, this checkpoint has **not** been instruction-tuned (SFT) or aligned (DPO). It behaves like a completion engine. For a chat-ready assistant, please see `RISys-Lab/RedSage-Qwen3-8B-DPO`.
|
||||
|
||||
## Training Lineage
|
||||
|
||||
RedSage employs a multi-stage training pipeline. This model represents the output of **Stage 2**.
|
||||
|
||||
1. Stage 1: Continual Pre-Training (CPT) -> [RedSage-Qwen3-8B-CFW](https://huggingface.co/RISys-Lab/RedSage-Qwen3-8B-CFW) (CyberFineWeb data)
|
||||
2. **Stage 2: Targeted Pre-Training** -> **`RedSage-Qwen3-8B-Base`** (Current Model)
|
||||
* *Data:* RedSage-Seed (\~150M Tokens) + RedSage-Dump (\~700M Tokens)
|
||||
4. Stage 3: Supervised Fine-Tuning (SFT) -> [RedSage-Qwen3-8B-Ins](https://huggingface.co/RISys-Lab/RedSage-Qwen3-8B-Ins)
|
||||
5. Stage 4: Direct Preference Optimization (DPO) -> [RedSage-Qwen3-8B-DPO](https://huggingface.co/RISys-Lab/RedSage-Qwen3-8B-DPO)
|
||||
|
||||
## Training Data: RedSage-Seed & Dump
|
||||
|
||||
This model was trained on approximately **850 million tokens** of curated data, split into two collections:
|
||||
|
||||
1. **RedSage-Seed (~150M Tokens):** A highly curated collection of 28,637 samples converted to structured Markdown.
|
||||
* **Knowledge:** General concepts and Frameworks (MITRE ATT&CK, CAPEC, CWE, OWASP).
|
||||
* **Skills:** Offensive security resources including write-ups, hacking techniques, and payload examples.
|
||||
* **Tools:** Manuals and cheat sheets for CLI tools and Kali Linux.
|
||||
|
||||
2. **RedSage-Dump (~700M Tokens):** A larger aggregation of 459K technical documents.
|
||||
* **Sources:** Computer education portals, cybersecurity news, RFC entries, NIST publications, and the National Vulnerability Database (NVD).
|
||||
|
||||
## Performance
|
||||
|
||||
RedSage-8B-Base achieves state-of-the-art performance among 8B models, showing significant improvements over the general-purpose Qwen3-8B-Base. It achieves the highest mean score on external benchmarks among all 8B base models tested.
|
||||
|
||||
### RedSage-Bench (0-shot Accuracy)
|
||||
|
||||
| Category | Qwen3-8B-Base | **RedSage-8B-Base** |
|
||||
| :--- | :---: | :---: |
|
||||
| **Macro Average** | 84.24 | **85.05** |
|
||||
| Knowledge (General) | 83.08 | 83.12 |
|
||||
| Knowledge (Frameworks) | 81.94 | **84.94** |
|
||||
| Skill (Offensive) | 88.23 | **88.72** |
|
||||
| Tools (CLI) | 85.08 | **85.44** |
|
||||
| Tools (Kali) | 78.86 | **79.36** |
|
||||
|
||||
### External Cybersecurity Benchmarks (5-shot)
|
||||
|
||||
| Benchmark | Qwen3-8B-Base | **RedSage-8B-Base** |
|
||||
| :--- | :---: | :---: |
|
||||
| **Mean** | 80.81 | **84.56** |
|
||||
| CTI-Bench (MCQ) | 68.80 | **71.04** |
|
||||
| CTI-Bench (RCM) | 63.50 | **78.40** |
|
||||
| CyberMetric (500) | 92.00 | **92.60** |
|
||||
| MMLU (Security) | 83.00 | **87.00** |
|
||||
| SecBench (En) | **82.84** | 81.76 |
|
||||
| SecEva (MCQ) | 75.60 | **75.83** |
|
||||
| SECURE (CWET) | 92.70 | **93.22** |
|
||||
| SECURE (KCV) | 75.05 | **87.20** |
|
||||
| SECURE (MEAT) | 93.81 | **94.00** |
|
||||
|
||||
## Training Procedure
|
||||
|
||||
The model was trained using the [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) framework.
|
||||
|
||||
- **Learning Rate:** 2.5e-6 (constant with linear warmup)
|
||||
- **Optimizer:** AdamW
|
||||
- **Epochs:** 1
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
model_id = "RISys-Lab/RedSage-Qwen3-8B-Base"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
|
||||
|
||||
text = "The primary difference between a firewall and an IDS is"
|
||||
inputs = tokenizer(text, return_tensors="pt").to("cuda")
|
||||
|
||||
outputs = model.generate(**inputs, max_new_tokens=50)
|
||||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this model or dataset, please cite our paper:
|
||||
|
||||
```
|
||||
@inproceedings{suryanto2026redsage,
|
||||
title={RedSage: A Cybersecurity Generalist {LLM}},
|
||||
author={Naufal Suryanto and Muzammal Naseer and Pengfei Li and Syed Talal Wasim and Jinhui Yi and Juergen Gall and Paolo Ceravolo and Ernesto Damiani},
|
||||
booktitle={The Fourteenth International Conference on Learning Representations},
|
||||
year={2026},
|
||||
url={https://openreview.net/forum?id=W4FAenIrQ2}
|
||||
}
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user