初始化项目,由ModelHub XC社区提供模型
Model: openSUSE/CVE-Backport-Qwen2.5-Coder-32B Source: Original Platform
This commit is contained in:
180
README.md
Normal file
180
README.md
Normal file
@@ -0,0 +1,180 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
base_model: Qwen/Qwen2.5-Coder-32B-Instruct
|
||||
tags:
|
||||
- security
|
||||
- patch-backporting
|
||||
- code-generation
|
||||
- qwen2
|
||||
- qlora
|
||||
- opensuse
|
||||
datasets:
|
||||
- openSUSE/cve-backport-codegen-dataset
|
||||
language:
|
||||
- en
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# CVE Backport Code Generation — Qwen2.5-Coder-32B (v5)
|
||||
|
||||
Fine-tuned [Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) for security patch backporting via per-hunk code generation. Maintained as part of the openSUSE security tooling effort, alongside the [cve-backport-tool](https://github.com/openSUSE/cve-backport-tool) CLI.
|
||||
|
||||
Instead of generating unified diffs, this model takes a vulnerable code region and a fix description, and outputs the **fixed version of the code**. A programmatic diff then produces the final patch.
|
||||
|
||||
> **MoE variant available:** An MoE-based alternative built on
|
||||
> Qwen3-Coder-30B-A3B (3B active parameters) is hosted at
|
||||
> [anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b).
|
||||
> It scores 91.9% recall on the same 100-example eval — 1.2 pt below this
|
||||
> dense model — while running roughly 10× faster at inference due to sparse
|
||||
> MoE activation. Recommended for bulk CVE backport workflows where
|
||||
> throughput matters.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
git clone https://github.com/openSUSE/cve-backport-tool
|
||||
cd cve-backport-tool
|
||||
./setup.sh # downloads GGUF, registers with ollama
|
||||
|
||||
python3 cve-backport.py \
|
||||
--cve CVE-2024-1234 \
|
||||
--package curl \
|
||||
--patch upstream-fix.patch \
|
||||
--obs-fetch --obs-project openSUSE:Leap:15.6:Update \
|
||||
--retry 3
|
||||
```
|
||||
|
||||
## GGUF Downloads
|
||||
|
||||
| File | Quant | Size | Notes |
|
||||
|------|-------|------|-------|
|
||||
| `cve-backport-codegen-v5-q8_0.gguf` | Q8_0 | 33 GB | **Recommended** (v5, 93.1% recall, 94.4% precision, codegen-only) |
|
||||
| `cve-backport-codegen-v4-q8_0.gguf` | Q8_0 | 33 GB | v4, 93% recall, 95% precision (includes test generation training) |
|
||||
| `cve-backport-codegen-v3-q8_0.gguf` | Q8_0 | 33 GB | v3, 94% recall, 98% precision (legacy, smaller eval set) |
|
||||
|
||||
## Evaluation (v5)
|
||||
|
||||
Per-hunk evaluation on 100 held-out examples the model never saw during training:
|
||||
|
||||
| Metric | v5 | v4 | v3 (n=20) |
|
||||
|--------|:--:|:--:|:---------:|
|
||||
| Average recall | **93.1%** | 93% | 94% |
|
||||
| Average precision | **94.4%** | 95% | 98% |
|
||||
| Exact match | **83/100** | 87/100 | 16/20 |
|
||||
| Failures (<10%) | **3/100** | 4/100 | 0/20 |
|
||||
|
||||
By tier:
|
||||
- **Identical** (upstream patch applies directly): 93.7% recall (77/85 perfect)
|
||||
- **Adapted** (line numbers/context differ): 90.0% recall (13/15 perfect)
|
||||
|
||||
Adapted-tier recall has steadily improved: 71% (v1) → 86% (v4) → **90% (v5)**.
|
||||
|
||||
### What changed in v5
|
||||
|
||||
v5 uses a codegen-only dataset — all 36,166 training examples follow the same 3-turn format. v4 mixed in 772 five-turn test-generation examples which diluted codegen focus. Dropping those and training for 2 epochs (vs 1 in v4) improved adapted-tier recall.
|
||||
|
||||
### Comparison with Frontier Models
|
||||
|
||||
Same eval, same 100 examples, optimized prompts with markdown stripping:
|
||||
|
||||
| Model | Recall | Precision | Exact | Failures |
|
||||
|-------|--------|-----------|-------|----------|
|
||||
| **CVE Backport v5** (32B fine-tuned) | **93%** | **94%** | **83/100** | **3** |
|
||||
| Gemini 3.1 Pro (frontier, zero-shot) | 27% | 24% | 10/100 | 50 |
|
||||
| Gemini 2.0 Flash (frontier, zero-shot) | 13% | 17% | 4/100 | 81 |
|
||||
|
||||
Fine-tuning on 36K domain-specific examples outperforms frontier models by 3-7x on this task.
|
||||
|
||||
## Prompt Format
|
||||
|
||||
ChatML format. Each prompt covers one hunk region with 15 lines of context padding.
|
||||
|
||||
### Code Generation (3-turn)
|
||||
|
||||
**System:**
|
||||
```
|
||||
You are a security patch backporting assistant.
|
||||
|
||||
Given vulnerable source code and a description of the upstream fix, output the FIXED version of the code.
|
||||
|
||||
Rules:
|
||||
- Output ONLY the fixed code, nothing else — no explanations, no markdown fences
|
||||
- Preserve exact formatting, indentation, and style of the original
|
||||
- Make ONLY the changes described in the fix — do not modify anything else
|
||||
- Do not add comments about what you changed
|
||||
```
|
||||
|
||||
**User:**
|
||||
```
|
||||
## File: crypto/bn/bn.h
|
||||
## Lines: 280-310
|
||||
|
||||
\```c
|
||||
/* vulnerable source code region with 15 lines of context */
|
||||
\```
|
||||
|
||||
## Fix
|
||||
Add bounds check for BN_num_bits to prevent buffer over-read (CVE-2024-XXXX).
|
||||
```
|
||||
|
||||
**Assistant:** The fixed version of the code region (just the code, no markup).
|
||||
|
||||
## Training
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Base model | Qwen2.5-Coder-32B-Instruct |
|
||||
| Method | QLoRA (4-bit NF4, bf16 compute, double quantization) |
|
||||
| LoRA rank / alpha | 64 / 128 |
|
||||
| Epochs | 2 (8,228 steps) |
|
||||
| Training data | 36,166 train / 1,834 eval (codegen-only, all 3-turn) |
|
||||
| Effective batch size | 8 |
|
||||
| Learning rate | 1e-4 (cosine, 5% warmup) |
|
||||
| Max sequence length | 4,096 tokens |
|
||||
| Hardware | 2× NVIDIA H100 NVL 94GB |
|
||||
| Training time | 46.1 hours |
|
||||
| Final eval loss | 0.00602 |
|
||||
|
||||
## Reproduction via Teapot
|
||||
|
||||
This model is reproducible via the [teapot](https://github.com/anicka-net/teapot) training pipeline. Once the dataset is composed, training is a four-command sequence:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/anicka-net/teapot
|
||||
cd teapot
|
||||
pip install -e .
|
||||
|
||||
# 1. Compose training data from the cve-backport module
|
||||
teapot compose configs/cve-backport.config \
|
||||
--output train-cve-backport.jsonl
|
||||
|
||||
# 2. Generate the QLoRA-HF launch script
|
||||
teapot train configs/cve-backport.config \
|
||||
--backend qlora-hf \
|
||||
--train-data train-cve-backport.jsonl \
|
||||
--eval-data eval-cve-backport.jsonl \
|
||||
--output train-cve-backport.sh
|
||||
|
||||
# 3. Train (2× H100 NVL 94GB; ~46 hours)
|
||||
bash train-cve-backport.sh
|
||||
|
||||
# 4. Final adapter is at output-teapot-cve-backport/final/
|
||||
```
|
||||
|
||||
The teapot config (`configs/cve-backport.config`) pins all the hyperparameters listed in the Training table above. The `qlora-hf` backend invokes `teapot.train_qlora_hf`, a thin wrapper over the HuggingFace `Trainer` with bitsandbytes 4-bit quantization and PEFT LoRA.
|
||||
|
||||
## LoRA Adapter and MoE Variant
|
||||
|
||||
The LoRA adapter for this model is hosted at [anicka/cve-backport-codegen-v5-qwen25-32b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen25-32b) for use with PEFT/transformers.
|
||||
|
||||
An MoE variant trained on the same dataset is available at [anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b) — built on Qwen3-Coder-30B-A3B (3B active params), 91.9% recall on the same n=100 eval, ~10× faster inference.
|
||||
|
||||
## Known Issues
|
||||
|
||||
- The 3 failure cases (0% recall) are all complex libvirt patches involving multi-function adaptations across large files with significant structural differences. These likely require an agentic approach with source tree context.
|
||||
- Very long hunks (>2000 tokens) may be truncated due to the 4096-token training context.
|
||||
- Always review generated patches before applying to production systems.
|
||||
|
||||
## License
|
||||
|
||||
Apache-2.0 (inherited from Qwen2.5-Coder-32B-Instruct).
|
||||
Reference in New Issue
Block a user