311 lines
10 KiB
Markdown
311 lines
10 KiB
Markdown
|
|
---
|
|||
|
|
base_model: Qwen/Qwen2.5-Coder-32B-Instruct
|
|||
|
|
library_name: peft
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
tags:
|
|||
|
|
- security
|
|||
|
|
- cve
|
|||
|
|
- patches
|
|||
|
|
- backporting
|
|||
|
|
- opensuse
|
|||
|
|
- suse
|
|||
|
|
- linux
|
|||
|
|
- code-generation
|
|||
|
|
- lora
|
|||
|
|
- qlora
|
|||
|
|
- transformers
|
|||
|
|
datasets:
|
|||
|
|
- anicka/cve-backport-codegen-dataset
|
|||
|
|
model-index:
|
|||
|
|
- name: cve-backport-codegen-v5-qwen25-32b
|
|||
|
|
results:
|
|||
|
|
- task:
|
|||
|
|
type: text-generation
|
|||
|
|
name: Security Patch Backporting
|
|||
|
|
dataset:
|
|||
|
|
type: anicka/cve-backport-codegen-dataset
|
|||
|
|
name: CVE Backport Codegen Dataset
|
|||
|
|
metrics:
|
|||
|
|
- name: Recall
|
|||
|
|
type: recall
|
|||
|
|
value: 0.931
|
|||
|
|
- name: Precision
|
|||
|
|
type: precision
|
|||
|
|
value: 0.944
|
|||
|
|
- name: Exact Match
|
|||
|
|
type: exact_match
|
|||
|
|
value: 0.83
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# CVE Backport Codegen v5 — Qwen2.5-Coder-32B QLoRA
|
|||
|
|
|
|||
|
|
Fine-tuned code generation model for backporting upstream CVE security fixes
|
|||
|
|
to older SUSE/openSUSE package versions. Given vulnerable source code and an
|
|||
|
|
upstream fix description, the model outputs the corrected code. A separate
|
|||
|
|
tool then diffs the output against the original to produce a patch.
|
|||
|
|
|
|||
|
|
This is a **per-hunk code generation** approach: the model sees one region of
|
|||
|
|
source code at a time and returns the fixed version, rather than generating
|
|||
|
|
raw unified diffs. This yields higher accuracy than patch-format models
|
|||
|
|
because the model works in its natural domain (code) rather than a
|
|||
|
|
meta-format (diffs).
|
|||
|
|
|
|||
|
|
> **MoE sibling now available:** [anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b)
|
|||
|
|
> reaches 91.9% recall on the same n=100 eval (within 1.2 pt of this model)
|
|||
|
|
> while running ~10× faster at inference, thanks to Qwen3-Coder-30B-A3B's
|
|||
|
|
> sparse 3B-active MoE architecture. Same training data, same config style,
|
|||
|
|
> trained in 1/5 the wall time on a single H100.
|
|||
|
|
|
|||
|
|
## What's New in v5
|
|||
|
|
|
|||
|
|
v5 uses a unified **codegen-only dataset** — all 36,166 training examples
|
|||
|
|
follow the same 3-turn format (system / user with code + fix description /
|
|||
|
|
assistant with fixed code). v4 mixed in 5-turn test-generation examples;
|
|||
|
|
v5 drops those to focus entirely on codegen quality.
|
|||
|
|
|
|||
|
|
| Metric | v5 | v4 | v1 |
|
|||
|
|
|--------|:--:|:--:|:--:|
|
|||
|
|
| **Recall** | **93.1%** | 93% | 91% |
|
|||
|
|
| **Precision** | **94.4%** | 95% | — |
|
|||
|
|
| **Exact match** | **83/100** | 87/100 | — |
|
|||
|
|
| **Adapted recall** | **90.0%** | 86% | 71% |
|
|||
|
|
| **Identical recall** | 93.7% | 94% | 94% |
|
|||
|
|
|
|||
|
|
Adapted-tier recall has steadily improved: 71% (v1) → 86% (v4) → **90% (v5)**.
|
|||
|
|
The codegen-only dataset gives the model a cleaner training signal for the
|
|||
|
|
core task.
|
|||
|
|
|
|||
|
|
## Model Details
|
|||
|
|
|
|||
|
|
| | |
|
|||
|
|
|---|---|
|
|||
|
|
| **Base model** | [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) |
|
|||
|
|
| **Method** | QLoRA (4-bit NF4, double quantization, bf16 compute) |
|
|||
|
|
| **LoRA rank / alpha** | 64 / 128 |
|
|||
|
|
| **LoRA dropout** | 0.05 |
|
|||
|
|
| **LoRA targets** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|||
|
|
| **Training data** | 36,166 train / 1,834 eval examples |
|
|||
|
|
| **Epochs** | 2 (8,228 steps) |
|
|||
|
|
| **Effective batch size** | 8 (1 × grad_accum 8) |
|
|||
|
|
| **Learning rate** | 1e-4 (cosine schedule, 5% warmup) |
|
|||
|
|
| **Max sequence length** | 4,096 tokens |
|
|||
|
|
| **Optimizer** | AdamW fused, weight decay 0.01 |
|
|||
|
|
| **Hardware** | 2× NVIDIA H100 NVL 94GB |
|
|||
|
|
| **Training time** | 46.1 hours |
|
|||
|
|
| **Train loss (avg)** | 0.0215 |
|
|||
|
|
| **Eval loss (final)** | 0.00602 |
|
|||
|
|
| **PEFT version** | 0.18.1 |
|
|||
|
|
|
|||
|
|
## Files
|
|||
|
|
|
|||
|
|
This repository contains:
|
|||
|
|
|
|||
|
|
- **LoRA adapter** (`adapter_model.safetensors`, `adapter_config.json`) — merge with the base model using PEFT
|
|||
|
|
- **GGUF Q8_0** (`cve-backport-codegen-v5-q8_0.gguf`, 33GB) — ready for llama.cpp / ollama
|
|||
|
|
|
|||
|
|
## Reproduction via Teapot
|
|||
|
|
|
|||
|
|
This model was trained via the [teapot](https://github.com/anicka-net/teapot)
|
|||
|
|
training pipeline. The full reproduction is a four-command sequence once the
|
|||
|
|
cve-backport dataset is prepared:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git clone https://github.com/anicka-net/teapot
|
|||
|
|
cd teapot
|
|||
|
|
pip install -e .
|
|||
|
|
|
|||
|
|
# 1. Compose training data from the cve-backport module
|
|||
|
|
teapot compose configs/cve-backport.config \
|
|||
|
|
--output train-cve-backport.jsonl
|
|||
|
|
|
|||
|
|
# 2. Generate the QLoRA-HF launch script
|
|||
|
|
teapot train configs/cve-backport.config \
|
|||
|
|
--backend qlora-hf \
|
|||
|
|
--train-data train-cve-backport.jsonl \
|
|||
|
|
--eval-data eval-cve-backport.jsonl \
|
|||
|
|
--output train-cve-backport.sh
|
|||
|
|
|
|||
|
|
# 3. Train (2× H100 NVL 94GB; ~46 hours)
|
|||
|
|
bash train-cve-backport.sh
|
|||
|
|
|
|||
|
|
# 4. Final adapter is at output-teapot-cve-backport/final/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The teapot config (`configs/cve-backport.config`) pins all the hyperparameters:
|
|||
|
|
`method: qlora`, `epochs: 2`, `lr: 1e-4`, `batch_size: 1`, `gradient_accumulation: 8`,
|
|||
|
|
`lora_r: 64`, `lora_alpha: 128`, `max_length: 4096`, `warmup_ratio: 0.05`,
|
|||
|
|
`hardware.gpus: 2`. See the config file in the teapot repo for the full
|
|||
|
|
declaration.
|
|||
|
|
|
|||
|
|
The `qlora-hf` backend invokes `python3 -m teapot.train_qlora_hf`, which is
|
|||
|
|
a thin wrapper over the HuggingFace `Trainer` with bitsandbytes 4-bit
|
|||
|
|
quantization and PEFT LoRA. Training data is composed from the
|
|||
|
|
[cve-backport-codegen-dataset](https://huggingface.co/datasets/anicka/cve-backport-codegen-dataset)
|
|||
|
|
HF repo (the `domain/cve-backport` teapot module fetches it automatically).
|
|||
|
|
|
|||
|
|
## Evaluation
|
|||
|
|
|
|||
|
|
Evaluated on 100 held-out examples (zero CVE overlap with training) using
|
|||
|
|
the Q8_0 GGUF served via llama-server (temperature=0, ctx=8192).
|
|||
|
|
|
|||
|
|
### Overall
|
|||
|
|
|
|||
|
|
| Metric | Value |
|
|||
|
|
|--------|-------|
|
|||
|
|
| Avg recall | 93.1% |
|
|||
|
|
| Avg precision | 94.4% |
|
|||
|
|
| Exact match | 83/100 |
|
|||
|
|
| Perfect (100% recall) | 90/100 |
|
|||
|
|
| Failures (0% recall) | 3/100 |
|
|||
|
|
|
|||
|
|
### By Tier
|
|||
|
|
|
|||
|
|
| Tier | Count | Avg Recall | Perfect |
|
|||
|
|
|------|:-----:|:----------:|:-------:|
|
|||
|
|
| **Identical** (upstream applies as-is) | 85 | 93.7% | 77/85 |
|
|||
|
|
| **Adapted** (requires modification) | 15 | 90.0% | 13/15 |
|
|||
|
|
|
|||
|
|
### Failure Analysis
|
|||
|
|
|
|||
|
|
The 3 zero-recall cases are all complex libvirt patches (multi-function
|
|||
|
|
adaptations across large files with significant structural differences
|
|||
|
|
between versions). These are known hard cases that likely require an
|
|||
|
|
agentic approach with source tree context.
|
|||
|
|
|
|||
|
|
## Training Data
|
|||
|
|
|
|||
|
|
The v5 dataset contains real SUSE/openSUSE maintenance patches paired
|
|||
|
|
with their upstream CVE fixes, converted to a per-hunk codegen format:
|
|||
|
|
|
|||
|
|
- **36,166 train + 1,834 eval** examples (strict CVE-level split, zero overlap)
|
|||
|
|
- All examples use a **3-turn ChatML format** (system / user / assistant)
|
|||
|
|
- Per-hunk extraction with 15-line context padding, nearby hunks merged
|
|||
|
|
- Covers C, C++, Python, shell, Java, JavaScript, Go, and more
|
|||
|
|
- Sources: openSUSE Build Service maintenance incidents
|
|||
|
|
|
|||
|
|
### Input Format
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
## File: path/to/file.c
|
|||
|
|
## Lines: 100-130
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
/* 15 lines before the change */
|
|||
|
|
vulnerable_code_here();
|
|||
|
|
/* 15 lines after the change */
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Fix
|
|||
|
|
Description of what the upstream patch changes in this region.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Output Format
|
|||
|
|
|
|||
|
|
The model outputs the fixed version of the code region (just the code,
|
|||
|
|
no diff headers or markup).
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
### With llama.cpp / llama-server (GGUF)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
llama-server \
|
|||
|
|
--model cve-backport-codegen-v5-q8_0.gguf \
|
|||
|
|
--port 8403 \
|
|||
|
|
--n-gpu-layers 99 \
|
|||
|
|
--ctx-size 8192
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### With the CVE Backport Tool
|
|||
|
|
|
|||
|
|
The recommended way to use this model is via the
|
|||
|
|
[cve-backport-tool](https://github.com/openSUSE/cve-backport-tool),
|
|||
|
|
which handles patch parsing, source extraction, model inference, and
|
|||
|
|
diff generation:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python3 cve-backport.py \
|
|||
|
|
--cve CVE-2024-1234 \
|
|||
|
|
--package openssl-1.1.1d \
|
|||
|
|
--patch upstream.patch \
|
|||
|
|
--source-dir /path/to/source/ \
|
|||
|
|
--backend openai \
|
|||
|
|
--retry 3
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### With transformers + PEFT (adapter)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from peft import PeftModel
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
|
|||
|
|
base = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
"Qwen/Qwen2.5-Coder-32B-Instruct",
|
|||
|
|
torch_dtype="bfloat16",
|
|||
|
|
device_map="auto",
|
|||
|
|
)
|
|||
|
|
model = PeftModel.from_pretrained(base, "anicka/cve-backport-codegen-v5-qwen25-32b")
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Prompt Template (ChatML)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
<|im_start|>system
|
|||
|
|
You are a security patch backporting assistant.
|
|||
|
|
|
|||
|
|
Given vulnerable source code and a description of the upstream fix, output the FIXED version of the code.
|
|||
|
|
|
|||
|
|
Rules:
|
|||
|
|
- Output ONLY the fixed code, nothing else
|
|||
|
|
- Preserve all surrounding context exactly
|
|||
|
|
- Apply only the described fix
|
|||
|
|
<|im_end|>
|
|||
|
|
<|im_start|>user
|
|||
|
|
## File: crypto/bn/bn.h
|
|||
|
|
## Lines: 280-310
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
/* source code region */
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Fix
|
|||
|
|
Add bounds check for BN_num_bits to prevent buffer over-read.
|
|||
|
|
<|im_end|>
|
|||
|
|
<|im_start|>assistant
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Limitations
|
|||
|
|
|
|||
|
|
- **Best at identical-tier patches** (upstream fix applies directly) — 93.7% recall
|
|||
|
|
- **Good at adapted patches** (90% recall) but complex multi-function adaptations
|
|||
|
|
across structurally different versions remain challenging
|
|||
|
|
- **Context window**: 4,096 token training limit means very large functions or
|
|||
|
|
multi-file patches may be truncated
|
|||
|
|
- **No compilation feedback**: the model generates code in a single pass without
|
|||
|
|
verifying it compiles. Use `--retry` in the CLI tool for iterative correction.
|
|||
|
|
- Always review generated patches before applying to production systems
|
|||
|
|
|
|||
|
|
## Related
|
|||
|
|
|
|||
|
|
- **MoE sibling**: [anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b) — Qwen3-Coder-30B-A3B (3B active, MoE), 91.9% recall on the same n=100 eval, ~10× faster inference
|
|||
|
|
- **openSUSE mirror**: [openSUSE/CVE-Backport-Qwen2.5-Coder-32B](https://huggingface.co/openSUSE/CVE-Backport-Qwen2.5-Coder-32B)
|
|||
|
|
- **CLI tool**: [openSUSE/cve-backport-tool](https://github.com/openSUSE/cve-backport-tool)
|
|||
|
|
- **Dataset**: [anicka/cve-backport-codegen-dataset](https://huggingface.co/datasets/anicka/cve-backport-codegen-dataset)
|
|||
|
|
- **Training pipeline**: [teapot](https://github.com/anicka-net/teapot)
|
|||
|
|
- **Previous version (v1)**: [anicka/cve-backport-codegen-qwen25-32b-v1](https://huggingface.co/anicka/cve-backport-codegen-qwen25-32b-v1)
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@misc{cve-backport-codegen-v5,
|
|||
|
|
title={CVE Backport Codegen v5: Fine-tuned Qwen2.5-Coder-32B for Security Patch Backporting},
|
|||
|
|
author={Anna Maresova},
|
|||
|
|
year={2026},
|
|||
|
|
url={https://huggingface.co/anicka/cve-backport-codegen-v5-qwen25-32b}
|
|||
|
|
}
|
|||
|
|
```
|