初始化项目,由ModelHub XC社区提供模型

Model: openSUSE/CVE-Backport-Qwen2.5-Coder-32B
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-15 22:52:28 +08:00
commit b86a0b588f
33 changed files with 154352 additions and 0 deletions

261
v5-lora-adapter/README.md Normal file
View File

@@ -0,0 +1,261 @@
---
base_model: Qwen/Qwen2.5-Coder-32B-Instruct
library_name: peft
pipeline_tag: text-generation
license: apache-2.0
language:
- en
tags:
- security
- cve
- patches
- backporting
- opensuse
- suse
- linux
- code-generation
- lora
- qlora
- transformers
datasets:
- anicka/cve-backport-codegen-dataset
model-index:
- name: cve-backport-codegen-v5-qwen25-32b
results:
- task:
type: text-generation
name: Security Patch Backporting
dataset:
type: anicka/cve-backport-codegen-dataset
name: CVE Backport Codegen Dataset
metrics:
- name: Recall
type: recall
value: 0.931
- name: Precision
type: precision
value: 0.944
- name: Exact Match
type: exact_match
value: 0.83
---
# CVE Backport Codegen v5 — Qwen2.5-Coder-32B QLoRA
Fine-tuned code generation model for backporting upstream CVE security fixes
to older SUSE/openSUSE package versions. Given vulnerable source code and an
upstream fix description, the model outputs the corrected code. A separate
tool then diffs the output against the original to produce a patch.
This is a **per-hunk code generation** approach: the model sees one region of
source code at a time and returns the fixed version, rather than generating
raw unified diffs. This yields higher accuracy than patch-format models
because the model works in its natural domain (code) rather than a
meta-format (diffs).
## What's New in v5
v5 uses a unified **codegen-only dataset** — all 36,166 training examples
follow the same 3-turn format (system / user with code + fix description /
assistant with fixed code). v4 mixed in 5-turn test-generation examples;
v5 drops those to focus entirely on codegen quality.
| Metric | v5 | v4 | v1 |
|--------|:--:|:--:|:--:|
| **Recall** | **93.1%** | 93% | 91% |
| **Precision** | **94.4%** | 95% | — |
| **Exact match** | **83/100** | 87/100 | — |
| **Adapted recall** | **90.0%** | 86% | 71% |
| **Identical recall** | 93.7% | 94% | 94% |
Adapted-tier recall has steadily improved: 71% (v1) → 86% (v4) → **90% (v5)**.
The codegen-only dataset gives the model a cleaner training signal for the
core task.
## Model Details
| | |
|---|---|
| **Base model** | [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) |
| **Method** | QLoRA (4-bit NF4, double quantization, bf16 compute) |
| **LoRA rank / alpha** | 64 / 128 |
| **LoRA dropout** | 0.05 |
| **LoRA targets** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **Training data** | 36,166 train / 1,834 eval examples |
| **Epochs** | 2 (8,228 steps) |
| **Effective batch size** | 8 (1 × grad_accum 8) |
| **Learning rate** | 1e-4 (cosine schedule, 5% warmup) |
| **Max sequence length** | 4,096 tokens |
| **Optimizer** | AdamW fused, weight decay 0.01 |
| **Hardware** | 2× NVIDIA H100 NVL 94GB |
| **Training time** | 46.1 hours |
| **Train loss (avg)** | 0.0215 |
| **Eval loss (final)** | 0.00602 |
| **PEFT version** | 0.18.1 |
## Files
This repository contains:
- **LoRA adapter** (`adapter_model.safetensors`, `adapter_config.json`) — merge with the base model using PEFT
- **GGUF Q8_0** (`cve-backport-codegen-v5-q8_0.gguf`, 33GB) — ready for llama.cpp / ollama
## Evaluation
Evaluated on 100 held-out examples (zero CVE overlap with training) using
the Q8_0 GGUF served via llama-server (temperature=0, ctx=8192).
### Overall
| Metric | Value |
|--------|-------|
| Avg recall | 93.1% |
| Avg precision | 94.4% |
| Exact match | 83/100 |
| Perfect (100% recall) | 90/100 |
| Failures (0% recall) | 3/100 |
### By Tier
| Tier | Count | Avg Recall | Perfect |
|------|:-----:|:----------:|:-------:|
| **Identical** (upstream applies as-is) | 85 | 93.7% | 77/85 |
| **Adapted** (requires modification) | 15 | 90.0% | 13/15 |
### Failure Analysis
The 3 zero-recall cases are all complex libvirt patches (multi-function
adaptations across large files with significant structural differences
between versions). These are known hard cases that likely require an
agentic approach with source tree context.
## Training Data
The v5 dataset contains real SUSE/openSUSE maintenance patches paired
with their upstream CVE fixes, converted to a per-hunk codegen format:
- **36,166 train + 1,834 eval** examples (strict CVE-level split, zero overlap)
- All examples use a **3-turn ChatML format** (system / user / assistant)
- Per-hunk extraction with 15-line context padding, nearby hunks merged
- Covers C, C++, Python, shell, Java, JavaScript, Go, and more
- Sources: openSUSE Build Service maintenance incidents
### Input Format
```
## File: path/to/file.c
## Lines: 100-130
```c
/* 15 lines before the change */
vulnerable_code_here();
/* 15 lines after the change */
```
## Fix
Description of what the upstream patch changes in this region.
```
### Output Format
The model outputs the fixed version of the code region (just the code,
no diff headers or markup).
## Usage
### With llama.cpp / llama-server (GGUF)
```bash
llama-server \
--model cve-backport-codegen-v5-q8_0.gguf \
--port 8403 \
--n-gpu-layers 99 \
--ctx-size 8192
```
### With the CVE Backport Tool
The recommended way to use this model is via the
[cve-backport-tool](https://github.com/anicka-net/cve-backport-tool),
which handles patch parsing, source extraction, model inference, and
diff generation:
```bash
python3 cve-backport.py \
--cve CVE-2024-1234 \
--package openssl-1.1.1d \
--patch upstream.patch \
--source-dir /path/to/source/ \
--backend openai \
--retry 3
```
### With transformers + PEFT (adapter)
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-32B-Instruct",
torch_dtype="bfloat16",
device_map="auto",
)
model = PeftModel.from_pretrained(base, "anicka/cve-backport-codegen-v5-qwen25-32b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
```
### Prompt Template (ChatML)
```
<|im_start|>system
You are a security patch backporting assistant.
Given vulnerable source code and a description of the upstream fix, output the FIXED version of the code.
Rules:
- Output ONLY the fixed code, nothing else
- Preserve all surrounding context exactly
- Apply only the described fix
<|im_end|>
<|im_start|>user
## File: crypto/bn/bn.h
## Lines: 280-310
```c
/* source code region */
```
## Fix
Add bounds check for BN_num_bits to prevent buffer over-read.
<|im_end|>
<|im_start|>assistant
```
## Limitations
- **Best at identical-tier patches** (upstream fix applies directly) — 93.7% recall
- **Good at adapted patches** (90% recall) but complex multi-function adaptations
across structurally different versions remain challenging
- **Context window**: 4,096 token training limit means very large functions or
multi-file patches may be truncated
- **No compilation feedback**: the model generates code in a single pass without
verifying it compiles. Use `--retry` in the CLI tool for iterative correction.
- Always review generated patches before applying to production systems
## Related
- **CLI tool**: [cve-backport-tool](https://github.com/anicka-net/cve-backport-tool)
- **Dataset**: [anicka/cve-backport-codegen-dataset](https://huggingface.co/datasets/anicka/cve-backport-codegen-dataset)
- **Previous version (v1)**: [anicka/cve-backport-codegen-qwen25-32b-v1](https://huggingface.co/anicka/cve-backport-codegen-qwen25-32b-v1)
## Citation
```bibtex
@misc{cve-backport-codegen-v5,
title={CVE Backport Codegen v5: Fine-tuned Qwen2.5-Coder-32B for Security Patch Backporting},
author={Anna Maresova},
year={2026},
url={https://huggingface.co/anicka/cve-backport-codegen-v5-qwen25-32b}
}
```

View File

@@ -0,0 +1,46 @@
{
"alora_invocation_tokens": null,
"alpha_pattern": {},
"arrow_config": null,
"auto_mapping": null,
"base_model_name_or_path": "Qwen/Qwen2.5-Coder-32B-Instruct",
"bias": "none",
"corda_config": null,
"ensure_weight_tying": false,
"eva_config": null,
"exclude_modules": null,
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 128,
"lora_bias": false,
"lora_dropout": 0.05,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"peft_version": "0.18.1",
"qalora_group_size": 16,
"r": 64,
"rank_pattern": {},
"revision": null,
"target_modules": [
"gate_proj",
"o_proj",
"down_proj",
"up_proj",
"k_proj",
"v_proj",
"q_proj"
],
"target_parameters": null,
"task_type": "CAUSAL_LM",
"trainable_token_indices": null,
"use_dora": false,
"use_qalora": false,
"use_rslora": false
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:644bd0c861027440a38e5a6d59e4fc8e5629568a86a68881f735d68dd04b839c
size 2147605960

View File

@@ -0,0 +1,24 @@
{
"</tool_call>": 151658,
"<tool_call>": 151657,
"<|box_end|>": 151649,
"<|box_start|>": 151648,
"<|endoftext|>": 151643,
"<|file_sep|>": 151664,
"<|fim_middle|>": 151660,
"<|fim_pad|>": 151662,
"<|fim_prefix|>": 151659,
"<|fim_suffix|>": 151661,
"<|im_end|>": 151645,
"<|im_start|>": 151644,
"<|image_pad|>": 151655,
"<|object_ref_end|>": 151647,
"<|object_ref_start|>": 151646,
"<|quad_end|>": 151651,
"<|quad_start|>": 151650,
"<|repo_name|>": 151663,
"<|video_pad|>": 151656,
"<|vision_end|>": 151653,
"<|vision_pad|>": 151654,
"<|vision_start|>": 151652
}

View File

@@ -0,0 +1,54 @@
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0]['role'] == 'system' %}
{{- messages[0]['content'] }}
{%- else %}
{{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
{%- endif %}
{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0]['role'] == 'system' %}
{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
{%- else %}
{{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- for message in messages %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{{- '<|im_start|>' + message.role }}
{%- if message.content %}
{{- '\n' + message.content }}
{%- endif %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '\n<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{{- tool_call.arguments | tojson }}
{{- '}\n</tool_call>' }}
{%- endfor %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}

151388
v5-lora-adapter/merges.txt Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,31 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:83396048d512ec1f3178af0d7c1f79a226bba041822614b0e26a4fd2d4b55bf7
size 11421995

View File

@@ -0,0 +1,207 @@
{
"add_bos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151646": {
"content": "<|object_ref_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151647": {
"content": "<|object_ref_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151648": {
"content": "<|box_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151649": {
"content": "<|box_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151650": {
"content": "<|quad_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151651": {
"content": "<|quad_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151652": {
"content": "<|vision_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151653": {
"content": "<|vision_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151654": {
"content": "<|vision_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151655": {
"content": "<|image_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151656": {
"content": "<|video_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151657": {
"content": "<tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151658": {
"content": "</tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151659": {
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151660": {
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151661": {
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151662": {
"content": "<|fim_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151663": {
"content": "<|repo_name|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151664": {
"content": "<|file_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"bos_token": null,
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"errors": "replace",
"extra_special_tokens": {},
"model_max_length": 32768,
"pad_token": "<|endoftext|>",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:18d5482439b903314c5777c6cb1050193782f8e89ed3d18122237dc3b827c686
size 5905

File diff suppressed because one or more lines are too long