初始化项目，由ModelHub XC社区提供模型

Model: openSUSE/CVE-Backport-Qwen2.5-Coder-32B Source: Original Platform
2026-05-15 22:52:28 +08:00
commit b86a0b588f
33 changed files with 154352 additions and 0 deletions
--- a/v5-lora-adapter/README.md
+++ b/v5-lora-adapter/README.md
@@ -0,0 +1,261 @@
+---
+base_model: Qwen/Qwen2.5-Coder-32B-Instruct
+library_name: peft
+pipeline_tag: text-generation
+license: apache-2.0
+language:
+  - en
+tags:
+  - security
+  - cve
+  - patches
+  - backporting
+  - opensuse
+  - suse
+  - linux
+  - code-generation
+  - lora
+  - qlora
+  - transformers
+datasets:
+  - anicka/cve-backport-codegen-dataset
+model-index:
+  - name: cve-backport-codegen-v5-qwen25-32b
+    results:
+      - task:
+          type: text-generation
+          name: Security Patch Backporting
+        dataset:
+          type: anicka/cve-backport-codegen-dataset
+          name: CVE Backport Codegen Dataset
+        metrics:
+          - name: Recall
+            type: recall
+            value: 0.931
+          - name: Precision
+            type: precision
+            value: 0.944
+          - name: Exact Match
+            type: exact_match
+            value: 0.83
+---
+
+# CVE Backport Codegen v5 — Qwen2.5-Coder-32B QLoRA
+
+Fine-tuned code generation model for backporting upstream CVE security fixes
+to older SUSE/openSUSE package versions. Given vulnerable source code and an
+upstream fix description, the model outputs the corrected code. A separate
+tool then diffs the output against the original to produce a patch.
+
+This is a **per-hunk code generation** approach: the model sees one region of
+source code at a time and returns the fixed version, rather than generating
+raw unified diffs. This yields higher accuracy than patch-format models
+because the model works in its natural domain (code) rather than a
+meta-format (diffs).
+
+## What's New in v5
+
+v5 uses a unified **codegen-only dataset** — all 36,166 training examples
+follow the same 3-turn format (system / user with code + fix description /
+assistant with fixed code). v4 mixed in 5-turn test-generation examples;
+v5 drops those to focus entirely on codegen quality.
+
+| Metric | v5 | v4 | v1 |
+|--------|:--:|:--:|:--:|
+| **Recall** | **93.1%** | 93% | 91% |
+| **Precision** | **94.4%** | 95% | — |
+| **Exact match** | **83/100** | 87/100 | — |
+| **Adapted recall** | **90.0%** | 86% | 71% |
+| **Identical recall** | 93.7% | 94% | 94% |
+
+Adapted-tier recall has steadily improved: 71% (v1) → 86% (v4) → **90% (v5)**.
+The codegen-only dataset gives the model a cleaner training signal for the
+core task.
+
+## Model Details
+
+| | |
+|---|---|
+| **Base model** | [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) |
+| **Method** | QLoRA (4-bit NF4, double quantization, bf16 compute) |
+| **LoRA rank / alpha** | 64 / 128 |
+| **LoRA dropout** | 0.05 |
+| **LoRA targets** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
+| **Training data** | 36,166 train / 1,834 eval examples |
+| **Epochs** | 2 (8,228 steps) |
+| **Effective batch size** | 8 (1 × grad_accum 8) |
+| **Learning rate** | 1e-4 (cosine schedule, 5% warmup) |
+| **Max sequence length** | 4,096 tokens |
+| **Optimizer** | AdamW fused, weight decay 0.01 |
+| **Hardware** | 2× NVIDIA H100 NVL 94GB |
+| **Training time** | 46.1 hours |
+| **Train loss (avg)** | 0.0215 |
+| **Eval loss (final)** | 0.00602 |
+| **PEFT version** | 0.18.1 |
+
+## Files
+
+This repository contains:
+
+- **LoRA adapter** (`adapter_model.safetensors`, `adapter_config.json`) — merge with the base model using PEFT
+- **GGUF Q8_0** (`cve-backport-codegen-v5-q8_0.gguf`, 33GB) — ready for llama.cpp / ollama
+
+## Evaluation
+
+Evaluated on 100 held-out examples (zero CVE overlap with training) using
+the Q8_0 GGUF served via llama-server (temperature=0, ctx=8192).
+
+### Overall
+
+| Metric | Value |
+|--------|-------|
+| Avg recall | 93.1% |
+| Avg precision | 94.4% |
+| Exact match | 83/100 |
+| Perfect (100% recall) | 90/100 |
+| Failures (0% recall) | 3/100 |
+
+### By Tier
+
+| Tier | Count | Avg Recall | Perfect |
+|------|:-----:|:----------:|:-------:|
+| **Identical** (upstream applies as-is) | 85 | 93.7% | 77/85 |
+| **Adapted** (requires modification) | 15 | 90.0% | 13/15 |
+
+### Failure Analysis
+
+The 3 zero-recall cases are all complex libvirt patches (multi-function
+adaptations across large files with significant structural differences
+between versions). These are known hard cases that likely require an
+agentic approach with source tree context.
+
+## Training Data
+
+The v5 dataset contains real SUSE/openSUSE maintenance patches paired
+with their upstream CVE fixes, converted to a per-hunk codegen format:
+
+- **36,166 train + 1,834 eval** examples (strict CVE-level split, zero overlap)
+- All examples use a **3-turn ChatML format** (system / user / assistant)
+- Per-hunk extraction with 15-line context padding, nearby hunks merged
+- Covers C, C++, Python, shell, Java, JavaScript, Go, and more
+- Sources: openSUSE Build Service maintenance incidents
+
+### Input Format
+
+```
+## File: path/to/file.c
+## Lines: 100-130
+
+```c
+/* 15 lines before the change */
+vulnerable_code_here();
+/* 15 lines after the change */
+```
+
+## Fix
+Description of what the upstream patch changes in this region.
+```
+
+### Output Format
+
+The model outputs the fixed version of the code region (just the code,
+no diff headers or markup).
+
+## Usage
+
+### With llama.cpp / llama-server (GGUF)
+
+```bash
+llama-server \
+    --model cve-backport-codegen-v5-q8_0.gguf \
+    --port 8403 \
+    --n-gpu-layers 99 \
+    --ctx-size 8192
+```
+
+### With the CVE Backport Tool
+
+The recommended way to use this model is via the
+[cve-backport-tool](https://github.com/anicka-net/cve-backport-tool),
+which handles patch parsing, source extraction, model inference, and
+diff generation:
+
+```bash
+python3 cve-backport.py \
+    --cve CVE-2024-1234 \
+    --package openssl-1.1.1d \
+    --patch upstream.patch \
+    --source-dir /path/to/source/ \
+    --backend openai \
+    --retry 3
+```
+
+### With transformers + PEFT (adapter)
+
+```python
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+base = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen2.5-Coder-32B-Instruct",
+    torch_dtype="bfloat16",
+    device_map="auto",
+)
+model = PeftModel.from_pretrained(base, "anicka/cve-backport-codegen-v5-qwen25-32b")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
+```
+
+### Prompt Template (ChatML)
+
+```
+<|im_start|>system
+You are a security patch backporting assistant.
+
+Given vulnerable source code and a description of the upstream fix, output the FIXED version of the code.
+
+Rules:
+- Output ONLY the fixed code, nothing else
+- Preserve all surrounding context exactly
+- Apply only the described fix
+<|im_end|>
+<|im_start|>user
+## File: crypto/bn/bn.h
+## Lines: 280-310
+
+```c
+/* source code region */
+```
+
+## Fix
+Add bounds check for BN_num_bits to prevent buffer over-read.
+<|im_end|>
+<|im_start|>assistant
+```
+
+## Limitations
+
+- **Best at identical-tier patches** (upstream fix applies directly) — 93.7% recall
+- **Good at adapted patches** (90% recall) but complex multi-function adaptations
+  across structurally different versions remain challenging
+- **Context window**: 4,096 token training limit means very large functions or
+  multi-file patches may be truncated
+- **No compilation feedback**: the model generates code in a single pass without
+  verifying it compiles. Use `--retry` in the CLI tool for iterative correction.
+- Always review generated patches before applying to production systems
+
+## Related
+
+- **CLI tool**: [cve-backport-tool](https://github.com/anicka-net/cve-backport-tool)
+- **Dataset**: [anicka/cve-backport-codegen-dataset](https://huggingface.co/datasets/anicka/cve-backport-codegen-dataset)
+- **Previous version (v1)**: [anicka/cve-backport-codegen-qwen25-32b-v1](https://huggingface.co/anicka/cve-backport-codegen-qwen25-32b-v1)
+
+## Citation
+
+```bibtex
+@misc{cve-backport-codegen-v5,
+  title={CVE Backport Codegen v5: Fine-tuned Qwen2.5-Coder-32B for Security Patch Backporting},
+  author={Anna Maresova},
+  year={2026},
+  url={https://huggingface.co/anicka/cve-backport-codegen-v5-qwen25-32b}
+}
+```
--- a/v5-lora-adapter/adapter_config.json
+++ b/v5-lora-adapter/adapter_config.json
@@ -0,0 +1,46 @@
+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-Coder-32B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 128,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.18.1",
+  "qalora_group_size": 16,
+  "r": 64,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "gate_proj",
+    "o_proj",
+    "down_proj",
+    "up_proj",
+    "k_proj",
+    "v_proj",
+    "q_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}
--- a/v5-lora-adapter/adapter_model.safetensors
+++ b/v5-lora-adapter/adapter_model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:644bd0c861027440a38e5a6d59e4fc8e5629568a86a68881f735d68dd04b839c
+size 2147605960
--- a/v5-lora-adapter/added_tokens.json
+++ b/v5-lora-adapter/added_tokens.json
@@ -0,0 +1,24 @@
+{
+  "</tool_call>": 151658,
+  "<tool_call>": 151657,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}
--- a/v5-lora-adapter/chat_template.jinja
+++ b/v5-lora-adapter/chat_template.jinja
@@ -0,0 +1,54 @@
+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}
--- a/v5-lora-adapter/merges.txt
+++ b/v5-lora-adapter/merges.txt
--- a/v5-lora-adapter/special_tokens_map.json
+++ b/v5-lora-adapter/special_tokens_map.json
@@ -0,0 +1,31 @@
+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/v5-lora-adapter/tokenizer.json
+++ b/v5-lora-adapter/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:83396048d512ec1f3178af0d7c1f79a226bba041822614b0e26a4fd2d4b55bf7
+size 11421995
--- a/v5-lora-adapter/tokenizer_config.json
+++ b/v5-lora-adapter/tokenizer_config.json
@@ -0,0 +1,207 @@
+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 32768,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}
--- a/v5-lora-adapter/training_args.bin
+++ b/v5-lora-adapter/training_args.bin
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:18d5482439b903314c5777c6cb1050193782f8e89ed3d18122237dc3b827c686
+size 5905
--- a/v5-lora-adapter/vocab.json
+++ b/v5-lora-adapter/vocab.json