Files
ccs-code-llama-7b/README.md
ModelHub XC cb56dd7df7 初始化项目,由ModelHub XC社区提供模型
Model: 0x404/ccs-code-llama-7b
Source: Original Platform
2026-05-25 09:27:19 +08:00

5.1 KiB

pipeline_tag, tags, datasets
pipeline_tag tags datasets
text-generation
git commit
conventional commits
0x404/ccs_dataset

ccs-code-llama-7b

ccs-code-llama-7b is a model based on code-llama-7b, fine-tuned using LORA SFT, which can automatically classify a git commit into ten conventional commits types, e.g., feat, fix, build, docs, ci, ...

See our GitHub Repo for more details.

How to Use

Our model basically accepts two inputs: the commit message and the corresponding git diff. To classify your desired commit, you must follow a specific prompt format. We provide a code snippet as follows:

from transformers import pipeline

pipe = pipeline("text-generation", model="0x404/ccs-code-llama-7b", device_map="auto")
tokenizer = pipe.tokenizer


def prepare_prompt(commit_message: str, git_diff: str, context_window: int = 1024):
    prompt_head = "<s>[INST] <<SYS>>\nYou are a commit classifier based on commit message and code diff.Please classify the given commit into one of the ten categories: docs, perf, style, refactor, feat, fix, test, ci, build, and chore. The definitions of each category are as follows:\n**feat**: Code changes aim to introduce new features to the codebase, encompassing both internal and user-oriented features.\n**fix**: Code changes aim to fix bugs and faults within the codebase.\n**perf**: Code changes aim to improve performance, such as enhancing execution speed or reducing memory consumption.\n**style**: Code changes aim to improve readability without affecting the meaning of the code. This type encompasses aspects like variable naming, indentation, and addressing linting or code analysis warnings.\n**refactor**: Code changes aim to restructure the program without changing its behavior, aiming to improve maintainability. To avoid confusion and overlap, we propose the constraint that this category does not include changes classified as ``perf'' or ``style''. Examples include enhancing modularity, refining exception handling, improving scalability, conducting code cleanup, and removing deprecated code.\n**docs**: Code changes that modify documentation or text, such as correcting typos, modifying comments, or updating documentation.\n**test**: Code changes that modify test files, including the addition or updating of tests.\n**ci**: Code changes to CI (Continuous Integration) configuration files and scripts, such as configuring or updating CI/CD scripts, e.g., ``.travis.yml'' and ``.github/workflows''.\n**build**: Code changes affecting the build system (e.g., Maven, Gradle, Cargo). Change examples include updating dependencies, configuring build configurations, and adding scripts.\n**chore**: Code changes for other miscellaneous tasks that do not neatly fit into any of the above categories.\n<</SYS>>\n\n"
    prompt_head_encoded = tokenizer.encode(prompt_head, add_special_tokens=False)

    prompt_message = f"- given commit message:\n{commit_message}\n"
    prompt_message_encoded = tokenizer.encode(prompt_message, max_length=64, truncation=True, add_special_tokens=False)

    prompt_diff = f"- given commit diff: \n{git_diff}\n"
    remaining_length = (context_window - len(prompt_head_encoded) - len(prompt_message_encoded) - 6)
    prompt_diff_encoded = tokenizer.encode(prompt_diff, max_length=remaining_length, truncation=True, add_special_tokens=False)

    prompt_end = tokenizer.encode(" [/INST]", add_special_tokens=False)
    return tokenizer.decode(prompt_head_encoded + prompt_message_encoded + prompt_diff_encoded + prompt_end)


def classify_commit(commit_message: str, git_diff: str, context_window: int = 1024):
    prompt = prepare_prompt(commit_message, git_diff, context_window)
    result = pipe(prompt, max_new_tokens=10, pad_token_id=pipe.tokenizer.eos_token_id)
    label = result[0]["generated_text"].split()[-1]
    return label

Here, you can use the classify_commit function to classify your commit by inputting the commit's message and git diff. The context_window controls the size of the entire prompt, set to 1024 by default but adjustable to a larger value like 2048 to include more git diff in one prompt. Here is an example of its usage:

import requests
from github import Github

def fetch_message_and_diff(repo_name, commit_sha):
    g = Github()
    try:
        repo = g.get_repo(repo_name)
        commit = repo.get_commit(commit_sha)
        if commit.parents:
            parent_sha = commit.parents[0].sha
            diff_url = repo.compare(parent_sha, commit_sha).diff_url
            return commit.commit.message, requests.get(diff_url).text
        else:
            raise ValueError("No parent found for this commit, unable to retrieve diff.")
    except Exception as e:
        raise RuntimeError(f"Error retrieving commit information: {e}")

message, diff = fetch_message_and_diff("pytorch/pytorch", "9856bc50a251ac054debfdbbb5ed29fc4f6aeb39")
print(classify_commit(message, diff))

In this setup, we've defined a function fetch_message_and_diff that fetches the commit message and diff for any specified SHA from a GitHub repository, enabling our model to classify the commit accordingly.