Files
ccs-code-llama-7b/README.md

75 lines
5.1 KiB
Markdown
Raw Permalink Normal View History

---
pipeline_tag: text-generation
tags:
- git commit
- conventional commits
datasets:
- 0x404/ccs_dataset
---
# ccs-code-llama-7b
ccs-code-llama-7b is a model based on code-llama-7b, fine-tuned using LORA SFT, which can automatically classify a git commit into ten [conventional commits types](https://www.conventionalcommits.org/en/v1.0.0/), e.g., feat, fix, build, docs, ci, ...
See our [GitHub Repo](https://github.com/0x404/conventional-commit-classification) for more details.
## How to Use
Our model basically accepts two inputs: the commit message and the corresponding git diff. To classify your desired commit, you must follow a specific prompt format. We provide a code snippet as follows:
```python
from transformers import pipeline
pipe = pipeline("text-generation", model="0x404/ccs-code-llama-7b", device_map="auto")
tokenizer = pipe.tokenizer
def prepare_prompt(commit_message: str, git_diff: str, context_window: int = 1024):
prompt_head = "<s>[INST] <<SYS>>\nYou are a commit classifier based on commit message and code diff.Please classify the given commit into one of the ten categories: docs, perf, style, refactor, feat, fix, test, ci, build, and chore. The definitions of each category are as follows:\n**feat**: Code changes aim to introduce new features to the codebase, encompassing both internal and user-oriented features.\n**fix**: Code changes aim to fix bugs and faults within the codebase.\n**perf**: Code changes aim to improve performance, such as enhancing execution speed or reducing memory consumption.\n**style**: Code changes aim to improve readability without affecting the meaning of the code. This type encompasses aspects like variable naming, indentation, and addressing linting or code analysis warnings.\n**refactor**: Code changes aim to restructure the program without changing its behavior, aiming to improve maintainability. To avoid confusion and overlap, we propose the constraint that this category does not include changes classified as ``perf'' or ``style''. Examples include enhancing modularity, refining exception handling, improving scalability, conducting code cleanup, and removing deprecated code.\n**docs**: Code changes that modify documentation or text, such as correcting typos, modifying comments, or updating documentation.\n**test**: Code changes that modify test files, including the addition or updating of tests.\n**ci**: Code changes to CI (Continuous Integration) configuration files and scripts, such as configuring or updating CI/CD scripts, e.g., ``.travis.yml'' and ``.github/workflows''.\n**build**: Code changes affecting the build system (e.g., Maven, Gradle, Cargo). Change examples include updating dependencies, configuring build configurations, and adding scripts.\n**chore**: Code changes for other miscellaneous tasks that do not neatly fit into any of the above categories.\n<</SYS>>\n\n"
prompt_head_encoded = tokenizer.encode(prompt_head, add_special_tokens=False)
prompt_message = f"- given commit message:\n{commit_message}\n"
prompt_message_encoded = tokenizer.encode(prompt_message, max_length=64, truncation=True, add_special_tokens=False)
prompt_diff = f"- given commit diff: \n{git_diff}\n"
remaining_length = (context_window - len(prompt_head_encoded) - len(prompt_message_encoded) - 6)
prompt_diff_encoded = tokenizer.encode(prompt_diff, max_length=remaining_length, truncation=True, add_special_tokens=False)
prompt_end = tokenizer.encode(" [/INST]", add_special_tokens=False)
return tokenizer.decode(prompt_head_encoded + prompt_message_encoded + prompt_diff_encoded + prompt_end)
def classify_commit(commit_message: str, git_diff: str, context_window: int = 1024):
prompt = prepare_prompt(commit_message, git_diff, context_window)
result = pipe(prompt, max_new_tokens=10, pad_token_id=pipe.tokenizer.eos_token_id)
label = result[0]["generated_text"].split()[-1]
return label
```
Here, you can use the `classify_commit` function to classify your commit by inputting the commit's message and git diff. The `context_window` controls the size of the entire prompt, set to 1024 by default but adjustable to a larger value like 2048 to include more git diff in one prompt. Here is an example of its usage:
```python
import requests
from github import Github
def fetch_message_and_diff(repo_name, commit_sha):
g = Github()
try:
repo = g.get_repo(repo_name)
commit = repo.get_commit(commit_sha)
if commit.parents:
parent_sha = commit.parents[0].sha
diff_url = repo.compare(parent_sha, commit_sha).diff_url
return commit.commit.message, requests.get(diff_url).text
else:
raise ValueError("No parent found for this commit, unable to retrieve diff.")
except Exception as e:
raise RuntimeError(f"Error retrieving commit information: {e}")
message, diff = fetch_message_and_diff("pytorch/pytorch", "9856bc50a251ac054debfdbbb5ed29fc4f6aeb39")
print(classify_commit(message, diff))
```
In this setup, we've defined a function `fetch_message_and_diff` that fetches the commit message and diff for any specified SHA from a GitHub repository, enabling our model to classify the commit accordingly.