271 lines
8.9 KiB
Markdown
271 lines
8.9 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
- zh
|
|||
|
|
datasets:
|
|||
|
|
- m-a-p/Code-Feedback
|
|||
|
|
metrics:
|
|||
|
|
- pass@k
|
|||
|
|
base_model: Qwen/Qwen3-4B-Instruct
|
|||
|
|
tags:
|
|||
|
|
- code
|
|||
|
|
- agent
|
|||
|
|
- react
|
|||
|
|
- code-review
|
|||
|
|
- lora
|
|||
|
|
- unsloth
|
|||
|
|
- qwen3
|
|||
|
|
library_name: transformers
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
model-index:
|
|||
|
|
- name: Qwen3-4B-CodeAgent
|
|||
|
|
results:
|
|||
|
|
- task:
|
|||
|
|
type: text-generation
|
|||
|
|
name: Code Generation
|
|||
|
|
dataset:
|
|||
|
|
name: HumanEval (10-problem subset)
|
|||
|
|
type: openai/openai_humaneval
|
|||
|
|
metrics:
|
|||
|
|
- name: Pass@1
|
|||
|
|
type: pass@1
|
|||
|
|
value: 62.6
|
|||
|
|
- name: Pass@3
|
|||
|
|
type: pass@3
|
|||
|
|
value: 75.61
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Qwen3-4B-CodeAgent
|
|||
|
|
|
|||
|
|
A fine-tuned code execution and Code Review agent based on Qwen3-4B-Instruct, trained to follow a structured ReAct (Plan → Execute → Reflect → Finish) workflow with XML-formatted responses.
|
|||
|
|
|
|||
|
|
## Model Description
|
|||
|
|
|
|||
|
|
This model is a LoRA fine-tuned version of [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B) designed to function as an autonomous coding agent. It generates structured XML responses that can be parsed by an orchestration framework to execute code, review results, and iteratively debug.
|
|||
|
|
|
|||
|
|
| Attribute | Value |
|
|||
|
|
|-----------|-------|
|
|||
|
|
| Base Model | Qwen3-4B-Instruct (3.6B params) |
|
|||
|
|
| Architecture | Qwen3ForCausalLM, 36 layers, 2560 hidden size, GQA (32 heads / 8 KV heads) |
|
|||
|
|
| Fine-tuning Method | LoRA (4-bit quantization + LoRA r=32, alpha=32) |
|
|||
|
|
| Framework | Unsloth + TRL SFTTrainer |
|
|||
|
|
| Training Data | m-a-p/Code-Feedback (~47K train samples) |
|
|||
|
|
| Context Length | 4096 tokens |
|
|||
|
|
| Precision | bfloat16 (merged weights) |
|
|||
|
|
|
|||
|
|
## Intended Use
|
|||
|
|
|
|||
|
|
This model is designed for building code agent systems that need structured, parseable output. It is suitable for:
|
|||
|
|
|
|||
|
|
- Automated code generation with execution feedback loops
|
|||
|
|
- Code review and iterative debugging pipelines
|
|||
|
|
- Tool-augmented LLM applications with sandbox execution
|
|||
|
|
- Educational coding assistants
|
|||
|
|
|
|||
|
|
## Output Format
|
|||
|
|
|
|||
|
|
The model outputs XML-structured responses following a ReAct workflow:
|
|||
|
|
|
|||
|
|
```xml
|
|||
|
|
<agent_response>
|
|||
|
|
<node>Plan</node>
|
|||
|
|
<next_node>Execute</next_node>
|
|||
|
|
<content>
|
|||
|
|
## Analysis
|
|||
|
|
The task requires implementing a binary search algorithm.
|
|||
|
|
|
|||
|
|
## Plan
|
|||
|
|
1. Define the function signature
|
|||
|
|
2. Implement iterative binary search
|
|||
|
|
3. Handle edge cases (empty array, target not found)
|
|||
|
|
</content>
|
|||
|
|
</agent_response>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Node Types
|
|||
|
|
|
|||
|
|
| Node | Trigger | Content | Next Node |
|
|||
|
|
|------|---------|---------|-----------|
|
|||
|
|
| **Plan** | User sends a task | Markdown-formatted solution plan | Execute |
|
|||
|
|
| **Execute** | After Plan or Reflect | `{"tool_name": "python_sandbox", "arguments": {"code": "..."}}` | Execute |
|
|||
|
|
| **Reflect** | Execute fails (exit_code=1) | Root cause analysis and fix direction | Execute |
|
|||
|
|
| **Finish** | Execute succeeds (exit_code=0) | Task summary | Finish |
|
|||
|
|
|
|||
|
|
### Standard Workflow
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Plan → Execute → (failure → Reflect → Execute → ...) → Finish
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
### With Transformers
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
|
|||
|
|
model_name = "Nanami14138/qwen3-4b-instruct-code-agent"
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|||
|
|
|
|||
|
|
## 🛠️ Prompting Strategy (系统提示词策略)
|
|||
|
|
|
|||
|
|
本模型被设计为一个基于 ReAct 框架的智能 Code Agent。为了让模型严格按照状态机(Plan -> Execute -> Reflect -> Finish)运行,并输出结构化的 XML 格式,**强烈建议在推理时使用以下 System Prompt**:
|
|||
|
|
|
|||
|
|
system_prompt = """你是一个专业的代码执行与Code Review智能Agent,遵循ReAct工作流。
|
|||
|
|
|
|||
|
|
## 输出格式
|
|||
|
|
你的每一次回复都必须严格使用以下XML格式:
|
|||
|
|
<agent_response>
|
|||
|
|
<node>当前节点</node>
|
|||
|
|
<next_node>下一个节点</next_node>
|
|||
|
|
<content>输出内容</content>
|
|||
|
|
</agent_response>
|
|||
|
|
|
|||
|
|
## 节点定义
|
|||
|
|
### Plan(规划)
|
|||
|
|
- 触发:收到用户任务后立即进入
|
|||
|
|
- <content>:分析任务需求,以 Markdown 格式输出解决方案规划
|
|||
|
|
- <next_node>:Execute
|
|||
|
|
|
|||
|
|
### Execute(执行)
|
|||
|
|
- 触发:Plan 或 Reflect 之后进入
|
|||
|
|
- <content>:输出 {"tool_name": "python_sandbox", "arguments": {"code": "你的代码"}}
|
|||
|
|
- <next_node>:Execute(等待执行结果)
|
|||
|
|
|
|||
|
|
### Reflect(反思)
|
|||
|
|
- 触发:Execute 执行失败(exit_code=1)后进入
|
|||
|
|
- <content>:分析失败原因,定位根因,给出修正方向
|
|||
|
|
- <next_node>:Execute(修正后重新执行)
|
|||
|
|
|
|||
|
|
### Finish(完成)
|
|||
|
|
- 触发:Execute 执行成功(exit_code=0)后进入
|
|||
|
|
- <content>:输出任务总结
|
|||
|
|
- <next_node>:Finish
|
|||
|
|
|
|||
|
|
## 标准工作流
|
|||
|
|
Plan → Execute → (失败 → Reflect → Execute → ...) → Finish"""
|
|||
|
|
|
|||
|
|
messages = [
|
|||
|
|
{"role": "system", "content": system_prompt},
|
|||
|
|
{"role": "user", "content": "任务:Write a Python function to check if a number is prime.\n\n当前状态:Start"}
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
|
|||
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
|||
|
|
|
|||
|
|
with torch.no_grad():
|
|||
|
|
output = model.generate(**inputs, max_new_tokens=1024, temperature=0.1, top_p=0.95)
|
|||
|
|
|
|||
|
|
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
|||
|
|
print(response)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### With Unsloth (Faster Inference)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from unsloth import FastLanguageModel
|
|||
|
|
|
|||
|
|
model, tokenizer = FastLanguageModel.from_pretrained(
|
|||
|
|
model_name="your-username/qwen3-4b-code-agent",
|
|||
|
|
max_seq_length=4096,
|
|||
|
|
load_in_4bit=True,
|
|||
|
|
)
|
|||
|
|
FastLanguageModel.for_inference(model)
|
|||
|
|
# Then use the same message format as above
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Training Details
|
|||
|
|
|
|||
|
|
### Data
|
|||
|
|
|
|||
|
|
Trained on [m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback), a multi-turn code conversation dataset with ~66K examples. The data was processed into three pools:
|
|||
|
|
|
|||
|
|
| Pool | Description | Train Samples | Ratio |
|
|||
|
|
|------|-------------|---------------|-------|
|
|||
|
|
| Pool A (Base SFT) | Single-turn code Q&A, plain text | 117 | 0.2% |
|
|||
|
|
| Pool B (Code Review) | Multi-turn debug/review → ReAct XML format | 29,562 | 62.3% |
|
|||
|
|
| Pool C (Discussion) | Multi-turn code discussion → ReAct XML format | 17,737 | 37.4% |
|
|||
|
|
|
|||
|
|
The system prompt is injected at training time (not stored in the data) to ensure consistent behavior.
|
|||
|
|
|
|||
|
|
### Hyperparameters
|
|||
|
|
|
|||
|
|
| Parameter | Value |
|
|||
|
|
|-----------|-------|
|
|||
|
|
| LoRA rank (r) | 32 |
|
|||
|
|
| LoRA alpha | 32 |
|
|||
|
|
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|||
|
|
| Learning rate | 2e-4 (cosine schedule) |
|
|||
|
|
| Warmup ratio | 0.1 |
|
|||
|
|
| Batch size | 4 × 4 (gradient accumulation) = 16 effective |
|
|||
|
|
| Max sequence length | 4096 |
|
|||
|
|
| Precision | LoRA 4-bit (training), bfloat16 (merged) |
|
|||
|
|
| Optimizer | AdamW 8-bit |
|
|||
|
|
| Epochs | 3 (stopped early at ~5.8% progress, step 620/8892) |
|
|||
|
|
|
|||
|
|
### Training Curve
|
|||
|
|
|
|||
|
|
| Step | Train Loss | Eval Loss |
|
|||
|
|
|------|-----------|-----------|
|
|||
|
|
| 20 | 1.927 | 1.905 |
|
|||
|
|
| 100 | 0.649 | 0.573 |
|
|||
|
|
| 200 | 0.463 | 0.454 |
|
|||
|
|
| 300 | 0.412 | 0.422 |
|
|||
|
|
| 400 | 0.413 | 0.409 |
|
|||
|
|
| 500 | 0.374 | 0.401 |
|
|||
|
|
| 600 | 0.383 | 0.397 |
|
|||
|
|
|
|||
|
|
Loss decreased from 1.90 to 0.40 with no signs of overfitting. The checkpoint at step 620 was merged for this release.
|
|||
|
|
|
|||
|
|
### Hardware
|
|||
|
|
|
|||
|
|
- 8× NVIDIA L20 (48GB each), single-GPU training via LoRA
|
|||
|
|
|
|||
|
|
## Evaluation
|
|||
|
|
|
|||
|
|
### HumanEval (10-problem subset)
|
|||
|
|
|
|||
|
|
| Metric | Score |
|
|||
|
|
|--------|-------|
|
|||
|
|
| Pass@1 | 62.6% |
|
|||
|
|
| Pass@2 | 71.14% |
|
|||
|
|
| Pass@3 | 75.61% |
|
|||
|
|
| Avg tokens/problem | 215.2 |
|
|||
|
|
|
|||
|
|
Evaluation was conducted on a 10-problem subset of HumanEval. Full 164-problem evaluation is planned.
|
|||
|
|
|
|||
|
|
## Limitations
|
|||
|
|
|
|||
|
|
- **Early checkpoint**: This model was merged at step 620 out of 8892 total steps (~3.4% of training). Performance will likely improve with continued training.
|
|||
|
|
- **English-centric data**: The training data (Code-Feedback) is predominantly in English. Chinese language coding tasks may have lower quality.
|
|||
|
|
- **XML format dependency**: The model is trained to output structured XML. Without the system prompt, it may not follow the expected format.
|
|||
|
|
- **No real execution**: The training data simulates tool responses; the model has not been trained with actual code execution feedback.
|
|||
|
|
- **Limited code languages**: While the training data covers multiple languages, Python is heavily overrepresented.
|
|||
|
|
- **Hallucination risk**: Like all LLMs, the model may generate plausible but incorrect code, especially for complex algorithms or domain-specific tasks.
|
|||
|
|
|
|||
|
|
## Ethical Considerations
|
|||
|
|
|
|||
|
|
- The model should not be used to generate malicious code or exploit vulnerabilities.
|
|||
|
|
- Generated code should always be reviewed by a human before deployment in production systems.
|
|||
|
|
- The model may reproduce biases present in the training data (e.g., coding style preferences, library choices).
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
If you use this model, please cite the base model and training dataset:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@article{qwen3,
|
|||
|
|
title={Qwen3 Technical Report},
|
|||
|
|
author={Qwen Team},
|
|||
|
|
year={2025}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
@misc{code-feedback,
|
|||
|
|
title={Code-Feedback: Multi-turn Code Conversation Dataset},
|
|||
|
|
author={m-a-p},
|
|||
|
|
url={https://huggingface.co/datasets/m-a-p/Code-Feedback}
|
|||
|
|
}
|
|||
|
|
```
|