A fine-tuned code execution and Code Review agent based on Qwen3-4B-Instruct, trained to follow a structured ReAct (Plan → Execute → Reflect → Finish) workflow with XML-formatted responses.
Model Description
This model is a LoRA fine-tuned version of Qwen3-4B-Instruct designed to function as an autonomous coding agent. It generates structured XML responses that can be parsed by an orchestration framework to execute code, review results, and iteratively debug.
fromtransformersimportAutoModelForCausalLM,AutoTokenizermodel_name="Nanami14138/qwen3-4b-instruct-code-agent"model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype="auto",device_map="auto")tokenizer=AutoTokenizer.from_pretrained(model_name)## 🛠️ Prompting Strategy (系统提示词策略)本模型被设计为一个基于ReAct框架的智能CodeAgent。为了让模型严格按照状态机(Plan->Execute->Reflect->Finish)运行,并输出结构化的XML格式,**强烈建议在推理时使用以下SystemPrompt**:system_prompt="""你是一个专业的代码执行与Code Review智能Agent,遵循ReAct工作流。
## 输出格式
你的每一次回复都必须严格使用以下XML格式:<agent_response>
<node>当前节点</node>
<next_node>下一个节点</next_node>
<content>输出内容</content>
</agent_response>
## 节点定义
### Plan(规划)- 触发:收到用户任务后立即进入
- <content>:分析任务需求,以 Markdown 格式输出解决方案规划
- <next_node>:Execute
### Execute(执行)- 触发:Plan 或 Reflect 之后进入
- <content>:输出 {"tool_name": "python_sandbox", "arguments": {"code": "你的代码"}}
- <next_node>:Execute(等待执行结果)### Reflect(反思)- 触发:Execute 执行失败(exit_code=1)后进入
- <content>:分析失败原因,定位根因,给出修正方向
- <next_node>:Execute(修正后重新执行)### Finish(完成)- 触发:Execute 执行成功(exit_code=0)后进入
- <content>:输出任务总结
- <next_node>:Finish
## 标准工作流
Plan → Execute → (失败 → Reflect → Execute → ...) → Finish"""messages=[{"role":"system","content":system_prompt},{"role":"user","content":"任务:Write a Python function to check if a number is prime.\n\n当前状态:Start"}]text=tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=False)inputs=tokenizer(text,return_tensors="pt").to(model.device)withtorch.no_grad():output=model.generate(**inputs,max_new_tokens=1024,temperature=0.1,top_p=0.95)response=tokenizer.decode(output[0][inputs["input_ids"].shape[1]:],skip_special_tokens=True)print(response)
With Unsloth (Faster Inference)
fromunslothimportFastLanguageModelmodel,tokenizer=FastLanguageModel.from_pretrained(model_name="your-username/qwen3-4b-code-agent",max_seq_length=4096,load_in_4bit=True,)FastLanguageModel.for_inference(model)# Then use the same message format as above
Training Details
Data
Trained on m-a-p/Code-Feedback, a multi-turn code conversation dataset with ~66K examples. The data was processed into three pools:
Pool
Description
Train Samples
Ratio
Pool A (Base SFT)
Single-turn code Q&A, plain text
117
0.2%
Pool B (Code Review)
Multi-turn debug/review → ReAct XML format
29,562
62.3%
Pool C (Discussion)
Multi-turn code discussion → ReAct XML format
17,737
37.4%
The system prompt is injected at training time (not stored in the data) to ensure consistent behavior.
3 (stopped early at ~5.8% progress, step 620/8892)
Training Curve
Step
Train Loss
Eval Loss
20
1.927
1.905
100
0.649
0.573
200
0.463
0.454
300
0.412
0.422
400
0.413
0.409
500
0.374
0.401
600
0.383
0.397
Loss decreased from 1.90 to 0.40 with no signs of overfitting. The checkpoint at step 620 was merged for this release.
Hardware
8× NVIDIA L20 (48GB each), single-GPU training via LoRA
Evaluation
HumanEval (10-problem subset)
Metric
Score
Pass@1
62.6%
Pass@2
71.14%
Pass@3
75.61%
Avg tokens/problem
215.2
Evaluation was conducted on a 10-problem subset of HumanEval. Full 164-problem evaluation is planned.
Limitations
Early checkpoint: This model was merged at step 620 out of 8892 total steps (~3.4% of training). Performance will likely improve with continued training.
English-centric data: The training data (Code-Feedback) is predominantly in English. Chinese language coding tasks may have lower quality.
XML format dependency: The model is trained to output structured XML. Without the system prompt, it may not follow the expected format.
No real execution: The training data simulates tool responses; the model has not been trained with actual code execution feedback.
Limited code languages: While the training data covers multiple languages, Python is heavily overrepresented.
Hallucination risk: Like all LLMs, the model may generate plausible but incorrect code, especially for complex algorithms or domain-specific tasks.
Ethical Considerations
The model should not be used to generate malicious code or exploit vulnerabilities.
Generated code should always be reviewed by a human before deployment in production systems.
The model may reproduce biases present in the training data (e.g., coding style preferences, library choices).
Citation
If you use this model, please cite the base model and training dataset: