--- license: apache-2.0 language: - en - zh datasets: - m-a-p/Code-Feedback metrics: - pass@k base_model: Qwen/Qwen3-4B-Instruct tags: - code - agent - react - code-review - lora - unsloth - qwen3 library_name: transformers pipeline_tag: text-generation model-index: - name: Qwen3-4B-CodeAgent results: - task: type: text-generation name: Code Generation dataset: name: HumanEval (10-problem subset) type: openai/openai_humaneval metrics: - name: Pass@1 type: pass@1 value: 62.6 - name: Pass@3 type: pass@3 value: 75.61 --- # Qwen3-4B-CodeAgent A fine-tuned code execution and Code Review agent based on Qwen3-4B-Instruct, trained to follow a structured ReAct (Plan → Execute → Reflect → Finish) workflow with XML-formatted responses. ## Model Description This model is a LoRA fine-tuned version of [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B) designed to function as an autonomous coding agent. It generates structured XML responses that can be parsed by an orchestration framework to execute code, review results, and iteratively debug. | Attribute | Value | |-----------|-------| | Base Model | Qwen3-4B-Instruct (3.6B params) | | Architecture | Qwen3ForCausalLM, 36 layers, 2560 hidden size, GQA (32 heads / 8 KV heads) | | Fine-tuning Method | LoRA (4-bit quantization + LoRA r=32, alpha=32) | | Framework | Unsloth + TRL SFTTrainer | | Training Data | m-a-p/Code-Feedback (~47K train samples) | | Context Length | 4096 tokens | | Precision | bfloat16 (merged weights) | ## Intended Use This model is designed for building code agent systems that need structured, parseable output. It is suitable for: - Automated code generation with execution feedback loops - Code review and iterative debugging pipelines - Tool-augmented LLM applications with sandbox execution - Educational coding assistants ## Output Format The model outputs XML-structured responses following a ReAct workflow: ```xml Plan Execute ## Analysis The task requires implementing a binary search algorithm. ## Plan 1. Define the function signature 2. Implement iterative binary search 3. Handle edge cases (empty array, target not found) ``` ### Node Types | Node | Trigger | Content | Next Node | |------|---------|---------|-----------| | **Plan** | User sends a task | Markdown-formatted solution plan | Execute | | **Execute** | After Plan or Reflect | `{"tool_name": "python_sandbox", "arguments": {"code": "..."}}` | Execute | | **Reflect** | Execute fails (exit_code=1) | Root cause analysis and fix direction | Execute | | **Finish** | Execute succeeds (exit_code=0) | Task summary | Finish | ### Standard Workflow ``` Plan → Execute → (failure → Reflect → Execute → ...) → Finish ``` ## Usage ### With Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Nanami14138/qwen3-4b-instruct-code-agent" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) ## 🛠️ Prompting Strategy (系统提示词策略) 本模型被设计为一个基于 ReAct 框架的智能 Code Agent。为了让模型严格按照状态机(Plan -> Execute -> Reflect -> Finish)运行,并输出结构化的 XML 格式,**强烈建议在推理时使用以下 System Prompt**: system_prompt = """你是一个专业的代码执行与Code Review智能Agent,遵循ReAct工作流。 ## 输出格式 你的每一次回复都必须严格使用以下XML格式: 当前节点 下一个节点 输出内容 ## 节点定义 ### Plan(规划) - 触发:收到用户任务后立即进入 - :分析任务需求,以 Markdown 格式输出解决方案规划 - :Execute ### Execute(执行) - 触发:Plan 或 Reflect 之后进入 - :输出 {"tool_name": "python_sandbox", "arguments": {"code": "你的代码"}} - :Execute(等待执行结果) ### Reflect(反思) - 触发:Execute 执行失败(exit_code=1)后进入 - :分析失败原因,定位根因,给出修正方向 - :Execute(修正后重新执行) ### Finish(完成) - 触发:Execute 执行成功(exit_code=0)后进入 - :输出任务总结 - :Finish ## 标准工作流 Plan → Execute → (失败 → Reflect → Execute → ...) → Finish""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": "任务:Write a Python function to check if a number is prime.\n\n当前状态:Start"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=1024, temperature=0.1, top_p=0.95) response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(response) ``` ### With Unsloth (Faster Inference) ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="your-username/qwen3-4b-code-agent", max_seq_length=4096, load_in_4bit=True, ) FastLanguageModel.for_inference(model) # Then use the same message format as above ``` ## Training Details ### Data Trained on [m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback), a multi-turn code conversation dataset with ~66K examples. The data was processed into three pools: | Pool | Description | Train Samples | Ratio | |------|-------------|---------------|-------| | Pool A (Base SFT) | Single-turn code Q&A, plain text | 117 | 0.2% | | Pool B (Code Review) | Multi-turn debug/review → ReAct XML format | 29,562 | 62.3% | | Pool C (Discussion) | Multi-turn code discussion → ReAct XML format | 17,737 | 37.4% | The system prompt is injected at training time (not stored in the data) to ensure consistent behavior. ### Hyperparameters | Parameter | Value | |-----------|-------| | LoRA rank (r) | 32 | | LoRA alpha | 32 | | LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Learning rate | 2e-4 (cosine schedule) | | Warmup ratio | 0.1 | | Batch size | 4 × 4 (gradient accumulation) = 16 effective | | Max sequence length | 4096 | | Precision | LoRA 4-bit (training), bfloat16 (merged) | | Optimizer | AdamW 8-bit | | Epochs | 3 (stopped early at ~5.8% progress, step 620/8892) | ### Training Curve | Step | Train Loss | Eval Loss | |------|-----------|-----------| | 20 | 1.927 | 1.905 | | 100 | 0.649 | 0.573 | | 200 | 0.463 | 0.454 | | 300 | 0.412 | 0.422 | | 400 | 0.413 | 0.409 | | 500 | 0.374 | 0.401 | | 600 | 0.383 | 0.397 | Loss decreased from 1.90 to 0.40 with no signs of overfitting. The checkpoint at step 620 was merged for this release. ### Hardware - 8× NVIDIA L20 (48GB each), single-GPU training via LoRA ## Evaluation ### HumanEval (10-problem subset) | Metric | Score | |--------|-------| | Pass@1 | 62.6% | | Pass@2 | 71.14% | | Pass@3 | 75.61% | | Avg tokens/problem | 215.2 | Evaluation was conducted on a 10-problem subset of HumanEval. Full 164-problem evaluation is planned. ## Limitations - **Early checkpoint**: This model was merged at step 620 out of 8892 total steps (~3.4% of training). Performance will likely improve with continued training. - **English-centric data**: The training data (Code-Feedback) is predominantly in English. Chinese language coding tasks may have lower quality. - **XML format dependency**: The model is trained to output structured XML. Without the system prompt, it may not follow the expected format. - **No real execution**: The training data simulates tool responses; the model has not been trained with actual code execution feedback. - **Limited code languages**: While the training data covers multiple languages, Python is heavily overrepresented. - **Hallucination risk**: Like all LLMs, the model may generate plausible but incorrect code, especially for complex algorithms or domain-specific tasks. ## Ethical Considerations - The model should not be used to generate malicious code or exploit vulnerabilities. - Generated code should always be reviewed by a human before deployment in production systems. - The model may reproduce biases present in the training data (e.g., coding style preferences, library choices). ## Citation If you use this model, please cite the base model and training dataset: ```bibtex @article{qwen3, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025} } @misc{code-feedback, title={Code-Feedback: Multi-turn Code Conversation Dataset}, author={m-a-p}, url={https://huggingface.co/datasets/m-a-p/Code-Feedback} } ```