初始化项目,由ModelHub XC社区提供模型

Model: joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-16 03:40:17 +08:00
commit acfc2cee00
9 changed files with 713 additions and 0 deletions

36
.gitattributes vendored Normal file
View File

@@ -0,0 +1,36 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text

511
README.md Normal file
View File

@@ -0,0 +1,511 @@
---
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
base_model_relation: finetune
datasets:
- joshuasundance/mypo-4k-rfc
language:
- en
- code
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
model_name: mypo-qwen2.5-coder-1.5b-dpo-v3
tags:
- generated_from_trainer
- dpo
- trl
- preference-optimization
- python
- type-hints
- code
- qwen2.5-coder
- mypo
- hf_jobs
- codecarbon
- carbon-emissions
co2_eq_emissions:
emissions: 134.115
source: "CodeCarbon v3.2.6 (measured)"
training_type: "fine-tuning"
geographical_location: "Virginia, USA (AWS us-east-1)"
hardware_used: "1 x NVIDIA A10G (HF Jobs a10g-large)"
model-index:
- name: mypo-qwen2.5-coder-1.5b-dpo-v3
results:
- task:
type: text-generation
name: Python type-hinted code generation
dataset:
name: mypo-4k-rfc
type: joshuasundance/mypo-4k-rfc
split: validation
metrics:
- type: pass_rate
name: parse rate
value: 1.000
- type: pass_rate
name: black pass rate
value: 0.953
- type: pass_rate
name: ruff pass rate
value: 0.913
- type: pass_rate
name: mypy --strict pass rate
value: 0.920
- type: coverage
name: annotation slot coverage
value: 0.963
- type: win_rate
name: preference win-rate vs gold (chosen)
value: 0.527
- task:
type: text-generation
name: Python code generation
dataset:
name: HumanEval+
type: humaneval-plus
split: test
metrics:
- type: pass_rate
name: pass@1 (base tests)
value: 0.5853658536585366
- type: pass_rate
name: pass@1 (plus tests)
value: 0.5121951219512195
---
# Model Card for `mypo-qwen2.5-coder-1.5b-dpo-v3`
**Preference-tuned Python coding model** that prefers fully type-annotated code by default.
- **Base:** [`Qwen/Qwen2.5-Coder-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct)
- **Pipeline:** base → [SFT adapter](https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-sft) (merged) → DPO LoRA (merged) → this model
- **Training data:** [`joshuasundance/mypo-4k-rfc`](https://huggingface.co/datasets/joshuasundance/mypo-4k-rfc) — `chosen` = type-hinted Python, `rejected` = unhinted Python
- **This repo ships a fully merged standalone model**, not a LoRA adapter. Load directly with `AutoModelForCausalLM.from_pretrained(...)`.
- **Training scripts, raw generations, per-subject analysis, and the comparison report live in** [`joshuasundance/mypo-training`](https://huggingface.co/joshuasundance/mypo-training).
---
## TL;DR
v3 is the first DPO model in the MyPO project that actually shifts argmax decoding past the base. Two complementary measurements are reported — both published, both reproducible:
| metric | base | dpo-v2 | SFT | **dpo-v3** | gold (`chosen`) |
|---|---|---|---|---|---|
| `mypy --strict` pass — n=150 batched | 6.0% | 6.0% | 92.7% | **92.0%** | 100% |
| `mypy --strict` pass — n=30 single-prompt | **0.0%** | **0.0%** | **73.3%** | **73.3%** | — |
| annotation slot coverage — n=150 batched | 0.000 | 0.000 | 0.953 | **0.963** | 0.955 |
| annotation slot coverage — n=30 single-prompt | 0.000 | 0.000 | 0.971 | **0.976** | — |
| `black` pass — n=150 batched | 12.0% | 12.0% | 97.3% | **95.3%** | 98.0% |
| preference win-rate vs gold (n=150) | — | 0.0% | 49.0% | **52.7%** | — |
The large effects are robust: **0 % → 73 %** `mypy --strict` pass and **0.0 → 0.976** annotation slot coverage under real-world single-prompt inference (batch=1, no padding). The earlier batched and single-prompt validations are both retained as in-domain measurements, but we no longer attribute their gap to left-padding or batching as a general causal explanation.
An external benchmark now exists as well: on the latest canonical full
HumanEval+ run (n=164), this model reaches **96 / 164 = 58.5 %** pass@1 on
base tests and **84 / 164 = 51.2 %** on plus tests. That still underperforms
the Qwen base model (`112 / 164` base-test pass, `99 / 164` plus-test pass),
so v3 should be understood as an in-domain type-hinting preference model
rather than a generally stronger code generation model.
At n=30 single-prompt, **SFT and v3 are statistically indistinguishable** on the hard metrics; v3's clearer advantage over SFT is the 52.7 % preference win-rate vs gold on the n=150 batched eval (first model to exceed 50 % vs gold). v2 is indistinguishable from base under both decoding conditions — see the [v2 card](https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v2) for the failure-mode post-mortem.
---
## What changed vs v2
v2 logged healthy training telemetry (`rewards/accuracies → 1.0`) but generated text indistinguishable from the base model at greedy decode. The DPO ranking objective can be satisfied by infinitesimal weight deltas when both the LoRA scale and the learning rate are small. v2's effective scale was `α/r = 16/256 = 0.0625`, and its lr was `1e-6`; the product was too small to move argmax decoding.
v3 addresses all proximate causes at once:
| Design choice | v2 | **v3** | Rationale |
|---|---|---|---|
| Starting point | Base model | **Base + SFT (merged)** | DPO optimizes *beyond* SFT instead of re-deriving type-hint behavior |
| LoRA α | 16 | **256** | Matches r=256 → effective scale α/r = 1.0 (was 0.0625) |
| Learning rate | 1e-6 | **5e-5** | 50× higher; calibrated to the matched LoRA scale |
| DPO β | 0.1 | **0.3** | Stronger preference margin target |
| Epochs | 3 | **2** | Higher lr + scale + warm-start → faster convergence |
| Precision | 4-bit (QLoRA) | **bf16 full** | 1.5B bf16 fits on A10G 24 GB; clean `merge_and_unload` |
| Optimizer | `paged_adamw_8bit` | **`adamw_torch`** | No bitsandbytes dep in bf16 |
| Published as | PEFT adapter | **Fully merged model** | v3's DPO LoRA is only valid on top of (base+SFT); shipping a bare adapter would break the obvious load pattern |
---
## Quick start
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa",
)
messages = [{"role": "user", "content": "Write a function that returns the nth Fibonacci number."}]
rendered = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([rendered], return_tensors="pt", padding=True, truncation=True, max_length=2048).to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False, use_cache=True, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
```
No PEFT dependency required — v3 is a merged full model.
### Real one-prompt demo
The repo [`joshuasundance/mypo-training`](https://huggingface.co/joshuasundance/mypo-training) includes a runnable comparison script at [`examples/reproduce_v3.py`](https://huggingface.co/joshuasundance/mypo-training/blob/main/examples/reproduce_v3.py). We executed that exact script on HF Jobs (`69e959a92aa1660eaffa8ca6`) with the prompt `Write a function that returns the nth Fibonacci number.`
Observed outputs:
**Base**
```python
def fibonacci(n: int) -> int:
if n == 0 or n == 1:
return n
prev = 0
curr = 1
for i in range(2, n + 1):
temp = curr
curr += prev
prev = temp
return curr
# Driver code
n = 9
print(fibonacci(n))
```
**SFT**
```python
def fibonacci(n: int) -> int:
if n == 0 or n == 1:
return n
prev = 0
curr = 1
for i in range(2, n + 1):
temp = curr
curr += prev
prev = temp
return curr
# Driver code
n = 9
print(fibonacci(n))
```
**DPO-v2**
```python
def fibonacci(n):
# Base cases: F(0) = 0, F(1) = 1
if n == 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n-1) + fibonacci(n-2)
```
```
...followed by a natural-language explanation block in the same response.
```
**DPO-v3**
```python
from typing import Union
def fibonacci(n: int) -> Union[int, float]:
if n == 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n - 1) + fibonacci(n - 2)
```
This single prompt is useful as a smoke test, but it is **not** the main evidence for v3's value because the base model already returns typed code here. The stronger evidence is the 150-prompt characterization table above: across that broader sample, v3 materially improves annotation coverage and is the only model to exceed 50% preference win-rate vs gold.
If you want to reproduce a specific stored row from the published eval artifacts, use [`examples/reproduce_eval_row.py`](https://huggingface.co/joshuasundance/mypo-training/blob/main/examples/reproduce_eval_row.py). We validated this on row 13 from `samples.jsonl`: replaying the prompt by itself did not match the stored sample, but replaying the original 8-prompt batch window did.
---
## Training
Trained with [TRL](https://github.com/huggingface/trl) `DPOTrainer` on a single NVIDIA A10G via [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs). Training script: [`mypo_dpo_train_v3.py`](https://huggingface.co/joshuasundance/mypo-training/blob/main/mypo_dpo_train_v3.py). Job id: `69e933522aa1660eaffa8c51`.
### Hyperparameters (full `DPOConfig`)
| Group | Setting |
|---|---|
| Base model | `Qwen/Qwen2.5-Coder-1.5B-Instruct` |
| Warm-start | `joshuasundance/mypo-qwen2.5-coder-1.5b-sft` (merged into base before DPO LoRA attached) |
| Dataset | `joshuasundance/mypo-4k-rfc` (train + validation concatenated → 6,361 pairs) |
| LoRA | `r=256`, `α=256`, `dropout=0.05`, `target_modules="all-linear"`, `task_type=CAUSAL_LM` |
| Optimization | `adamw_torch`, `lr=5e-5`, cosine schedule, `warmup_steps=100` |
| DPO | `β=0.3`, `loss_type="sigmoid"` |
| Batching | `per_device_train_batch_size=1`, `gradient_accumulation_steps=8` (effective 8) |
| Schedule | `num_train_epochs=2`, `max_length=2048` |
| Precision | bf16, gradient checkpointing on, `attn_implementation="sdpa"` |
| Reporting | `report_to=["codecarbon"]`, `logging_steps=10` |
| Seed | 42 |
### Final training metrics (from job logs)
| Metric | Value |
|---|---|
| `train_runtime` | 6,005 s (~1 h 40 m) |
| `train_loss` (DPO sigmoid) | 3.28 × 10⁻³ |
| `rewards/accuracies` (final) | 1.000 |
| `rewards/margins` (peak / plateau) | ~26 |
| `rewards/chosen` (final) | +6.24 |
| `rewards/rejected` (final) | 15.7 |
| `mean_token_accuracy` (final) | 0.910 |
| `grad_norm` (late training) | ≲ 1 × 10⁻⁵ |
**Convergence note:** `rewards/accuracies` saturated to 1.0 by epoch ~0.3 and `rewards/margins` plateaued by epoch ~0.5. The remaining ~1.5 epochs were cosine-decay ride-out with near-zero grads. [v4 draft](https://huggingface.co/joshuasundance/mypo-training/blob/main/mypo_dpo_train_v4.py) adds `EarlyStoppingCallback` and a held-out eval split to cut this.
---
## Evaluation
Evaluated on 150 stratified held-out validation prompts from `joshuasundance/mypo-4k-rfc`. Full report: [`reports/2026-04-22-qwen2.5-1.5b-v3/CHARACTERIZATION.md`](https://huggingface.co/joshuasundance/mypo-training/blob/main/reports/2026-04-22-qwen2.5-1.5b-v3/CHARACTERIZATION.md). Raw generations and per-subject JSON/CSV are also published in the training repo under `generations/` and `analysis/`.
| metric | base | dpo-v2 | SFT | **dpo-v3** | gold (`chosen`) | `rejected` |
|---|---|---|---|---|---|---|
| parse rate | 0.973 | 0.973 | 1.000 | **1.000** | 1.000 | 1.000 |
| `black` pass rate | 0.120 | 0.120 | 0.973 | **0.953** | 0.980 | 0.060 |
| `ruff` pass rate | 0.933 | 0.940 | 0.960 | **0.913** | 1.000 | 0.913 |
| `mypy --strict` pass rate | 0.060 | 0.060 | 0.927 | **0.920** | 1.000 | 0.000 |
| annotation slot coverage | 0.000 | 0.000 | 0.953 | **0.963** | 0.955 | 0.000 |
| fully-annotated fn fraction | 0.000 | 0.000 | 0.893 | **0.903** | 0.898 | 0.000 |
| mean `ruff` violations / sample | 0.47 | 0.46 | 0.07 | **0.09** | 0.00 | 0.11 |
| mean `mypy` errors / sample | 2.30 | 2.35 | 0.13 | **0.13** | 0.00 | 2.25 |
| preference win-rate vs gold | — | 0.000 | 0.490 | **0.527** | — | — |
| preference win-rate vs base | — | 0.500 | 1.000 | **1.000** | — | — |
| preference win-rate vs `rejected` | — | 0.500 | 1.000 | **1.000** | — | — |
**Interpretation (batched n=150):**
- v3 matches SFT on every quality gate (within noise).
- v3 has **the highest annotation slot coverage of any model, including gold** (0.963 vs gold 0.955 vs SFT 0.953). Judgment call whether this is "more thorough" or "slight over-annotation."
- v3 is **the only subject to exceed 50% win-rate vs gold** (52.7%) on this eval — measurable DPO-level gain on top of SFT at this sample size.
- ruff regression (0.913 vs SFT 0.960) is small but real; likely a handful of idiomatic style issues introduced by more aggressive annotation.
### Single-prompt validation (n=30)
A follow-up job re-decoded 30 stratified validation prompts with `batch_size=1` and no padding — i.e., the realistic one-user inference condition — across all four subjects. This directly tests whether the batched characterization numbers reflect real-world behavior or batching/left-padding artifacts. Full artifacts: [`single-prompt-validation/single-prompt-2026-04-23T002137Z/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/single-prompt-validation/single-prompt-2026-04-23T002137Z).
| metric | base | dpo-v2 | SFT | **dpo-v3** |
|---|---|---|---|---|
| parse rate | 0.933 | 0.967 | 1.000 | **1.000** |
| `black` pass rate | 0.067 | 0.067 | 1.000 | **0.967** |
| `ruff` pass rate | 0.900 | 0.967 | 0.933 | **0.800** |
| `mypy --strict` pass rate | **0.000** | **0.000** | 0.733 | **0.733** |
| annotation slot coverage | 0.000 | 0.000 | 0.971 | **0.976** |
| mean `mypy` errors / sample | 2.33 | 2.40 | 0.30 | **0.30** |
**What this tells us:**
- The **core claim holds under real-world inference.** 0 % → 73 % `mypy --strict` is not a batching artifact.
- The batched n=150 and single-prompt n=30 validations should be treated as two different measurement regimes. We no longer claim that the gap is specifically caused by left-padding or batching as a general explanation.
- **v2's no-op is confirmed** under both decoding modes. Rules out "v2 adapter not loading" as an alternative explanation.
- **SFT and v3 are indistinguishable** at n=30 single-prompt (both 0.733 `mypy`, both ≈ 0.97 annotation coverage). At this sample size we cannot claim v3 is hard-metric better than SFT; the case for v3 over SFT rests on the 52.7 % preference win-rate vs gold in the batched eval.
- **v3's ruff regression is larger in single-prompt mode** (0.800 vs SFT 0.933). Consistent with v3 trading some style-conformance for stronger annotation behavior.
### HumanEval+ external benchmark (n=164)
We also ran a full evalplus HumanEval+ benchmark. That is the stronger
out-of-domain coding benchmark, and it does **not** show a general gain for v3:
| subject | pass@1 base tests | pass@1 plus tests |
|---|---:|---:|
| `base` | 112 / 164 (68.3%) | 99 / 164 (60.4%) |
| `dpo-v2` | 110 / 164 (67.1%) | 97 / 164 (59.1%) |
| `sft` | 97 / 164 (59.1%) | 86 / 164 (52.4%) |
| `dpo-v3` | 96 / 164 (58.5%) | 84 / 164 (51.2%) |
So the honest reading is: v3 changes the model's in-domain type-hinting
behavior, but it is not a generally stronger HumanEval+ solver than the base
model.
---
## Environmental impact
Reported with [CodeCarbon](https://codecarbon.io/) v3.2.6. Raw data: [`emissions.csv`](./emissions.csv).
### Training (this model)
| Metric | Value |
|---|---|
| Duration | 6,005.4 s (1 h 40 m) |
| **Energy consumed** | **0.363 kWh** |
| **CO₂e emissions** | **0.134 kg** |
| GPU energy / avg power | 0.242 kWh / 144.9 W |
| CPU energy / avg power | 0.034 kWh / 21.4 W |
| RAM energy / avg power | 0.087 kWh / 54.0 W |
| Hardware | 1 × NVIDIA A10G, AMD EPYC 7R32 (48 vCPU), 187 GB RAM |
| Region | AWS `us-east-1` (Virginia, USA); PUE 1.0 |
| Tracker | codecarbon 3.2.6, `tracking_mode=machine` |
### Cumulative project footprint
Because v3 builds on SFT warm-start + v2 was a training run too, the full energy cost of this model's lineage is:
| Stage | Duration | Energy | CO₂e |
|---|---|---|---|
| SFT training | 8,340 s | 0.472 kWh | 0.174 kg |
| v2 DPO training (failed) | 10,938 s | 0.646 kWh | 0.238 kg |
| **v3 DPO training (this)** | 6,005 s | 0.363 kWh | 0.134 kg |
| v3 characterization (generate × 4 models) | 937 s | 0.052 kWh | 0.019 kg |
| 6 analysis jobs (cpu-upgrade) | ~3 min each, parallel | ~0.01 kWh | ~0.004 kg |
| **Cumulative (SFT + v2 + v3 + eval)** | **~7.3 h** | **~1.55 kWh** | **~0.57 kg** |
### Approximate compute cost
HF Jobs wall-clock billed at published [HF Jobs rates](https://huggingface.co/docs/hub/jobs). Rates shown are approximate.
| Stage | Flavor | Wall-clock | Approx cost |
|---|---|---|---|
| SFT training | a10g-large | 2.32 h | ~$3.50 |
| v2 DPO training | a10g-large | 3.04 h | ~$4.60 |
| **v3 DPO training** | a10g-large | 1.67 h | **~$2.50** |
| v3 characterization generate | a10g-large | 0.26 h | ~$0.40 |
| 6 analysis jobs | cpu-upgrade × 6 parallel | ~3 min each | <$0.05 |
| Rollup report | cpu-basic | <1 min | ~$0 |
| **Cumulative project cost** | | | **~$11** |
---
## Limitations and biases
- **Narrow objective:** optimized only for Python type-hint preference. Docstring style, line length, complexity, security idioms, etc. were not objectives.
- **Possible over-annotation:** `rewards/rejected` fell to ~20 during training, meaning the model strongly suppresses unhinted outputs. In principle this could cause annotations where Python idiom doesn't require them (trivial lambdas, short list comprehensions). v3's annotation coverage slightly exceeding gold's is mild evidence of this; watch for it in your downstream use.
- **No eval split during training:** v3 trained on the full 6,361-pair pool with no held-out metric for best-checkpoint selection. [v4 draft](https://huggingface.co/joshuasundance/mypo-training/blob/main/mypo_dpo_train_v4.py) adds a 2% eval split and `load_best_model_at_end`.
- **bf16 weights only:** merged safetensors are bf16. Fine for A10G/A100/H100; float16 consumers should cast.
- **Small base model:** 1.5B parameters. For larger code tasks, consider applying the same recipe to Qwen2.5-Coder-7B or similar.
- **English + code only:** training data is English prompts, English/Python responses.
---
## Reproducibility
Everything needed to reproduce this model is on the Hub:
| Artifact | Location |
|---|---|
| Training script | [`mypo-training/mypo_dpo_train_v3.py`](https://huggingface.co/joshuasundance/mypo-training/blob/main/mypo_dpo_train_v3.py) |
| Training data | [`joshuasundance/mypo-4k-rfc`](https://huggingface.co/datasets/joshuasundance/mypo-4k-rfc) |
| SFT warm-start | [`joshuasundance/mypo-qwen2.5-coder-1.5b-sft`](https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-sft) |
| Training energy log | [`emissions.csv`](./emissions.csv) (this repo) |
| Evaluation pipeline | [`mypo-training/eval/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/eval/) (generate / analyze / report scripts) |
| Raw generations | [`mypo-training/generations/2026-04-22-qwen2.5-1.5b-v3/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/generations/2026-04-22-qwen2.5-1.5b-v3) |
| Per-subject analysis | [`mypo-training/analysis/2026-04-22-qwen2.5-1.5b-v3/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/analysis/2026-04-22-qwen2.5-1.5b-v3) |
| Characterization report | [`mypo-training/reports/2026-04-22-qwen2.5-1.5b-v3/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/reports/2026-04-22-qwen2.5-1.5b-v3) |
| Single-prompt validation (n=30) | [`mypo-training/single-prompt-validation/single-prompt-2026-04-23T002137Z/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/single-prompt-validation/single-prompt-2026-04-23T002137Z) |
To re-train from scratch:
```bash
hf jobs uv run --flavor a10g-large --timeout 3h --secrets HF_TOKEN \
https://huggingface.co/joshuasundance/mypo-training/raw/main/mypo_dpo_train_v3.py
```
---
## Framework versions
- Python 3.12
- PyTorch 2.4+, Transformers 4.45+, TRL 0.15+, PEFT 0.12+, Datasets 3.0+, Accelerate 0.34+
- CodeCarbon 3.2.6
---
## License
Apache 2.0 (inherits from the [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) base model).
---
## Citations
### This model
```bibtex
@software{mypo_dpo_v3_2026,
title = {{MyPO DPO v3: Qwen2.5-Coder-1.5B Type-Hint Preference Optimization}},
author = {Bailey, Joshua Sundance},
year = 2026,
url = {https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3}
}
```
### CodeCarbon (emissions tracking)
```bibtex
@software{codecarbon,
author = {Benoit Courty and Victor Schmidt and Sasha Luccioni and Goyal-Kamal and MarionCoutarel and Boris Feld and Jérémy Lecourt and LiamConnell and Amine Saboni and Inimaz and supatomic and Mathilde Léval and Luis Blanche and Alexis Cruveiller and Ouminasara and Franklin Zhao and Aditya Joshi and Alexis Bogroff and Hugues de Lavoreille and Niko Laskaris and Edoardo Abati and Douglas Blank and Ziyao Wang and Armin Catovic and Marc Alencon and Michał Stęchły and Christian Bauer and Lucas Otávio N. de Araújo and JPW and MinervaBooks},
title = {{CodeCarbon: Estimate and track carbon emissions from machine learning computing}},
year = 2024,
doi = {10.5281/zenodo.11171501},
url = {https://github.com/mlco2/codecarbon}
}
```
### DPO
```bibtex
@inproceedings{rafailov2023direct,
title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
author = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS 2023)},
year = 2023
}
```
### TRL
```bibtex
@software{vonwerra2020trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
license = {Apache-2.0},
url = {https://github.com/huggingface/trl},
year = 2020
}
```
### LoRA
```bibtex
@inproceedings{hu2022lora,
title = {{LoRA: Low-Rank Adaptation of Large Language Models}},
author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
booktitle = {International Conference on Learning Representations},
year = 2022
}
```
### Qwen2.5-Coder (base model)
```bibtex
@article{hui2024qwen25coder,
title = {{Qwen2.5-Coder Technical Report}},
author = {Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, Kai and others},
journal = {arXiv preprint arXiv:2409.12186},
year = 2024
}
```

54
chat_template.jinja Normal file
View File

@@ -0,0 +1,54 @@
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0]['role'] == 'system' %}
{{- messages[0]['content'] }}
{%- else %}
{{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
{%- endif %}
{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0]['role'] == 'system' %}
{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
{%- else %}
{{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- for message in messages %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{{- '<|im_start|>' + message.role }}
{%- if message.content %}
{{- '\n' + message.content }}
{%- endif %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '\n<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{{- tool_call.arguments | tojson }}
{{- '}\n</tool_call>' }}
{%- endfor %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}

61
config.json Normal file
View File

@@ -0,0 +1,61 @@
{
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": null,
"dtype": "bfloat16",
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 1536,
"initializer_range": 0.02,
"intermediate_size": 8960,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 12,
"num_hidden_layers": 28,
"num_key_value_heads": 2,
"pad_token_id": 151643,
"rms_norm_eps": 1e-06,
"rope_parameters": {
"rope_theta": 1000000.0,
"rope_type": "default"
},
"sliding_window": null,
"tie_word_embeddings": true,
"transformers_version": "5.6.0",
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 151936
}

2
emissions.csv Normal file
View File

@@ -0,0 +1,2 @@
timestamp,project_name,run_id,experiment_id,duration,emissions,emissions_rate,cpu_power,gpu_power,ram_power,cpu_energy,gpu_energy,ram_energy,energy_consumed,water_consumed,country_name,country_iso_code,region,cloud_provider,cloud_region,os,python_version,codecarbon_version,cpu_count,cpu_model,gpu_count,gpu_model,longitude,latitude,ram_total_size,tracking_mode,cpu_utilization_percent,gpu_utilization_percent,ram_utilization_percent,ram_used_gb,on_cloud,pue,wue
2026-04-22T22:26:33,codecarbon,ba950a38-b3a3-46b2-9911-7183a9df1633,5b0fa12a-3dd7-45bb-9766-cc326314d9f1,6005.371236527004,0.13411510475644237,2.2332525246849374e-05,21.40210806247722,144.90591554415892,54.0,0.03448764703697292,0.24181496956293191,0.08702064124127941,0.3633232578411839,0.0,United States,USA,virginia,,,Linux-6.12.79-101.147.amzn2023.x86_64-x86_64-with-glibc2.36,3.12.12,3.2.6,48,AMD EPYC 7R32,1,1 x NVIDIA A10G,-77.4903,39.0469,186.68793869018555,machine,3.375802139037433,43.505848930481285,7.6515207219251336,14.27656325936955,N,1.0,0.0
1 timestamp project_name run_id experiment_id duration emissions emissions_rate cpu_power gpu_power ram_power cpu_energy gpu_energy ram_energy energy_consumed water_consumed country_name country_iso_code region cloud_provider cloud_region os python_version codecarbon_version cpu_count cpu_model gpu_count gpu_model longitude latitude ram_total_size tracking_mode cpu_utilization_percent gpu_utilization_percent ram_utilization_percent ram_used_gb on_cloud pue wue
2 2026-04-22T22:26:33 codecarbon ba950a38-b3a3-46b2-9911-7183a9df1633 5b0fa12a-3dd7-45bb-9766-cc326314d9f1 6005.371236527004 0.13411510475644237 2.2332525246849374e-05 21.40210806247722 144.90591554415892 54.0 0.03448764703697292 0.24181496956293191 0.08702064124127941 0.3633232578411839 0.0 United States USA virginia Linux-6.12.79-101.147.amzn2023.x86_64-x86_64-with-glibc2.36 3.12.12 3.2.6 48 AMD EPYC 7R32 1 1 x NVIDIA A10G -77.4903 39.0469 186.68793869018555 machine 3.375802139037433 43.505848930481285 7.6515207219251336 14.27656325936955 N 1.0 0.0

13
generation_config.json Normal file
View File

@@ -0,0 +1,13 @@
{
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.1,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8,
"transformers_version": "5.6.0"
}

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7da19c381d69959c66d3447ab77b4e155b6000744ae4b4d532a1678792d14fd5
size 3087467144

3
tokenizer.json Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
size 11421892

30
tokenizer_config.json Normal file
View File

@@ -0,0 +1,30 @@
{
"add_prefix_space": false,
"backend": "tokenizers",
"bos_token": null,
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"errors": "replace",
"extra_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"is_local": false,
"local_files_only": false,
"model_max_length": 32768,
"pad_token": "<|endoftext|>",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}