初始化项目，由ModelHub XC社区提供模型

Model: joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3 Source: Original Platform
2026-06-16 03:40:17 +08:00
commit acfc2cee00
9 changed files with 713 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,511 @@
+---
+base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
+base_model_relation: finetune
+datasets:
+- joshuasundance/mypo-4k-rfc
+language:
+- en
+- code
+library_name: transformers
+license: apache-2.0
+pipeline_tag: text-generation
+model_name: mypo-qwen2.5-coder-1.5b-dpo-v3
+tags:
+- generated_from_trainer
+- dpo
+- trl
+- preference-optimization
+- python
+- type-hints
+- code
+- qwen2.5-coder
+- mypo
+- hf_jobs
+- codecarbon
+- carbon-emissions
+co2_eq_emissions:
+  emissions: 134.115
+  source: "CodeCarbon v3.2.6 (measured)"
+  training_type: "fine-tuning"
+  geographical_location: "Virginia, USA (AWS us-east-1)"
+  hardware_used: "1 x NVIDIA A10G (HF Jobs a10g-large)"
+model-index:
+- name: mypo-qwen2.5-coder-1.5b-dpo-v3
+  results:
+  - task:
+      type: text-generation
+      name: Python type-hinted code generation
+    dataset:
+      name: mypo-4k-rfc
+      type: joshuasundance/mypo-4k-rfc
+      split: validation
+    metrics:
+    - type: pass_rate
+      name: parse rate
+      value: 1.000
+    - type: pass_rate
+      name: black pass rate
+      value: 0.953
+    - type: pass_rate
+      name: ruff pass rate
+      value: 0.913
+    - type: pass_rate
+      name: mypy --strict pass rate
+      value: 0.920
+    - type: coverage
+      name: annotation slot coverage
+      value: 0.963
+    - type: win_rate
+      name: preference win-rate vs gold (chosen)
+      value: 0.527
+  - task:
+      type: text-generation
+      name: Python code generation
+    dataset:
+      name: HumanEval+
+      type: humaneval-plus
+      split: test
+    metrics:
+    - type: pass_rate
+      name: pass@1 (base tests)
+      value: 0.5853658536585366
+    - type: pass_rate
+      name: pass@1 (plus tests)
+      value: 0.5121951219512195
+---
+
+# Model Card for `mypo-qwen2.5-coder-1.5b-dpo-v3`
+
+**Preference-tuned Python coding model** that prefers fully type-annotated code by default.
+
+- **Base:** [`Qwen/Qwen2.5-Coder-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct)
+- **Pipeline:** base → [SFT adapter](https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-sft) (merged) → DPO LoRA (merged) → this model
+- **Training data:** [`joshuasundance/mypo-4k-rfc`](https://huggingface.co/datasets/joshuasundance/mypo-4k-rfc) — `chosen` = type-hinted Python, `rejected` = unhinted Python
+- **This repo ships a fully merged standalone model**, not a LoRA adapter. Load directly with `AutoModelForCausalLM.from_pretrained(...)`.
+- **Training scripts, raw generations, per-subject analysis, and the comparison report live in** [`joshuasundance/mypo-training`](https://huggingface.co/joshuasundance/mypo-training).
+
+---
+
+## TL;DR
+
+v3 is the first DPO model in the MyPO project that actually shifts argmax decoding past the base. Two complementary measurements are reported — both published, both reproducible:
+
+| metric | base | dpo-v2 | SFT | **dpo-v3** | gold (`chosen`) |
+|---|---|---|---|---|---|
+| `mypy --strict` pass — n=150 batched | 6.0% | 6.0% | 92.7% | **92.0%** | 100% |
+| `mypy --strict` pass — n=30 single-prompt | **0.0%** | **0.0%** | **73.3%** | **73.3%** | — |
+| annotation slot coverage — n=150 batched | 0.000 | 0.000 | 0.953 | **0.963** | 0.955 |
+| annotation slot coverage — n=30 single-prompt | 0.000 | 0.000 | 0.971 | **0.976** | — |
+| `black` pass — n=150 batched | 12.0% | 12.0% | 97.3% | **95.3%** | 98.0% |
+| preference win-rate vs gold (n=150) | — | 0.0% | 49.0% | **52.7%** | — |
+
+The large effects are robust: **0 % → 73 %** `mypy --strict` pass and **0.0 → 0.976** annotation slot coverage under real-world single-prompt inference (batch=1, no padding). The earlier batched and single-prompt validations are both retained as in-domain measurements, but we no longer attribute their gap to left-padding or batching as a general causal explanation.
+
+An external benchmark now exists as well: on the latest canonical full
+HumanEval+ run (n=164), this model reaches **96 / 164 = 58.5 %** pass@1 on
+base tests and **84 / 164 = 51.2 %** on plus tests. That still underperforms
+the Qwen base model (`112 / 164` base-test pass, `99 / 164` plus-test pass),
+so v3 should be understood as an in-domain type-hinting preference model
+rather than a generally stronger code generation model.
+
+At n=30 single-prompt, **SFT and v3 are statistically indistinguishable** on the hard metrics; v3's clearer advantage over SFT is the 52.7 % preference win-rate vs gold on the n=150 batched eval (first model to exceed 50 % vs gold). v2 is indistinguishable from base under both decoding conditions — see the [v2 card](https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v2) for the failure-mode post-mortem.
+
+---
+
+## What changed vs v2
+
+v2 logged healthy training telemetry (`rewards/accuracies → 1.0`) but generated text indistinguishable from the base model at greedy decode. The DPO ranking objective can be satisfied by infinitesimal weight deltas when both the LoRA scale and the learning rate are small. v2's effective scale was `α/r = 16/256 = 0.0625`, and its lr was `1e-6`; the product was too small to move argmax decoding.
+
+v3 addresses all proximate causes at once:
+
+| Design choice | v2 | **v3** | Rationale |
+|---|---|---|---|
+| Starting point | Base model | **Base + SFT (merged)** | DPO optimizes *beyond* SFT instead of re-deriving type-hint behavior |
+| LoRA α | 16 | **256** | Matches r=256 → effective scale α/r = 1.0 (was 0.0625) |
+| Learning rate | 1e-6 | **5e-5** | 50× higher; calibrated to the matched LoRA scale |
+| DPO β | 0.1 | **0.3** | Stronger preference margin target |
+| Epochs | 3 | **2** | Higher lr + scale + warm-start → faster convergence |
+| Precision | 4-bit (QLoRA) | **bf16 full** | 1.5B bf16 fits on A10G 24 GB; clean `merge_and_unload` |
+| Optimizer | `paged_adamw_8bit` | **`adamw_torch`** | No bitsandbytes dep in bf16 |
+| Published as | PEFT adapter | **Fully merged model** | v3's DPO LoRA is only valid on top of (base+SFT); shipping a bare adapter would break the obvious load pattern |
+
+---
+
+## Quick start
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+  dtype=torch.bfloat16,
+    device_map="auto",
+  attn_implementation="sdpa",
+)
+
+messages = [{"role": "user", "content": "Write a function that returns the nth Fibonacci number."}]
+rendered = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer([rendered], return_tensors="pt", padding=True, truncation=True, max_length=2048).to(model.device)
+out = model.generate(**inputs, max_new_tokens=512, do_sample=False, use_cache=True, pad_token_id=tokenizer.pad_token_id)
+print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
+```
+
+No PEFT dependency required — v3 is a merged full model.
+
+### Real one-prompt demo
+
+The repo [`joshuasundance/mypo-training`](https://huggingface.co/joshuasundance/mypo-training) includes a runnable comparison script at [`examples/reproduce_v3.py`](https://huggingface.co/joshuasundance/mypo-training/blob/main/examples/reproduce_v3.py). We executed that exact script on HF Jobs (`69e959a92aa1660eaffa8ca6`) with the prompt `Write a function that returns the nth Fibonacci number.`
+
+Observed outputs:
+
+**Base**
+
+```python
+def fibonacci(n: int) -> int:
+  if n == 0 or n == 1:
+    return n
+
+  prev = 0
+  curr = 1
+
+  for i in range(2, n + 1):
+    temp = curr
+    curr += prev
+    prev = temp
+
+  return curr
+
+
+# Driver code
+n = 9
+print(fibonacci(n))
+```
+
+**SFT**
+
+```python
+def fibonacci(n: int) -> int:
+  if n == 0 or n == 1:
+    return n
+
+  prev = 0
+  curr = 1
+
+  for i in range(2, n + 1):
+    temp = curr
+    curr += prev
+    prev = temp
+
+  return curr
+
+
+# Driver code
+n = 9
+print(fibonacci(n))
+```
+
+**DPO-v2**
+
+```python
+def fibonacci(n):
+  # Base cases: F(0) = 0, F(1) = 1
+  if n == 0:
+    return 0
+  elif n == 1:
+    return 1
+  else:
+    return fibonacci(n-1) + fibonacci(n-2)
+```
+```
+...followed by a natural-language explanation block in the same response.
+```
+
+**DPO-v3**
+
+```python
+from typing import Union
+
+
+def fibonacci(n: int) -> Union[int, float]:
+  if n == 0:
+    return 0
+  elif n == 1:
+    return 1
+  else:
+    return fibonacci(n - 1) + fibonacci(n - 2)
+```
+
+This single prompt is useful as a smoke test, but it is **not** the main evidence for v3's value because the base model already returns typed code here. The stronger evidence is the 150-prompt characterization table above: across that broader sample, v3 materially improves annotation coverage and is the only model to exceed 50% preference win-rate vs gold.
+
+If you want to reproduce a specific stored row from the published eval artifacts, use [`examples/reproduce_eval_row.py`](https://huggingface.co/joshuasundance/mypo-training/blob/main/examples/reproduce_eval_row.py). We validated this on row 13 from `samples.jsonl`: replaying the prompt by itself did not match the stored sample, but replaying the original 8-prompt batch window did.
+
+---
+
+## Training
+
+Trained with [TRL](https://github.com/huggingface/trl) `DPOTrainer` on a single NVIDIA A10G via [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs). Training script: [`mypo_dpo_train_v3.py`](https://huggingface.co/joshuasundance/mypo-training/blob/main/mypo_dpo_train_v3.py). Job id: `69e933522aa1660eaffa8c51`.
+
+### Hyperparameters (full `DPOConfig`)
+
+| Group | Setting |
+|---|---|
+| Base model | `Qwen/Qwen2.5-Coder-1.5B-Instruct` |
+| Warm-start | `joshuasundance/mypo-qwen2.5-coder-1.5b-sft` (merged into base before DPO LoRA attached) |
+| Dataset | `joshuasundance/mypo-4k-rfc` (train + validation concatenated → 6,361 pairs) |
+| LoRA | `r=256`, `α=256`, `dropout=0.05`, `target_modules="all-linear"`, `task_type=CAUSAL_LM` |
+| Optimization | `adamw_torch`, `lr=5e-5`, cosine schedule, `warmup_steps=100` |
+| DPO | `β=0.3`, `loss_type="sigmoid"` |
+| Batching | `per_device_train_batch_size=1`, `gradient_accumulation_steps=8` (effective 8) |
+| Schedule | `num_train_epochs=2`, `max_length=2048` |
+| Precision | bf16, gradient checkpointing on, `attn_implementation="sdpa"` |
+| Reporting | `report_to=["codecarbon"]`, `logging_steps=10` |
+| Seed | 42 |
+
+### Final training metrics (from job logs)
+
+| Metric | Value |
+|---|---|
+| `train_runtime` | 6,005 s (~1 h 40 m) |
+| `train_loss` (DPO sigmoid) | 3.28 × 10⁻³ |
+| `rewards/accuracies` (final) | 1.000 |
+| `rewards/margins` (peak / plateau) | ~26 |
+| `rewards/chosen` (final) | +6.24 |
+| `rewards/rejected` (final) | −15.7 |
+| `mean_token_accuracy` (final) | 0.910 |
+| `grad_norm` (late training) | ≲ 1 × 10⁻⁵ |
+
+**Convergence note:** `rewards/accuracies` saturated to 1.0 by epoch ~0.3 and `rewards/margins` plateaued by epoch ~0.5. The remaining ~1.5 epochs were cosine-decay ride-out with near-zero grads. [v4 draft](https://huggingface.co/joshuasundance/mypo-training/blob/main/mypo_dpo_train_v4.py) adds `EarlyStoppingCallback` and a held-out eval split to cut this.
+
+---
+
+## Evaluation
+
+Evaluated on 150 stratified held-out validation prompts from `joshuasundance/mypo-4k-rfc`. Full report: [`reports/2026-04-22-qwen2.5-1.5b-v3/CHARACTERIZATION.md`](https://huggingface.co/joshuasundance/mypo-training/blob/main/reports/2026-04-22-qwen2.5-1.5b-v3/CHARACTERIZATION.md). Raw generations and per-subject JSON/CSV are also published in the training repo under `generations/` and `analysis/`.
+
+| metric | base | dpo-v2 | SFT | **dpo-v3** | gold (`chosen`) | `rejected` |
+|---|---|---|---|---|---|---|
+| parse rate | 0.973 | 0.973 | 1.000 | **1.000** | 1.000 | 1.000 |
+| `black` pass rate | 0.120 | 0.120 | 0.973 | **0.953** | 0.980 | 0.060 |
+| `ruff` pass rate | 0.933 | 0.940 | 0.960 | **0.913** | 1.000 | 0.913 |
+| `mypy --strict` pass rate | 0.060 | 0.060 | 0.927 | **0.920** | 1.000 | 0.000 |
+| annotation slot coverage | 0.000 | 0.000 | 0.953 | **0.963** | 0.955 | 0.000 |
+| fully-annotated fn fraction | 0.000 | 0.000 | 0.893 | **0.903** | 0.898 | 0.000 |
+| mean `ruff` violations / sample | 0.47 | 0.46 | 0.07 | **0.09** | 0.00 | 0.11 |
+| mean `mypy` errors / sample | 2.30 | 2.35 | 0.13 | **0.13** | 0.00 | 2.25 |
+| preference win-rate vs gold | — | 0.000 | 0.490 | **0.527** | — | — |
+| preference win-rate vs base | — | 0.500 | 1.000 | **1.000** | — | — |
+| preference win-rate vs `rejected` | — | 0.500 | 1.000 | **1.000** | — | — |
+
+**Interpretation (batched n=150):**
+- v3 matches SFT on every quality gate (within noise).
+- v3 has **the highest annotation slot coverage of any model, including gold** (0.963 vs gold 0.955 vs SFT 0.953). Judgment call whether this is "more thorough" or "slight over-annotation."
+- v3 is **the only subject to exceed 50% win-rate vs gold** (52.7%) on this eval — measurable DPO-level gain on top of SFT at this sample size.
+- ruff regression (0.913 vs SFT 0.960) is small but real; likely a handful of idiomatic style issues introduced by more aggressive annotation.
+
+### Single-prompt validation (n=30)
+
+A follow-up job re-decoded 30 stratified validation prompts with `batch_size=1` and no padding — i.e., the realistic one-user inference condition — across all four subjects. This directly tests whether the batched characterization numbers reflect real-world behavior or batching/left-padding artifacts. Full artifacts: [`single-prompt-validation/single-prompt-2026-04-23T002137Z/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/single-prompt-validation/single-prompt-2026-04-23T002137Z).
+
+| metric | base | dpo-v2 | SFT | **dpo-v3** |
+|---|---|---|---|---|
+| parse rate | 0.933 | 0.967 | 1.000 | **1.000** |
+| `black` pass rate | 0.067 | 0.067 | 1.000 | **0.967** |
+| `ruff` pass rate | 0.900 | 0.967 | 0.933 | **0.800** |
+| `mypy --strict` pass rate | **0.000** | **0.000** | 0.733 | **0.733** |
+| annotation slot coverage | 0.000 | 0.000 | 0.971 | **0.976** |
+| mean `mypy` errors / sample | 2.33 | 2.40 | 0.30 | **0.30** |
+
+**What this tells us:**
+- The **core claim holds under real-world inference.** 0 % → 73 % `mypy --strict` is not a batching artifact.
+- The batched n=150 and single-prompt n=30 validations should be treated as two different measurement regimes. We no longer claim that the gap is specifically caused by left-padding or batching as a general explanation.
+- **v2's no-op is confirmed** under both decoding modes. Rules out "v2 adapter not loading" as an alternative explanation.
+- **SFT and v3 are indistinguishable** at n=30 single-prompt (both 0.733 `mypy`, both ≈ 0.97 annotation coverage). At this sample size we cannot claim v3 is hard-metric better than SFT; the case for v3 over SFT rests on the 52.7 % preference win-rate vs gold in the batched eval.
+- **v3's ruff regression is larger in single-prompt mode** (0.800 vs SFT 0.933). Consistent with v3 trading some style-conformance for stronger annotation behavior.
+
+### HumanEval+ external benchmark (n=164)
+
+We also ran a full evalplus HumanEval+ benchmark. That is the stronger
+out-of-domain coding benchmark, and it does **not** show a general gain for v3:
+
+| subject | pass@1 base tests | pass@1 plus tests |
+|---|---:|---:|
+| `base` | 112 / 164 (68.3%) | 99 / 164 (60.4%) |
+| `dpo-v2` | 110 / 164 (67.1%) | 97 / 164 (59.1%) |
+| `sft` | 97 / 164 (59.1%) | 86 / 164 (52.4%) |
+| `dpo-v3` | 96 / 164 (58.5%) | 84 / 164 (51.2%) |
+
+So the honest reading is: v3 changes the model's in-domain type-hinting
+behavior, but it is not a generally stronger HumanEval+ solver than the base
+model.
+
+---
+
+## Environmental impact
+
+Reported with [CodeCarbon](https://codecarbon.io/) v3.2.6. Raw data: [`emissions.csv`](./emissions.csv).
+
+### Training (this model)
+
+| Metric | Value |
+|---|---|
+| Duration | 6,005.4 s (1 h 40 m) |
+| **Energy consumed** | **0.363 kWh** |
+| **CO₂e emissions** | **0.134 kg** |
+| GPU energy / avg power | 0.242 kWh / 144.9 W |
+| CPU energy / avg power | 0.034 kWh / 21.4 W |
+| RAM energy / avg power | 0.087 kWh / 54.0 W |
+| Hardware | 1 × NVIDIA A10G, AMD EPYC 7R32 (48 vCPU), 187 GB RAM |
+| Region | AWS `us-east-1` (Virginia, USA); PUE 1.0 |
+| Tracker | codecarbon 3.2.6, `tracking_mode=machine` |
+
+### Cumulative project footprint
+
+Because v3 builds on SFT warm-start + v2 was a training run too, the full energy cost of this model's lineage is:
+
+| Stage | Duration | Energy | CO₂e |
+|---|---|---|---|
+| SFT training | 8,340 s | 0.472 kWh | 0.174 kg |
+| v2 DPO training (failed) | 10,938 s | 0.646 kWh | 0.238 kg |
+| **v3 DPO training (this)** | 6,005 s | 0.363 kWh | 0.134 kg |
+| v3 characterization (generate × 4 models) | 937 s | 0.052 kWh | 0.019 kg |
+| 6 analysis jobs (cpu-upgrade) | ~3 min each, parallel | ~0.01 kWh | ~0.004 kg |
+| **Cumulative (SFT + v2 + v3 + eval)** | **~7.3 h** | **~1.55 kWh** | **~0.57 kg** |
+
+### Approximate compute cost
+
+HF Jobs wall-clock billed at published [HF Jobs rates](https://huggingface.co/docs/hub/jobs). Rates shown are approximate.
+
+| Stage | Flavor | Wall-clock | Approx cost |
+|---|---|---|---|
+| SFT training | a10g-large | 2.32 h | ~$3.50 |
+| v2 DPO training | a10g-large | 3.04 h | ~$4.60 |
+| **v3 DPO training** | a10g-large | 1.67 h | **~$2.50** |
+| v3 characterization generate | a10g-large | 0.26 h | ~$0.40 |
+| 6 analysis jobs | cpu-upgrade × 6 parallel | ~3 min each | <$0.05 |
+| Rollup report | cpu-basic | <1 min | ~$0 |
+| **Cumulative project cost** | | | **~$11** |
+
+---
+
+## Limitations and biases
+
+- **Narrow objective:** optimized only for Python type-hint preference. Docstring style, line length, complexity, security idioms, etc. were not objectives.
+- **Possible over-annotation:** `rewards/rejected` fell to ~−20 during training, meaning the model strongly suppresses unhinted outputs. In principle this could cause annotations where Python idiom doesn't require them (trivial lambdas, short list comprehensions). v3's annotation coverage slightly exceeding gold's is mild evidence of this; watch for it in your downstream use.
+- **No eval split during training:** v3 trained on the full 6,361-pair pool with no held-out metric for best-checkpoint selection. [v4 draft](https://huggingface.co/joshuasundance/mypo-training/blob/main/mypo_dpo_train_v4.py) adds a 2% eval split and `load_best_model_at_end`.
+- **bf16 weights only:** merged safetensors are bf16. Fine for A10G/A100/H100; float16 consumers should cast.
+- **Small base model:** 1.5B parameters. For larger code tasks, consider applying the same recipe to Qwen2.5-Coder-7B or similar.
+- **English + code only:** training data is English prompts, English/Python responses.
+
+---
+
+## Reproducibility
+
+Everything needed to reproduce this model is on the Hub:
+
+| Artifact | Location |
+|---|---|
+| Training script | [`mypo-training/mypo_dpo_train_v3.py`](https://huggingface.co/joshuasundance/mypo-training/blob/main/mypo_dpo_train_v3.py) |
+| Training data | [`joshuasundance/mypo-4k-rfc`](https://huggingface.co/datasets/joshuasundance/mypo-4k-rfc) |
+| SFT warm-start | [`joshuasundance/mypo-qwen2.5-coder-1.5b-sft`](https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-sft) |
+| Training energy log | [`emissions.csv`](./emissions.csv) (this repo) |
+| Evaluation pipeline | [`mypo-training/eval/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/eval/) (generate / analyze / report scripts) |
+| Raw generations | [`mypo-training/generations/2026-04-22-qwen2.5-1.5b-v3/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/generations/2026-04-22-qwen2.5-1.5b-v3) |
+| Per-subject analysis | [`mypo-training/analysis/2026-04-22-qwen2.5-1.5b-v3/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/analysis/2026-04-22-qwen2.5-1.5b-v3) |
+| Characterization report | [`mypo-training/reports/2026-04-22-qwen2.5-1.5b-v3/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/reports/2026-04-22-qwen2.5-1.5b-v3) |
+| Single-prompt validation (n=30) | [`mypo-training/single-prompt-validation/single-prompt-2026-04-23T002137Z/`](https://huggingface.co/joshuasundance/mypo-training/tree/main/single-prompt-validation/single-prompt-2026-04-23T002137Z) |
+
+To re-train from scratch:
+
+```bash
+hf jobs uv run --flavor a10g-large --timeout 3h --secrets HF_TOKEN \
+  https://huggingface.co/joshuasundance/mypo-training/raw/main/mypo_dpo_train_v3.py
+```
+
+---
+
+## Framework versions
+
+- Python 3.12
+- PyTorch 2.4+, Transformers 4.45+, TRL 0.15+, PEFT 0.12+, Datasets 3.0+, Accelerate 0.34+
+- CodeCarbon 3.2.6
+
+---
+
+## License
+
+Apache 2.0 (inherits from the [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) base model).
+
+---
+
+## Citations
+
+### This model
+
+```bibtex
+@software{mypo_dpo_v3_2026,
+  title   = {{MyPO DPO v3: Qwen2.5-Coder-1.5B Type-Hint Preference Optimization}},
+  author  = {Bailey, Joshua Sundance},
+  year    = 2026,
+  url     = {https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3}
+}
+```
+
+### CodeCarbon (emissions tracking)
+
+```bibtex
+@software{codecarbon,
+  author  = {Benoit Courty and Victor Schmidt and Sasha Luccioni and Goyal-Kamal and MarionCoutarel and Boris Feld and Jérémy Lecourt and LiamConnell and Amine Saboni and Inimaz and supatomic and Mathilde Léval and Luis Blanche and Alexis Cruveiller and Ouminasara and Franklin Zhao and Aditya Joshi and Alexis Bogroff and Hugues de Lavoreille and Niko Laskaris and Edoardo Abati and Douglas Blank and Ziyao Wang and Armin Catovic and Marc Alencon and Michał Stęchły and Christian Bauer and Lucas Otávio N. de Araújo and JPW and MinervaBooks},
+  title   = {{CodeCarbon: Estimate and track carbon emissions from machine learning computing}},
+  year    = 2024,
+  doi     = {10.5281/zenodo.11171501},
+  url     = {https://github.com/mlco2/codecarbon}
+}
+```
+
+### DPO
+
+```bibtex
+@inproceedings{rafailov2023direct,
+  title     = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
+  author    = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
+  booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS 2023)},
+  year      = 2023
+}
+```
+
+### TRL
+
+```bibtex
+@software{vonwerra2020trl,
+  title   = {{TRL: Transformer Reinforcement Learning}},
+  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
+  license = {Apache-2.0},
+  url     = {https://github.com/huggingface/trl},
+  year    = 2020
+}
+```
+
+### LoRA
+
+```bibtex
+@inproceedings{hu2022lora,
+  title     = {{LoRA: Low-Rank Adaptation of Large Language Models}},
+  author    = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
+  booktitle = {International Conference on Learning Representations},
+  year      = 2022
+}
+```
+
+### Qwen2.5-Coder (base model)
+
+```bibtex
+@article{hui2024qwen25coder,
+  title   = {{Qwen2.5-Coder Technical Report}},
+  author  = {Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, Kai and others},
+  journal = {arXiv preprint arXiv:2409.12186},
+  year    = 2024
+}
+```
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,54 @@
+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,61 @@
+{
+  "architectures": [
+    "Qwen2ForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": null,
+  "dtype": "bfloat16",
+  "eos_token_id": 151645,
+  "hidden_act": "silu",
+  "hidden_size": 1536,
+  "initializer_range": 0.02,
+  "intermediate_size": 8960,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen2",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 2,
+  "pad_token_id": 151643,
+  "rms_norm_eps": 1e-06,
+  "rope_parameters": {
+    "rope_theta": 1000000.0,
+    "rope_type": "default"
+  },
+  "sliding_window": null,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.6.0",
+  "use_cache": false,
+  "use_sliding_window": false,
+  "vocab_size": 151936
+}
--- a/emissions.csv
+++ b/emissions.csv
@@ -0,0 +1,2 @@
+timestamp,project_name,run_id,experiment_id,duration,emissions,emissions_rate,cpu_power,gpu_power,ram_power,cpu_energy,gpu_energy,ram_energy,energy_consumed,water_consumed,country_name,country_iso_code,region,cloud_provider,cloud_region,os,python_version,codecarbon_version,cpu_count,cpu_model,gpu_count,gpu_model,longitude,latitude,ram_total_size,tracking_mode,cpu_utilization_percent,gpu_utilization_percent,ram_utilization_percent,ram_used_gb,on_cloud,pue,wue
+2026-04-22T22:26:33,codecarbon,ba950a38-b3a3-46b2-9911-7183a9df1633,5b0fa12a-3dd7-45bb-9766-cc326314d9f1,6005.371236527004,0.13411510475644237,2.2332525246849374e-05,21.40210806247722,144.90591554415892,54.0,0.03448764703697292,0.24181496956293191,0.08702064124127941,0.3633232578411839,0.0,United States,USA,virginia,,,Linux-6.12.79-101.147.amzn2023.x86_64-x86_64-with-glibc2.36,3.12.12,3.2.6,48,AMD EPYC 7R32,1,1 x NVIDIA A10G,-77.4903,39.0469,186.68793869018555,machine,3.375802139037433,43.505848930481285,7.6515207219251336,14.27656325936955,N,1.0,0.0
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,13 @@
+{
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "pad_token_id": 151643,
+  "repetition_penalty": 1.1,
+  "temperature": 0.7,
+  "top_k": 20,
+  "top_p": 0.8,
+  "transformers_version": "5.6.0"
+}
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7da19c381d69959c66d3447ab77b4e155b6000744ae4b4d532a1678792d14fd5
+size 3087467144
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
+size 11421892
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,30 @@
+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": false,
+  "local_files_only": false,
+  "model_max_length": 32768,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}