Qwen3-4B-Reasoning-Backfill…/README.md

---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
- axolotl
- generated_from_trainer
datasets:
- joeyzero/OpenThought-144k-Backfill-0.2
- joeyzero/dolphin-r1-backfill-0.0.2
model-index:
- name: Qwen3-4B-Reasoning-Backfill-V0.1
  results: []
thumbnail: "https://cdn-uploads.huggingface.co/production/uploads/6334832f86c3fdcdc7acbe8e/DcjT8q4QvTjfCJWEa8kC7.png"
---

<!DOCTYPE html>
<style>
  :root { --bg:#0b0f14; --panel:#121824; --ink:#c9d1d9; --acc:#6e00ff; --cy:#00ffff; --line:#2a3242; }
  html,body{background:#000;color:var(--ink);font-family:ui-sans-serif,system-ui,Segoe UI,Roboto,Helvetica,Arial,sans-serif;margin:0;padding:0}
  .markdown-body{color:var(--ink);max-width:1000px;margin:36px auto;padding:36px;border-radius:12px;position:relative;overflow:hidden}
  .markdown-body::after{content:'';position:absolute;inset:0;background:linear-gradient(180deg,rgba(16,20,32,0.9),rgba(10,14,22,0.95));z-index:-1}
   h1,h2,h3{background:linear-gradient(45deg,#6e00ff,#00ffff);-webkit-background-clip:text;-webkit-text-fill-color:transparent;border-bottom:1px solid var(--line);padding-bottom:.3em;margin-top:1.6em}
  .card{background:rgba(12,16,26,0.6);border:2px solid var(--line);border-radius:12px;padding:20px;margin:16px 0;box-shadow:0 0 12px rgba(110,0,255,.15)}
  .grid{display:grid;grid-template-columns:repeat(auto-fit,minmax(260px,1fr));gap:16px}
  code{background:#0f1522;padding:.2em .4em;border-radius:4px}
  pre{background:#0f1522;border:1px solid var(--line);border-radius:8px;padding:14px;overflow:auto}
  table{width:100%;border-collapse:collapse;margin:14px 0;background:rgba(0,0,0,0.2)}
  th,td{border:1px solid var(--line);padding:10px;text-align:left}
  a{color:var(--cy);text-decoration:none} a:hover{color:var(--acc)}
  details>summary{cursor:pointer}
  *{color-scheme:dark}
</style>

<div class="markdown-body">

<div align="center" style="margin-top:8px">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6334832f86c3fdcdc7acbe8e/DcjT8q4QvTjfCJWEa8kC7.png" alt="Model Visualization" width="500px" style="border: 3px solid #333; box-shadow: 0 0 15px rgba(66, 0, 131, 0.5);" />
  <div style="height:12px"></div>
  <div style="font-size:1.6em;font-weight:800;background:linear-gradient(45deg,#6e00ff,#00ffff);-webkit-background-clip:text;-webkit-text-fill-color:transparent;">
    Qwen3-4B-Reasoning-Backfill-v0.1
  </div>
  <div style="opacity:.75">Experimental reasoning-trace backfiller fine-tuned from <a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen/Qwen3-4B</a></div>
</div>

<div class="card">
<h2>Overview</h2>
This is an experimental model trained to <b>reconstruct a plausible chain of reasoning</b> connecting a user-provided <code>INSTRUCTION</code> to a fixed <code>SOLUTION</code> while preserving the original answer. It focuses entirely on the route, not rewriting the destination, producing stepwise “thinking” traces that align with the target output. The goal is to enable <b>reasoning backfill</b> for legacy or non-reasoning datasets where collecting thought processes directly is impractical, such as older chat logs or instruction corpora. These traces can help bootstrap process-supervision signals, support teacher-style models, and deepen auditability of output behavior.

<p>
I would love to try this with larger or more stylized models to see how feasible it is to produce stylized or limited-domain reasoning traces. If you’re a compute partner interested in scaling this line of work to larger backfill models, let’s talk.
</p>

</div>

<div class="card">
    <h3>Train setup</h3>
    <ul>
      <li>Base: Qwen/Qwen3-4B</li>
      <li>Hardware: 1× H100</li>
      <li>Epochs: 4 · Cosine schedule · warmup 40</li>
      <li>Optimizer: <code>adamw_bnb_8bit</code> · lr <code>2.5e-5</code></li>
    </ul>
  </div>

<div class="card">
<h2>Intended Uses</h2>
<ul>
  <li><b>Dataset augmentation</b>: generate process-supervision style traces for older instruction pairs.</li>
  <li><b>Teacher bootstrapping</b>: seed trace-rich examples to train or distill teachers.</li>
  <li><b>Analysis tooling</b>: produce rationales for audit of solution adherence.</li>
</ul>

<h3>Limitations</h3>
<ul>
  <li>Traces are <b>plausible reconstructions</b>, not ground truth cognition.</li>
  <li>Model can over-rationalize if solution is underspecified.</li>
</ul>
</div>

<div class="card">
<h2>Prompting · ChatML</h2>
<p>The model expects the input and output to be wrapped in <code>&lt;|instruction_start|&gt;&lt;|instruction_end|&gt;</code> and <code><|solution_start><|solution_end|></code> tokens respectively. The model will should render the output in <code>&lt;|thinking_start|&gt;&lt;|thinking_end|&gt;</code> tokens to help with parsing and identification of malformed outputs.</p>
<details><summary><b>Example Prompt</b></summary>
<pre>
&lt;|im_start|&gt;system
Your role as an assistant involves thoroughly reconstructing a plausible reasoning process that leads from a user-provided INSTRUCTION to a user-provided SOLUTION. You must not alter the SOLUTION. Each step should include concrete decisions and validations, such as: interpreting the INSTRUCTION, extracting constraints, selecting an approach, justifying key choices, verifying that intermediate results remain consistent with the provided SOLUTION, refining any errors, and a final consistency check noting any residual ambiguities. Use domain-appropriate specifics and avoid filler. Do not introduce new facts that would change the SOLUTION. Now, given an INSTRUCTION and a SOLUTION, reconstruct the Thought and present the Solution per the above guidelines. Your reasoning should begin and end with '&lt;|thinking_start|&gt;' and '&lt;|thinking_end&gt;' respectively.
&lt;|im_end|&gt;
&lt;|im_start|&gt;user
&lt;|instruction_start|&gt;
What are some tips for reducing stress at work? Your response should contain at least 4 bullet points.
Use markdown bullets like "* point". Include the keyword "mindfulness" twice.
&lt;|instruction_end|&gt;

&lt;|solution_start|&gt;
&dash; Practice mindfulness during breaks ...
&dash; Prioritize tasks and set boundaries ...
&dash; Incorporate mindfulness into routine activities ...
&dash; Stay physically active ...
<|solution_end|>
&lt;|im_end|&gt;
&lt;|im_start|&gt;assistant
&lt;|thinking_start|&gt;

</pre>
</details>

</div>

  <div class="card">
    <h2>Recommended Sampling</h2>
    <pre><code>temperature: 0.7–1.0
top_p: 0.9
min_p: 0.05
max_tokens: as needed for trace length
stop: ["&lt;|im_end|&gt;"]</code></pre>
    <p>For tighter adherence, drop temperature toward 0.5–0.7.</p>
  </div>

  <div class="card">
    <h2>Quantizations</h2>
    <ul>
      <li><b>GGUF</b>: coming soon</li>
    </ul>
  </div>

<div class="card">
<details><summary><b>Axolotl config</b></summary>

```yaml
base_model: Qwen/Qwen3-4B
hub_model_id: joeyzero/Qwen3-4B-Reasoning-Backfill-V0.1
hf_use_auth_token: true

load_in_8bit: false
load_in_4bit: false
strict: false

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 2.5e-5
max_grad_norm: 1.0
bf16: auto
tf32: false

datasets:
  - path: joeyzero/OpenThought-144k-Backfill-0.2
    type: chat_template
    field_messages: messages
  - path: joeyzero/dolphin-r1-backfill-0.0.2
    type: chat_template
    field_messages: messages

chat_template: chatml
dataset_prepared_path: prepared_data2
output_dir: ./thinking-backfill-0.1.17

sequence_len: 1024
sample_packing: true
pad_to_sequence_len: true
xformers_attention:
flash_attention: true
warmup_steps: 40
save_steps: 0.5

weight_decay: 0.02
wandb_project: reasoning-backfill
wandb_name: reasoning-backfill-attempt-04

```

</details>
</div>
<div align="center" style="opacity:.8;margin-top:10px">
  <small>Made by <a href="https://huggingface.co/joeyzero">joeyzero</a>· contributions and issues welcome.</small>
</div>
</div>