100 lines
3.6 KiB
Markdown
100 lines
3.6 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
base_model: Qwen/Qwen3-1.7B
|
|||
|
|
tags:
|
|||
|
|
- jacobi-forcing
|
|||
|
|
- speculative-decoding
|
|||
|
|
- qwen3
|
|||
|
|
- text-generation
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Qwen3-1.7B Jacobi Forcing (v2math811, AR×10)
|
|||
|
|
|
|||
|
|
Jacobi-Forcing fine-tune of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) trained on a mixed code + math trajectory dataset (`v2math811`). Produces output identical in quality to the base AR model while supporting **Jacobi parallel decoding for ~1.5–1.7× wall-clock speedup**.
|
|||
|
|
|
|||
|
|
## Highlights
|
|||
|
|
|
|||
|
|
- **Lossless quality**: HumanEval pass@1 / GSM8K accuracy match base AR generation (within noise).
|
|||
|
|
- **Speedup**: 1.65× on HumanEval, 1.53× on GSM8K (vs greedy AR, same model).
|
|||
|
|
- **Drop-in compatible** with HuggingFace `AutoModelForCausalLM` for AR generation. Jacobi inference requires the [JacobiForcing repo](https://github.com/) (custom forward kernel).
|
|||
|
|
|
|||
|
|
## Training recipe
|
|||
|
|
|
|||
|
|
Continued from base Qwen3-1.7B with the consistency + AR loss from the [JacobiForcing](https://arxiv.org/abs/2403.00835) paper:
|
|||
|
|
|
|||
|
|
| Setting | Value |
|
|||
|
|
| --- | --- |
|
|||
|
|
| Base | `Qwen/Qwen3-1.7B` |
|
|||
|
|
| Dataset | code (OpenCodeInstruct buckets 8-11) + math (OpenThought2 buckets 8-11), 26 510 trajectory samples after traj_count ≤ 3 filter |
|
|||
|
|
| Strategy | progressive noise window, N=32, window=16 |
|
|||
|
|
| Epochs | 1 |
|
|||
|
|
| Optimizer | AdamW |
|
|||
|
|
| LR | 5e-6 (cosine, warmup 0.03) |
|
|||
|
|
| Batch | per-device 1 × grad-accum 4 = 4 |
|
|||
|
|
| Precision | bf16 |
|
|||
|
|
| `AR_LOSS_WEIGHT` | **10** (paper default; tested 20 — slightly worse Jacobi acceptance) |
|
|||
|
|
| GPU | 1× A100-80GB, ~4h47m |
|
|||
|
|
|
|||
|
|
## Benchmarks (1× A100, greedy)
|
|||
|
|
|
|||
|
|
| Bench | AR pass@1 / acc | Jacobi pass@1 / acc | AR tok/s | Jacobi tok/s | Speedup |
|
|||
|
|
| --- | ---: | ---: | ---: | ---: | ---: |
|
|||
|
|
| HumanEval (n=164) | 60.4 % | **61.0 %** | 37.2 | 61.3 | **1.65×** |
|
|||
|
|
| GSM8K (n=653 subset) | 72.4 % | **74.3 %** | 38.0 | 58.3 | **1.53×** |
|
|||
|
|
|
|||
|
|
Jacobi internals (HumanEval): tok/iter = 1.74, average accept-window 87 % of N=32.
|
|||
|
|
|
|||
|
|
## Usage — standard AR
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
ckpt = "zcyzcyzcy/qwen3-1.7b-jf-v2math811-ar10"
|
|||
|
|
tok = AutoTokenizer.from_pretrained(ckpt)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
ckpt, torch_dtype=torch.bfloat16, device_map="cuda"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
msgs = [{"role": "user", "content": "Write a Python is_prime(n)."}]
|
|||
|
|
inp = tok.apply_chat_template(
|
|||
|
|
msgs, tokenize=False, add_generation_prompt=True, enable_thinking=False
|
|||
|
|
)
|
|||
|
|
ids = tok(inp, return_tensors="pt").to("cuda")
|
|||
|
|
out = model.generate(**ids, max_new_tokens=200, do_sample=False)
|
|||
|
|
print(tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Usage — Jacobi parallel decoding
|
|||
|
|
|
|||
|
|
Jacobi inference uses a custom `jacobi_forward_greedy` registered on `Qwen3ForCausalLM`. See the [JacobiForcing repo](https://github.com/) for the full inference script, or use the snippet:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import Qwen3ForCausalLM
|
|||
|
|
from generate_trajectory.generation.qwen3_modeling_jacobi_forcing_greedy import (
|
|||
|
|
jacobi_forward_greedy,
|
|||
|
|
)
|
|||
|
|
Qwen3ForCausalLM.jacobi_forward_greedy = jacobi_forward_greedy
|
|||
|
|
# ... call model.jacobi_forward_greedy(...) for prefill + generation phases.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The model checkpoint itself is a standard Qwen3 — no architecture changes — so any speculative-decoding framework that accepts a Qwen3 base model can drive it.
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@article{kou2024cllm,
|
|||
|
|
title={CLLMs: Consistency Large Language Models},
|
|||
|
|
author={Kou, Siqi and Hu, Lanxiang and He, Zhezhi and Deng, Zhijie and Zhang, Hao},
|
|||
|
|
journal={arXiv preprint arXiv:2403.00835},
|
|||
|
|
year={2024}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
Apache 2.0, inherited from the base Qwen3-1.7B model.
|