license, base_model, tags, language, pipeline_tag
license base_model tags language pipeline_tag
apache-2.0 Qwen/Qwen3-1.7B
jacobi-forcing
speculative-decoding
qwen3
text-generation
en
text-generation

Qwen3-1.7B Jacobi Forcing (v2math811, AR×10)

Jacobi-Forcing fine-tune of Qwen/Qwen3-1.7B trained on a mixed code + math trajectory dataset (v2math811). Produces output identical in quality to the base AR model while supporting Jacobi parallel decoding for ~1.51.7× wall-clock speedup.

Highlights

  • Lossless quality: HumanEval pass@1 / GSM8K accuracy match base AR generation (within noise).
  • Speedup: 1.65× on HumanEval, 1.53× on GSM8K (vs greedy AR, same model).
  • Drop-in compatible with HuggingFace AutoModelForCausalLM for AR generation. Jacobi inference requires the JacobiForcing repo (custom forward kernel).

Training recipe

Continued from base Qwen3-1.7B with the consistency + AR loss from the JacobiForcing paper:

Setting Value
Base Qwen/Qwen3-1.7B
Dataset code (OpenCodeInstruct buckets 8-11) + math (OpenThought2 buckets 8-11), 26 510 trajectory samples after traj_count ≤ 3 filter
Strategy progressive noise window, N=32, window=16
Epochs 1
Optimizer AdamW
LR 5e-6 (cosine, warmup 0.03)
Batch per-device 1 × grad-accum 4 = 4
Precision bf16
AR_LOSS_WEIGHT 10 (paper default; tested 20 — slightly worse Jacobi acceptance)
GPU 1× A100-80GB, ~4h47m

Benchmarks (1× A100, greedy)

Bench AR pass@1 / acc Jacobi pass@1 / acc AR tok/s Jacobi tok/s Speedup
HumanEval (n=164) 60.4 % 61.0 % 37.2 61.3 1.65×
GSM8K (n=653 subset) 72.4 % 74.3 % 38.0 58.3 1.53×

Jacobi internals (HumanEval): tok/iter = 1.74, average accept-window 87 % of N=32.

Usage — standard AR

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

ckpt = "zcyzcyzcy/qwen3-1.7b-jf-v2math811-ar10"
tok = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForCausalLM.from_pretrained(
    ckpt, torch_dtype=torch.bfloat16, device_map="cuda"
)

msgs = [{"role": "user", "content": "Write a Python is_prime(n)."}]
inp = tok.apply_chat_template(
    msgs, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
ids = tok(inp, return_tensors="pt").to("cuda")
out = model.generate(**ids, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True))

Usage — Jacobi parallel decoding

Jacobi inference uses a custom jacobi_forward_greedy registered on Qwen3ForCausalLM. See the JacobiForcing repo for the full inference script, or use the snippet:

from transformers import Qwen3ForCausalLM
from generate_trajectory.generation.qwen3_modeling_jacobi_forcing_greedy import (
    jacobi_forward_greedy,
)
Qwen3ForCausalLM.jacobi_forward_greedy = jacobi_forward_greedy
# ... call model.jacobi_forward_greedy(...) for prefill + generation phases.

The model checkpoint itself is a standard Qwen3 — no architecture changes — so any speculative-decoding framework that accepts a Qwen3 base model can drive it.

Citation

@article{kou2024cllm,
  title={CLLMs: Consistency Large Language Models},
  author={Kou, Siqi and Hu, Lanxiang and He, Zhezhi and Deng, Zhijie and Zhang, Hao},
  journal={arXiv preprint arXiv:2403.00835},
  year={2024}
}

License

Apache 2.0, inherited from the base Qwen3-1.7B model.

Description
Model synced from source: zcyzcyzcy/qwen3-1.7b-jf-v2math811-ar10
Readme 2 MiB
Languages
Jinja 100%