初始化项目,由ModelHub XC社区提供模型
Model: Efficient-Large-Model/Fast_dLLM_1.5B Source: Original Platform
This commit is contained in:
150
README.md
Normal file
150
README.md
Normal file
@@ -0,0 +1,150 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- en
|
||||
base_model:
|
||||
- Qwen/Qwen2.5-1.5B-Instruct
|
||||
---
|
||||
|
||||
# Fast-dLLM v2 (1.5B) — Efficient Block-Diffusion LLM
|
||||
|
||||
## 📖 Introduction
|
||||
|
||||
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their **inherent sequential decoding limits inference efficiency**.
|
||||
|
||||
We present **Fast-dLLM v2** — a carefully designed **block diffusion language model (dLLM)** that efficiently adapts a pretrained AR model (**Qwen2.5-1.5B-Instruct**) into a diffusion-style decoder for **parallel text generation**.
|
||||
|
||||
Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks.
|
||||
|
||||
### ✨ Key Innovations
|
||||
- **Block Diffusion Mechanism + Complementary Attention Mask**
|
||||
Enables **blockwise bidirectional context modeling** without sacrificing AR objectives.
|
||||
- **Hierarchical Caching**
|
||||
- **Block-level cache**: Stores historical context representations across blocks.
|
||||
- **Sub-block cache**: Parallel decoding within partially generated blocks.
|
||||
- **Token Shift Mechanism**
|
||||
Retains autoregressive characteristics while supporting bidirectional context within blocks.
|
||||
- **Parallel Decoding Pipeline**
|
||||
Achieves up to **2.5× speedup** over standard AR decoding **without compromising quality**.
|
||||
|
||||
> 🚀 Fast-dLLM v2 uses **only ~1B tokens** for fine-tuning — a **500× reduction** vs. full-attention diffusion LLMs (Dream: 580B tokens) — while **matching or surpassing AR baselines** in accuracy.
|
||||
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 🛠 Model Overview
|
||||
- **Type**: Block Diffusion Language Model (dLLM)
|
||||
- **Base Model**: `Qwen/Qwen2.5-1.5B-Instruct`
|
||||
- **Architecture**: Transformer w/ RoPE, SwiGLU, RMSNorm, Attention QKV bias, tied embeddings
|
||||
- **Params**: 1.54B (non-embedding: 1.31B)
|
||||
- **Layers**: 28
|
||||
- **Attention Heads**: 12 (Q), 2 (KV, GQA)
|
||||
- **Key Feature**: Parallel **block-wise decoding** + **hierarchical caching**
|
||||
|
||||
---
|
||||
|
||||
## 📦 Installation
|
||||
You will need `transformers`, `torch`, and our **custom generation function**:
|
||||
|
||||
```bash
|
||||
pip install transformers torch numpy
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Quickstart
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "Efficient-Large-Model/Fast_dLLM_1.5B"
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
trust_remote_code=True
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
||||
|
||||
prompt = "Give me a short introduction to large language model."
|
||||
messages = [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
|
||||
text = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True
|
||||
)
|
||||
inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
||||
|
||||
# Fast-dLLM v2 parallel decoding
|
||||
gen_ids = model.generate(
|
||||
inputs["input_ids"],
|
||||
tokenizer=tokenizer,
|
||||
max_new_tokens=512,
|
||||
small_block_size=8,
|
||||
threshold=0.9,
|
||||
)
|
||||
|
||||
response = tokenizer.decode(
|
||||
gen_ids[0][inputs["input_ids"].shape[1]:],
|
||||
skip_special_tokens=True
|
||||
)
|
||||
print(response)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Performance & Benchmarks
|
||||
|
||||
### ▶ Real-time Throughput
|
||||
Fast-dLLM v2 offers **up to 2.54× higher throughput** than Qwen2.5-7B-Instruct, **without loss in quality**.
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
### 🏆 Benchmark Results
|
||||
We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks:
|
||||
HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA).
|
||||
|
||||
- **1B group**: Fast-dLLM v2 (1.5B) achieves **best average score: 45.0**.
|
||||
- **7B group**: Fast-dLLM v2 (7B) achieves **best average score: 60.3**, surpassing LLaDA and Dream models.
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 📜 Citation
|
||||
|
||||
If you use Fast-dLLM v2 in your research or products, please cite:
|
||||
|
||||
```bibtex
|
||||
@misc{wu2025fastdllmv2efficientblockdiffusion,
|
||||
title={Fast-dLLM v2: Efficient Block-Diffusion LLM},
|
||||
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
|
||||
year={2025},
|
||||
eprint={2509.26328},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2509.26328},
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📄 License
|
||||
Released under **Apache 2.0**, following the base Qwen2.5 license.
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Resources
|
||||
- 📄 [Paper](https://arxiv.org/abs/2509.26328)
|
||||
- 💻 [Code](https://github.com/NVlabs/Fast-dLLM)
|
||||
- 🤗 [HuggingFace Model](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_1.5B)
|
||||
Reference in New Issue
Block a user