Files
PARD-Llama-3.2-1B/README.md
ModelHub XC 5b87241421 初始化项目,由ModelHub XC社区提供模型
Model: amd/PARD-Llama-3.2-1B
Source: Original Platform
2026-05-01 16:41:32 +08:00

65 lines
3.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: mit
pipeline_tag: text-generation
library_name: transformers
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/66a056d0229269a861ac1245/UmJOD5HnhCfvy3nAXgxgE.png" alt="PARD" width="100" align="left">
<div align="center">
<h1>PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation</h1>
</div>
<p align="center"> |
<a href="https://arxiv.org/abs/2504.18583"><b>Paper</b></a> |
<a href="https://github.com/AMD-AIG-AIMA/PARD"><b>Github</b></a> |
<a href="https://www.amd.com/en/developer/resources/technical-articles/accelerating-generative-llms-interface-with-parallel-draft-model-pard.html"><b>Blog</b></a> |
</p>
## Introduction
PARD is a high-performance speculative decoding method that also enables low-cost adaptation of autoregressive draft models into parallel draft models. It offers the following advantages:
- **Low-Cost Training**: PARD adapts AR (autoregressive) draft models into parallel draft models with minimal overhead. Compared to pure AR draft models, PARD achieves an average inference speedup of 1.78×. By introducing a conditional drop-token strategy, PARD improves training efficiency by up to 3× while maintaining the same level of accuracy.
- **Generalizability**: Thanks to its target-independent design, a single PARD draft model can accelerate an entire family of target models. This contrasts with target-dependent approaches such as Medusa and EAGLE, which require retraining or tuning for each new target. As a result, PARD significantly reduces both deployment complexity and adaptation cost.
- **High Performance**: When integrated into an optimized inference framework called Transformers+ PARD delivers up to a 4.08× speedup, with LLaMA3.1 8B reaches a state-of-the-art 311.5 tokens per second. When integrated into vLLM, PARD delivers up to 3.06× speedup, outperforming other speculative decoding methods in vLLM by 1.51×.
<p align="center">
<figure style="display: inline-block; text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/630cb01cc169245d78fe76b6/Dh-7wE-l0YAfU9lXWssKf.png" width="100%">
<figcaption style="font-style: italic; margin-top: 2px;">
AR and AR+ represent baseline auto-regressive generation using Transformers and Transformers+, respectively. VSD denotes vanilla speculative decoding. PARD refers to the proposed method in this work.
</figcaption>
</figure>
</p>
## Model Weights
| Model Series | Model Name | Download |
|--------------|---------------------------------------|---------------|
| llama3 | PARD-Llama-3.2-1B | [🤗 HuggingFace](https://huggingface.co/amd/PARD-Llama-3.2-1B) |
| DSR Qwen | PARD-DeepSeek-R1-Distill-Qwen-1.5B | [🤗 HuggingFace](https://huggingface.co/amd/PARD-DeepSeek-R1-Distill-Qwen-1.5B) |
| Qwen | PARD-Qwen2.5-0.5B | [🤗 HuggingFace](https://huggingface.co/amd/PARD-Qwen2.5-0.5B) |
## How To Use
Please visit [PARD](https://github.com/AMD-AIG-AIMA/PARD) repo for more information
## Citation
```
@article{an2025pard,
title={PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation},
author={An, Zihao and Bai, Huajun and Liu, Ziqiong and Li, Dong and Barsoum, Emad},
journal={arXiv preprint arXiv:2504.18583},
year={2025}
}
```