This is a multi-turn conversational AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the ultrachat_200k dataset.
On the AlpacaEval benchmark (version 1), it achieves a score of 61.1, compared to 62.0 for the fine-tuned dense model Llama-3.1-8B-ultrachat_200k — demonstrating a 98.5% accuracy recovery.
Model Optimizations
This inherits the optimizations from its parent, Sparse-Llama-3.1-8B-2of4.
Namely, all linear operators within transformer blocks were pruned to the 2:4 sparsity pattern: in each group of four weights, two are retained while two are pruned.
Deployment with vLLM
This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details.