Transformed Llama 3.1 and Mistral NeMo models to higher efficiency 4B and 8B models using pruning and distillation.
We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.
In the article "LLM Pruning and Distillation in Practice: The Minitron Approach," researchers compressed complex 8B and 12B models into smaller 4B and 8B models through pruning and distillation techniques. They tested different ways to cut down the model size, like depth pruning and width pruning, and then evaluated them on standard tests to compare their performance. By fine-tuning the refined models on a new dataset, the researchers created a more efficient 4B model from the Llama 8B model and a top-notch 8B model named MN-Minitron-8B from the Mistral NeMo 12B model. These successes show that with some tweaks, bigger models can be shrunk without losing accuracy, making them more manageable and faster to run. The researchers shared their model weights for others to use freely.