Mistral-NeMo-Minitron 8B Foundation Model Delivers Unparalleled Accuracy | NVIDIA Technical Blog (2024)

Last month, NVIDIA and Mistral AI unveiled Mistral NeMo 12B, a leading state-of-the-art large language model (LLM). Mistral NeMo 12B consistently outperforms similarly sized models on a wide range of benchmarks.

Today, we announce Mistral-NeMo-Minitron 8B, one of the most advanced open-access models in its size class. This model consistently delivers leading accuracy on nine popular benchmarks.The Mistral-NeMo-Minitron 8B base model was obtained by width-pruning the Mistral NeMo 12B base model, followed by a light retraining process using knowledge distillation. This is a successful recipe that NVIDIA originally proposed in the paper, Compact Language Models via Pruning and Knowledge Distillation. It’s been proven time and again with NVIDIA Minitron 8B and 4B, and Llama-3.1-Minitron 4B models.

Training tokensWino-Grande 5-shotARC
Challenge 25-shot
MMLU 5-shotHella
Swag 10-shot
GSM8K 5-shotTruthfulQA 0-shotXLSum en (20%)
3-shot
MBPP
0-shot
Human
Eval
0-shot
Llama 3.1 8B15T77.2757.9465.2881.8048.6045.0630.0542.2724.76
Gemma 7B6T786164825045173932
Mistral-NeMo-Minitron 8B380B80.3564.4269.5183.0358.4547.5631.9443.7736.22
Mistral NeMo 12BN/A82.2465.1068.9985.1656.4149.7933.4342.6323.78

Overview of model pruning and distillation

Model Pruning is the process of making a model smaller and leaner, either by dropping layers (depth pruning) or dropping neurons and attention heads and embedding channels (width pruning). Pruning is often accompanied by some amount of retraining for accuracy recovery.

Model distillation is a technique used to transfer knowledge from a large, complex model, often called the teacher model, to a smaller, simpler student model. The goal is to create a more efficient model that retains much of the predictive power of the original, larger model while being faster and less resource-intensive to run. Herein, we employ distillation as a light retraining procedure after pruning, on a dataset much smaller than that used in model training from scratch.

Iterative pruning and distillation is an approach where, starting from a single pretrained model, multiple progressively smaller models can be obtained. For example, a 15B model can be pruned and distilled to obtain an 8B model, which in turn serves as a starting point for pruning and distilling a 4B model, and so on.

The combination of model pruning followed by light retraining through distillation has been found to be an effective and cost-efficient approach to train a family of models. For each additional model, just 100-400 billion tokens are used for retraining—a greater than 40x reduction compared to training from scratch. As such, the compute cost savings to train a family of models (12B, 8B, and 4B) is up to 1.95x compared to training all models from scratch.

The learning from extensive ablation studies has been summarized into 10 best practices for structured weight pruning combined with knowledge distillation. We found that width pruning consistently outperforms depth pruning and, most importantly, pruned and distilled models outperform models trained from scratch in quality.

Mistral-NeMo-Minitron 8B

Following our best practices, we width-pruned the Mistral NeMo 12B model to obtain an 8B target model. This section details the steps and parameters used to obtain the Mistral-NeMo-Minitron 8B base model, as well as its performance.

Teacher fine-tuning

To correct for the distribution shift across the original dataset the model was trained on, we first fine-tuned the unpruned Mistral NeMo 12B model on our dataset using 127B tokens. Experiments showed that, without correcting for the distribution shift, the teacher provides suboptimal guidance on the dataset when being distilled.

Width-only pruning

Given our goal of obtaining the strongest 8B model possible, we proceeded with width-only pruning. We pruned both the embedding (hidden) and MLP intermediate dimensions along the width axis to compress Mistral NeMo 12B. Specifically, we computed importance scores for each attention head, embedding channel, and MLP hidden dimension using the activation-based strategy. Following importance estimation, we:

  • Pruned the MLP intermediate dimension from 14336 to 11520
  • Pruned the hidden size from 5120 to 4096
  • Retained the attention head count and number of layers

Distillation parameters

We distilled the model with peak learning rate=1e-4, minimum learning rate=4.5e-7, linear warm up of 60 steps, cosine decay schedule, and a global batch size of 768 using 380 billion tokens (the same dataset used in teacher fine-tuning).

Conclusion

Mistral-NeMo-Minitron 8B provides class-leading accuracy and consistently outperforms recently introduced state-of-the-art models of similar size. Mistral-NeMo-Minitron 8B is our first work on the distillation of the Mistral NeMo 12B model and provides strong support for our structured weight pruning combined with knowledge distillation best practices. Further work distilling and obtaining even smaller and more accurate models is planned. The technique implementation will be gradually rolled out in the NVIDIA NeMo framework for generative AI.

To learn more, check out these resources:

Acknowledgments

This work would not have been possible without contributions from many people at NVIDIA. To mention a few of them:

Foundation model: Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Pavlo Molchanov, Mostofa Patwary, Daniel Korzekwa, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro, and Jan Kautz
Alignment: Ameya Sunil Mahabaleshwarkar, Hayley Ross, Brandon Rowlett, Oluwatobi Olabiyi, Shizhe Diao, and Yoshi Suhara
Datasets: Sanjeev Satheesh, Jupinder Parmar, Shengyang Sun, Jiaqi Zeng, Zhilin Wang, Yi Dong, Zihan Liu, Rajarshi Roy, Wei Ping, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev
TensorRT-LLM: Bobby Chen, James Shen and Chenhan Yu
Hugging Face support: Ao Tang, Yoshi Suhara, and Greg Heinrich

Mistral-NeMo-Minitron 8B Foundation Model Delivers Unparalleled Accuracy | NVIDIA Technical Blog (2024)

References

Top Articles
Carli Lloyd Net Worth | Husband
457 Celebrities Names With "Carl"
Craigslist San Francisco Bay
Jordanbush Only Fans
Lakers Game Summary
Elleypoint
Napa Autocare Locator
Robot or human?
Jonathon Kinchen Net Worth
Insidious 5 Showtimes Near Cinemark Tinseltown 290 And Xd
Sissy Hypno Gif
Bloxburg Image Ids
Atrium Shift Select
Devourer Of Gods Resprite
Autozone Locations Near Me
Planets Visible Tonight Virginia
Red Heeler Dog Breed Info, Pictures, Facts, Puppy Price & FAQs
Cranberry sauce, canned, sweetened, 1 slice (1/2" thick, approx 8 slices per can) - Health Encyclopedia
Everything You Need to Know About Holly by Stephen King
David Turner Evangelist Net Worth
Uky Linkblue Login
iZurvive DayZ & ARMA Map
Andhrajyothy Sunday Magazine
Energy Healing Conference Utah
Www.publicsurplus.com Motor Pool
Persona 5 Royal Fusion Calculator (Fusion list with guide)
Hobby Stores Near Me Now
Atdhe Net
Babbychula
All Obituaries | Gateway-Forest Lawn Funeral Home | Lake City FL funeral home and cremation Lake City FL funeral home and cremation
Ou Class Nav
Unable to receive sms verification codes
Giantbodybuilder.com
Sandals Travel Agent Login
Jersey Shore Subreddit
Jesus Calling Feb 13
Downloahub
Helpers Needed At Once Bug Fables
Plasma Donation Racine Wi
Teenage Jobs Hiring Immediately
10 Most Ridiculously Expensive Haircuts Of All Time in 2024 - Financesonline.com
Sams La Habra Gas Price
Unifi Vlan Only Network
The TBM 930 Is Another Daher Masterpiece
Japanese Big Natural Boobs
11526 Lake Ave Cleveland Oh 44102
Directions To The Closest Auto Parts Store
Weather Underground Cedar Rapids
Sound Of Freedom Showtimes Near Amc Mountainside 10
Stoughton Commuter Rail Schedule
Lsreg Att
Haunted Mansion Showtimes Near The Grand 14 - Ambassador
Latest Posts
Article information

Author: Frankie Dare

Last Updated:

Views: 5956

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Frankie Dare

Birthday: 2000-01-27

Address: Suite 313 45115 Caridad Freeway, Port Barabaraville, MS 66713

Phone: +3769542039359

Job: Sales Manager

Hobby: Baton twirling, Stand-up comedy, Leather crafting, Rugby, tabletop games, Jigsaw puzzles, Air sports

Introduction: My name is Frankie Dare, I am a funny, beautiful, proud, fair, pleasant, cheerful, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.