Skip to main content

May 2, 2023

New algorithms cut size of GPT-scale models

Alistarh group’s novel methods bring large language models to laptops

The enormous sizes of large language models are prohibitive to most users—new algorithms from the Alistarh group and their collaborators may change this. © Shutterstock

ChatGPT has taken the world by storm. However, its enormous size leads to extremely high computational and storage costs, making storing, training, or running an instance impossible for most people. Now, the Alistarh group at the Institute of Science and Technology Austria (ISTA) has developed two algorithms that dramatically reduce these costs without reducing accuracy, allowing individuals and companies to use these models more easily.

Size appears to be key when it comes to large language models—huge computational neural networks trained on large quantities of text. For example, the model behind ChatGPT, the fastest growing application ever, tracks and computes around 175 billion numbers, known as “weights”, linked together in a complex network to come up with its results. Unlike smaller versions, large models have astonishing language capabilities and moreover display surprising “emergent behaviors”—exceling at certain tasks for which they have not been trained, such as addition or unscrambling words.

However, the massive size of these neural networks also means that most users—individuals and independent companies alike—are unable to even store them. Previously, attempts to address this issue focused primarily on pruning and quantization methods that are scalable but not very accurate. Now, PhD student Elias Frantar and Professor Dan Alistarh have developed highly efficient and accurate algorithms in both of these areas, significantly expanding the realm of possibility.

The minds behind the algorithms. PhD student Elias Frantar (center) and Professor Dan Alistarh (right) have developed groundbreaking methods to decrease the size of large language models. Eldar Kurtic (left), the group’s machine-learning research technician, also supported the research. © ISTA

Pruning the network

With pruning, algorithms seek to trim unnecessary weights from the networks that run large language models. Past algorithms in this area have either been too expensive for use on large networks or required extensive retraining, limiting their value in practice. Therefore, when several large language models were made publicly available in summer of 2022, Frantar and Alistarh were eager to try out their ideas. After just a few months of work, first author Frantar had a flash of inspiration: “The key idea was to freeze some of the weights in specific patterns. That way, we could extensively reuse information that is expensive to compute,” he explains. Using this, the team created SparseGPT, the first accurate one-shot pruning method that works efficiently at the scale of models with 10-100+ billion weights. This work will be presented in July at the International Conference on Machine Learning (ICML 2023).

In under four and a half hours, their algorithm can cut up to 60 percent of the weights with minimal loss of accuracy—and without needing to retrain the network. “Surprisingly, our algorithm reveals that significant portions of these models are redundant,” adds Alistarh. Key to the team’s success was use of ISTA’s Scientific Computing Facility, which provided a computing cluster for storing the models and testing algorithms as well as regular technical support.

A high-performance team. Stefano Elefante (second from the right) and Alois Schlögl (second from the left) provided expert support as part of ISTA’s Scientific Computing Facility. © ISTA
The High-Performance Computing (HPC) cluster room. The HPC provided by the Scientific Computing Facility allows scientists to store and experiment with enormous models. © ISTA/Stephan Stadlbauer

Compressing the model

The second main method, quantization, works by reducing the set of numerical values allowed to be assigned to each weight, which in turn allows the model to be compressed. Earlier algorithms of this type, however, were limited by the size and complexity of GPT models. Together with their collaborators at ETH Zurich, Alistarh and Frantar created a highly accurate and efficient one-shot algorithm, OPTQ

This new method can quantize GPT models with 175 billion weights in just four hours. “Previous methods,” explains Frantar, “were designed for models a thousand times smaller and would be challenging to scale up.” Moreover, their method more than doubles the compression gains relative to the one-shot methods proposed previously. “In practice, this means it is now possible to use a 175 billion-weight model with a single graphics card,” Alistarh adds. Their paper will be presented this week at the International Conference on Learning Representations (ICLR 2023).

Democratizing large language models

While the model behind ChatGPT remains corporate property, other large language models like it have been made available—and the public is eager to experiment. “Since its release, we’ve been contacted about our work at least ten times a week, by individuals as well as companies,” says Frantar. “We hope that our work will stimulate further research in this area and can be a further step towards making these models available to a wider audience.”



Share

facebook share icon
twitter share icon
back-to-top icon
Back to Top