Model compression as constrained optimization, with application to neural nets

Speaker
Miguel Á. Carreira-Perpiñán

Deep neural nets have become in recent years a widespread practical technology, with impressive performance in computer vision, speech recognition, natural language processing and many other applications. Deploying deep nets in mobile phones, robots, sensors and IoT devices is of great interest. However, state-of-the-art deep nets for tasks such as object recognition are too large to be deployed in these devices because of the computational limits they impose in CPU speed, memory, bandwidth, battery life or energy consumption. This has made compressing neural nets an active research problem. More generally, compression can be seen as a sophisticated form of regularization that allows the model designer to learn the structure of a model.


We give a general formulation of model compression as constrained optimization. This includes many types of compression: quantization, low-rank decomposition, pruning, lossless compression and others. Then, we give a general algorithm to optimize this nonconvex problem based on the augmented Lagrangian and alternating optimization. This results in a "learning-compression" (LC) algorithm, which alternates a learning step of the uncompressed model, independent of the compression type, with a compression step of the model parameters, independent of the learning task. This simple, efficient algorithm is guaranteed to find the best compressed model for the task in a local sense under some assumptions.

We then describe specializations of the LC algorithm for various types of compression, such as binarization, ternarization and other forms of quantization, pruning, low-rank decomposition, and other variations. We show experimentally with large deep neural nets such as ResNets that the LC algorithm can achieve much higher compression rates than previous work on deep net compression for a given target classification accuracy. For example, we can often quantize down to just 1 bit per weight with negligible accuracy degradation.

This is joint work with my PhD students Yerlan Idelbayev, Pooya Tavallali and Arman Zharmagambetov.

Summary Image