Post-training Compression of Neural Network via SVDRobert Dyro |
|||||||||||||||
IntroductionA well known demonstration of how singular value decomposition (SVD) can extract the most important features of a matrix is image compression where only a couple of components from the full SVD decomposition of the original image are retained. When the image is reconstructed, it tends to resemble the original one, but the representation occupies much less space. The compression ratio can often result in a 100x reduction in size (compared to a naive storage).1 Here's an algorithm for compressing an image using SVD:
Question: Can we use SVD to compress a neural network?The idea behind this project is to investigate whether an SVD compression can be used to reduce the memory (storage) footprint of a trained neural network (NN) after it has been trained.
Post-training CompressionThis problem is addressed in the field of post-training compression of NNs. Most approaches to reducing the model footprint focus exclusively on only 1 type of target for compression: matrix weights. Biases, activations, buffers and other learned parameters are typically much smaller in size to weights and so are left uncompressed. Weights are typically:
where is the number of bits available. Finally, the computation with quantized weights is typically done by dequantizing the matrix into the smallest floating point representation that does not introduce numerical problems, usually half-precision (float16) and performing regular matrix multiplication with the right hand-side. For much faster matrix multiplication, the matrix is usually dequantized in small optimized blocks to make best use of the modern GPU architectures' local memory cache and fastest memory access patterns. SVD CompressionThe SVD compression of a matrix is an extremely simple technique. For every weight to compress, we compute and then retain only the first components, so that Combining SVD and QuantizationBecause quantization both results in enormous memory savings and can be applied to any matrix, we use quantized SVD compression as the technique of choice in this project. where is not quantized, because its size is usually negligible. Experimental ApproachWe investigate the combined SVD + quantization compression technique on 4 different, representative and popular neural networks
We use the excellent Deciding on the compression fraction is a bit of an art. As a way to make the experiments tractable, we decide on the following rank selection rule:
Since the error is a monotonic function of , we can use binary search to find the smallest that satisfies the error condition. def evaluate_error(k: int) -> float: self.rescale(CompressionConfig(Compression.SIZE_K, k)) with torch.no_grad(): new_output = self.forward(input) err = torch.norm(new_output.to(output) - output) / torch.norm(output) return err Finally, if the required error cannot be satisfied without exceeding such a that the compression ratio is > 1.0, we simply change the layer back to its linear form. No need to use SVD if we would exceed the number of parameters of the linear layer. ResultsVGG19 ImageNet Classification ModelWe start with the VGG19 network trained on ImageNet. The inspiration for this starting point is in the Sparse low rank factorization for deep neural network compression paper.2 The reason why VGG model family is an attractive target for SVD weight compression is because over 90% of model parameters are contained in the extremely large linear classifier layers at the end of the network. By scanning the layer error threshold , we create and plot two dependent variables: the model size (in bytes) vs a performance measure. For VGG19, performance is the top-1 accuracy on the ImageNet validation set. For VGG19, we only compress the linear layers (in the classifier).
ResNet-101 ImageNet Classification ModelFollowing from VGG19, we turn to the ResNet-101 model. The ResNet model family contains linear layers, but they are much smaller. The vast majority of parameters is contained in the convolutional layers. We compress both the linear layers and the convolutional layers, the latter by first flattening all, but the first (output channels) dimension -- thus converting the 4D convolutional layer weight into a 2D matrix.
BERTAfter ResNet, we attempt to compress the BERT model. BERT is a transformer model with the vast majority of components being the linear weight matrices used for the attention projections:
All of which are linear layers. The BERT model is not a classifier, at least in its base form, so we compare the cosine similarity of the final pooler embeddings between an uncompressed (float32) and a compressed (SVD + quantization) model.
A cosine similarity of 1.0 means the two vectors are aligned, 0.0 means they are uncorrelated. The result of the model achieving a cosine similarity of -0.5 is odd as random vectors should have a cosine similarity of 0.0. Phi-2Finally, we take a look at a small LLM (SLM). The Phi-2 model is a 2.7B model for which the vast majority of parameters are contained in transformer layers, so linear layers (the attention projections). Here, the performance metric is the perplexity of the model on a validation set. Low perplexity means the model predicts the ground-truth next word in a text well. It is defined as the exponentiation of the cross-entropy loss.
Discussion and ConclusionsThe truncated SVD does not appear to be a particularly useful compression method. It is also extremely noteworthy that quantization is a very effective way of scaling the model down. Existing academic literature includes several competing compression ideas
Recovering Performance Through Brief Retraining
In the face of terrible results, we need to find another approach. What is particularly surprising about quantization is that despite being performed per-layer, with error introduced to each layer independently, it does not degrade global performance very much. This is in stark contrast to the truncated SVD approach investigated here. SVD is an optimal compression method under the operator norm, but it does not appear to be a particularly useful compression method for neural networks -- perhaps with the exception of VGG19 where the majority of parameters are contained in a single linear layer. We should to abandon our initial assumption of not using data to fine-tune the model. The obvious next step involves looking at the network error globally. Instead of compressing every layer independently, we should compress the network as a whole. This leads to two possible ideas, investigated to some degree in the literature:
Work in progress... ReferencesWork in progress...
|