Part III : What does Low Rank Factorization of a Convolutional Layer really do?
Decomposition of a Convolutional layer
In a previous post I described (in some detail) what it means to decompose a matrix multiply into a sequence of low rank matrix multiplies. We can do something similar for a tensor as well, this is somewhat less easy to see since tensors (particularly in higher dimensions) are quite hard to visualize.
Recall, the matrix formulation,
Where
Now instead of a weight matrix multiplication
Interestingly, you can also think about this as a matrix multiplication, by creating a Toplitz matrix version of
Convolution Operation
At the heart of it, a convolution operation takes a smaller cube subset of a “cube” of numbers (also known as the map stack) multiplies each of those numbers by a fixed set of numbers (also known as the kernel) and gives a single scalar output. Let us start with what each “slice” of the cube really represents.
Now that we have a working example of the representation, let us try to visualize what a convolution is.
A convolution operation takes a subset of the RGB image across all channels and maps it to one number (a scalar), by multiplying the cube of numbers with a fixed set of numbers (a.k.a kernel, not pictured here) and adding them together.A convolution operation multiplies each pixel in the image across all
Low Rank Approximation of Convolution
Now that we have a good idea of what a convolution looks like, we can now try to visualize what a low rank approximation to a convolution might look like. The particular kind of approximation we have chosen here does the following 4 operations to approximate the one convolution operation being done.
Painful Example of Convolution by hand
Consider the input matrix :
Kernel:
Element-wise multiplication and sum:
Low Rank Approximation of convolution
Now we will painfully do a low rank decomposition of the convolution kernel above. There is a theorem that says that a
We can easily guess
This is easy because I chose values for the kernel that were easy to break down. How to perform this breakdown is the subject of the later sections.
Consider the original kernel matrix
The input matrix
Convolution with Original Kernel
Perform the convolution at the top-left corner of the input matrix:
Convolution with Low-Rank Vectors
Using the low-rank vectors:
Step 1: Apply
Step 2: Apply
Comparison
Convolution with Original Kernel: -2
Convolution with Low-Rank Vectors: 0
The results are different due to the simplifications made by the low-rank approximation. But this is part of the problem that we need to optimize for when picking low rank approximations. In practice, we will ALWAYS lose some accuracy
PyTorch Implementation
Below you can find the original definition of AlexNet.
1 | class Net(nn.Module): |
Now let us decompose the first convolutional layer into 3 simpler layers using SVD
1 |
|
Decomposition into a list of simpler operations
The examples above are quite simple and are perfectly good for simplifying neural networks. This is still an active area of research. One of the things that researchers try to do is try to further simplify each already simplified operation, of course you pay the price of more operations. The one we will use for this example is one where the operations is broken down into four simpler operations.
(Green) Takes one pixel from the image across all
channels and maps it to one value(Red) Takes one long set of pixels from one channel and maps it to one value
(Blue) Takes one wide set of pixels from one channel and maps it to one value
(Green) takes one pixel from all
channels and maps it to one value
Intuitively, we are still taking the subset “cube” but we have broken it down so that in any given operation only
PyTorch Implementation
In this section, we will take AlexNet (Net
), evaluate (evaluate_model
) it on some data and then decompose the convolutional layers.
Declaring both the original and low rank network
Here we will decompose the second convolutional layer, given by the layer_to_replace
argument. The two important lines to pay attention to are est_rank
and cp_decomposition_conv_layer
. The first function estimates the rank of the convolutional layer and the second function decomposes the convolutional layer into a list of simpler operations.
1 | class lowRankNet(Net): |
Evaluate the Model
You can evaluate the model by running the following code. This will print the accuracy of the original model and the low rank model.
1 | decomp_alexnet = lowRankNetSVD(net) |
Let us first discuss estimate rank. For a complete discussion see the the references by Nakajima and Shinchi. The basic idea is that we take the tensor, “unfold” it along one axis (basically reduce the tensor into a matrix by collapsing around other axes) and estimate the rank of that matrix.
You can find est_rank
below.
1 | from __future__ import division |
You can find the EVBMF code on my github page. I do not go into it in detail here. Jacob Gildenblatt’s code is a great resource for an in-depth look at this algorithm.
Conclusion
So why is all this needed? The main reason is that we can reduce the number of operations needed to perform a convolution. This is particularly important in embedded systems where the number of operations is a hard constraint. The other reason is that we can reduce the number of parameters in a neural network, which can help with overfitting. The final reason is that we can reduce the amount of memory needed to store the neural network. This is particularly important in mobile devices where memory is a hard constraint.
What does this mean mathematically? Fundamentally it means that neural networks are over parameterized i.e. they have far more parameters than the information that they represent. By reducing the rank of the matrices needed carry out a convolution, we are representing the same operation (as closely as possible) with a lot less information.
References
- [Low Rank approximation of CNNs] (https://arxiv.org/pdf/1511.06067)
- [CP Decomposition] (https://arxiv.org/pdf/1412.6553)
- Kolda & Bader “Tensor Decompositions and Applications”in SIAM REVIEW, 2009
- [1] Nakajima, Shinichi, et al. “Global analytic solution of fully-observed variational Bayesian matrix factorization.” Journal of Machine Learning Research 14.Jan (2013): 1-37.
- [2] Nakajima, Shinichi, et al. “Perfect dimensionality recovery by variational Bayesian PCA.”
- [Python implementation of EVBMF] (https://github.com/CasvandenBogaard/VBMF)
- [Accelerating Deep Neural Networks with Tensor Decompositions - Jacob Gildenblat] (https://jacobgil.github.io/deeplearning/tensor-decompositions-deep-learning)
- [Python Implementatioon of VBMF] (https://github.com/CasvandenBogaard/VBMF)
- [Similar article that is more high level] (https://medium.com/@anishhilary97/low-rank-approximation-for-4d-kernels-in-convolutional-neural-networks-through-svd-65b30dc55f6b)