Deconvolution or Transpose: opposite operation of CNN
写之前吐个槽,因为昨晚(2017-08-29)开着阳台门睡觉,今天妥妥感冒了。也因为好久没生病,估计要触底反弹了(手动捂脸)。言归正传,最近在用AutoEncoder (AE) 做样本的预训练和特征表示学习,有一些体会, 今天想聊聊反向卷积神经网络。先挖个坑,等明天把程序写出来可能会写得更有调理一些。[Update: 烧了两天,回来填坑。。。]
最近用过的开源深度学习框架有Theano/Lasagne、TensorFlow和caffe。其中前两个框架主要基于Python实现,容易上手;第三个基于C++,也提供了python和MATLAB的接口,三者在Deep Learning中都有广泛的应用和实现。今天,我要讨论的反向卷积 (Transposed convolution) 参考了Theano的tutorial, 以及这个主题为”全卷积”的自动编码器的repo.
Note: 我会尽量用中文写,但有些地方我可能会偷懒。。。
Auto-encoder
首先,我来简要说一下自动编码器 (AutoEncoder, AE)。参考Wiki的定义,AE是一种人工神经网络 (ANN),擅长无监督学习 (unsupervised learning) 以及特征表示 (feature representation) 和特征降维 (dimensionality reduction)。更通俗地理解,AE的目的是在没有任何标记的情况下从事物中提取出最能够表征他们的特点。我们来看看Wiki上训练AE网络的思路,
为了实现这一结果,最naive的做法是建立镜像对称的网络。将网络拆分为编码 (encoder) 和解码 (decoder) 两个部分,后者和前者具有镜像对称的结构,并且共享对应层的权值参数矩阵(weight parameters)。这样,在训练的网络中,我们不仅可以利用编码器进行特征提取和降维的工作,还可以利用解码器生成新的样本,这也是生成对抗网络(Generative Adverserial Network, GAN)的主要思路。
当然也有论文指出,共享权值和镜像网络并不是好的选择,尤其是在卷积神经网络中。虽然卷积是线性的,但卷积的逆运算通常是超定的,不可逆,所以镜像的网络并不一定有效 (需要再确认一下,总觉得怪怪的。。。)。 因此,对于卷积自动编码器(CAE)而言,其解码部分的反向卷积虽然称为Deconvolution,但不是真正意义上的逆向,而应该称为Transposed convolution.
Deconvolution or transposed convolution
下面,我们来讨论transposed convolution。首先从Theano的tutorial上摘了一些观点,里面对于卷积和矩阵乘法的理解非常棒!!!
Understanding of Transposed convolution
- Transposed convolution: map from output-vector space to the input-vector space, while keeping the connectivity pattern of the convolution depicted. Also called fractionally strided convolutions.
- The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution. One might use such a transformation as the decoding layer of a convolutional autoencoder or to project feature maps to a higher-dimensional space. (转置卷积的意义)
- Every convolution boils down to an efficient implementation of a matrix operation, thus the insights gained from the fully-connected case are useful solving the convolutional case. (卷积运算的内凛属性依然是矩阵乘法)
- The dissertation about transposed convolution arithmetic is simplified by the fact that transposed convolution properties don’t interact across axes.
卷积和矩阵乘法的关系
为了说明transposed convolution的可行性,首先得解释卷积运算和矩阵乘法的关系。我们定义输入矩阵为$\mathbf{I}$,卷积核为$\mathbf{K}$,以及输出矩阵为$\mathbf{O}$, 那么三者的关系如下,
卷积神经网络中的卷积与传统的二维卷积是不一样的,CNN中的卷积只考虑某一区域与卷积核相乘的累计和,不会在计算初始对卷积核做镜像对称。这一点倒是更像correlation的运算。下图给了一个例子,摘自Theano tutorial,
那么这种运算的本质是什么呢? 我们可以将卷积运算转变成矩阵的乘法。我们将$\mathbf{I}$,$\mathbf{O}$按先列再行的顺序展开为vector, 定义为$\mathbf{I{v}}$和$\mathbf{O{v}}$, 以上图中对应的参数为例,则有,
紧接着,我们用矩阵$\mathbf{C}$来定义卷积运算,其中$\mathbf{C}$的元素由核矩阵$\mathbf{K}$的元素定义,如下所示,
最后,公式(1)的卷积运算便可由公式(4)转化为矩阵运算,
Toturial中也指出,矩阵$\mathbf{C}$正是CNN训练中前向 (forward) 和后向 (back propogation)的关键,对$\mathbf{C}$做转置换,便可以
通过网络的输出将梯度传递给输入,即
我们发现,公式(5)正是我们需要的“反向卷积”,我们来看看Theano tutorial的观点,
- Though the kernel defines a convolution, whether it’s a direct convolution or a transposed convolution is determined by how the forward and backward passes are computed.
- For instance, the kernel $w$ defines a convolution whose forward and backward passes are computed by multiplying with $C$ and $C^{T}$ respectively, but it is also defines a transposed convolution whose forward and backward passes are computed by multiplying with $C^{T}$ and $(C^{T})^{T} = C$ respectively.
因此,利用网络训练中的Back propogation运算进行transposed convolution是可行的,只需要在运算时调换一下$\mathbf{C}$和$\mathbf{C^{T}}$的顺序。
Execute transposed convolution
最后,我们来说一下怎么实现“反向卷积”。通过前面的分析,transposed convolution的运算可以概括为convolution的逆运算,所以最简单的方式是用卷积,也即deconvolution。但tutorial中也指出,卷积运算通常需要做zero padding,引入不必要的运算,所以Theano内部定义了新的函数theano.tensor.nnet.abstract_conv.conv2d_grad_wrt_inputs来实现简化了的卷积运算。
- It is always possible to implement a trasnposed convolution with a direct convolution. The disadvantage is that it usually involves adding many columns and rows of zeros to the input, resulting in a much less efficient implementation.
- The simples way to think about a transposed convolution is by computing the output shape of the direct convolution for a given input shape first, and the inverting the input and output shapes for the transposed convolution.
- To maintain the same connectivity pattern in the equivalent convolution it is necessary to zero pad the input in such a way that the first (top-left) application of the kernel only touches the top-left pixel, i.e., the padding has to be equal to the size of the kernel minus one.