The auto-encoder has been applied widely for unsupervised learning, which is usually composed of two symmetric parts namely encoder and decoder. It is easy to realize an autoencoder only with fully-connected layers, i.e., DNN, but which is not that clear in CNN.
For convolution case, the layer in the decoder maintains the shape and kernel configurations for its symmetric layer in the encoder, thus the deconvolution, or transpose convolution operation will be used instead of the convolution operation.
TensorFlow provides a method namedly conv2d_transpose in both tf.nn module and tf.contrib.layers module, which are very convenient. However, for tf.contrib.layers.conv2d_transpose, if the output shape of the transpose convolutution is odd when convolution stride setting as 2, it cannot control the output shape to desired one.
For example, denote a [None, 9, 9, 1] 4D-tensor $X$, convolved by a kernel of size [3, 3] with a 2 step stride and halp padding (SAME), the output 4D tensor $y$ will be [None, 5, 5, 1]. However, the transpose convolution from y by the same parameters setting generates $x’$ into a [None, 10, 10, 1] tensor, not [None, 9, 9, 1].
To handle this, I provide a naive but effective way, see as follows,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import tensorflow as tf
import tensorflow.contrib.layers as layers
x = tf.placeholder(tf.float32, shape=[None, 5, 5, 1])
y = tf.placeholder(tf.float32, shape=[None, 9, 9, 1])
kernel_size = [3, 3]
stride = 2
x_r = layers.conv2d_transpose(
inputs=x,
num_outputs=x.get_shape().as_list()[1],
kernel_size=kenerl_size,
padding='SAME',
stride=stride,
scope='conv2d_transpose'
)
x_r = x_r[:, 0:-1, 0:-1, :]
Above solution played well in my code, though ths crop may introduce bias..
A convolutional auto-encoder is usually composed of two sysmmetric parts, i.e., the encoder and decoder. By TensorFlow, it is easy to build the encoder part using modules like tf.contrib.layers or tf.nn, which encapsulate methods for convolution, downsampling, and dense operations.
However, as for the decoder part, TF does not provide method like upsampling, which is the reverse operation of downsampling (avg_pool2, max_pool2). This is because max pooling is applied more frequently than average pooling, while recover an image from max-pooled matrix is difficult for lossing of locations of the max points.
For the average-pooled feature maps, there is a simple way to realize upsampling without high-level API like keras, but with basic functions of TF itself.
Now, suppose the input is a 4-D tenser whose shape is [1, 4, 4, 1] and sampling rate is [1, 2, 2, 1], then the upsampled matrix is also a 4-D tenser of shape [1, 8, 8, 1]. Following lines can realize this operation.
TF官方的建议方法解释是, Note: when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
参考这篇博客,作者对此做了更棒的解释!!!!! When you execute an operation (such as train_step), only the subgraph components relevant to train_step will be executed. Unfortunately, the update_moving_averages operation is not a parent of train_step in the computational graph, so we will never update the moving averages!
作者的解决方法:Personally, I think it makes more sense to attach the update ops to the train_step itself. So I modified the code a little and created the following training function
First, create the TensorFlow graph that you’d like to collect summary data from, and decide which nodes you would like to annotate with summary operations.
首先,Ioffe和Szegedy在摘要中说明了提出BN的动机,其原文为, The deep neural network is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.
LeCun在1998年的论文中提出过这样一个观点,The newtork training converges faster if its inputs are whitened, i.e., linearily transformed to have zero means and unit variances, and decorrelated. 即对网络的输入进行归一化,有利于加速网络的收敛。这一点也是比较好理解的,参考传统机器学习算法,通常分类器的输入是人工提取的特征,这些特征在量纲和数量级上会有相当大的差异性,如果直接输入网络,数值较大的特征便会影响训练结果。为了解决这一问题,通常对特征进行去量纲的操作,最好的方式便是归一化,这种针对每一维特征进行归一化的处理方法也是BN的核心。
以上虽然对输入做了归一话,但如BN文中所述Simply normalizing each input of a layer may change what the layer can represent。以激活函数sigmoid为例,如果$x^{(k)}$本来的数值较大,经过sigmoid函数以后会分布在接近饱和区的两端,而白化以后则剧集在sigmoid(x)=0附近,使得原有的分布消失。为了解决这一问题,需要对$\hat{x}$进行尺度(scale)和位移(shift)变换,从而有,