Variational autoencoder and the conditional case

Apr 15, 2018

今天讨论一下变分自动编码器 (Variational Auto-Encoder, VAE)及添加了条件约束的conditional VAE，包括其与传统AE的区别、推导思路和基于TensorFlow的实现。

传统的自动编码器主要用于特征提取、降维和压缩，是一种无监督的机器学习工具。其由编码器 (encoder)和解码器 (decoder)组成。编码器类似于分类问题中的特征提取模块，对样本进行特征提取并压缩成一个长度很小的向量；解码器则对这一特征向量进行解码以恢复出原始样本信息。通常衡量一个自动编码器好坏的指标 (或目标函数) 为最小均方误差 (mean squared error, MSE)，

$\begin{equation} L_{MSE} = \frac{1}{N}\sum^{N}_{i=1}{(X_\mathrm{in} - X_\mathrm{out})^2}, \end{equation}$

其中 $X_\mathrm{in}$ 和 $X_\mathrm{out}$ 表示编码器的输入和解码器的输出。

传统AE的问题在于，解码器只能根据编码器在样本上提取的特征向量来输出恢复的样本，很难直接通过生成特征向量产生新的样本 (当然特征是可以通过类似Monte Carlo模拟或者GMM等模型来生成的，感兴趣可以看我的这个repo) ，所以传统的AE不完全是生成模型。

为了解决这一问题，Kingma & Welling 在Auto-Encoding Variational Bayes文中提出了变分自动编码器的概念。虽然与AE的组成类似，但二者是完全不同的，VAE是基于Bayes理论的，利用编解码器来抽象参数的非线性，最大化解码器输出相对于输入样本的似然性。 简而言之，就是试图获取数据本身的分布，从这个分布产生新的样本。

Variational auto-encoder

VAE首先定义一个称为latent variable的高维随机变量 $z$ (z的维数远小于 $X$ )，其分布为 $P(z)$ 。然后定义映射 $f: z\times \theta -> X$ ，即由参数 $\theta$ 约束的映射 $f$ 在服从 $P(z)$ 分布的 $z$ 的作用下能够近似真实样本 $X$ (如公式2，其中 $P(X|z;\theta)$ 表示 $f(z|\theta)$ 的概率分布)。通过 $z$ 便可以刻画样本 $x$ 。而由于 $z$ 是无法直接观测的，所以称为隐变量。

$\begin{equation} P(X) = \int{P(X|z;\theta)P(z)dz}. \end{equation}$

VAE假设 $P(z)$ ~ $N(0, I)$ 的标准正态分布，通过对参数 $\theta$ 进行估计，最大化公式(2)。因此，VAE网络实际上是用来求解 $P(X|z;\theta)$ 这个分布的，与传统意义上的AE的理解是不同的。正如Tutorial on Variational Autoencoder里提到的，They are called “autoencoders” only because the final training objective that derives from this setup does have an encoder and a decoder, and resembles a traditional autoencoder.

VAE的Encoder可以理解为 $P(z|X)$ ，而Decoder理解为 $P(X|z)$ ，其目标便是最大化Decoder的输出 $P(X|z)$ 或者 $log(P(X|z))$ . 由于 $z$ 是无法观测的，分布 $P(z|X)$ 无法直接求解，VAE巧妙地利用了变分的方法，定义分布 $Q(z|X)$ 来近似 $P(z|X)$ 。最终有 $Q(z|X)$ 为Encoder，而 $E_{z\sim Q}{P(X|z)}$ 表示Decoder。

通常用KL divergence来描述两个分布之间的相似程度，KL散度越接近于零，二者越相似。定义 $\mathrm{KL}(Q(z)||P(z|X))$ ,

$\begin{equation} \mathrm{KL}(Q(z|X)||P(z|X)) = E_{z\sim Q}\left({Q(z|X)\log{\frac{Q(z)}{P(z|X)}}}\right) \end{equation}$

利用Bayes定理， $P(z|X) = P(X|z)P(z)/P(x)$ ，推导得到 (具体过程参考文献2)，

$\begin{equation} \log{P(X)} - \mathrm{KL}(Q(z|X)||P(z|X)) = E_{z \sim Q}\log{P(X|z)} - \mathrm{KL}(Q(z|X)||P(z)) \end{equation}$

因为有 $\mathrm{KL}(Q(z|X)||P(z|X))\leq 0$ ，所以上式定义了 $logP(X)$ 的下界，即

$\begin{equation} \log{P(X)} \leq E_{z\sim Q}\log{P(X|z)} - \mathrm{KL}(Q(z|X)||P(z)) \end{equation}$

显然，公式(3)的右边两项分别对应VAE中Decoder和Encoder的损失函数，也即要最大化的目标，这与GAN网络的思路非常类似。 $E_{z\sim Q}logP(X|z)$ 通常可以用最小均方误差定义，也可以用交叉熵来描述。这里详细推导一下 $\mathrm{KL}(Q(z|X)||P(z))$ ，以帮助其在TensorFlow下的实现。这里假设 $Q(z|X)$ 服从正态分布 $N(\mu_Q, \sigma_Q\times I)$ , $P(z)$ 是服从标准正态分布的。

$\begin{align} \mathrm{KL}\left(Q(z|X)||P(z)\right) &= \int{Q(z|X)\log{\frac{Q(z|X)}{P(z)}}dz} \notag \\ &= \int{Q(z|X)\log{\frac{\frac{1}{\sqrt{2\pi}\sigma_Q}e^{-\frac{(X-\mu_Q)^2}{2\sigma^2_Q}}}{\frac{1}{\sqrt{2\pi}\sigma_P}e^{-\frac{(X-\mu_P)^2}{2\sigma^2_P}}}}}dz \notag \\ &= \int{Q(z|X)}\left(\log{\frac{\sigma_P}{\sigma_Q}}-\frac{(X-\mu_Q)^2}{2\sigma^2_Q} + \frac{(X-\mu_P)^2}{2\sigma^2_P} \right)dz \notag \\ &= \log{\sigma_P} - \log{\sigma_Q} - 0.5 + \int{Q(z|X)}\frac{(X-\mu_Q + \mu_Q - \mu_P)^2}{2\sigma^2_P}dz \notag \\ &= \log{\sigma_P} - \log{\sigma_Q} - 0.5 + \notag \\ & \frac{1}{2\sigma^2_P}\int{Q(z|X)}\left[(x-\mu_Q)^2 + (\mu_Q - \mu_P)^2 -2(x-\mu_Q)(\mu_Q-\mu_P)\right]dz \notag \\ & = \log{\sigma_P} - \log{\sigma_Q} - 0.5 + \frac{\sigma^2_Q}{2\sigma^2_P} + \frac{(\mu_Q - \mu_P)^2}{2\sigma^2_P} \notag \\ & = -0.5 - \log{\sigma_Q} + \frac{\sigma^2_Q}{2} + \frac{(\mu_Q - \mu_P)^2}{2} \\ \end{align}$

Conditional Variational auto-encoder

VAE只能生成服从样本分布的模拟，而对于MNIST这类样本有明确的标签的情形，能否通过设定某种条件，让VAE生成指定类别的手写体图像呢？基于这个motivation,人们提出了conditional VAE。假定样本的标签为 $Y$ ，以 $Y$ 作为Encoder和Decoder额外的约束 (如图1(c))，即可实现以上要求。相应的，条件变分自动编码器的似然函数变为,

$\begin{equation} \log{P(X|Y)} - \mathrm{KL}(Q(z|X,Y)||P(z|X,Y)) = E_{z\sim Q}\log P(X|z,Y) - \mathrm{KL}(Q(z|X,Y)||P(z)) \notag \end{equation}$

Conditional VAE的约束与VAE基本相同，依然可以分为针对Encoder和Decoder的两个部分。

VAE和Conditional的TensorFlow实现

简单介绍一下VAE的TensorFlow实现，具体的Notebook参考这里。

Init

batch_size = 64
X_in = tf.placeholder(dtype=tf.float32, shape=[None, 784], name='X_in')
X_out = tf.placeholder(dtype=tf.float32, shape=[None, 784], name='X_out')
keep_prob = tf.placeholder(dtype=tf.float32, shape=(), name='keep_prob')
n_latent = 32

Encoder

def encoder(X_in, keep_prob):
    with tf.variable_scope("encoder", reuse=None):
        x = tf.layers.dense(X_in, units=128, activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)        
        x = tf.layers.dense(x, units=64, activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)        
        # The latent layer
        mu = tf.layers.dense(x, units=n_latent)
        sigma = 1e-6 + tf.nn.softplus(tf.layers.dense(x, units=n_latent)) # softplus to avoid negative sigma
        # reparameterization
        epsilon = tf.random_normal(tf.shape(mu))
        z  = mu + tf.multiply(epsilon, sigma)        
        return z, mu, sigma

Decoder

def decoder(z, keep_prob):
    with tf.variable_scope("decoder", reuse=None):
        x = tf.layers.dense(z, units=64, activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.layers.dense(x, units=128, activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.layers.dense(x, units=784, activation=tf.nn.sigmoid) 
        # sigmoid to contrain the output to [0,1)
        return x

Network instance

1 2	z, mu, sigma = encoder(X_in, keep_prob) dec = decoder(z, keep_prob)

Loss function

loss_d = - tf.reduce_sum(
    X_out * tf.log(1e-8+dec) + (1.0 - X_out) * tf.log(1e-8+ 1.0 - dec), 1)
loss_e = 0.5 * tf.reduce_sum(
    tf.square(mu) + tf.square(sigma) - tf.log(1e-8 + tf.square(sigma)) - 1, 1)
# 注意这里的符号。。。
loss = tf.reduce_mean(loss_d + loss_e)
optimizer = tf.train.AdamOptimizer(0.0001).minimize(loss)

$E_{z\sim Q}log(P(X|z))$ 的不同定义及结果对比

最后贴一个结果，我对比了一下采用交叉熵和均方误差两种损失函数表述 $E_{z\sim Q}log(P(X|z))$ 的结果，在参数相同的情况下，看起来MSE效果好一些，如下图。

References

Comment and share

Generative adversarial network and its conditional case

Apr 14, 2018

本来想讨论InfoGAN的，先留个坑吧。今天先讨论标准的GAN，即生成对抗网络。GAN最先由Ian Goodfellow在2014年提出，跟Variational Auto-Encoder (VAE)的时间差不多，二者都是非常好的生成网络，在无监督学习中发挥了重要的作用。

Basic theory

生成对抗网络属于一个minmax game，其目标是Learn a generator, whose distribution $P_G(x)$ matches the real data distribution $P_\mathrm{data}(x)$ ，即实现一个生成器 $G(x)$ 用于从随机分布 (一般为高斯) 的噪声 $z$ 中生成与目标样本 $x_\mathrm{data}$ 类似的模拟$x_g$。

为了衡量 $x_g$ 与 $x_\mathrm{data}$ 的相似性，设计一个称为Discriminator的对抗网络 $D(x)$ ，该网络的输入可以是真实的 $x_\mathrm{data}$ ，也可以是有generator生成的伪造的样本 $x_{g}$ 。我们要求 $D(x)$ 能够分辨出输入给它的样本的真实性。因此discriminator的输出应该是样本真实的概率，

$\begin{equation} D(x) = \frac{P_\mathrm{data}(x)}{P_\mathrm{data}(x)+P_{G}(x)} \end{equation}$

可以看出，当 $P_{G}(x)$ 与 $P_\mathrm{data}(x)$ 接近时， $D(x) \sim 1/2$ ，即discriminator无法判断其输入来自真实样本还是伪造的样本。因此 $G(x)$ 与 $D(x)$ 构成了一对相互对抗的网络，即生成对抗。相应的目标函数为，

$\begin{equation} \min\limits_{G}\max\limits_{D} V(D,G) = \mathrm{E}_{x \sim P_\mathrm{data}[\log(D(x))]} + \mathrm{E}_{z}[1 - \log(D(G(x)))] \end{equation}$

参考Goodfellow的论文，该目标函数的优化求解分为两步，

Step1

随机生成minibatch个z noise作为generator的输入，得到对应的输出 $x_g$ ；
随机选取minibatch个real data样本 $x_\mathrm{data}$ ；
计算 $L_D = \mathrm{E}_{x \sim P_\mathrm{data}[\log(D(x))]} + \mathrm{E}_{z}[1 - \log(D(G(x)))]$ ；
将 $L_D$ 沿梯度方向传递给Discriminator的参数，进行参数学习。

Step2

随机生成minibatch个z noise作为generator的输入，得到对应的输出 $x_g$ ；
计算 $L_G = \mathrm{E}_{z}[1 - \log(D(G(x)))]$ ；
将 $L_G$ 沿梯度方向传递给Generator的参数，进行参数学习。

交替重复以上两步，直到 $L_D$ 和 $L_G$ 收敛，即 After several steps of training, if G and D have enough capacity, they will reach a point at which both cannot improve.

Conditional case

以上是最原始的GAN，是一种无监督的网络。那么，以MNIST手写体数据库为例，如果我们想得到一个能够生成特定数字的生成器，应该如何做？这一问题可以理解为GAN的有监督学习，即conditional GAN。这里插一句，InfoGAN可以在无标签的情况下通过将$z$分解为noise+latent两部分，无监督地学到具有样本的context semantic representations.绝对秒杀原始的GAN。。。

条件GAN的思路是什么呢？如上图所示，从左往右分别是原始的GAN、条件GAN两个网络。conditional GAN在generator和discriminator的输入部分均添加了一个新的变量$y$，即样本的标签，作为一种固定的约束，指导网络学习到样本内部的区别。但是网络的目标函数和训练过程的变化很小。 这个思路，和conditional variational auto-encoder非常相似。更新后的目标函数如下所示，

$\begin{equation} \min\limits_{G}\max\limits_{D} V(D,G) = \mathrm{E}_{x \sim P_\mathrm{data}[\log(D(x|y))]} + \mathrm{E}_{z|y}[1 - \log(D(G(x|y)))] \end{equation}$

GAN的TensorFlow实现

首先吐个槽,GAN的调参真的不是一般的复杂，我发现Generator和Discriminator互搏的时候，经常训着训着，loss就偏了，最后的输出也很诡异。然后是激活函数的选择、dropout的keep_prob，以及z的长度，batch的大小都会有影响，我提供了GAN和Conditional GAN的notebook，感兴趣可以自己去试试看。。。下面我分步骤介绍一下Conditional GAN的实现。

Initialization

batch_size = 64
z_len = 100
z_noise = tf.placeholder(dtype=tf.float32, shape=[None, z_len], name='z_noise')
y = tf.placeholder(dtype=tf.float32, shape=[None, 10], name='y')
x_data = tf.placeholder(dtype=tf.float32, shape=[None, 784], name='x_data')
keep_prob = tf.placeholder(dtype=tf.float32, shape=(), name='keep_prob')

Generator

def generator(z_noise, y, keep_prob, namescope='generator'):
    """The generator"""
    with tf.name_scope(namescope):
        net = tf.concat([z_noise, y], axis=1)
        net = tf.layers.dense(net, units=150, activation=tf.nn.relu, name='g_fc1')
        net = tf.nn.dropout(net, keep_prob=keep_prob)
        net = tf.layers.dense(net, units=300, activation=tf.nn.relu, name='g_fc2')
        net = tf.nn.dropout(net, keep_prob=keep_prob)
        net = tf.layers.dense(net, units=784, activation=tf.nn.sigmoid, name='g_fc3')
    return net

Discriminator

def discriminator(d_in, y, z_len, keep_prob, namescope='discriminator', reuse=True):
    """The discriminator"""
    with tf.name_scope(namescope):
        net = tf.concat([d_in, y], axis=1)
        net = tf.layers.dense(net, units=300, activation=tf.nn.relu, name='d_fc1', reuse=reuse)
        net = tf.nn.dropout(net, keep_prob=keep_prob)
        net = tf.layers.dense(net, units=150, activation=tf.nn.relu, name='d_fc2', reuse=reuse)
        net = tf.nn.dropout(net, keep_prob=keep_prob)
        net = tf.layers.dense(net, units=1, activation=tf.nn.sigmoid, name='d_fc4', reuse=reuse)
        return net

Network instance

# generate the network
x_g = generator(z_noise, y, keep_prob)
d_g = discriminator(x_g, y, z_len, keep_prob, reuse=False)
d_data = discriminator(x_data, y, z_len, keep_prob)

Loss and optimizer

# get variables
varlist = tf.trainable_variables() # 查看待训练的参数，为了获取G和D两个网络的参数列表
# The objective
with tf.name_scope("loss"):
    loss_d = - (tf.reduce_mean(tf.log(1e-8 + d_data)) + tf.reduce_mean(tf.log(1e-8 + 1 - d_g)))
    loss_g = - tf.reduce_mean(tf.log(1e-8 + d_g))
    train_op_g = tf.train.AdamOptimizer(0.0001).minimize(loss_g, var_list=varlist[0:6])
    train_op_d = tf.train.AdamOptimizer(0.0001).minimize(loss_d, var_list=varlist[6:])

Conditional GAN在MNIST上的测试结果

针对MNIST手写体数据库，实现了一个可以根据标签生成指定数字的Conditional GAN，网络的配置如下表，参考了这篇博客。

Subnet	Layer	Nodes	Activation	Dropout
Generator	input [z, y]	32+10	—-	—-
Generator	FC	150	relu	T
Generator	FC	300	relu	T
Generator	FC	784	sigmoid	F
Discriminator	input	784+10	—-	—-
Discriminator	FC	300	relu	T
Discriminator	FC	150	relu	T
Discriminator	FC	1	sigmoid	F

下面贴一下实验结果 (生成的手写体图像，每行对应一个数字)，可以看出，随着迭代次数的增加，生成的数字越来约清晰，且准确性在提升。

References

Comment and share

tf.layers.dense 和 tf.contrib.layers.fully_connected的区别

Apr 08, 2018

TensorFlow的tf.layers和ctf.contrib.layers都提供了相关用于搭建神经网络的模块，但同一版本额和不同版本之间均存在区别，TF的更新真的是好快。。。

今天主要讨论一下tf.layers.dense和tf.contrib.layers.fully_connected的区别，二者都可以用于构建全连接层。参考tensorflow的文档，二者的参数如下，

tf.layers.dense

tf.layers.dense(
    inputs,
    units,
    activation=None,
    use_bias=True,
    kernel_initializer=None,
    bias_initializer=tf.zeros_initializer(),
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    trainable=True,
    name=None,
    reuse=None
)

tf.contrib.layers.fully-connected

fully_connected(
    inputs,
    num_outputs,
    activation_fn=tf.nn.relu,
    normalizer_fn=None,
    normalizer_params=None,
    weights_initializer=initializers.xavier_initializer(),
    weights_regularizer=None,
    biases_initializer=tf.zeros_initializer(),
    biases_regularizer=None,
    reuse=None,
    variables_collections=None,
    outputs_collections=None,
    trainable=True,
    scope=None
)

可以看出，tf.layers.dense相对更简单，没有提供默认的activation和kernel_initializer, 而后者这两个参数都做了默认的初始化。使用时一定要显示说明这些，否则会出现不可控的错误。。。

Reference

Comment and share

Confusion matrix and generation with TensorFlow

Mar 17, 2018

简单记录一下多分类问题中的一种评价方法及其可视化，混淆矩阵 (confusion matrix, CM)。首先给出其定义及作用，然后给出样例。

Confusion matrix

混淆矩阵通过比较分类器的预测和真实标签，评估分类器效果，通常用于多分类问题。TF的audio recognition tutorial里对混淆矩阵的评价是, This matrix can be more useful than just a single accuracy score because it gives a good summary of what mistakes the network is making. 即混淆矩阵可以帮助分析分类器在哪些类上的表现最差。相对于单一的准确率这种衡量指标，更加直观。

混淆矩阵是一个二维的方阵，横轴代表真实标签，枞轴代表预测标签，其中的元素 $CM_{ij}$ 代表实际为第 $i$ ，被分成第 $j$ 类的样本数目。显然矩阵的非零元素集中在对角线时，分类器的表现更优异。

Examples

下面给出一个样例，这是一个六分类的问题，结果来自我们近期的工作。其中A的样本量约1000, B的样本量约10000，A random是随机产生预测标签的CM矩阵。如下所示，A的CM map中，多数样本集中在对角线；B的表现也不错，但(6,6)的色块显然淡了很多，说明分类器对第六类的分类效果不好；而A random，因为是随机生成的，其CM的各个元素的样本数比较均衡，因此分类准确率也非常差。

References

[1] Simple Audio Recognition
[2] tf.confusion_matrix

Comment and share

Recurrent neural network and LSTM

Mar 15, 2018

简单记录一下循环神经网络 (recurrent neural network, RNN)，另一种RNN，主要关注时间序列的预测、分类和识别等问题。这里卖个瓜，前面有讨论过残差神经网络，感兴趣地可以去围观，链接见文末。
本文首先讨论RNN的motivation及其特点；然后是为了解决长程依赖 (long-term dependencies) 而提出的long short term memory (LSTM) 结构；最后是二者在tensorflow中的简单样例。参考文献很多，这里强烈案例下面两篇文章！

Recurrent neural network

循环神经网络的动机是刻画一个时间序列当前输入与此前信息的联系，从网络结构上，循环神经网络通过称为循环体的模块 (如下图) 实现对信息的记忆，即该层在 $t-1$ 时刻的输出状态 $\mathbf{h_{t-1}}$ 会被记录，并作为 $t$ 时刻该模块输入的一部分，以级联的形式与 $\mathbf{x_t}$ 构成此刻的输入 $[\mathbf{h_{t-1}}, \mathbf{x_{t}}]$ .

显然，循环体中的循环理论上是无穷的，但在实际应用中会限制循环的次数以避免梯度消失 (gradient vanishing) 的问题，用num_step来定义，即循环体的基本模块被复制并展开为num_step个。如
文献[4]所述，循环体结构是RNN的基础，在RNN中对于复制展开的循环体，其参数是共享的。这一点与卷积神经网络中的权值共享有类似之处。在这篇文章里给出了非常多的RNN的应用场景，我很喜欢里面关于手写体识别的问题，体现了权值共享的效果。

设 $t$ 时刻循环体的输入为 $\mathbf{x}(t)$ ，$t-1$时刻循环体的输出状态为 $\mathbf{h(t)}$ ，则RNN中 $t$ 时刻的输出 $\mathbf{h}(t)$ 为，

$\begin{equation} \mathbf{h}(t) = \rm{tanh}\left(W\cdot[\mathbf{h}(t-1),\mathbf{x}(t)] + b\right). \end{equation}$

其中 $W$ 是权值矩阵，其shape为 $[\mathrm{len}(\mathbf{h}) + \mathrm{len}(\mathbf{x}), \mathrm{len}(\mathbf{h})]$ , $b$ 为偏置。这里采用的激活函数是tanh，将数据限制到 $[-1,1]$ 之间。

那么，为什么用tanh，而不是有high reputation的ReLU? 知乎的这个讨论给出了很棒的解释，参考Hinton论文中的观点 ReLUs seem inappropriate for RNNs because they can have very large outputs so they might be expected to be far more likely to explode than units thathave bounded values. ReLU将输出的值限制在 $[0, \infty)$ 之间，而RNN中循环体之间的权值是共享的，经过公式(1)的多次作用，相当于对 $W$ 做了连乘，ReLU函数会导致梯度爆炸的问题。因此，采用tanh可以将每层的输出空控制在限定的范围内，既避免了梯度消失，也避免了梯度爆炸的问题。

LSTM

在时间序列的预测中存在长期依赖 (long-term dependencies) 的问题，即网络需要记住离时间 $t$ 很远的某个时刻的信息，固定的num_step将不适用于这一情形，并且长时间间隔下的梯度消失问题将无法处理。因此，需要对RNN的循环体模块进行修改，即长短时记忆网络 (long short term memory, LSTM).

LSTM的基本模块如下图所示，参考Understanding LSTM的解释，The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. 即LSTM的核心是组成为cell state，用于存储和记忆上文的信息，类似传送带的功能。

而cell state的更新，通过三个逻辑门以及 $\mathbf{x_t}$ 和 $\mathbf{h_{t-1}}$ 共同完成。它们分别称为 (1) forget gate, (2) input gate, (3) output gate. 采用sigmoid函数将数值规范化到 $[0,1]$ 区间，并与待处理信号进行点乘，本质上实现软判决.

与RNN类似，首先将 $t-1$ 时刻LSTM cell的输出 $\mathbf{h}_{t-1}$ 与 $t$ 时刻的输入 $\mathbf{x}_{t}$ 进行级联，逐一通过三个门，

遗忘门 (Forget gate) $\begin{equation} \mathbf{f_t} = \sigma\left(W_f \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_f\right) \end{equation}$
输入门 (Input gate) $\begin{equation} \mathbf{i_t} = \sigma\left(W_i \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_i \right). \end{equation}$
输出门 (Output gate) $\begin{equation} \mathbf{o_t} = \sigma\left(W_o \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_o \right). \end{equation}$

其中遗忘门的作用是抛弃cell state中不需要的信息，与 $\mathbf{C_{t-1}}$ 作用；输入门则是决定cell state中待更新的信息，与 $\mathbf{\tilde{C}_t}$ ，即state candidates作用；输出门则从更新后的cell state中决定输出的状态。结合以上三个门结构，便可以更新cell state以及cell output,

cell state update $\begin{align} \mathbf{\tilde{C_{t}}} &= \rm{tanh}\left(W_c \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_c \right) \\ \mathbf{C_t} &= \mathbf{f_t}\cdot\mathbf{C_{t-1}} + \mathbf{i_t} \cdot \mathbf{\tilde{C_{t}}} \end{align}$
cell output upadte $\begin{equation} \mathbf{h_{t}} = \rm{tanh}(\mathbf{C_{t}}) \cdot \mathbf{o_t} \end{equation}$

Examples

参考《TensorFlow实战》第7章的LSTM基于PTB数据集的语言预测样例，以及TensorFlow的Tutorial，设计了一个小实验，对比RNN和LSTM的performance，以及激活函数对于RNN的影响。(详细的notebooks见这里: RNN, LSTM)

这里给出tf.contrib.rnn中提供的用于搭建RNN和LSTM cell的类的实例化方法，以及如何构建多个Recurrent层，

import tensorflow as tf
# RNN
def rnn_cell(num_units, activation, reuse=None):
   return tf.contrib.rnn.BasicLSTMCell(
       num_units=num_units,  
       activation=activation,
       reuse=reuse)
# LSTM
def lstm_cell(num_units, forget_bias=0.0, state_in_tuple=True, reuse=None):
	return tf.contrib.rnn.BasicLSTMCell(
    	num_units=size, 
        forget_bias=forget_bias, 
        state_is_tuple=state_in_tuple,
        reuse=reuse)
# Multiple layers
attn_cell = rnn_cell
numlayers = 2
cell = tf.contrib.rnn.MultiRNNCell(
	[attn_cell() for _ in range(numlayers)],
    state_is_tuple=True)

另外，在PTB的TF教程里，设置了可变的学习率以及梯度的clipping用于抑制梯度爆炸 (gradient explosion) 的问题，代码如下

# Adjustable learning rate
new_lr = tf.placeholder(tf.float32, shape=[], name="new_learning_rate")
lr_update = tf.assign(self._lr, self._new_lr) # use tf.assign to transfer the updated lr
def assign_lr(session, lr_value):
	sess.run(self._lr_update, feed_dict={new_lr: lr_value})
# Gradient clipping
...
max_grad_norm = 5.0 # maximum gradient
tvars = tf.trainable_variables()  # Get all trainable variables
grads, _ = tf.clip_by_global_norm(
	tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self._lr)
train_op = optimizer.apply_gradients(
    zip(grads, tvars),
    global_step = tf.contrib.framework.get_or_create_global_step())

下面来比较一下RNN和LSTM的效果。比较的指标是perplexity (复杂度)，用于刻画该模型能够估计出某一句话的概率，其数值越小，模型表现越好。

容易看出，LSTM的表现是优于RNN的。除此之外，采用tanh函数的RNN要显著好于采用ReLU，在训练中也出现了RuntimeWarning: overflow encountered in exp的警告，说明出现了gradient explosion的问题。最后，我尝试增加了RNN的层数，但是效果并没有变好，也许是参数多了？也有可能是我偷懒了，没多训测试几次。。。

References

[1] Understanding LSTM
[2] The Unreasonable Effectiveness of Recurrent Neural Networks
[3] Tensorflow tutorial
[4] TensorFlow实战Google深度学习框架
[5] TensorFlow实战
[6] RNN中为什么要采用tanh而不是ReLu作为激活函数？

广告位

Residual network I — block and bottleneck
Residual network II — realize with tensorflow

Comment and share

Train selected variables in tensorflow graph

Feb 04, 2018

写之前吐个槽，我又把tf.nn.softmax_cross_entropy_with_logits的参数赋反了，折腾了一晚上。。。这篇文章主要讨论TensorFlow中训练指定变量的问题。这篇博客给了个非常巧妙的方法，简单记录一下。

1. 查看可训练的参数及其index

在tf.trainiable_variables里存储了可以用于训练的变量，利用如下方法可以打印出它们的信息，

variables_names = [v.name for v in tf.trainable_variables()]
values = sess.run(variables_names)
i = 0
for k, v in zip(variables_names, values):
    print(i, "Variable: ", k)
    print("Shape: ", v.shape)
    i += 1

2. 建立train options，并为其提供不同的trainable lists

假设有两个loss function，分别对应网络中不同区域的变量，为了实现梯度的有效传递，可以利用如下方法，

loss1 = ...
loss2 = ...
var_list1 = tf.trainable_variables()[0:10]
var_list2 = tf.trainable_variables()[10:]
train_op1 = tf.train.AdamOptimizer(learning_rate).minimize(loss1, var_list=var_list1)
train_op2 = tf.train.AdamOptimizer(learning_rate).minimize(lose2, var_list=var_list2)

reference

[1] [tensorflow] 在不同层上设置不同的学习率，fine-tuning

Comment and share

Transpose convolution by tensorflow--odd kernel shape

Jan 31, 2018

The auto-encoder has been applied widely for unsupervised learning, which is usually composed of two symmetric parts namely encoder and decoder. It is easy to realize an autoencoder only with fully-connected layers, i.e., DNN, but which is not that clear in CNN.

For convolution case, the layer in the decoder maintains the shape and kernel configurations for its symmetric layer in the encoder, thus the deconvolution, or transpose convolution operation will be used instead of the convolution operation.

TensorFlow provides a method namedly conv2d_transpose in both tf.nn module and tf.contrib.layers module, which are very convenient. However, for tf.contrib.layers.conv2d_transpose, if the output shape of the transpose convolutution is odd when convolution stride setting as 2, it cannot control the output shape to desired one.

For example, denote a [None, 9, 9, 1] 4D-tensor $X$, convolved by a kernel of size [3, 3] with a 2 step stride and halp padding (SAME), the output 4D tensor $y$ will be [None, 5, 5, 1]. However, the transpose convolution from y by the same parameters setting generates $x’$ into a [None, 10, 10, 1] tensor, not [None, 9, 9, 1].

To handle this, I provide a naive but effective way, see as follows,

import tensorflow as tf
import tensorflow.contrib.layers as layers
x = tf.placeholder(tf.float32, shape=[None, 5, 5, 1])
y = tf.placeholder(tf.float32, shape=[None, 9, 9, 1])
kernel_size = [3, 3]
stride = 2
x_r = layers.conv2d_transpose(
        inputs=x,
        num_outputs=x.get_shape().as_list()[1],
        kernel_size=kenerl_size,
        padding='SAME',
        stride=stride,
        scope='conv2d_transpose'
        )
x_r = x_r[:, 0:-1, 0:-1, :]

Above solution played well in my code, though ths crop may introduce bias..

Comment and share

Upsampling for 2D convolution by tensorflow

Jan 27, 2018

A convolutional auto-encoder is usually composed of two sysmmetric parts, i.e., the encoder and decoder. By TensorFlow, it is easy to build the encoder part using modules like tf.contrib.layers or tf.nn, which encapsulate methods for convolution, downsampling, and dense operations.

However, as for the decoder part, TF does not provide method like upsampling, which is the reverse operation of downsampling (avg_pool2, max_pool2). This is because max pooling is applied more frequently than average pooling, while recover an image from max-pooled matrix is difficult for lossing of locations of the max points.

For the average-pooled feature maps, there is a simple way to realize upsampling without high-level API like keras, but with basic functions of TF itself.

Now, suppose the input is a 4-D tenser whose shape is [1, 4, 4, 1] and sampling rate is [1, 2, 2, 1], then the upsampled matrix is also a 4-D tenser of shape [1, 8, 8, 1]. Following lines can realize this operation.

import tensorflow as tf
x = tf.ones([1, 4, 4, 1])
k = tf.ones([2, 2, 1, 1]) # note k.shape = [rows, cols, depth_in, depth_output]
output_shape=[1, 8, 8, 1]
y = tf.nn.conv2d_transpose(
    value=x,
    filter=k,
    output_shape=output_shape,
    strides=[1, 2, 2, 1],
    padding='SAME'
        )
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(y))

Then, y is the upsampled matrix.

You may also realize upsampling by the resize_images function of moduletf.image, which is,

y = tf.image.resize_images(
    images=x,
    size=[1, 8, 8, 1],
    method=ResizeMethod.NEAREST_NEIGHBOR
        )

Enjoy yourself.

References

[1] Transposed convolution arithmetic

Comment and share

Residual network II -- realization by tensorflow

Jan 27, 2018

来填ResNet的坑，residual network的原理已经在上一篇里做了介绍，这一篇来讨论如何用TensorFlow实现。

虽然TF提供了slim这个库，可以很方便地搭建网络，但考虑到移植和扩展性，还是决定用tf.contrib.layers的函数和tf基本的函数来写。我们知道，ResNet的核心模块是Bottleneck，如下图所示，每个bottleneck的输入会通过两条路径在输出汇聚，计算残差，作为下一层的输入。

多个botleneck组合成一个block，通常会在每个block的最后一个bottleneck进行降采样，以缩小特征图大小。

具体的实现可以参考我的notebook, 下面贴一个在手写体识别样本上的测试结果，对比了这篇文章里讨论的DNN网络。

可以看出ResNet的效果还是非常显著的。但是得强调一下，由于网络显著加深，训练时占用的显存资源非常大，普通的GPU非常吃力。

Comment and share

How to apply the batch-normalized net

Jan 25, 2018

继续填这篇文章的坑，如何测试和应用包含了Batch Normalization层的网络？在训练过程中，每个BN层直接从输入样本中求取mean和variance量，不是通过学习获取的固定值。因此，在测试网络时，需要人工提供这两个值。

在BN的文章里的处理方法是，对所有参与训练的mini-batch的均值和方差进行收集，采用无偏估计的方式估计总体样本的均值和方差，来表征测试样本的均值和方差，其公式如下，

$\begin{align} E[x] &= E[\mu_B], \notag \\ \mathrm{Var}[x] &= \frac{m}{m-1} \cdot E[{\sigma_B}^2], \notag \end{align}$

进而，BN layer的输出定义为，

$y = \frac{\gamma}{\sqrt{\mathrm{Var}[x]+\epsilon}}\cdot x + (\beta - \frac{\gamma E[x]}{\sqrt{\mathrm{Var}[x]+\epsilon}}).$

那么有如下几个问题需要解决，

训练和测试过程中如何给BN传递mean和variance？即如何在计算图上体现这一运算？
如何动态收集每个mini-batch的mean和variance，用于总体样本的无偏估计moving_mean, moving_variance

针对以上问题，TensorFlow的解决思路是设定is_training这个flag，如果为真，则每个mini-batch都会计算均值和方差，训练网络; 如果为假，则进入测试流程。

基于tf.nn.batch_normalization的底层实现

TF提供了tf.nn.batch_normalization函数从底层搭建网络，其直接参考了Ioeff\&Szegdy的论文，这里需要利用tf.nn.moments求取mini-batch的均值和方差，详细的实现代码参考这里.

with tf.name_scope('BatchNorm'):  
	axis = list(range(len(x.get_shape()) - 1))
    mean,var = tf.nn.moments(x_h, axis)
    with tf.name_scope('gamma'):
        gamma = tf.Variable(tf.constant(0.1, shape=mean.get_shape()))
    with tf.name_scope('beta'):
        beta = tf.Variable(tf.constant(0.1, shape=mean.get_shape()))
    y = tf.nn.batch_normalization(
           x = x_h,
           mean = mean,
           variance = var,
           offset = beta,
           scale = gamma,
           variance_epsilon = 1e-5,
           name= 'BN')

基于tf.contrib.layers.batch_norm的实现

在tf.contrib.layers提供了batch_norm方法，该方法是对tf.nn.batch_normalization的封装，增加了如center，is_training等变量，并对BN的基础算法做了更新，用滑动平均来实现均值和房车的估计。

那么，如何实现包含BN层的网络的训练和测试？其核心是利用is_training作为flag控制输入给BN的mean和variance的来源，以及如何将moving_mean和moving_variance加入网络的训练过程中。

TF官方的建议方法解释是，
Note: when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:

1
2
3

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
  with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss)

参考这篇博客，作者对此做了更棒的解释！！！！！
When you execute an operation (such as train_step), only the subgraph components relevant to train_step will be executed. Unfortunately, the update_moving_averages operation is not a parent of train_step in the computational graph, so we will never update the moving averages!

作者的解决方法：Personally, I think it makes more sense to attach the update ops to the train_step itself. So I modified the code a little and created the following training function

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        # Ensures that we execute the update_ops before performing the train_step
        train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

以上代码在tf.slim.batch_norm中也有体现，slim是对tf的一个更高层的封装，利用slim实现的ResNet-v2-152可以参考这里。

最后，贴上基于tf.contrib.layers.batch_norm的实现样例，更详细的实现见我的notebook。

import tensorflow as tf
import tensorflow.contrib.layers as layers
with tf.name_scope('BatchNorm'):  
	y = layers.batch_norm(
    	x_h,
        center=True,
        scale=True,
        is_training=is_training)
        
# Train step 
# note: should add update_ops to the train graph
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    with tf.name_scope('train'):
        train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)

MLP是否采用BN的结果对比

最后，贴一个是否采用BN层的结果对比，效果还是比较显著的。但是我也发现由于我设置的网络层数和FC长度都比较可观，随着Epochs增大，BN的优势并没有那么明显了。。。

Enjoy it !! 我终于把这个问题看懂了，开心

References

[1] Ioffe, S. and Szegedy, C., 2015, June. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (pp. 448-456).
[2] tensorflow 中batch normalize 的使用
[3] docs: batch normalization usage in slim #7469
[4] tf.layers.batch_normalization
[5] TENSORFLOW GUIDE: BATCH NORMALIZATION

Comment and share

OLDER POSTS
page 1 of 2

Variational auto-encoder

Conditional Variational auto-encoder

VAE和Conditional的TensorFlow实现

E_{z\sim Q}log(P(X|z))的不同定义及结果对比

References

Basic theory

Step1

Step2

Conditional case

GAN的TensorFlow实现

Conditional GAN在MNIST上的测试结果

References

Reference

Confusion matrix

Examples

References

Recurrent neural network

LSTM

Examples

References

广告位

1. 查看可训练的参数及其index

2. 建立train options，并为其提供不同的trainable lists

reference

References

基于tf.nn.batch_normalization的底层实现

基于tf.contrib.layers.batch_norm的实现

MLP是否采用BN的结果对比

References

Jason Ma

$E_{z\sim Q}log(P(X|z))$ 的不同定义及结果对比