Variational autoencoder and the conditional case

今天讨论一下变分自动编码器 (Variational Auto-Encoder, VAE)及添加了条件约束的conditional VAE，包括其与传统AE的区别、推导思路和基于TensorFlow的实现。

传统的自动编码器主要用于特征提取、降维和压缩，是一种无监督的机器学习工具。其由编码器 (encoder)和解码器 (decoder)组成。编码器类似于分类问题中的特征提取模块，对样本进行特征提取并压缩成一个长度很小的向量；解码器则对这一特征向量进行解码以恢复出原始样本信息。通常衡量一个自动编码器好坏的指标 (或目标函数) 为最小均方误差 (mean squared error, MSE)，

$\begin{equation} L_{MSE} = \frac{1}{N}\sum^{N}_{i=1}{(X_\mathrm{in} - X_\mathrm{out})^2}, \end{equation}$

其中 $X_\mathrm{in}$ 和 $X_\mathrm{out}$ 表示编码器的输入和解码器的输出。

传统AE的问题在于，解码器只能根据编码器在样本上提取的特征向量来输出恢复的样本，很难直接通过生成特征向量产生新的样本 (当然特征是可以通过类似Monte Carlo模拟或者GMM等模型来生成的，感兴趣可以看我的这个repo) ，所以传统的AE不完全是生成模型。

为了解决这一问题，Kingma & Welling 在Auto-Encoding Variational Bayes文中提出了变分自动编码器的概念。虽然与AE的组成类似，但二者是完全不同的，VAE是基于Bayes理论的，利用编解码器来抽象参数的非线性，最大化解码器输出相对于输入样本的似然性。 简而言之，就是试图获取数据本身的分布，从这个分布产生新的样本。

Variational auto-encoder

VAE首先定义一个称为latent variable的高维随机变量 $z$ (z的维数远小于 $X$ )，其分布为 $P(z)$ 。然后定义映射 $f: z\times \theta -> X$ ，即由参数 $\theta$ 约束的映射 $f$ 在服从 $P(z)$ 分布的 $z$ 的作用下能够近似真实样本 $X$ (如公式2，其中 $P(X|z;\theta)$ 表示 $f(z|\theta)$ 的概率分布)。通过 $z$ 便可以刻画样本 $x$ 。而由于 $z$ 是无法直接观测的，所以称为隐变量。

$\begin{equation} P(X) = \int{P(X|z;\theta)P(z)dz}. \end{equation}$

VAE假设 $P(z)$ ~ $N(0, I)$ 的标准正态分布，通过对参数 $\theta$ 进行估计，最大化公式(2)。因此，VAE网络实际上是用来求解 $P(X|z;\theta)$ 这个分布的，与传统意义上的AE的理解是不同的。正如Tutorial on Variational Autoencoder里提到的，They are called “autoencoders” only because the final training objective that derives from this setup does have an encoder and a decoder, and resembles a traditional autoencoder.

VAE的Encoder可以理解为 $P(z|X)$ ，而Decoder理解为 $P(X|z)$ ，其目标便是最大化Decoder的输出 $P(X|z)$ 或者 $log(P(X|z))$ . 由于 $z$ 是无法观测的，分布 $P(z|X)$ 无法直接求解，VAE巧妙地利用了变分的方法，定义分布 $Q(z|X)$ 来近似 $P(z|X)$ 。最终有 $Q(z|X)$ 为Encoder，而 $E_{z\sim Q}{P(X|z)}$ 表示Decoder。

通常用KL divergence来描述两个分布之间的相似程度，KL散度越接近于零，二者越相似。定义 $\mathrm{KL}(Q(z)||P(z|X))$ ,

$\begin{equation} \mathrm{KL}(Q(z|X)||P(z|X)) = E_{z\sim Q}\left({Q(z|X)\log{\frac{Q(z)}{P(z|X)}}}\right) \end{equation}$

利用Bayes定理， $P(z|X) = P(X|z)P(z)/P(x)$ ，推导得到 (具体过程参考文献2)，

$\begin{equation} \log{P(X)} - \mathrm{KL}(Q(z|X)||P(z|X)) = E_{z \sim Q}\log{P(X|z)} - \mathrm{KL}(Q(z|X)||P(z)) \end{equation}$

因为有 $\mathrm{KL}(Q(z|X)||P(z|X))\leq 0$ ，所以上式定义了 $logP(X)$ 的下界，即

$\begin{equation} \log{P(X)} \leq E_{z\sim Q}\log{P(X|z)} - \mathrm{KL}(Q(z|X)||P(z)) \end{equation}$

显然，公式(3)的右边两项分别对应VAE中Decoder和Encoder的损失函数，也即要最大化的目标，这与GAN网络的思路非常类似。 $E_{z\sim Q}logP(X|z)$ 通常可以用最小均方误差定义，也可以用交叉熵来描述。这里详细推导一下 $\mathrm{KL}(Q(z|X)||P(z))$ ，以帮助其在TensorFlow下的实现。这里假设 $Q(z|X)$ 服从正态分布 $N(\mu_Q, \sigma_Q\times I)$ , $P(z)$ 是服从标准正态分布的。

$\begin{align} \mathrm{KL}\left(Q(z|X)||P(z)\right) &= \int{Q(z|X)\log{\frac{Q(z|X)}{P(z)}}dz} \notag \\ &= \int{Q(z|X)\log{\frac{\frac{1}{\sqrt{2\pi}\sigma_Q}e^{-\frac{(X-\mu_Q)^2}{2\sigma^2_Q}}}{\frac{1}{\sqrt{2\pi}\sigma_P}e^{-\frac{(X-\mu_P)^2}{2\sigma^2_P}}}}}dz \notag \\ &= \int{Q(z|X)}\left(\log{\frac{\sigma_P}{\sigma_Q}}-\frac{(X-\mu_Q)^2}{2\sigma^2_Q} + \frac{(X-\mu_P)^2}{2\sigma^2_P} \right)dz \notag \\ &= \log{\sigma_P} - \log{\sigma_Q} - 0.5 + \int{Q(z|X)}\frac{(X-\mu_Q + \mu_Q - \mu_P)^2}{2\sigma^2_P}dz \notag \\ &= \log{\sigma_P} - \log{\sigma_Q} - 0.5 + \notag \\ & \frac{1}{2\sigma^2_P}\int{Q(z|X)}\left[(x-\mu_Q)^2 + (\mu_Q - \mu_P)^2 -2(x-\mu_Q)(\mu_Q-\mu_P)\right]dz \notag \\ & = \log{\sigma_P} - \log{\sigma_Q} - 0.5 + \frac{\sigma^2_Q}{2\sigma^2_P} + \frac{(\mu_Q - \mu_P)^2}{2\sigma^2_P} \notag \\ & = -0.5 - \log{\sigma_Q} + \frac{\sigma^2_Q}{2} + \frac{(\mu_Q - \mu_P)^2}{2} \\ \end{align}$

Conditional Variational auto-encoder

VAE只能生成服从样本分布的模拟，而对于MNIST这类样本有明确的标签的情形，能否通过设定某种条件，让VAE生成指定类别的手写体图像呢？基于这个motivation,人们提出了conditional VAE。假定样本的标签为 $Y$ ，以 $Y$ 作为Encoder和Decoder额外的约束 (如图1(c))，即可实现以上要求。相应的，条件变分自动编码器的似然函数变为,

$\begin{equation} \log{P(X|Y)} - \mathrm{KL}(Q(z|X,Y)||P(z|X,Y)) = E_{z\sim Q}\log P(X|z,Y) - \mathrm{KL}(Q(z|X,Y)||P(z)) \notag \end{equation}$

Conditional VAE的约束与VAE基本相同，依然可以分为针对Encoder和Decoder的两个部分。

VAE和Conditional的TensorFlow实现

简单介绍一下VAE的TensorFlow实现，具体的Notebook参考这里。

Init

batch_size = 64
X_in = tf.placeholder(dtype=tf.float32, shape=[None, 784], name='X_in')
X_out = tf.placeholder(dtype=tf.float32, shape=[None, 784], name='X_out')
keep_prob = tf.placeholder(dtype=tf.float32, shape=(), name='keep_prob')
n_latent = 32

Encoder

def encoder(X_in, keep_prob):
    with tf.variable_scope("encoder", reuse=None):
        x = tf.layers.dense(X_in, units=128, activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)        
        x = tf.layers.dense(x, units=64, activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)        
        # The latent layer
        mu = tf.layers.dense(x, units=n_latent)
        sigma = 1e-6 + tf.nn.softplus(tf.layers.dense(x, units=n_latent)) # softplus to avoid negative sigma
        # reparameterization
        epsilon = tf.random_normal(tf.shape(mu))
        z  = mu + tf.multiply(epsilon, sigma)        
        return z, mu, sigma

Decoder

def decoder(z, keep_prob):
    with tf.variable_scope("decoder", reuse=None):
        x = tf.layers.dense(z, units=64, activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.layers.dense(x, units=128, activation=tf.nn.relu)
        x = tf.nn.dropout(x, keep_prob)
        x = tf.layers.dense(x, units=784, activation=tf.nn.sigmoid) 
        # sigmoid to contrain the output to [0,1)
        return x

Network instance

1 2	z, mu, sigma = encoder(X_in, keep_prob) dec = decoder(z, keep_prob)

Loss function

loss_d = - tf.reduce_sum(
    X_out * tf.log(1e-8+dec) + (1.0 - X_out) * tf.log(1e-8+ 1.0 - dec), 1)
loss_e = 0.5 * tf.reduce_sum(
    tf.square(mu) + tf.square(sigma) - tf.log(1e-8 + tf.square(sigma)) - 1, 1)
# 注意这里的符号。。。
loss = tf.reduce_mean(loss_d + loss_e)
optimizer = tf.train.AdamOptimizer(0.0001).minimize(loss)

$E_{z\sim Q}log(P(X|z))$ 的不同定义及结果对比

最后贴一个结果，我对比了一下采用交叉熵和均方误差两种损失函数表述 $E_{z\sim Q}log(P(X|z))$ 的结果，在参数相同的情况下，看起来MSE效果好一些，如下图。

Variational autoencoder and the conditional case

Variational auto-encoder

Conditional Variational auto-encoder

VAE和Conditional的TensorFlow实现

$E_{z\sim Q}log(P(X|z))$ 的不同定义及结果对比

References

Jason Ma

Variational autoencoder and the conditional case

Variational auto-encoder

Conditional Variational auto-encoder

VAE和Conditional的TensorFlow实现

E_{z\sim Q}log(P(X|z))的不同定义及结果对比

References

Jason Ma

$E_{z\sim Q}log(P(X|z))$ 的不同定义及结果对比