Recurrent neural network and LSTM

简单记录一下循环神经网络 (recurrent neural network, RNN)，另一种RNN，主要关注时间序列的预测、分类和识别等问题。这里卖个瓜，前面有讨论过残差神经网络，感兴趣地可以去围观，链接见文末。
本文首先讨论RNN的motivation及其特点；然后是为了解决长程依赖 (long-term dependencies) 而提出的long short term memory (LSTM) 结构；最后是二者在tensorflow中的简单样例。参考文献很多，这里强烈案例下面两篇文章！

Recurrent neural network

循环神经网络的动机是刻画一个时间序列当前输入与此前信息的联系，从网络结构上，循环神经网络通过称为循环体的模块 (如下图) 实现对信息的记忆，即该层在 $t-1$ 时刻的输出状态 $\mathbf{h_{t-1}}$ 会被记录，并作为 $t$ 时刻该模块输入的一部分，以级联的形式与 $\mathbf{x_t}$ 构成此刻的输入 $[\mathbf{h_{t-1}}, \mathbf{x_{t}}]$ .

显然，循环体中的循环理论上是无穷的，但在实际应用中会限制循环的次数以避免梯度消失 (gradient vanishing) 的问题，用num_step来定义，即循环体的基本模块被复制并展开为num_step个。如
文献[4]所述，循环体结构是RNN的基础，在RNN中对于复制展开的循环体，其参数是共享的。这一点与卷积神经网络中的权值共享有类似之处。在这篇文章里给出了非常多的RNN的应用场景，我很喜欢里面关于手写体识别的问题，体现了权值共享的效果。

设 $t$ 时刻循环体的输入为 $\mathbf{x}(t)$ ，$t-1$时刻循环体的输出状态为 $\mathbf{h(t)}$ ，则RNN中 $t$ 时刻的输出 $\mathbf{h}(t)$ 为，

$\begin{equation} \mathbf{h}(t) = \rm{tanh}\left(W\cdot[\mathbf{h}(t-1),\mathbf{x}(t)] + b\right). \end{equation}$

其中 $W$ 是权值矩阵，其shape为 $[\mathrm{len}(\mathbf{h}) + \mathrm{len}(\mathbf{x}), \mathrm{len}(\mathbf{h})]$ , $b$ 为偏置。这里采用的激活函数是tanh，将数据限制到 $[-1,1]$ 之间。

那么，为什么用tanh，而不是有high reputation的ReLU? 知乎的这个讨论给出了很棒的解释，参考Hinton论文中的观点 ReLUs seem inappropriate for RNNs because they can have very large outputs so they might be expected to be far more likely to explode than units thathave bounded values. ReLU将输出的值限制在 $[0, \infty)$ 之间，而RNN中循环体之间的权值是共享的，经过公式(1)的多次作用，相当于对 $W$ 做了连乘，ReLU函数会导致梯度爆炸的问题。因此，采用tanh可以将每层的输出空控制在限定的范围内，既避免了梯度消失，也避免了梯度爆炸的问题。

LSTM

在时间序列的预测中存在长期依赖 (long-term dependencies) 的问题，即网络需要记住离时间 $t$ 很远的某个时刻的信息，固定的num_step将不适用于这一情形，并且长时间间隔下的梯度消失问题将无法处理。因此，需要对RNN的循环体模块进行修改，即长短时记忆网络 (long short term memory, LSTM).

LSTM的基本模块如下图所示，参考Understanding LSTM的解释，The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. 即LSTM的核心是组成为cell state，用于存储和记忆上文的信息，类似传送带的功能。

而cell state的更新，通过三个逻辑门以及 $\mathbf{x_t}$ 和 $\mathbf{h_{t-1}}$ 共同完成。它们分别称为 (1) forget gate, (2) input gate, (3) output gate. 采用sigmoid函数将数值规范化到 $[0,1]$ 区间，并与待处理信号进行点乘，本质上实现软判决.

与RNN类似，首先将 $t-1$ 时刻LSTM cell的输出 $\mathbf{h}_{t-1}$ 与 $t$ 时刻的输入 $\mathbf{x}_{t}$ 进行级联，逐一通过三个门，

遗忘门 (Forget gate) $\begin{equation} \mathbf{f_t} = \sigma\left(W_f \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_f\right) \end{equation}$
输入门 (Input gate) $\begin{equation} \mathbf{i_t} = \sigma\left(W_i \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_i \right). \end{equation}$
输出门 (Output gate) $\begin{equation} \mathbf{o_t} = \sigma\left(W_o \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_o \right). \end{equation}$

其中遗忘门的作用是抛弃cell state中不需要的信息，与 $\mathbf{C_{t-1}}$ 作用；输入门则是决定cell state中待更新的信息，与 $\mathbf{\tilde{C}_t}$ ，即state candidates作用；输出门则从更新后的cell state中决定输出的状态。结合以上三个门结构，便可以更新cell state以及cell output,

cell state update $\begin{align} \mathbf{\tilde{C_{t}}} &= \rm{tanh}\left(W_c \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_c \right) \\ \mathbf{C_t} &= \mathbf{f_t}\cdot\mathbf{C_{t-1}} + \mathbf{i_t} \cdot \mathbf{\tilde{C_{t}}} \end{align}$
cell output upadte $\begin{equation} \mathbf{h_{t}} = \rm{tanh}(\mathbf{C_{t}}) \cdot \mathbf{o_t} \end{equation}$

Examples

参考《TensorFlow实战》第7章的LSTM基于PTB数据集的语言预测样例，以及TensorFlow的Tutorial，设计了一个小实验，对比RNN和LSTM的performance，以及激活函数对于RNN的影响。(详细的notebooks见这里: RNN, LSTM)

这里给出tf.contrib.rnn中提供的用于搭建RNN和LSTM cell的类的实例化方法，以及如何构建多个Recurrent层，

import tensorflow as tf
# RNN
def rnn_cell(num_units, activation, reuse=None):
   return tf.contrib.rnn.BasicLSTMCell(
       num_units=num_units,  
       activation=activation,
       reuse=reuse)
# LSTM
def lstm_cell(num_units, forget_bias=0.0, state_in_tuple=True, reuse=None):
	return tf.contrib.rnn.BasicLSTMCell(
    	num_units=size, 
        forget_bias=forget_bias, 
        state_is_tuple=state_in_tuple,
        reuse=reuse)
# Multiple layers
attn_cell = rnn_cell
numlayers = 2
cell = tf.contrib.rnn.MultiRNNCell(
	[attn_cell() for _ in range(numlayers)],
    state_is_tuple=True)

另外，在PTB的TF教程里，设置了可变的学习率以及梯度的clipping用于抑制梯度爆炸 (gradient explosion) 的问题，代码如下

# Adjustable learning rate
new_lr = tf.placeholder(tf.float32, shape=[], name="new_learning_rate")
lr_update = tf.assign(self._lr, self._new_lr) # use tf.assign to transfer the updated lr
def assign_lr(session, lr_value):
	sess.run(self._lr_update, feed_dict={new_lr: lr_value})
# Gradient clipping
...
max_grad_norm = 5.0 # maximum gradient
tvars = tf.trainable_variables()  # Get all trainable variables
grads, _ = tf.clip_by_global_norm(
	tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self._lr)
train_op = optimizer.apply_gradients(
    zip(grads, tvars),
    global_step = tf.contrib.framework.get_or_create_global_step())

下面来比较一下RNN和LSTM的效果。比较的指标是perplexity (复杂度)，用于刻画该模型能够估计出某一句话的概率，其数值越小，模型表现越好。

容易看出，LSTM的表现是优于RNN的。除此之外，采用tanh函数的RNN要显著好于采用ReLU，在训练中也出现了RuntimeWarning: overflow encountered in exp的警告，说明出现了gradient explosion的问题。最后，我尝试增加了RNN的层数，但是效果并没有变好，也许是参数多了？也有可能是我偷懒了，没多训测试几次。。。

References

[1] Understanding LSTM
[2] The Unreasonable Effectiveness of Recurrent Neural Networks
[3] Tensorflow tutorial
[4] TensorFlow实战Google深度学习框架
[5] TensorFlow实战
[6] RNN中为什么要采用tanh而不是ReLu作为激活函数？

广告位

Residual network I — block and bottleneck
Residual network II — realize with tensorflow

Recurrent neural network and LSTM

Recurrent neural network

LSTM

Examples

References

广告位

Jason Ma