继续填这篇文章的坑,如何测试和应用包含了Batch Normalization层的网络? 在训练过程中,每个BN层直接从输入样本中求取meanvariance量,不是通过学习获取的固定值。因此,在测试网络时,需要人工提供这两个值。

在BN的文章里的处理方法是,对所有参与训练的mini-batch的均值和方差进行收集,采用无偏估计的方式估计总体样本的均值和方差,来表征测试样本的均值和方差,其公式如下,

进而,BN layer的输出定义为,

那么有如下几个问题需要解决,

  1. 训练和测试过程中如何给BN传递meanvariance?即如何在计算图上体现这一运算?
  2. 如何动态收集每个mini-batch的mean和variance,用于总体样本的无偏估计moving_mean, moving_variance

针对以上问题,TensorFlow的解决思路是设定is_training这个flag,如果为真,则每个mini-batch都会计算均值和方差,训练网络; 如果为假,则进入测试流程。

基于tf.nn.batch_normalization的底层实现

TF提供了tf.nn.batch_normalization函数从底层搭建网络,其直接参考了Ioeff\&Szegdy的论文,这里需要利用tf.nn.moments求取mini-batch的均值和方差,详细的实现代码参考这里.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
with tf.name_scope('BatchNorm'):
axis = list(range(len(x.get_shape()) - 1))
mean,var = tf.nn.moments(x_h, axis)
with tf.name_scope('gamma'):
gamma = tf.Variable(tf.constant(0.1, shape=mean.get_shape()))
with tf.name_scope('beta'):
beta = tf.Variable(tf.constant(0.1, shape=mean.get_shape()))
y = tf.nn.batch_normalization(
x = x_h,
mean = mean,
variance = var,
offset = beta,
scale = gamma,
variance_epsilon = 1e-5,
name= 'BN')

基于tf.contrib.layers.batch_norm的实现

tf.contrib.layers提供了batch_norm方法,该方法是对tf.nn.batch_normalization的封装,增加了如centeris_training等变量,并对BN的基础算法做了更新,用滑动平均来实现均值和房车的估计。

那么,如何实现包含BN层的网络的训练和测试? 其核心是利用is_training作为flag控制输入给BN的mean和variance的来源,以及如何将moving_mean和moving_variance加入网络的训练过程中。

TF官方的建议方法解释是,
Note: when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:

1
2
3
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)

参考这篇博客,作者对此做了更棒的解释!!!!!
When you execute an operation (such as train_step), only the subgraph components relevant to train_step will be executed. Unfortunately, the update_moving_averages operation is not a parent of train_step in the computational graph, so we will never update the moving averages!

作者的解决方法:Personally, I think it makes more sense to attach the update ops to the train_step itself. So I modified the code a little and created the following training function

1
2
3
4
5
6
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
# Ensures that we execute the update_ops before performing the train_step
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())

以上代码在tf.slim.batch_norm中也有体现,slim是对tf的一个更高层的封装,利用slim实现的ResNet-v2-152可以参考这里

最后,贴上基于tf.contrib.layers.batch_norm的实现样例,更详细的实现见我的notebook

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import tensorflow as tf
import tensorflow.contrib.layers as layers
with tf.name_scope('BatchNorm'):
y = layers.batch_norm(
x_h,
center=True,
scale=True,
is_training=is_training)
# Train step
# note: should add update_ops to the train graph
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
with tf.name_scope('train'):
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)

MLP是否采用BN的结果对比

最后,贴一个是否采用BN层的结果对比,效果还是比较显著的。但是我也发现由于我设置的网络层数和FC长度都比较可观,随着Epochs增大,BN的优势并没有那么明显了。。。

Enjoy it !! 我终于把这个问题看懂了,开心

References

[1] Ioffe, S. and Szegedy, C., 2015, June. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (pp. 448-456).
[2] tensorflow 中batch normalize 的使用
[3] docs: batch normalization usage in slim #7469
[4] tf.layers.batch_normalization
[5] TENSORFLOW GUIDE: BATCH NORMALIZATION