Range based for loop by cpp

Mar 26, 2018

最近落了好多没写了，感觉已经废了。继续C++的笔记，关于range-based for loop的使用，一个C++11的标准。首先我会给出样例，然后针对gcc的报错，说明Ubuntu16.04-LTS下gcc的更新方法。

range-based for loop

C++11标准中增加了一种新的for循环的方式，称为range-based for loop，即基于范围的for循环，这个类似于python中直接进行列表索引的for循环。请看下面的样例，分别是C++和python的for循环写法，

C++

#include <iostream>
using namespace std;
int main()
{
    int numArray[5] = {0, 11, 22, 33, 44};
    for (int num : numArray)
	    cout << num << endl;
    return 0;
}

python

numArray = [0, 11, 22, 33, 44]
for num in numArray:
    print(num)

二者的输出结果均为如下形式，

显然，相对于C99的for循环，基于范围的这种for循环能够减少多行代码量。

g++ update

为了应用C++11/14的新标准，gcc/g++也需要做相应的更新，由于我还在Ubuntu-16.04 LTS的坑里，gcc的最高版本为5.4。如果编译时看到如下的警告，则需要对gcc进行更新。

1	warning: range-based ‘for’ loops only available with -std=c++11 or -std=gnu++11

更新方法如下，

1
2
3

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt update
sudo apt install gcc-7 g++-7

Reference

Ubuntu升级GCC版本
Sams Teach Yourself C++ in One Hour a Day

Comment and share

Configure permalinks in wordpress with nginx and avoid 404

Mar 25, 2018

简单记录一下wordpress配置静态链接后，页面无法重定向，网页404的解决方法。可参考的posts特别多，我罗列了一些在references里。我会从三个方面来说这个问题，(1) wordpress静态链接的配置； (2) 导致404的原因及解决方法； (3) 自定义链接插件

wordpress静态链接的配置

Wordpress可供配置的静态链接很多

自定义链接插件

custom permalinks plugin

References

Comment and share

Confusion matrix and generation with TensorFlow

Mar 17, 2018

简单记录一下多分类问题中的一种评价方法及其可视化，混淆矩阵 (confusion matrix, CM)。首先给出其定义及作用，然后给出样例。

Confusion matrix

混淆矩阵通过比较分类器的预测和真实标签，评估分类器效果，通常用于多分类问题。TF的audio recognition tutorial里对混淆矩阵的评价是, This matrix can be more useful than just a single accuracy score because it gives a good summary of what mistakes the network is making. 即混淆矩阵可以帮助分析分类器在哪些类上的表现最差。相对于单一的准确率这种衡量指标，更加直观。

混淆矩阵是一个二维的方阵，横轴代表真实标签，枞轴代表预测标签，其中的元素 $CM_{ij}$ 代表实际为第 $i$ ，被分成第 $j$ 类的样本数目。显然矩阵的非零元素集中在对角线时，分类器的表现更优异。

Examples

下面给出一个样例，这是一个六分类的问题，结果来自我们近期的工作。其中A的样本量约1000, B的样本量约10000，A random是随机产生预测标签的CM矩阵。如下所示，A的CM map中，多数样本集中在对角线；B的表现也不错，但(6,6)的色块显然淡了很多，说明分类器对第六类的分类效果不好；而A random，因为是随机生成的，其CM的各个元素的样本数比较均衡，因此分类准确率也非常差。

References

[1] Simple Audio Recognition
[2] tf.confusion_matrix

Comment and share

Install macOS 10.9 with virtualbox-5.1 on Unbuntu 1710

Mar 17, 2018

首先，建议不要作死，还是去买mac吧，太折腾人了！！然后进入正题，如何在Ubuntu 17.10上利用virtualbox 5.1安装MacOS 10.9虚拟机。涉及CPUID问题、vdmk转vdi，vdi的resize，以及EFI的问题。好几点都不太懂，慢慢摸索吧。

再吐个槽，不要随意自己更新virtualbox，因为如果是UEFI安装的ubuntu，是没法将Virtualbox的modprov部署进系统image的，然后virtualbox就彻底挂了。。。解决方法是升级系统或者重装。

镜像去哪里找

安装

具体的配置参考文献里有很多，我就不赘述了。。。这里主要说明一下CPUID的问题。。。我的CPU是i3 7100，按道理说其架构不适合装黑苹果，不过可以通过修改虚拟机的cpuid来解决这个问题。。。网上的方法太坑爹了，固定了cpuid，其实这个要自己去查的，每一代intel架构不太一样。查询方法如下，

VBoxManage list hostcpuids
# It will print
Host CPUIDs:
Leaf no.  EAX      EBX      ECX      EDX
00000000  00000016 756e6547 6c65746e 49656e69
00000001  000906e9 00100800 7ffafbbf bfebfbff
00000002  76036301 00f0b5ff 00000000 00c30000
00000003  00000000 00000000 00000000 00000000
00000004  1c004121 01c0003f 0000003f 00000000
# 找到00000001这一行，
替换到xxx.vbox中

VBoxManage modifyvm "macOS" --cpuidset 00000001 000306a9 00100800 3d9ae3bf bfebfbff
VBoxManage setextradata "macOS" "VBoxInternal/Devices/efi/0/Config/DmiSystemProduct" "iMac11,3"
VBoxManage setextradata "macOS" "VBoxInternal/Devices/efi/0/Config/DmiSystemVersion" "1.0"
VBoxManage setextradata "macOS" "VBoxInternal/Devices/efi/0/Config/DmiBoardProduct" "Iloveapple"
VBoxManage setextradata "macOS" "VBoxInternal/Devices/smc/0/Config/DeviceKey" "ourhardworkbythesewordsguardedpleasedontsteal(c)AppleComputerInc"
VBoxManage setextradata "macOS" "VBoxInternal/Devices/smc/0/Config/GetKeyFromRealSMC" 1

更新以后如下

<ExtraDataItem name="VBoxInternal/Devices/efi/0/Config/DmiBoardProduct" value="Iloveapple"/>
<ExtraDataItem name="VBoxInternal/Devices/efi/0/Config/DmiSystemProduct" value="iMac11,3"/>
<ExtraDataItem name="VBoxInternal/Devices/efi/0/Config/DmiSystemVersion" value="1.0"/>
<ExtraDataItem name="VBoxInternal/Devices/smc/0/Config/DeviceKey" value="ourhardworkbythesewordsguardedpleasedontsteal(c)AppleComputerInc"/>
<ExtraDataItem name="VBoxInternal/Devices/smc/0/Config/GetKeyFromRealSMC" value="1"/>
<CPU>
    <PAE enabled="true"/>
    <LongMode enabled="true"/>
    <HardwareVirtExLargePages enabled="false"/>
    <CpuIdTree>
    <CpuIdLeaf id="1" eax="591593" ebx="34605056" ecx="2147154879" edx="3219913727"/>
    </CpuIdTree>
</CPU>

resize

VBoxManage clonehd “MavericksInstaller.vmdk” “MavericksInstaller.vdi” —format vdi
VBoxManage list hdds

UUID:           501b0eed-167e-4b38-96ff-93efa707b5fc
Parent UUID:    base
State:          created
Type:           normal (base)
Location:       xxx/macOS/MavericksInstaller.vdi
Storage format: vdi
Capacity:       10240 MBytes
Encryption:     disabled

VBoxManage modifyhd 501b0eed-167e-4b38-96ff-93efa707b5fc —resize 40960

UUID:           501b0eed-167e-4b38-96ff-93efa707b5fc
Parent UUID:    base
State:          created
Type:           normal (base)
Location:       xxx/macOS/MavericksInstaller.vdi
Storage format: vdi
Capacity:       40960 MBytes
Encryption:     disabled

解决mediakit 报告设备上的空间不足以执行

最简单粗暴的方法，重新挂载一个vdi，利用磁盘工具，抹掉，然后就行了。。。

分辨率调整

VBoxManage setextradata “macOS” VBoxInternal2/EfiGopMode 3

References

Comment and share

A simple audio recognition example by tensorflow

Mar 17, 2018

Comment and share

Recurrent neural network and LSTM

Mar 15, 2018

简单记录一下循环神经网络 (recurrent neural network, RNN)，另一种RNN，主要关注时间序列的预测、分类和识别等问题。这里卖个瓜，前面有讨论过残差神经网络，感兴趣地可以去围观，链接见文末。
本文首先讨论RNN的motivation及其特点；然后是为了解决长程依赖 (long-term dependencies) 而提出的long short term memory (LSTM) 结构；最后是二者在tensorflow中的简单样例。参考文献很多，这里强烈案例下面两篇文章！

Recurrent neural network

循环神经网络的动机是刻画一个时间序列当前输入与此前信息的联系，从网络结构上，循环神经网络通过称为循环体的模块 (如下图) 实现对信息的记忆，即该层在 $t-1$ 时刻的输出状态 $\mathbf{h_{t-1}}$ 会被记录，并作为 $t$ 时刻该模块输入的一部分，以级联的形式与 $\mathbf{x_t}$ 构成此刻的输入 $[\mathbf{h_{t-1}}, \mathbf{x_{t}}]$ .

显然，循环体中的循环理论上是无穷的，但在实际应用中会限制循环的次数以避免梯度消失 (gradient vanishing) 的问题，用num_step来定义，即循环体的基本模块被复制并展开为num_step个。如
文献[4]所述，循环体结构是RNN的基础，在RNN中对于复制展开的循环体，其参数是共享的。这一点与卷积神经网络中的权值共享有类似之处。在这篇文章里给出了非常多的RNN的应用场景，我很喜欢里面关于手写体识别的问题，体现了权值共享的效果。

设 $t$ 时刻循环体的输入为 $\mathbf{x}(t)$ ，$t-1$时刻循环体的输出状态为 $\mathbf{h(t)}$ ，则RNN中 $t$ 时刻的输出 $\mathbf{h}(t)$ 为，

$\begin{equation} \mathbf{h}(t) = \rm{tanh}\left(W\cdot[\mathbf{h}(t-1),\mathbf{x}(t)] + b\right). \end{equation}$

其中 $W$ 是权值矩阵，其shape为 $[\mathrm{len}(\mathbf{h}) + \mathrm{len}(\mathbf{x}), \mathrm{len}(\mathbf{h})]$ , $b$ 为偏置。这里采用的激活函数是tanh，将数据限制到 $[-1,1]$ 之间。

那么，为什么用tanh，而不是有high reputation的ReLU? 知乎的这个讨论给出了很棒的解释，参考Hinton论文中的观点 ReLUs seem inappropriate for RNNs because they can have very large outputs so they might be expected to be far more likely to explode than units thathave bounded values. ReLU将输出的值限制在 $[0, \infty)$ 之间，而RNN中循环体之间的权值是共享的，经过公式(1)的多次作用，相当于对 $W$ 做了连乘，ReLU函数会导致梯度爆炸的问题。因此，采用tanh可以将每层的输出空控制在限定的范围内，既避免了梯度消失，也避免了梯度爆炸的问题。

LSTM

在时间序列的预测中存在长期依赖 (long-term dependencies) 的问题，即网络需要记住离时间 $t$ 很远的某个时刻的信息，固定的num_step将不适用于这一情形，并且长时间间隔下的梯度消失问题将无法处理。因此，需要对RNN的循环体模块进行修改，即长短时记忆网络 (long short term memory, LSTM).

LSTM的基本模块如下图所示，参考Understanding LSTM的解释，The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. 即LSTM的核心是组成为cell state，用于存储和记忆上文的信息，类似传送带的功能。

而cell state的更新，通过三个逻辑门以及 $\mathbf{x_t}$ 和 $\mathbf{h_{t-1}}$ 共同完成。它们分别称为 (1) forget gate, (2) input gate, (3) output gate. 采用sigmoid函数将数值规范化到 $[0,1]$ 区间，并与待处理信号进行点乘，本质上实现软判决.

与RNN类似，首先将 $t-1$ 时刻LSTM cell的输出 $\mathbf{h}_{t-1}$ 与 $t$ 时刻的输入 $\mathbf{x}_{t}$ 进行级联，逐一通过三个门，

遗忘门 (Forget gate) $\begin{equation} \mathbf{f_t} = \sigma\left(W_f \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_f\right) \end{equation}$
输入门 (Input gate) $\begin{equation} \mathbf{i_t} = \sigma\left(W_i \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_i \right). \end{equation}$
输出门 (Output gate) $\begin{equation} \mathbf{o_t} = \sigma\left(W_o \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_o \right). \end{equation}$

其中遗忘门的作用是抛弃cell state中不需要的信息，与 $\mathbf{C_{t-1}}$ 作用；输入门则是决定cell state中待更新的信息，与 $\mathbf{\tilde{C}_t}$ ，即state candidates作用；输出门则从更新后的cell state中决定输出的状态。结合以上三个门结构，便可以更新cell state以及cell output,

cell state update $\begin{align} \mathbf{\tilde{C_{t}}} &= \rm{tanh}\left(W_c \cdot [\mathbf{h_{t-1}}, \mathbf{x_{t}}] + b_c \right) \\ \mathbf{C_t} &= \mathbf{f_t}\cdot\mathbf{C_{t-1}} + \mathbf{i_t} \cdot \mathbf{\tilde{C_{t}}} \end{align}$
cell output upadte $\begin{equation} \mathbf{h_{t}} = \rm{tanh}(\mathbf{C_{t}}) \cdot \mathbf{o_t} \end{equation}$

Examples

参考《TensorFlow实战》第7章的LSTM基于PTB数据集的语言预测样例，以及TensorFlow的Tutorial，设计了一个小实验，对比RNN和LSTM的performance，以及激活函数对于RNN的影响。(详细的notebooks见这里: RNN, LSTM)

这里给出tf.contrib.rnn中提供的用于搭建RNN和LSTM cell的类的实例化方法，以及如何构建多个Recurrent层，

import tensorflow as tf
# RNN
def rnn_cell(num_units, activation, reuse=None):
   return tf.contrib.rnn.BasicLSTMCell(
       num_units=num_units,  
       activation=activation,
       reuse=reuse)
# LSTM
def lstm_cell(num_units, forget_bias=0.0, state_in_tuple=True, reuse=None):
	return tf.contrib.rnn.BasicLSTMCell(
    	num_units=size, 
        forget_bias=forget_bias, 
        state_is_tuple=state_in_tuple,
        reuse=reuse)
# Multiple layers
attn_cell = rnn_cell
numlayers = 2
cell = tf.contrib.rnn.MultiRNNCell(
	[attn_cell() for _ in range(numlayers)],
    state_is_tuple=True)

另外，在PTB的TF教程里，设置了可变的学习率以及梯度的clipping用于抑制梯度爆炸 (gradient explosion) 的问题，代码如下

# Adjustable learning rate
new_lr = tf.placeholder(tf.float32, shape=[], name="new_learning_rate")
lr_update = tf.assign(self._lr, self._new_lr) # use tf.assign to transfer the updated lr
def assign_lr(session, lr_value):
	sess.run(self._lr_update, feed_dict={new_lr: lr_value})
# Gradient clipping
...
max_grad_norm = 5.0 # maximum gradient
tvars = tf.trainable_variables()  # Get all trainable variables
grads, _ = tf.clip_by_global_norm(
	tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self._lr)
train_op = optimizer.apply_gradients(
    zip(grads, tvars),
    global_step = tf.contrib.framework.get_or_create_global_step())

下面来比较一下RNN和LSTM的效果。比较的指标是perplexity (复杂度)，用于刻画该模型能够估计出某一句话的概率，其数值越小，模型表现越好。

容易看出，LSTM的表现是优于RNN的。除此之外，采用tanh函数的RNN要显著好于采用ReLU，在训练中也出现了RuntimeWarning: overflow encountered in exp的警告，说明出现了gradient explosion的问题。最后，我尝试增加了RNN的层数，但是效果并没有变好，也许是参数多了？也有可能是我偷懒了，没多训测试几次。。。

References

[1] Understanding LSTM
[2] The Unreasonable Effectiveness of Recurrent Neural Networks
[3] Tensorflow tutorial
[4] TensorFlow实战Google深度学习框架
[5] TensorFlow实战
[6] RNN中为什么要采用tanh而不是ReLu作为激活函数？

广告位

Residual network I — block and bottleneck
Residual network II — realize with tensorflow

Comment and share

Expectation maximization algorithm

Mar 13, 2018

回顾一下GMM以及HMM等模型的求解方法，即著名的Expectation Maximization (EM) 算法。参考李航老师的《统计学习方法》，EM算法是一种迭代算法，用于含有隐含变量的概率模型参数的极大似然估计（MLE），或极大后验概率估计 (MPE).

EM算法

定义 $Y$ 为观测随机变量， $Z$ 为隐含随机变量，则 $Y$ 和 $Z$ 一起构成完全数据 (complete-data)。假设待估计的概率模型参数为 $\theta$ ，则观测数据的概率分布为 $P(Y|\theta)$ , 即为其似然函数，对应的对数似然函数为 $L(\theta) = \log{P(Y|\theta)}$ . 设 $Y$ 和 $Z$ Z的联合概率分布为 $P(Y,Z|\theta)$ ，那么完全数据的对数似然函数为 $\log{P(Y,Z|\theta)}$ .

EM算法的目的就是求解参数 $\theta$ ，极大化观测量的对数似然函数 $L(\theta)$ 。因为包含隐含量，所以采用迭代的方法，分为E (期望) 和M (极大化)两步，求解算法如下。

初始化参数 $\theta^{(0)}$ ;
E step: 记 $\theta^{(i)}$ 为第 $i$ 次迭代参数 $\theta$ 的估计值，则第 $i+1$ 次迭代，计算如下期望.
$\begin{align} Q(\theta,\theta^{(i)}) &= E_{z}[\log{P(Y,Z|\theta)}|Y,\theta^{(i)}] \notag \\ &=\sum_{z}{\log{P(Y,Z|\theta)P(Z|Y,\theta^{(i)})}}, \end{align}$
其中 $P(Z|Y,\theta^{(i)})$ 是在给定观测数据 $Y$ 和当前估计的参数$\theta^{(i)}$下隐含变量 $Z$ 的条件概率分布； $Q(\theta,\theta^{(i)})$ 定义为完全数据的对数似然函数在给定观测数据和当前参数下对隐含数据 $Z$ 的条件概率分布 $P(Z|Y,\theta^{(i)})$ 的期望，即 $\log{P(Y|\theta)}$ 。
M step: 求使 $Q(\theta, \theta^{(i)})$ 极大化的参数 $\theta$ ，以确定第 $i+1$ 次迭代的参数的估计值 $\theta^{(i+1)}$ , 即计算
$\begin{align} \theta^{(i+1)} = \mathrm{arg}\max\limits_{\theta}{Q(\theta,\theta^{(i)})} \end{align}$
重复E和M两步，直到收敛.

李航老师也强调了一点：EM算法与初值的选择有关，选择不同的初值可能得到不同的参数估计值。 我在这个notebook里给了一个样例，即“三硬币模型”。

高斯混合模型的求解

EM算法最经典的应用就是高斯混合模型的参数估计，该模型是语音信号处理的基础模型之一，其定义如下，

$\begin{align} P(\mathbf{y}|\mathbf{\theta}) = \sum_{k=1}^{K}{c_k\phi(\mathbf{y}|\mathbf{\theta_k)}} \end{align}$

其中， $c_k$ 是系数， $c_k\geq0$ , $\sum_{k=1}^{K}{c_k=1}$ ; $\phi(\mathbf{y}|\mathbf{\theta_k})$ 是高斯分布， $\mathbf{\theta_k} = (\mathbf{\mu_k},\Sigma_k^2)$ ,

$\begin{equation} \phi(\mathbf{y}|\mathbf{\theta_k}) = \frac{1}{(2\pi)^{M/2}|\Sigma_k|}{-\exp{(\mathbf{y}-\mathbf{\mu_k})^{T}{\Sigma_{k}}^{-1}(\mathbf{y}-\mathbf{\mu_k})}}. \end{equation}$

在GMM模型中有，

观测变量： $\mathbf{Y}$ ，
隐含变量： $\gamma_k\in{0,1}, k=1,~2,~,\cdots,~K$ ，表示 $\mathbf{y}$ 是否来自第 $k$ 个高斯分量
模型参数： $\mathbf{\theta} = {c_k, \mathbf{\theta_k}}, k=1,~2,~,\cdots,~K$ .

详细的推导见《统计学习方法》，这里给出GMM模型的E和M步骤.
设观测数据 $\mathbf{Y} = \{\mathbf{y1},~\mathbf{y2},~\cdots,~\mathbf{y_N}\}$

E step
$\begin{align} \hat{\gamma_{jk}} = \frac{c_k\phi(\mathbf{y_j}|\mathbf{\theta_k})}{\sum_{k=1}^{K}{c_k\phi(\mathbf{y_j}|\mathbf{\theta_k})}} \end{align}$
其中 $j=1,~2,~\cdots,~,N, k=1,~2,~\cdots,~K$ .
M step
$\begin{align} \hat{\mu}_{km} &= \sum_{j=1}^{N}{\hat{\gamma}_{jk}y_{jm}}/{\sum^{N}_{j=1}{\hat{\gamma}_{jk}}} \notag \\ \hat{\sigma}^{2}_{km} &= \sum_{j=1}^{N}{\hat{\gamma}_{jk}(y_{jm}-\mu_{km})^2}/{\sum_{j=1}^{N}{\hat{\gamma}_{jk}}} \notag \\ \hat{c}_{k} &= {\sum_{j=1}^{N}{\hat{\gamma}_{jk}}} / N \end{align}$
其中 $m=1,~2,~\cdots,~,M, k=1,~2,~\cdots,~K$ .

最后，给出一个样例，简单的二维GMM，如下图，notebook见这里.

可以看出，EM的估计效果还是不错的，并且提供初始化参数值的结果 (左下) 比随机初始化 (右下) 的结果要好。

Reference

[1] 李航，统计学系方法，2012 清华大学出版社
[2] Generating random variables from a mixture of Normal distributions

Comment and share

二分类及AUC的理解

Mar 11, 2018

在机器学习中，二分类 (binary classification) 问题是最常出现且最经典的问题。本文首先解释二分类的样本标签问题，包括正例 (Positive)/反例 (Negative) 和真例 (True)/假例 (False) 这两组集合；紧接着介绍几种常用的分类器评估指标，例如精确度 (precision)、准确率 (accuracy)、敏感性 (sensitivity)、特异性 (specificity)等；最后，讨论对ROC曲线及其衍生的AUC指标的理解，并给出样例。

二分类标签

应用于二分类的样本通常用正例 (Positive, P) 和反例 (Negative, N) 进行标注，作为实际标签。而经过分类器估计后的输出，根据其结果的正确与否，划分为真例 (True, T) 和假 (False, F)两个集合。因此，需要注意的是T和F两个集合均包含正例和反例样本。

根据实际标签和预测结果进行两两组合，得到四个子集，分别为真正率 (True positive, TP)、真反例 (True negative, TN)、假正率 (False positive, FP) 和假反例 (False negative, FN)，如下表所示。

	预测正例	预测反例
实际正例	TP	FN
实际反例	FP	TN

通过对这四个子集样本的组合，可以得到一些评估指标，用于评价分类器的表现。这里我想强调一下对FN和FP的理解，其中FN指的是实际为反例，但被分类器判断为正例，而FP指的是实际为正例，但被分类器判断为反例，二者的合集为F。

评估指标

下面讨论常用的二分类器的评估指标 (index or measure)。定义 $S$ 为所有的样本数量， $S_{P}$ 和 $S_{N}$ 对应实际标注为正例和反例的样本数； $S_{T}$ 和 $S_{F}$ 表示分类器估计的标签中正确和错误的样本数。相应的，定义 $S_{\mathrm{TP}}$ 、 $S_{\mathrm{TN}}$ 、 $S_{\mathrm{FP}}$ 和 $S_{\mathrm{FN}}$ 为TP、TN、FP、FN样本的数目。因此，定义如下的评估指标

准确率 (accuracy)
准确率为分类器预测结果中判断正确的样本占所有样本的比例，即 $\mathrm{acc} = S_{T}/S = (S_{\mathrm{TP}} + S_\mathrm{TN}) / S$ .
精确度 (precision)
精确度又称为查准率，衡量分类器预测为正的样本中实际为正例的样本比例，即 $\mathrm{pre} = S_{\mathrm{TP}} / (S_{\mathrm{TP}} + S_{\mathrm{FP}})$ .
敏感性 (sensitivity)
敏感性又称为召回率或真正率，衡量实际为正例的样本经过分类器预测后标记正确的样本所占比例，即 $\mathrm{sen} = S_{\mathrm{TP}} / (S_{\mathrm{TP}} + S_{\mathrm{FN}})$ .
F1-score
F1-score是对精确度和敏感性的结合，因为这两者本质上是矛盾的，通常敏感性越高则精确度会较低。F1-score的表达式为, $\mathrm{F1} = (2 \times \mathrm{pre} \times \mathrm{sen}) / (\mathrm{pre} + \mathrm{sen})$ .
特异性 (specificity)
特异性是实际标注为反例的样本经过分类器预测正确的样本的比例，即 $\mathrm{spe} = S_\mathrm{TN} / (S_\mathrm{FP} + S_\mathrm{TN})$ .
假正率 (fasle positive rate)
假正率表示分类器预测为反例的样本中实际为反例的样本的比例，即 $\mathrm{fpr} = S_\mathrm{TN} / (S_\mathrm{FN} + S_\mathrm{TN})$ .

对于这些评估指标，我的理解见下表，

评估指标	意义
准确率	衡量了分类器总体上的准确性，不考虑样本的实际类别.
精确度	衡量了正例的分类准确性，通常比准确率要高.
敏感性	衡量了分类器对正例的泛化能力，在异常检测中应用较多.
特异性	衡量了分类器对反例的分类准确性.
假正率	与敏感性类似，衡量分类器对反例的泛化能力.

ROC和AUC

以上的评估指标均要求分类器的输出为确定的标签，而分类器通常输出的是样本被判断为正例的概率，为了得到标签，需要设定概率的门限，即大于该门限的概率对应的样本判断为正例，否则为反例。门限的设定，影响分类器的泛化能力。

因此，人们提出了receiver operating characteristic (ROC) 的概念，最早出现二战时检测敌机的雷达分析技术。在信号处理中，有这样一组指标，即捕获率 (catch rate) 和追踪率 (follow rate)，前者衡量了系统对于目标信号的捕获能力，后者衡量系统在捕获信号后继续追踪的能力。这两个指标，对应到二分类问题中就是敏感性或真正率 (truu positive rate, TPR)和假正率。

通过TPR和FRP即可求解ROC曲线。对分类器输出的样本概率进行排序，设定概率门限thrs，将高于该门限的样本判断为正例，反之判断为反例。而后，与实际的样本标签进行比较，计算对应该门限的TPR和FPR。记录不同门限处的TPR和FRP，则得到了该分类器的ROC曲线。

如下为ROC曲线的求解算法，

Step1. input labels and probs estimated by classifier;
Step2. set thresholding step
Step3. for thrs = 1.0 : step : 0.0
		labels_est = zeros(size(labels))
        labels_Est[probs > thrs] = 1
        calculate fpr and tpr w.r.t. thrs
Step4. Draw the ROC curve

针对不同的分类器，若某个分类器的ROC曲线整体在其他分类器的上方，则可认为该分类器最优。但往往存在ROC曲线交叉的情形，此时通过计算ROC曲线下方的面积，即AUC (area under curve)值来进行对比，AUC数值越大，分类器的分类效果越好，泛化能力越强。

给出一个ROC曲线和AUC的求解样例，具体的代码实现见这里.

首先，设样本的标签和二分类器输出的概率为，

1 2	labels = [1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0] probs = [0.5, 0.6, 0.2, 0.9, 0.5, 0.3, 0.1, 0.7, 0.3, 0.9, 0.5]

设定thrs的步长为0.1，则求解的TPR和FPR分别为，

1 2	tpr = [0.0, 0.4, 0.4, 0.4, 0.6, 0.8, 1.0, 1.0, 1.0, 1.0, 1.0] fpr = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.33, 0.33, 0.67, 0.83, 1.0]

最后，求解得到auc=0.967. 下图为该样例的ROC曲线，

scikit-learn也提供了求解auc的函数，其用法如下，

1
2
3

from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(labels+1, p, pos_label=2)
auc = metrics.auc(fpr, tpr)

在我的notebook中，分别给出了我的实现和sklearn.metrics.auc的实现，二者结果相同。最后吐个槽，因为haroopad的bug，这篇是我重新写的！！！

References

[1] 周志华，机器学习，2017 清华大学出版社
[2] 机器学习和统计里面的auc怎么理解？
[3] sklearn.metrics.auc

Comment and share

SVM hyperplane visualization based on libsvm

Mar 06, 2018

Support vector machine (SVM), as a shallow model, has been widely applied for classification tasks. To solve the model, groups of super vectors (SVs) of corresponding classes are extracted, so as to calculate a hyperplane as the classification boarder.

A brief review

Denote $\mathbf{x} = \{\mathbf{x_1},~\mathbf{x_2},~\dots,~\mathbf{x_N}\}$ as the samples to be classified, and $y = \{y_1,~y_2,~\dots,~y_N\}$ are the corresponding labels. Take binary classification as an example,

$\begin{equation} y_i(\mathbf{w}\cdot\mathbf{x_i}+b) \geq 1, i = 1,~2,~\dots,~N, \end{equation}$

where $\mathbf{w}$ are the coefficients w.r.t. features in $\mathbf{x}$ , b is the bias.

The the problem becomes an optimization task, where the object is,

$\begin{equation} \begin{cases} \min\limits_{w}\frac{\left \| \mathbf{w} \right \|}{2}, \\ \mathrm{s.t.}~y_i(\mathbf{w}\cdot\mathbf{x_i}+b) \geq 1, i = 1,~2,~\dots,~N, \end{cases} \end{equation}$

which shall be calculated with Lagrange equation,

$\begin{equation} L_P = \frac{1}{2}{\left\| \mathbf{w} \right \|} - \sum^{N}_{i=1}{\lambda_i\{y_i(\mathbf{w}\cdot\mathbf{x_i}+b)-1\}}. \end{equation}$

To save time, it usually selects a subset of $\mathbf{x}$ namely super vectors to optimize above equation, instead of all of the samples. Those SVs are samples that stand close to the classification hyperplane, i.e., the boarders of different types. They are considered on behalf of the classes they belonging to.

By solving the Lagrange equation, we obtain the $\mathbf{\lambda}$ , as well as $\mathbf{w}$ and $b$.

$\begin{equation} \begin{cases} \mathbf{w} = \sum^{N_\rm{SV}}_{i=1}{\lambda_i y_i \mathbf{x_i}}, \\ b = - \frac{1}{2}\mathbf{w}\cdot(\mathbf{x_{c1}}+\mathbf{x_{c2}}), \end{cases} \end{equation}$

where $\mathbf{x_c1}$ and $\mathbf{x_c2}$ are arbitrary super vectors of class one and two, respectively.

The dicision function based on those parameters are,

$\begin{equation} f(\mathbf{x_s}) = \rm{sgn}\left[ \sum^{N_\rm{SV}}_{i=1}{\lambda_i y_i (\mathbf{x_i}\cdot\mathbf{x_s}) + b} \right], \end{equation}$

here $\rm{sgn}$ is the sign function.

For non-linear classification, which is more general than linear case, the dot product between $\mathbf{x_i}$ and $\mathbf{x_s}$ are replaced by non-linear kernel functions $\Phi(\cdot)$, i.e.,

$\begin{equation} f(\mathbf{x_s}) = \rm{sgn}\left[ \sum^{N_\rm{SV}}_{i=1}{\lambda_i y_i \Phi(\mathbf{x_i},\mathbf{x_s})+b} \right]. \end{equation}$

Realization and visualization

With the help of libsvm, it is easy to realize SVM based classification. What I want to say in this blog is how to visualize or replicate the prediction stage of the svmtrain function. Some comments are as follows,

After training the SVM with svmtrain, a model will be generated;
In the model, super vectors, parameters like weights and bias, are archived;
To save space, the support vectors are saved as sparse matrix.
For multi-class classification, it can be transformed to multiple binary-classification tasks.

Here is a naive two-dimensional three-type classification example (code is available). I divided three-class task into three binary classifications. The linear kernel function was used, thus the classification hyperplanes were also linear.

In the right figure, only support vector points are plotted. It can be found that the SVs are those points stand at the boarder between different categories.

Comment and share

Nupy矩阵的复制问题

Mar 04, 2018

简单记录一下最近遇到的一个bug，利用numpy生成的矩阵在复制时不能直接赋值，而是要用copy方法。直接赋值类似于把内存中的地址 (即指针) 给了目标变量，其与被赋值变量共享同一块内存，这样做可以节省内存空间。而copy则不同，会重新申请一块内存，分配给复制后的新变量，在该变量上的操作不会对愿变量产生影响。

下面看一个例子,

import numpy as np
x = np.arange(9.).reshape(3,3)
# copy
y = x.copy()
y[y>=5] = 0
# 赋值法
z = x
z[z>=5] = 0

其输出结果如下,

In [6]: y
Out[6]: 
array([[0, 1, 2],
       [3, 4, 0],
       [0, 0, 0]])
In [7]: x
Out[7]: 
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
In [8]: z
Out[8]: 
array([[0, 1, 2],
       [3, 4, 0],
       [0, 0, 0]])
In [9]: x
Out[9]: 
array([[0, 1, 2],
       [3, 4, 0],
       [0, 0, 0]])

可以看到，采用copy后，对y的操作不会影响到原矩阵x，而采用直接赋值后，对z的操作对x产生了影响。

Comment and share

NEWER POSTS
OLDER POSTS
page 2 of 8

range-based for loop

C++

python

g++ update

Reference

wordpress静态链接的配置

自定义链接插件

References

Confusion matrix

Examples

References

镜像去哪里找

安装

resize

解决mediakit 报告设备上的空间不足以执行

分辨率调整

References

Recurrent neural network

LSTM

Examples

References

广告位

EM算法

高斯混合模型的求解

Reference

二分类标签

评估指标

ROC和AUC

References

A brief review

Realization and visualization

Jason Ma