混淆矩阵通过比较分类器的预测和真实标签,评估分类器效果,通常用于多分类问题。TF的audio recognition tutorial里对混淆矩阵的评价是, This matrix can be more useful than just a single accuracy score because it gives a good summary of what mistakes the network is making. 即混淆矩阵可以帮助分析分类器在哪些类上的表现最差。相对于单一的准确率这种衡量指标,更加直观。
那么,为什么用tanh,而不是有high reputation的ReLU? 知乎的这个讨论给出了很棒的解释,参考Hinton论文中的观点 ReLUs seem inappropriate for RNNs because they can have very large outputs so they might be expected to be far more likely to explode than units thathave bounded values. ReLU将输出的值限制在之间,而RNN中循环体之间的权值是共享的,经过公式(1)的多次作用,相当于对做了连乘,ReLU函数会导致梯度爆炸的问题。因此,采用tanh可以将每层的输出空控制在限定的范围内,既避免了梯度消失,也避免了梯度爆炸的问题。
LSTM
在时间序列的预测中存在长期依赖 (long-term dependencies) 的问题,即网络需要记住离时间很远的某个时刻的信息,固定的num_step将不适用于这一情形,并且长时间间隔下的梯度消失问题将无法处理。因此,需要对RNN的循环体模块进行修改,即长短时记忆网络 (long short term memory, LSTM).
LSTM的基本模块如下图所示,参考Understanding LSTM的解释,The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. 即LSTM的核心是组成为cell state,用于存储和记忆上文的信息,类似传送带的功能。
Support vector machine (SVM), as a shallow model, has been widely applied for classification tasks. To solve the model, groups of super vectors (SVs) of corresponding classes are extracted, so as to calculate a hyperplane as the classification boarder.
A brief review
Denote as the samples to be classified, and $y = \{y_1,~y_2,~\dots,~y_N\}$ are the corresponding labels. Take binary classification as an example,
where are the coefficients w.r.t. features in , b is the bias.
The the problem becomes an optimization task, where the object is,
which shall be calculated with Lagrange equation,
To save time, it usually selects a subset of namely super vectors to optimize above equation, instead of all of the samples. Those SVs are samples that stand close to the classification hyperplane, i.e., the boarders of different types. They are considered on behalf of the classes they belonging to.
By solving the Lagrange equation, we obtain the , as well as and $b$.
where and are arbitrary super vectors of class one and two, respectively.
The dicision function based on those parameters are,
here is the sign function.
For non-linear classification, which is more general than linear case, the dot product between and are replaced by non-linear kernel functions $\Phi(\cdot)$, i.e.,
Realization and visualization
With the help of libsvm, it is easy to realize SVM based classification. What I want to say in this blog is how to visualize or replicate the prediction stage of the svmtrain function. Some comments are as follows,
After training the SVM with svmtrain, a model will be generated;
In the model, super vectors, parameters like weights and bias, are archived;
To save space, the support vectors are saved as sparse matrix.
For multi-class classification, it can be transformed to multiple binary-classification tasks.
Here is a naive two-dimensional three-type classification example (code is available). I divided three-class task into three binary classifications. The linear kernel function was used, thus the classification hyperplanes were also linear.
In the right figure, only support vector points are plotted. It can be found that the SVs are those points stand at the boarder between different categories.