《机器学习基石》学习笔记——1.2 learning to answer yes/no

handout slides

Machine Learning Foundations 课程由台湾大学NTU林轩田老师开设，课程共16篇，包括四部分内容：

When can machines learn? (illustrative + technical)
Why can machines learn? (theoretical + technical)
How can machines learn? (technical + practical)
How can machines learn better? (practical + theoretical)

下面是Topic 1 Part 2——learning to answer yes/no

1. Perceptron Hypothesis Set

引入信用卡的例子，银行如何根据用户信息来决定是否给用户发放信用卡。
A Simple Hypothesis Set: the Perceptron
- 每个用户信息形成d维向量$\mathbf{x} = (x_1, x_2, …, x_d)$；
- 每个特征赋予不同的权值$w_i$，表示该特征对是否发放信用卡的影响；
- 所有特征的加权求和与一设定的阈值进行比较，大于阈值输出+1，即发放信用卡；小于阈值输出-1，不发放信用卡
- 令$x_0 = +1, w_0 = -threshold$，将阈值吸收进$\mathbf{w}^T$，则：
  
  $h(x) = sign(\sum_{i=0}^{d}w_ix_i) = sign(\mathbf{w}^T\mathbf{x})$

2. Perceptron Learning Algorithm (PLA)

在Perceptron中，hypothesis set由许多直线构成，PLA的目的就是在这些可能是无限的直线中，选择一条最好的直线，能将平面上所有的正类和负类完全分开，也就是找到最好的g，使$g \approx f$

遍历所有可能的直线是不现实的，思路是采用“逐点修正”

基于一条直线，找到错误的点，并更新$w$，更新方法是：

如果错误点的$y = +1$，即正类误判为负类$\mathbf{w}_{(t)}\mathbf{x}_{n(t)} < 0$，则表示$\mathbf{w}$与$\mathbf{x}$夹角大于90，修正方案是将角度变小，即$\mathbf{w} = \mathbf{w} + y\mathbf{x}, y=+1$；
如果错误点的$y = -1$，即负类误判为正类$\mathbf{w}_{(t)}\mathbf{x}_{n(t)} > 0$，则表示$\mathbf{w}$与$\mathbf{x}$夹角小于90，修正方案是将角度变大，即$\mathbf{w} = \mathbf{w} + y\mathbf{x}, y=-1$；

如果数据本身是线性可分的，经过不断的迭代修正之后，所有的点都能正确分类

我们由$\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + y_n\mathbf{x}_n$，两边同时乘以$y_n\mathbf{x}_n$
得到$y_n \mathbf{w}_{t+1}\mathbf{x}_n \leftarrow y_n \mathbf{w}_t\mathbf{x}_n + (y_n\mathbf{x}_n)^2$
则有：$y_n \mathbf{w}_{t+1}\mathbf{x}_n \geqslant y_n \mathbf{w}_t\mathbf{x}_n$
上式表明，随着迭代的进行，正确分类的样本逐渐变多
the rule somewhat ‘tries to correct the mistake’.

3. Guarantee of PLA

如果D不是线性可分的（linear separable），则PLA不会停止
在线性可分的情况下，则存在$\mathbf{w}_f$使得$y_n = sing(\mathbf{w}_f^T \mathbf{x}_n)$成立
$$\min_n \ y_n \mathbf{w}_f^T \mathbf{x}n > 0 \Leftrightarrow
y{n(t)} \mathbf{w}f^T \mathbf{x}{n(t)} \geqslant \min_n \ y_n \mathbf{w}_f^T \mathbf{x}_n > 0$$

也就是说，$\mathbf{w}_f^T \mathbf{w}_t$逐渐变大，内积越大似乎就表示两个向量越接近，其实不然，有可能是长度更接近了，而不是角度。所以，下面再证明长度关系。
有错才更新，所以下面这个式子成立

可以看出，$\mathbf{w}_t$的增长被限制了，$\mathbf{w}_{t+1}$与$\mathbf{w}_t$向量的长度不会差别太大。

如果令初始权重$\mathbf{w}_0 = 0$，那么经过T次错误修正后，有如下结论：
$$\frac{\mathbf{w}_f^T}{\left |\mathbf{w}_f \right |}\frac{\mathbf{w}_T}{\left |\mathbf{w}_T \right |} \geqslant \sqrt{T}\cdot constant$$
总结：