机器学习笔记 01 · 简单易懂的现代魔法

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

机器学习算法可以大致分为两类：Supervised Learning（监督学习）和Unsupervised Learning（无监督学习）。

Supervised Learnin：从给定的训练数据集中学习出一个函数，当新的数据到来时，可以根据这个函数预测结果。例如 Regression Problem (Predict real-valued output)
Unsupervised Learning：与监督学习相比，无监督学习的训练集没有人为标注的结果。

机器学习问题中常用的符号定义如下：

m = Number of training examples

x = input values/ features

y = output values/ target variable

(x⁽ⁱ⁾, y⁽ⁱ⁾) = the No.i training example

Hypothesis (假设函数)

一个机器学习算法运行的方式如下图所示

其中函数

\[ h_\theta(x) = \theta_0 + \theta_1x \]

称为 hypothesis (假设函数)

\(^*\)以上函数h(x)仅针对线性情况

Cost Function (代价函数)

针对一个线性回归问题的假设函数

\[ h_\theta(x) = \theta_0 + \theta_1x \]

如何选择 \(\theta_0\) 和 \(\theta_1\) 可以使 \(h_\theta(x)\) 更好地拟合训练数据集呢？一般认为，当通过 \(h_\theta(x)\) 得出的预测值与训练数据集中的样本值方差最小时，\(\theta_0\) 和 \(\theta_1\) 取得最优值，即函数

\[ J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x_i)-y_i)^2 \]

取得最小值时，\(\theta_0\) 和 \(\theta_1\) 取得最优值。\(J(\theta_0,\theta_1)\) 被称作 Cost Function (代价函数)

Gradient Descent (梯度下降)

为了使代价函数 \(J(\theta_0,\theta_1)\) 取得最小值，我们采用一种叫做 Gradient Descent (梯度下降) 的算法——确切地来说，梯度下降算法只能帮助我们到达函数的局部极小值，因此在代价函数较为复杂的情况下可能会出现问题。

首先在代价函数 \(J(\theta_0,\theta_1)\) 上任意取得一点，然后向函数上当前点对应梯度（或者是近似梯度）的反方向的规定步长距离点进行迭代搜索：假设有函数 \(F(x)\) 在 \(a\) 点处存在且可微，那么如果 \(b = a - \gamma\nabla F(a)\) 对于任意小的 \(\gamma > 0\)成立，则有 \(F(a) \geq F(b)\)。因此我们只要对 \(J(\theta_0,\theta_1)\) 上某一点 \((\theta_0,\theta_1)\) 进行如下迭代：

\[ \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) \]

可以期望 \(\theta_j\) 最终收敛于局部极小值。其中，迭代步长 \(\alpha\) 被称为 Learning Rate (学习速率) ，并在每一次迭代中都可以改变。