线性回归

模型

线性回归模型，顾名思义就是线性模型求解回归问题。

线性模型

【定义】给定具有$d$个属性的数据样本$x=\{x_1,x_2,…,x_d\}$，$x_i$是$x$在第$i$个属性上的取值，线性模型通过对属性的线性组合来进行预测的函数：

$f(x)=w_1x_1+w_2x_2+...+w_dx_d+b$

向量形式表示为：

$f(x)=w^Tx+b$

其中$w=(w_1;w_2;…;w_d)$，只要确定了$w$和$b$，模型就得以确定。

向量都表示成竖直的一条直线。

回归问题

回归属于监督学习的范畴，用于预测输入变量与输出变量的关系。其本质就是数据拟合，选择一条函数曲线使其很好地拟合已知的数据，同时能够预测未知的数据。

按照输入变量的属性个数，分为一元回归和多元回归；按照输入变量和输出变量之间的映射关系，分为线性回归和非线性回归。

线性回归

线性模型描述的是属性间的组合关系，而回归问题求解的是输入与输出的关系。

即使用一个线性函数去建模输入变量属性间的线性关系，从而发现输入变量与输出变量的关系。给定数据集$D=\{(x_1,y_1),(x_2,y_2),…,(x_m,y_m)\}$，其中$x_i=(x_i1;x_i2;…;x_id)$。

问题描述：线性回归试图学习一个线性模型$f(x_i)=w^Tx_i+b$，使$f(x_i)\approx y_i$。

又称为多元线性回归，或者多变量线性回归。

算法求解

最小二乘法

只要确定了权重$w$和偏差$b$的值，那么我们就可以得到模型了，而我们想让预测值$f(x_i)$无限接近真实值$y_i$，所以使用均方误差作为性能度量，即我们试图让均方差最小：

$Loss(w^*,b^*)=\sum_{i=1}^{m}(f(x_i)-y_i)^2$

均方误差对应了常用的欧几里得距离，基于均方误差最小化进行模型求解的方法称为“最小二乘法”，实际上，最小二乘法就是试图找到一条直线，使样本到直线的欧氏距离之和最小。

关于为什么使用均方误差作为损失度量，一个是因为其本质是最大似然法，ref

常用损失函数对比ref2

参数估计

Loss函数求导取极值

我们可以对损失函数$Loss$求导，并令导数为零来求得最优参数：

$\begin{cases} w=\frac{\sum_{i=1}^{n}y_i(x_i-\bar{x})}{\sum_{i=1}^{n}x_{i}^{2}-\frac{1}{n}(\sum_{i=1}^{n}x_i)^2}\\ b=\frac{1}{m}\sum_{i=1}^{n}(y_i-wx_i) \end{cases}$

点击显/隐内容

公式推导

偏导
$\begin{cases} \frac{\partial{}E}{\partial{w}}=2[w\sum_{i=1}^{n}x_{i}^{2}-\sum_{i=1}^{n}(y_i-b)x_i]\\ \frac{\partial{}E}{\partial{b}}=2[mb-\sum^{n}_{i=1}(y_i-wx_i)] \end{cases}$
求解b，$\frac{\partial{}E}{\partial{b}}=0$
$\begin{align*} mb&=\sum^{n}_{i=1}(y_i-wx_i)\\ &=(y_1-wx_1)+(y_2-wx_2)+...+(y_n-wx_n)=(y_1+y_2+...+y_n)-w(x_1+x_2+...+x_n)\\ &=\sum{y_i}-w\sum{x_i} \end{align*}$
则$b=\frac{1}{n}\sum_{i=1}^{n}(y_i-wx_i)=\bar{y}-w\bar{x}$
求解w，$\frac{\partial{}E}{\partial{w}}=0$
$\begin{align*} w\sum_{i=1}^{n}x_{i}^{2}&=\sum_{i=1}^{n}(y_i-\bar{y}+w\bar{x})x_i\\ &=(y_1-\bar{y}+w\bar{x})x_1+(y_2-\bar{y}+w\bar{x})x_2+...+(x_n-\bar{y}+w\bar{x})x_n\\ &=(y_1x_1+y_2x_2+...+y_nx_n)-\bar{y}(x_1+x_2+...+x_n)+w\bar{x}(x_1+x_2+...+x_n)\\ &=\sum_{i=1}^{n}x_iy_i-\bar{y}\sum_{i=1}^{n}x_i+w\bar{x}\sum_{i=1}^{n}x_i\\ &=\sum_{i=1}^{n}x_iy_i-n\bar{y}\bar{x}+w\bar{x}n\bar{x} \end{align*}$
则$w(\sum_{i=1}^{n}x_{i}^{2}-n\bar{x}^2)=\sum_{i=1}^{n}x_iy_i-n\bar{y}\bar{x}$，故
$w=\frac{\sum_{i=1}^{n}x_iy_i-n\bar{y}\bar{x}}{\sum_{i=1}^{n}x_{i}^{2}-n\bar{x}^2}$

梯度下降法

损失函数是：

$\begin{align*} Loss(w,b)&=\sum_{i=1}^{n}(f(x_i)-y_i)^2\\ &=\sum_{i=1}^{n}(wx_i+b-y_i)^2 \end{align*}$

参数更新：

$\begin{cases} w=w-\alpha(f(x)-y){x}\\ b=b-\alpha(f(x)-y) \end{cases}$

$\begin{align*} \frac{\partial{E}}{\partial{w}}&=\frac{\partial{}}{\partial{w}}{\frac{1}{2}(f(x)-y)^2}\\ &=(f(x)-y)\frac{\partial}{\partial{w}}(wx+b-y)\\ &=(f(x)-y)x \end{align*}$

即：

$w_i=w_i-\alpha\frac{1}{m}\sum_{i=1}^{m}(f(x_i)-y_i)\cdot x_i$

此时，$b=w_0$，$x_0=1$。

正则化的损失函数

此时目标函数为：

$\begin{align*} E(w,b)&=\sum_{i=1}^{n}(wx_i+b-y_i)^2+\lambda\sum_{i=1}^{n}w_{i}^{2} \end{align*}$

$w_0$不需要正则化

此时梯度下降法为：

$w_0=w_0-\alpha\frac{1}{n}\sum_{i=1}^{n}(h_w(x^{(i)})-y^{(i)})x_0^{(i)}\\ w_j=w_j-\alpha[\frac{1}{n}\sum_{i=1}^{n}(h_w(x^{(i)})-y^{(i)})x_j^{(i)}-\frac{\lambda}{n}w_j]$

Data Preprocessing

Feature Scaling

It involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.

Mean normalization

This involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

By combining the two techniques and adjust the input values as shown in the following formula:

$x_i=\frac{x_i-\mu_i}{s_i}$

where $\mu_i$ is the average of all the values for feature $(i)$ and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.

Convergence Figure

In order to make sure that our algorithm runs correctly, we need to debug gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, $J(\theta)$ over the number of iterations of gradient descent. If $J(\theta)$ ever increases, then you probably need to decrease learning rate $\alpha$.

Summary

If $\alpha$ is too small, it could result in slow convergence.
If $\alpha$ is too large, it may not decrease on every iteration and thus may not converge, like the following pic:

Screen Shot 2018-08-06 at 12.58.12 PM

Example

Screen Shot 2018-08-06 at 12.54.04 PM

A is $\alpha=0.1$, B is $\alpha=0.01$, A is $\alpha=1$.

In graph C, the cost function is increasing, so the learning rate is set too high. Both graphs A and B converge to an optimum of the cost function, but graph B does so very slowly, so its learning rate is set too low. Graph A lies between the two.

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

Screen Shot 2018-08-07 at 12.18.01 AM

例子

线性回归预测一维数据

来自Andrew Ng第二周的课程练习，给出城市人口及该市的商店利润，预测如何进行店铺扩展，即选择哪座城市开分店？给定的数据只有一个人口特征及利润目标值。

第一步就是进行数据的可视化。

点击显/隐内容

def plot_data():
    data = np.loadtxt('./dataset/ex1data1.txt', delimiter=',')
    x = data[:, 0]
    y = data[:, 1]
    plt.scatter(x, y, marker='x')
    plt.xlabel("Population of City in 10,000s")
    plt.ylabel("Profit in $10,000s")
    plt.show()

第二步：数据处理及参数初始化

我们为数据增加一列全一

点击显/隐内容

length = x.shape[0]
z = np.ones(length)
x = np.array(list(zip(z,x)))  # add a column of ones to x
theta = np.zeros([x.shape[1],1]) # initialize fitting parameters
alpha = 0.01
iters = 1500

第三步：损失函数

我们使用公式

Screen Shot 2018-08-15 at 6.19.39 PM

初次调用，返回值是32.07。

点击显/隐内容

def cost_function(x,y,theta):
    y_ = x.dot(theta)
    loss = (y-y_)**2
    cost = sum(loss) / (2*length)
    return cost

第四步：梯度下降法

检查梯度下降法是否正确的一个手段：查看损失函数是否在逐步减小。我们使用梯度更新公式：

Screen Shot 2018-08-15 at 6.26.20 PM

点击显/隐内容

def gradient_descent(x,y,theta,alpha,iters):
    for i in range(iters):
        # cost_function(x,y,theta)
        y_ = x.dot(theta)
        loss = (y_ - y)
        gd = loss.T.dot(x) # shape = (1,2)
        gd = alpha * gd.T / length
        theta = theta - gd
    return theta

需要注意的一个点是：loss = ( y_pred - y )。反过来的话，会出现loss趋于无穷大。这个是对loss函数求导得到的，无论$(h_\theta(x)-y )^2$还是$(y-h_\theta(x) )^2$，都是一样的。

第五步：可视化拟合曲线

点击显/隐内容

def plot_data(x,y,theta):
    plt.scatter(x, y, marker='x',label='Training data')
    plt.xlabel("Population of City in 10,000s")
    plt.ylabel("Profit in $10,000s")
    x_ = np.linspace(min(x),max(x),1000)
    y_ = x_*theta[1] + theta[0]
    plt.plot(x_,y_,label='Linear regression')
    plt.show()

线性回归预测多特征数据

我们使用线性回归预测房价，数据集有两个特征：第一列是房屋面积(单位：$feet^2$)，第二列是房间数，最后一列是房价。

特征正则化

发现两个特征数据范围相差特别大，所以需要对数据进行正则化，这样可以加快梯度下降法的收敛。

减去均值
除以标准差

正则化之后，别忘记加一个全一列。

注意：记得存储mean和std的值，这样我们在预测位置数据的时候，第一步就是使用mean和std对未知数据做同样的处理。

点击显/隐内容

def feature_norm(x):
    mean = np.mean(x,axis=0)
    std = np.std(x,axis=0)
    x = (x-mean)/std
    ones = np.ones((x.shape[0], 1))
    x = np.concatenate([ones, x], axis=1)
    return x,mean,std

梯度下降法

先实现损失函数

Screen Shot 2018-08-15 at 6.19.39 PM

点击显/隐内容

def cost_func(x,y,thetas):
    y_pred = x.dot(thetas)
    loss = sum((y_pred-y)**2)/(2*x.shape[0])
    return loss

再实现梯度下降法，损失函数都是10次方。

点击显/隐内容

def gradient_dec(x,y,thetas,lr,iter):
    for i in range(iter):
        y_pred = x.dot(thetas)
        gd = (y_pred - y).T.dot(x)
        thetas -= lr * gd.T / x.shape[0]
        loss = cost_func(x, y, thetas)
        print(loss)
    return thetas