Understanding the mathematics behind linear regression
Today we are going to talk about linear regression, one of the most well known and well understood algorithms in machine learning. We are going to focus on the simple linear regression, which contains only one input variable. But the same logic and analyses will extend to the multi-variable linear regression. You probably are familiar with its form: y=β0+β1x. However, you might feel confused about how the linear regression model is learnt. This blog post will talk about some of the most commonly techniques used to train a linear regression model.
Solving linear regression using Ordinary Least Squares – general formula
A simple linear regression function can be written as:
yi=β0+β1xi+ϵi, i=1,2...nWe can obtain n equations for n examples:
y1=β0+β1x1+ϵ1y2=β0+β1x2+ϵ2⋮yn=β0+β1xn+ϵnIf we add n equations together, we get:
∑yi=nβ0+β1∑xi+∑ϵiBecause for linear regression, the sum of the residuals is zero. We get:
∑yi=nβ0+β1∑xi (1)If we use the Ordinary Least Squares method, which aims to minimize the sum of the squared residuals. We define C to be the sum of the squared residuals:
C=(β0+β1x1−y1)2+(β0+β1x2−y2)2+...+(β0+β1xn−yn)2This is a quadratic polynomial problem. To minimize C, we take the partial derivatives with respect to β1 and set the results to 0 and we get:
∑xiyi=∑xiβ0+β1∑x2i (2)Solving equations (1) and (2), we can then get:
β0=(∑x2i)(∑yi)−(∑xi)(∑xiyi)n∑x2i−(∑xi)2β1=n(∑xiyi)−(∑xi)(∑yi)n∑x2i−(∑xi)2Solving linear regression using Ordinary Least Squares – matrix formulation
It is, however, more computationally efficient to use matrices to define the linear regression model and performing the subsequent analyses.
The n simple linear regression equations can be written out as:
[y1y2⋮yn]=[1x11x2⋮⋮1xn][β0β1]+[ϵ1ϵ2⋮ϵn]In matrix formulation the linear regression model can be rewritten as:
Y=Xβ+ϵIn the above section, equations (1) and (2) state that:
∑yi=nβ0+β1∑xi∑xiyi=∑xiβ0+β1∑x2iWe know:
X′X=[11⋯1x1x2⋯xn][1x11x2⋮⋮1xn]=[n∑xi∑xi∑xi2]X′Y=[11⋯1x1x2⋯xn][y1y2⋮yn]=[∑yi∑xiyi]We can see that the equations (1) and (2) are equivalent to the following matrix formula:
X′Y=X′XBThus, we could get:
B=(X′X)−1X′Ywith B=[β0β1]Thus, to solve the linear regression problem using least squares, it normally requires that all of the data must be available and your computer must have enough memory to hold the data and perform matrix operations.
Solving linear regression using Gradient Descent
When you have a very large dataset. The dataset may contain a lot of examples (rows) or it may contain a lot of features (columns). In either way, the matrix representing the dataset will be large and may not fit into memory. It is better to use another method to train the linear regression model. We’ll talk about gradient descent in this section.
We first define a cost function which measures how good fit the regression line is. The cost function E(β) is defined as:
E(β)=12nn∑i=1(yi−(β0+β1xi))2The goal of the linear regression model is to minimize the cost function. Gradient descent search will determine a weight vector B that minimizes E by starting with an arbitrary initial weight vector, then repeatedly modifying it in small steps. At each step, the weight vector is altered in the direction that produces the steepest descent along the error surface. This process continues until the global minimum error is reached.
So, in what direction should we change the weight vector B that moves towards minimizing the cost function? If we change a small amount Δβ0 in the β0 direction, and change a small amount Δβ1 in the β1 direction, then E changes as follows:
ΔE≈δEδβ0Δβ0+δEδβ1Δβ1We rewrite this as:
ΔE≈∇E⋅ΔBwhere ∇E=(δEδβ0,δEδβ1), and ΔB=(β0,β1)T.
If we choose:
ΔB=−η∇EThen we will get:
ΔE≈∇E⋅−η∇E=−η||∇E||2≤0We will make sure that the error E will always decrease, which is the property we want.
Hence, the update rule for β0 and β1 is:
β0:=β0−ηδEδβ0β1:=β1−ηδEδβ1We have:
δEδβ0=1nn∑i=1((β0+β1xi)−yi)δEδβ1=1nn∑i=1((β0+β1xi)−yi)xiWe get:
β0:=β0−η1nn∑i=1((β0+β1xi)−yi)β1:=β1−η1nn∑i=1((β0+β1xi)−yi)xiThe learning rate η controls how quickly we want the weights to move towards the minimum. The weights are updated until a minimum sum squared error is achieved or no further improvement is possible.
Regularization methods for linear regression
For machine learning problems, overfitting is a common issue we have to be aware of. Overfitting means your model fits your training data well but performs badly at predicting unseen data. One of the most common techniques to overcome overfitting is the use of regularization. Two popular regularization procedures for linear regression are Lasso regression and Ridge regression.
Lasso Regression – L1 regularization
In Lasso regression, we add an extra term to the cost function, which is called the regularization term. The regularized cost function is:
E(β)=12nn∑i=1(yi−(β0+β1xi))2+λp∑j=1|βj|We can rewrite it as:
E(β)=E0+λp∑j=1|βj|where E0 is the unregularized cost function.
If we take the partial derivatives with regard to β1, we get:
δEδβ1=δE0δβ1+λ sgn(β1)where sgn(β1) is the sign of β1, +1 if β1 is positive, -1 if β1 is negative.
The update rule for β1 then becomes:
β1:=β1−ηλ sgn(β1)−ηδE0δβ1This update rule will have the effect of penalizing relatively smaller weights while concentrating on a small number of large weights, by shrinking the weights by a constant amount towards 0.
Ridge Regression – L2 regularization
In Ridge regression, we add a different term to the cost function. The regularized cost function is:
E(β)=12nn∑i=1(yi−(β0+β1xi))2+λp∑j=1β2jWe can rewrite it as:
E(β)=E0+λp∑j=1β2jwhere E0 is the unregularized cost function.
If we take the partial derivatives with regard to β1, we get:
δEδβ1=δE0δβ1+2λβ1The update rule for β1 then becomes:
β1:=β1−2ηλ β1−ηδE0δβ1It can be written as:
β1:=(1−2ηλ)β1−ηδE0δβ1The update rule now rescales the weights with a factor of (1−2ηλ). It has an effect of penalizing relatively large weights.
So, if we compare the L1 and L2 regularization. When a particular weight has a large magnitude, L1 regularization tends to shrink the weight much less than the L2 regularization does. In contrary, when a particular weight has a small magnitude, L1 regularization tends to shrink the weight much more than the L2 regularization does.
Regularization methods are useful when there is collinearity in your input variables and the unregularized model would overfit the training data.
Extra words about linear Regression
There are some heuristics you can use when you are performing linear regression, especially, in the stage of data preparation.
-
Linear assumption: linear regression assumes that the relationship between the input variables and output variable to be linear. You may need to transform the data to make the relationship linear. Also be careful when the training data contain outliers/noise, because linear regression is sensitive to outliers. You may need to clean the noise data before feeding the data into the model.
-
Collinearity assumption: linear regression assumes that there is no or little collinearity among input variables. Linear regression will overfit the training data when the input variables are highly correlated. You may need to do some feature selection and remove the most correlated features.
-
Normal assumption: linear regression assumes that all the variables to be normal distributed or close to normal. Linear regression will make more reliable predictions if the variables are normally distributed. You may need to transform the data to make the variables normal looking.
-
Rescaling data: linear regression will make more reliable predictions if the input variables are rescaled (using normalization or standardization).