We have several approaches to get the global minimum of the cost function for linear regression, including normal equation and gradient descent. This article will give the proof of normal equation.
#
Linear regression recap
For example, we want to model the target value global temperature anomalies (𝑌) based on CO₂ level every year (𝑋).
Fig1. Climate change data from 1959 to 2016
We assume there is a linear relationship (Figure 1).
hθ(x)=θ1x1+θ0
Normally, let 𝑚 be the number of training examples, 𝑛 be the number of features. Then the cost function should be:
J(θ)=2m1i=1∑m(hθ(x(i))−y(i))2
Vectorized cost function:
J(θ)=2m1(Xθ−y)T(Xθ−y)
#
Normal equation
We need to find 𝜃 to minimize the cost function, a convex function for 𝜃, which means we need to find 𝜃 to satisfy
∇θJ(θ)=0
Then we solve for 𝜃, we get the normal equation
θ=(XTX)−1XTy
#
Proof
#
Basic knowledge for proof
We define the derivative of 𝑓 with respect to matrix 𝐴 to be
Ng, A., 2003. Supervised Learning. [online] CS229.stanford.edu. Available at: http://cs229.stanford.edu/notes/cs229-notes1.pdf [Accessed 18 March 2020].