Pre-requisites

Interchange between notations

Before starting out, I would just like to familiarise the readers with some notational somersaults we might perform in this blog.

Let ${S = x_{1}^{2} + x_{2}^{2} +x_{3}^{2}+...+ x_{m}^{2} }$. Another way of writing this equation can be

${S = \Sigma_{i=1}^{m}\space x_{i}^{2}}$.

If we collect ${x_{1}, x_{2},...,{x_m}}$ in a single vector ${\small X}$, such that ${\small X = \left[ \begin{array}{ccc} x_{1}\\ x_{2}\\ . \\ . \\ x_{m} \end{array}\right]}$, then ${\small S}$ can be written as ${X^{T}X}$

Minimizing the squared error loss

Minimizing the Squared Error is the technique anyone uses while tackling a regression problem in which the target is a continuous variable. Let's say we have collected all our input variables in a matrix ${X}$ of shape (m,n) where m is the number of training examples and ${i^{th}}$ row of ${X}$ or ${i^{th}}$ training example is represented by a n dimensional vector ${x_{i}}$. All the target variables are represented by a vector ${Y}$ of length m (each training example has a real number as it's target) and ${i^{th}}$ element of ${Y}$ is denoted by ${y_{i}}$. Our task is to find a vector ${\hat\beta}$ of length n such that ${X\hat\beta = Y}$. Now this task can simply be solved by saying ${\hat\beta = X^{-1}Y}$ but this is the correct answer only in the case when ${X}$ is a squared matrix and ${Y}$ lies in the column space of ${X}$ i.e m =n and Y is simply a weighted sum of columns of ${X}$. If that's not the case ${X\hat\beta}$ can never be equal to ${Y}$. All we can hope to do is minimize the distance between the two vectors ${X\hat\beta}$ and ${Y}$. The distance we choose to minimize is ${(Y-X\hat\beta)^{T}(Y-X\hat\beta)}$ which can also be written as ${\Sigma_{i=1}^{m} \space (y_{i}-x_{i}^{T}\hat\beta)^{2}}$. Now, if you are like me, you may wonder why don't we simply minimize ${Y - X\hat\beta}$ or some other measure of distance. This is perfectly valid question to ask and we are going to explore this in this blog. For now, just remember that ${\hat\beta}$ is the value of ${\beta}$ that minimises the squared error; written mathematically as : ${\hat\beta = argmin_{\beta} \space(Y-X\beta)^{T}(Y-X\beta)}$.

We can find ${\hat\beta}$ using calculus.

${\frac{\partial ((Y-X\beta)^{T}(Y-X\beta))}{\partial \beta} = 0}$ at ${\hat\beta}$

${\therefore\small -X^{T}(Y - X\hat\beta) - X^{T}(Y - X\hat\beta) = 0 \implies \hat\beta = (X^{T}X)^{-1}X^{T}Y}$

Maximum Likelihood Estimation

Another way of looking at the regression task is that we have m observations ${(x_{1},y_{1}), (x_{2},y_{2}), ..., (x_{m},y_{m})}$ where each ${x_{i}}$ is a n dimensional vector and ${y_{i}}$ is it's corresponding scalar target value. These m data points can be thought of as m random samples from an unknown data distribution ${P(X,Y\space;\space\theta)}$ where ${\theta_{true}}$ is the parameter of the distribution. If this doesn't make sense let's take an example.

Let's say we want to predict how many goals a soccer player will score in his next season by just looking at the player's age and number of goals he scored in last season. Therefore, our independent variable is random variable ${X =}$ [Age, # of goals scored in last season] and dependent variable is ${Y =}$ number of goals scored in next season where ${0 \leq Y \leq 100}$. Messi may be represented by vector ${(33,40)}$ and Ronaldo by ${(34,37)}$. Now, what do you think ${P([18,1], 25)}$ will be? In other words, what is the probability that a player aged 18 and having scored just one goal in previous season scores 25 in his next season? We can safely say that is a highly unlikely event since the player is inexperienced and the number of goals he scored in previous season suggests that he may be a defender. In the same way, we would expect ${P([28,40], 32)}$ to be high since the player is at an age when soccer players achieve their peak and is a proven goalscorer and therefore is highly likely to score 32 in next season.

Now instead of guessing, we want to construct a probability distribution over random variables ${X}$ and ${Y}$. Once we have such a distribution we can simply plug in age and goals scored in past season of any player and calculate ${P([\text age,\text ranking], y\space ; \space \theta_{true})}$ for each ${y}$ in ${(0,100)}$ and report the ${y}$ which gave the highest probability. Unfortunately, we can't know true distribution since for that we would have to collect data from every single active player which can be a lot. But, we can estimate the true distribution. Any distribution is entirely characterised by it's parameters. for eg a Gaussian distribution is characterised by it's mean and variance and is denoted as ${N (x; \mu, \sigma^{2})}$. If we can estimate the parameters of a distribution, then we're golden. Let's denote the estimate of parameter ${\theta_{true}}$ as ${\hat\theta}$.

To construct such estimate of a distribution we collect the age and goals scored in previous season of m players and observe them for the next season and note down the number of goals they scored. Let each of these observations be represented by ${(x_{i}, y_{i})}$. We can see that some players may have the same ${x}$ either by virtue of error in data collection or they are of same age and scored the same number of goals in last season. Let's assume every observation is independent of any other observation. Given each ${x_{i}}$ we should predict ${y_{i}}$ and this can only happen when ${P(x_{i}, y_{i}\space ;\space \hat\theta)}$ is greater than all other ${P(x_{i}, y\space ; \space \hat\theta)}$ where ${y \neq y_{i}}$. The quantity ${P(x_{i}, y_{i})}$ is known as likelihood of the observation ${(x_{i}, y_{i})}$. Since we want likelihood of all the observations to be maximum, we might as well say that we we want the quantity ${P(x_{1}, y_{1})\cdot P(x_{2}, y_{2})\cdot P(x_{3}, y_{3}) ... P(x_{m}, y_{m})}$ to be maximum which in short can be written as ${\Pi_{i=1}^{m}P(x_{i}, y_{i}\space ; \space \theta)}$. Then ${\hat\theta}$ will be whatever value of ${\theta}$ which can achieve this feat.

${\therefore \hat\theta = argmax_{\theta} \space\Pi_{i=1}^{m} P(x_{i}, y_{i}\space ; \space \theta) = argmax_{\theta} \space P(X,Y\space ; \space \theta)}$.

In real world since product of too many numbers which are less than one can lead to numeical underflow, we adjust the objective function as below:

${\hat\theta = argmax_{\theta} \space log \space\Pi_{i=1}^{m} P(x_{i}, y_{i}\space ; \space \theta) = argmax_{\theta} \space\Sigma_{i=1}^{m}log \space P(x_{i}, y_{i}\space ; \space \theta) = argmax_{\theta} \space log\space P(X,Y\space ; \space \theta)}$.

This method of estimating ${\theta_{true}}$ is called maximum likelihood estimation and obtained ${\hat\theta}$ is called maximum likelihood estimate of ${\theta_{true}}$. Maximum likelihood estimates of true parameters have many desirable properties, most important of which is: As we increase the number of training examples, the probability of estimates being close to true parameter increases. Well, this should ofcourse be true. What's special about the maximum likelihood estimates is that they do so the fastest of any other type of estimate. In other words, if we are using maximum likelihood estimates to estimate our parameters and someone else is using other methods, then our performance on prediction will improve faster on collecting more training data.

Minimising Squared Error ${\rightarrow}$ Maximising Likelihood

Let's get back to minimising squared error and see through a sequence of logical steps how it is maximum likelihood estimation in disguise.

${\hat\beta = argmin_{\beta} \space(y_{1}-x_{1}^{T}\beta)^2 + (y_{2}-x_{2}^{T}\beta)^{2} + ... + (y_{m}-x_{m}^{T}\beta)^2 = \small {argmin_{\beta} \space(Y-X\beta)^{T}\space I\space(Y-X\beta)}}$ where ${I}$ is the identity matrix.

As mentioned in previous section, two or more observations may have same ${x_{i}}$ but different ${y_{i}}$ in the dataset. For generality let's assume that only p out of m (where p <m) ${x_{i}}$'s are unique. We can then collect duplicate ${x}$'s in a seperate matrix for each one and their corresponding targets can be arranged in a vector. This way we'll have p matrices viz. ${X_{1}, X_{2}, ..., X_{p}}$ each containing different number of rows where all the rows of a given ${X_{i}}$ are identical (since they carry all the duplicate ${x_{i}}$).

We can understand what we just did by a concrete example. Let's say we have 5 observations: ${[(1,2), 10],\, [(1,3),9],\, [(1,2), 12],\, [(1,4),15],\, [(1,3),11]}$. We can see some duplicate ${x}$'s in these observations. There are only three unique values of ${x}$'s. Arranging the duplicate ${x}$'s in a matrix and their corresponding labels in vectors we get: ${X_{1} = \left[ \begin{array}{ccc} 1 & 2\\ 1 & 2\\ \end{array} \right]}$, ${X_{2} = \left[ \begin{array}{ccc} 1 & 3\\ 1 & 3\\ \end{array} \right]}$, ${X_{3} = \left[ \begin{array}{ccc} 1 & 4\\ \end{array} \right]}$; ${Y_{1} = \left[ \begin{array}{ccc} 10\\ 12\\ \end{array} \right]}$, ${Y_{2} = \left[ \begin{array}{ccc} 9\\ 11\\ \end{array} \right]}$, ${Y_{3} = \left[ \begin{array}{ccc} 15\\ \end{array} \right]}$

By doing such an operation squared loss can be re-written as:

${\hat\beta = argmin_{\beta}\space (Y_{1} - X_{1}\beta)^{T}(Y_{1} - X_{1}\beta) \space+\space ...\space +\space (Y_{p} - X_{p}\beta)^{T}(Y_{p} - X_{p}\beta)}$

Now we'll go modify this objective function through some steps which may look nonsensical at first but will yield a fascinating interpretation in the end. So, please bear with me for a while.

Mutltiplying a function by a negative number turns it's minimas to maximas and vice versa. So let's multiply our function by ${-0.5}$ and change argmin to argmax.

${\therefore \hat\beta = argmax_{\beta} \space -0.5[\space(Y_{1} - X_{1}\beta)^{T}(Y_{1} - X_{1}\beta) \space+\space ...\space +\space (Y_{p} - X_{p}\beta)^{T}(Y_{p} - X_{p}\beta)\space]}$

Exponentiating the function and then taking it's log returns the function itself, because log and exp are inverse functions of each other.

${\therefore \hat\beta = argmax_{\beta} \space log\space e^{-0.5[\space(Y_{1} - X_{1}\beta)^{T}(Y_{1} - X_{1}\beta) \space+\space ...\space +\space (Y_{p} - X_{p}\beta)^{T}(Y_{p} - X_{p}\beta)\space]}}$

${\therefore \hat\beta = argmax_{\beta} \space log\space e^{\small -0.5\space(Y_{1} - X_{1}\beta)^{T}(Y_{1} - X_{1}\beta)} \space+\space ...\space +\space log\space e^{\small -0.5\space(Y_{p} - X_{p}\beta)^{T}(Y_{p} - X_{p}\beta)\space}}$

Subtracting a constant from a function doesn't change the location of it's maximas and minimas. So, let's subtract ${log (2\pi)^{0.5m}}$ where m is the number of observations in our dataset

${\therefore \hat\beta = argmax_{\beta} \space log\space e^{\small -0.5\space(Y_{1} - X_{1}\beta)^{T}(Y_{1} - X_{1}\beta)} \space+\space ...\space +\space log\space e^{\small -0.5\space(Y_{p} - X_{p}\beta)^{T}(Y_{p} - X_{p}\beta)\space} - log (2\pi)^{0.5m}}$.

Let ${k_{i}}$ be the length of vector ${Y_{i}}$, then

${log (2\pi)^{0.5m} = log(2\pi)^{0.5(k_{1} \space +\space ... \space +\space k_{p})} = log(2\pi)^{0.5k_{1}} \space + \space ...\space+ \space log(2\pi)^{0.5k_{p}}}$

${\therefore \hat\beta = argmax_{\beta} \space [\space log\space e^{\small -0.5\space(Y_{1} - X_{1}\beta)^{T}(Y_{1} - X_{1}\beta)} - log(2\pi)^{0.5k_{1}}]\space+\space ...\space +\space [log\space e^{\small -0.5\space(Y_{p} - X_{p}\beta)^{T}(Y_{p} - X_{p}\beta)\space} - log(2\pi)^{0.5k_{p}}]}$.

Solving further,

${\hat\beta = argmax_{\beta} \space log\space \frac{e^{\small -0.5\space(Y_{1} - X_{1}\beta)^{T}(Y_{1} - X_{1}\beta)}}{(2\pi)^{0.5k_{1}}}\space+\space ...\space +\space log\space \frac{e^{\small -0.5\space(Y_{p} - X_{p}\beta)^{T}(Y_{p} - X_{p}\beta)}}{(2\pi)^{0.5k_{p}}}}$.

Just to jog your memory, let me write the expression of probability density of a random variable ${W}$ which has a multivariate Normal distribution whose mean vector is ${\mu}$ and covariance matrix is ${I}$

${P_{normal}(\small W\space ; \mu, \small I) = \frac{\normalsize exp \frac{-1}{2}(W-\mu)^{T}\space(W-\mu)}{\normalsize{(2\pi)^{0.5k}}}}$ where k is the length of mean vector. This equation looks strikingly familiar to the equation of our morphed square minimization objective.

${\therefore \hat\beta = argmax_{\beta}\space log P_{normal}(Y_{1}\space ; \space X_{1}\beta, I) \space + \space ... +\space log P_{normal}(Y_{p}\space ; \space X_{p}\beta, I)}$

This means that by minimising squared error we were indeed maximising the log-likelihood of several random variables. Also, we were implicitly assuming that given a particular ${x}$, labels associated with it are gaussian random variable with a variance of 1.

Now we have turned the task of finding the argument that minimises squared error into the task of finding the argument that maximises the likelihood of multivariate gaussian distribution which is very easy if you remember what a Gaussian curve looks like. If not, given below is the picture of a graph of probablity density of a random variable with gaussian distribution which has scalar mean ${\mu}$ and scalar variance ${\sigma}$ i.e ${P_{normal}(X\space ; \space \mu, \sigma)}$

In this picture, we can see that maximum of the distribution occurs at the mean of random variable also called expectation of random variable and denoted as ${E[X]}$. This means that ${P_{normal}(Y_{i}\space ;\space X_{i}\beta, I)}$ is maximised when ${X_{i}\beta}$ is mean of variable ${Y_{i}}$.

${\therefore X_{i}\hat\beta = E[Y_{i}]}$.

This means that for a given ${x}$ in the dataset if we predict the mean of all the labels it is associated with, the squared error will be at it's minimum.

A Small Example

Let's take our previous dataset ${([(1,2), 10],\, [(1,3),9],\, [(1,2), 12])}$ and try to fit a linear model on it by minimizing the squared error. Arranging input and target variables seperately we have ${X = \left[ \begin{array}{ccc} 1 & 2\\ 1 & 3\\ 1 & 2\\ \end{array} \right]}$ and ${Y = \left[ \begin{array}{ccc} 10\\ 9\\ 12\\ \end{array} \right]}$. As we saw earlier we can use a simple formula to find the vector ${\hat\beta}$ such that ${X\hat\beta \approx Y}$ which is ${\hat\beta = (X^{T}X)^{-1}X^{T}Y}$ .

But why go this route when we can calculate ${X\hat\beta}$ in our heads. For a given ${x}$, just predict the mean of all the ${y}$'s it is associated with.

Therefore for ${(1,2)}$ we predict mean of 10 and 12 which is 11 and for ${[1,3]}$ we predict just 9.

${\therefore X\hat\beta = \left[ \begin{array}{ccc} 11\\ 9\\ 11\\ \end{array} \right]}$

Let's verify that the normal equations give the same results:

import numpy as np

X = np.array([[1,2], [1,3], [1,2]])
Y = np.array([10,9,12])

B = np.linalg.inv((X.transpose()@X))@(X.transpose()@Y)

print(X@B)
[11.  9. 11.]

Conclusion

Since Squared Error Minimization is just Maximum Likelihood Estimation in disguise, we can see the reason for it's widespread adoption for regression tasks. Maximum likelihood estimates of the parameters of models have many desirable properties which is why we design our minimization objectives around them. Of course, in large datasets things aren't as simple as just applying normal equations to get to minimum of squared error, since the're may not be a linear relation between inputs and outputs in the first place or our implicit bias that given an input it's labels are a gaussian random variable with unit variance may be flawed. Therefore, in those cases ${\hat\beta}$ is found out by descending the gradient of the objective function.