Linear Regression in R
7 Aug 2016 · 1211 words

As humans, we’re drawn to cause and effect. Even when it’s wrong of us to do so.

When it gets hot, it’s not necessarily because of global warming. An increase in homeless people will not definitively cause an increase in crime rates. Male drivers crash more than female drivers, but maybe this isn’t because of their increased testosterone levels.

We’ve all heard the correlation vs causation argument before. Our System 2 conscious mind knows that cause and effect isn’t always correct.

System 1, on the other hand, can’t get enough of cause and effect. And thus linear regression was born.

We’ve been slaves to it ever since. Ever since we discovered that plotting data on a graph can look cool, we’ve been drawing straight lines through everything we can - even when we shouldn’t.

Let’s ignore all that for now.

## What’s up with linear regression?

As you probably know, linear regression is fitting straight lines to things.

Typically you’ve got something you’re measuring which we call a dependent variable. This could be something like house prices, body fat percentages, or number of balls taken to dismiss an Australian top-order batsman. We denote this by $y$.

You’re typically interested in how the thing that you’re measuring changes when other factors change, called independent variables. Examples of independent variables are house location, person weight or if the Aussies are playing in the subcontinent (as true or false, or 1 or 0).  This is denoted by $x$.

Simple linear regression is when we only have one independent variable, or one $x$ value. As the name suggests, it’s rather simple.

Multiple linear regression is where we have multiple independent variables. We denote these by $x_1, x_2, \dots, x_n$.

The relationship between $y$ and the $x$’s is represented by

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon$$

where $\beta_0, \dots, \beta_n$ are real-valued coefficients, and $\epsilon$ is a normally distributed random variable with mean zero.

We denote the number of data points (also known as the training set) by $m$.

The dependent variable $y$ is represented by $y_1, y_2, \dots, y_m$, where $y_1$ is the $y$ value of the first data point, $y_5$ is the value of the fifth data point and so forth.

The independent variables are similarly represented  by $x_1^{(1)}, x_1^{(2)}, \dots, x_1^{(m)}, x_2^{(1)}, x_2^{(2)}, \dots, x_2^{(m)}, \dots, x_n^{(1)}, x_n^{(2)}, \dots, x_n^{(m)}$. The value in the brackets isn’t an exponent of each variable - rather, it represents the number of the data point.

There are multiple notations for representing data, and no single one is correct. The notation of other sources will be different to what I’ve used here, but essentially it’s all representing the same thing.

You’ll often see this in vector form:

$$\mathbf{y} = \boldsymbol{\beta X }+ \boldsymbol{\epsilon}$$

where the meanings of $\mathbf{y}, \boldsymbol{\beta}, \mathbf{X}$ and $\boldsymbol{\epsilon}$ are as found here. There’s a lot more to it than this, but the basic idea remains the same. If you’re interested in some of the sweet details visit the Wikipedia page and browse away.

There’s an interesting question here.

If you’ve collected your data, you’ll know what your $y$ values are, and you’ll have your data points for $x_1, \dots, x_n$. But how do you find $\beta_0, \beta_1, \dots, \beta_n$? You can’t do much without your coefficients.

Here’s an easy way.

## The normal equations

What you can do is use your knowledge of calculus and linear algebra to derive what’s called the normal equations. The normal equations are an analytic solution for all your $\beta$’s, so you can just plug your values into the equation and collect your linear regression coefficient.

The derivation and all the details can be found here.

To summarise the above result, the analytic solution for $\boldsymbol{\beta}$ is

$$\boldsymbol{\beta} = \left( \mathbf{X^T X} \right) ^{-1} \mathbf{X^T y}$$

## Let’s code up this shit

What we will do now is simulate some data and see if we can implement linear regression ourselves.

R offers a function to do linear regression:lm(). We can implement the least-squares solution and compare it to the answer that lm() gives us. If they are the same, we’ve done a good job!

First, let’s simulate data. We’ll simulate 10 data points from three independent $x$ variables.

x1 = rnorm(10,0,1)
x2 = rpois(10,1)
x3 = rbinom(10, 40, 0.07)
error = rnorm(10,0,0.5)
y = 4*x1 + 10*x2 - 3*x3 + error

Our independent variables are $x_1, x_2, x_3$ and I’ve given them all different distributions. The “error” variable adds some noise to our data. The dependent variable $y$ is a linear combination of these other factors - just as we would expect for a real world example.

We’ve essentially simulated data from

$$y = 4 x_1 + 10 x_2 - 3 x_3 + \epsilon$$

where $\epsilon \sim N(0,0.5)$.

We can implement the normal equation and see how we go in recreating these coefficient values. We won’t get them exactly correct (due to the noise in the data) but we should get similar numbers.

Let’s create our matrix $\mathbf{X}$. This matrix will have dimensions $m \times (n + 1)$, where we have $m$ data points and $n$ independent variables. In our case we have 10 data points and 3 independent variables, so our matrix will have dimensions $10 \times 4$.

Each column of $\mathbf{X}$ represents one dependent variable except the first column.  The first column is a vector of 1’s added on to allow a parameter estimate for $\beta_0$. The other columns will consist of our values for $x_1, x_2$ and $x_3$.

xmat = cbind(x1 = x1, x2 = x2, x3 = x3)

#make a 1's intercept vector
ones = rep(1,nrow(xmat))
xmat = cbind(intercept = ones,xmat)

We can now solve directly for $\boldsymbol{\beta}$.

beta = solve(t(xmat) %*% xmat)%*% t(xmat) %*% y

Note solve() function takes the inverse of a matrix, while the t() function calculates the transpose of a matrix.

How’d we go for $\boldsymbol{\beta}$?

beta

#              [,1]
# intercept  1.087445
# x1         3.686058
# x2         9.853127
# x3        -3.137238

That’s pretty close to our true coefficient values! Now let’s compare this with the result from lm():

lm(y~x1 + x2 + x3)

# Call:
# lm(formula = y ~ x1 + x2 + x3)

# Coefficients:
#    (Intercept)       x1           x2           x3
#        1.087        3.686        9.853       -3.137

Looks like it works!

## In summary

The normal equation offers an easy way to find the coefficients of $\boldsymbol{\beta}$. It’s not the only way, however.

Another method is called gradient descent. This is an iterative method where the coefficients converge to their true values over time. While it doesn’t give an exact solution as the normal equations do, it does offer other advantages in terms of speed and flexibility. We’ll cover this in a later post.

Just remember: straight lines may not have the answer.