Ever wondered how logistic regression really works?

You’re not alone. It’s confusing.

This post drills down into the mechanics of logistic regression. We’ll go through the mathematical aspects and also construct our own logistic regression function in R. Exciting!

## What is logistic regression?

In linear regression the dependent variable $Y$ takes continuous values, like house prices for an area or the income of a individual.

Logistic regression differs in that it is used for modelling categorical data. Typically you model data that falls into two categories, but using logistic regression with more than two categories is certainly possible.

We will assume for this post that our response variable $Y$ falls into exactly two categories. Here are some examples where you might want to use logistic regression:

- If a plant lives/ dies
- If someone was accepted/rejected for a home loan
- If a student graduates/drops out of university
- Owning/not owning a TV/laptop/segway/aircraft

We represent one of the categories by $Y = 1$ and the other by $Y = 0$. For example a plant that lives could be represented by $Y=1$, and a dead plant by $Y=0$.

## Mathematical representation

Let us have $n$ data points, or training samples. Each one of these training samples is comprised of measurements of $m$ independent variables which are denoted by $x_1, x_2, \cdots, x_n$. The realised values of these variables are given by $x_1 = x_1, x_2 = x_2, \cdots, x_m = x_m$.

Each training sample also has a value of the dependent variable $Y$ associated with it, and we let $y$ be the realised value of $Y$. We have that $Y = y$ where either $y =1 $ or $y=0$.

The $i$th training sample will consist of the data points $x_{i1}, x_{i2}, \cdots, x_{im}$ and the observed response value $y_i$. Because our data is all realised values (we know the actual values of the variables) we use lowercase notation to refer to it.

We refer to individual column vectors like

\begin{align}
\mathbf{x}_{*1} = \begin{bmatrix}
x_{11} \

x_{21} \

\vdots \

x_{n1}
\end{bmatrix}
\end{align}

and row vectors like

$$ \mathbf{x}_{1 *} = \left[ x_{11}, x_{12}, \cdots, x_{1m} \right] $$

We will find it useful to define the matrix $\mathbf{X}$, which is known as the design matrix. Let

$$ \mathbf{X} = \left[ \mathbf{x}_{* 0}, \mathbf{x}_{* 1},\mathbf{x}_{*2}, \cdots,\mathbf{x}_{*m} \right] $$

where $\mathbf{x}_{*0}$ is a vector of 1’s that allows us to estimate the intercept term.

In addition, let

$$ p_i = p_i (\mathbf{x}_{i*}) = P(Y = 1 | X = \mathbf{x}_{i*}) $$

, the probability of the response being 1 given the data $\mathbf{x}_{i*}$. You can then set the vector of probabilities $\mathbf{p}$ as a $N \times 1 $column vector where $p_i$ is the probability of $y_i = 1$.

The logistic regression model can be modelled as

$$ \log \left( \frac{\mathbf{p}}{1-\mathbf{p}} \right) = \mathbf{X} \cdot \boldsymbol{\beta} $$

where $\boldsymbol{\beta} = \left[ \beta_0, \beta_1, \cdots, \beta_m \right]$ is the vector of coefficients. Solving for $\mathbf{p}$ gives

$$ \mathbf{p} = \frac{\exp( \mathbf{X} \cdot \boldsymbol{\beta})}{1+\exp(\mathbf{X} \cdot \boldsymbol{\beta})} = \frac{1}{1+\exp \left(-(\mathbf{X} \cdot \boldsymbol{\beta}) \right) } $$

## Likelihood function

Looking at our data, for each vector of features $\mathbf{x}_{i*}$ we have an observed class $y_i$. We’ll be able to fit a likelihood to the data.

This is how we’ll find the $\beta$ coefficients - by maximising the likelihood function with respect to $\beta$.

I’ll only provide the key findings below, skipping over some of the mathematical steps. Feel free to try the derivations yourself and see if you can get to the same answer.

We have that $y_i = 1$ with probability $p_i$, and $y_i = 0$ with probability $1-p_i$. From this we can formulate the likelihood function

$$ L = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{(1-y_i)} $$

and hence the log likelihood

$$ \log(L) = \ell = \sum_{i=1}^n \left[ y_i \log(p_i) + (1- y_i) \log (1-p_i) \right] $$

which can be rearranged to get

$$ \ell = \sum_{i=1}^n \left[ y_i \boldsymbol{\beta}^T \mathbf{x}_{i*} - \log \left(1 + \exp( \boldsymbol{\beta}^T \mathbf{x}_{i*} \right) \right] $$

by substituting in the definition of $p_i$.

We can take the derivative of the log likelihood and see for what value of $\beta$ it equals zero: in other words, looking for a place where the slope of the $\beta $ vector is zero. This is the standard way of maximising the log likelihood and ends up being

$$ \frac{\partial \ell}{\partial \beta} = \sum_{i=1}^n \mathbf{x}_{i*}(y_i - p_i) = 0 $$

which is not just one equation, but a series of them!

In fact, matrix notation is the easiest way that we can represent this. We can rewrite it as

$$ \frac{\partial \ell}{\partial \beta} = \mathbf{X}^T (\mathbf{y}-\mathbf{p} ) $$

which is a lot easier to work with in R.

To solve this we’re going to use an iterative algorithm called the Newton-Raphson method. In order to be able to implement the method, it turns out we’re going to require the second derivative of the log likelihood. Let’s just grab that now:

$$ \frac{\partial ^2 \ell}{\partial \beta^2} = - \sum_{i=1}^n \mathbf{x}_{i*} \mathbf{x}_{i*}^T p_i (1-p_i) $$

Again we can rewrite this in matrix form but this time we’re going to define a new matrix $\mathbf{W}$ to help us out. Let $\mathbf{W}$ be a $n \times n$ diagonal matrix where the $i$th diagonal element of $\mathbf{W}$ is equal to $p_i (1-p_i)$. Another way of saying this is $W_{ii} = p_i (1-p_i)$ where $ 1 \leq i \leq n$.

Rewriting gives

$$ \frac{\partial ^2 \ell}{\partial \beta^2} = - \mathbf{X}^T \mathbf{WX} $$

## The Newton-Raphson Method

Lacking an explicit solution for $\boldsymbol{\beta} $ like we had for linear regression, we are forced to turn to alternate non-exact methods. One of these methods is the Newton-Raphson method.

The Newton-Raphson method is an iterative method used to calculate maximum likelihood estimators. Lucky us that $\boldsymbol{\beta}$ is a maximum likelihood estimator!

I won’t go into the method in depth, but a good explanation of Newton-Raphson can be found here. The important thing to remember is that it’s an iterative method where in each iteration $\boldsymbol{\beta}$ moves closer to its best value (ideally). The term “best value” here means the $\boldsymbol{\beta}$ with the highest likelihood, so on every iteration we should find that our likelihood increases with each new $\boldsymbol{\beta}$.

The good news for us is that we’ve already got everything we need to perform the Newton-Raphson method. We’re going to have to clear up the notation a little, however.

Define $\boldsymbol{\beta}_{new}$ as the $\boldsymbol{\beta}$ vector that we get after performing a Newton update, and $\boldsymbol{\beta}_{old}$ as $\boldsymbol{\beta}$ from the previous iteration.

A Newton update is then:

$$ \boldsymbol{\beta}_{new} =\boldsymbol{\beta}_{old} - \left( \frac{\partial ^2 \ell}{\partial \beta^2} \right) ^{-1}\frac{\partial \ell}{\partial \beta } $$

Substituting in the values that we got earlier gives

$$ \boldsymbol{\beta}_{new} =\boldsymbol{\beta}_{old} + \left( \mathbf{X}^T \mathbf{WX} \right) ^{-1} \mathbf{X}^T (\mathbf{y} - \mathbf{p} ) $$

In each Newton update (or iteration), our vector of probabilities $\mathbf{p}$ will update as we recalculate it using an updated $\boldsymbol{\beta}$. $\mathbf{W}$ will also change with each iteration as a consequence of its direct reliance on $\mathbf{p}$.

## Applying the Newton-Raphson method

The algorithm can be split into two sections: setup for the iterative loop, and the iterative loop itself.

For the setup:

- Set $\boldsymbol{\beta}$ to some initial value - a guess for the coefficients of the logistic regression. You can achieve quicker convergence if you already have a rough idea of what your coefficients are likely to be, since you can set the initial guess to a value close to its true value. This’ll lead to less iterations but the difference shouldn’t be too significant.
- Set a threshold value $\epsilon$ . The threshold value is used to check if the logistic regression algorithm is converging or not. The idea is that the amount that $\boldsymbol{\beta}$ changes by should get smaller and smaller each iteration and eventually drop below our threshold value. If it never drops below the threshold value, then we can say that the algorithm isn’t converging and terminate it.
- Set an iterations counter. We’ll use this counter to track the number of iterations, and if we find that $\boldsymbol{\beta}$ isn’t converging after some predetermined number of iterations (i.e. $| \boldsymbol{\beta}_{old} -\boldsymbol{\beta}_{new} | > \epsilon $) then we’ll end the algorithm.

The iterative part then goes like this:

- Calculate $\mathbf{p}$ using $\mathbf{p} = \frac{\exp(\mathbf{X} \cdot \boldsymbol{\beta})}{1+\exp( \mathbf{X} \cdot \boldsymbol{\beta})}$
- Calculate $\mathbf{W}$ using the updated $\mathbf{p}$. Remember that $\mathbf{W}$ is a diagonal matrix where $W_{ii} = p_i (1-p_i)$.
- Update $ \boldsymbol{\beta}}$ by using $\boldsymbol{\beta}_{new} =\boldsymbol{\beta}_{old} + \left( \mathbf{X}^T \mathbf{WX} \right) ^{-1} \mathbf{X}^T (\mathbf{y} - \mathbf{p} ) $, where $\boldsymbol{\beta}_{old}$ is the $\boldsymbol{\beta}$ value from the previous iteration.
- See how much $\boldsymbol{\beta}$ has changed by in the current iteration. If it’s less than the threshold value then end the loop and return $\boldsymbol{\beta}$.
- See if the number of iterations we’ve done is higher than some predetermined number (like 100). If so then end the algorithm and display an error message along the lines of “the function failed to converge”.

## In R

We’ll first generate some sample data. We’ll have 30 sample points and three independent variables, which we’ll imaginatively name $x_1, x_2$ and $x_3$. Our dependent variable is denoted by $y$.

```
set.seed(2016)
#simulate data
#independent variables
x1 = rnorm(30,3,2) + 0.1*c(1:30)
x2 = rbinom(30, 1,0.3)
x3 = rpois(n = 30, lambda = 4)
x3[16:30] = x3[16:30] - rpois(n = 15, lambda = 2)
#dependent variable
y = c(rbinom(5, 1,0.1),rbinom(10, 1,0.25),rbinom(10, 1,0.75),rbinom(5, 1,0.9))
```

We construct next our design matrix $\mathbf{X}$ which contains our data for $x_1, x_2$ and $x_3$ as well as a vector $x_0$ as a column of 1’s. Remember that we need $x_0$ in order to calculate the intercept $\beta_0$.

```
x0 = rep(1,30) #bias
X = cbind(x0,x1,x2,x3)
```

Here’s the algorithm. The code pretty much follows the steps described above.

```
manual_logistic_regression = function(X,y,threshold = 1e-10, max_iter = 100)
#A function to find logistic regression coeffiecients
#Takes three inputs:
{
#A function to return p, given X and beta
#We'll need this function in the iterative section
calc_p = function(X,beta)
{
beta = as.vector(beta)
return(exp(X%*%beta) / (1+ exp(X%*%beta)))
}
#### setup bit ####
#initial guess for beta
beta = rep(0,ncol(X))
#initial value bigger than threshold so that we can enter our while loop
diff = 10000
#counter to ensure we're not stuck in an infinite loop
iter_count = 0
#### iterative bit ####
while(diff > threshold ) #tests for convergence
{
#calculate probabilities using current estimate of beta
p = as.vector(calc_p(X,beta))
#calculate matrix of weights W
W = diag(p*(1-p))
#calculate the change in beta
beta_change = solve(t(X)%*%W%*%X) %*% t(X)%*%(y - p)
#update beta
beta = beta + beta_change
#calculate how much we changed beta by in this iteration
#if this is less than threshold, we'll break the while loop
diff = sum(beta_change^2)
#see if we've hit the maximum number of iterations
iter_count = iter_count + 1
if(iter_count > max_iter) {
stop("This isn't converging, mate.")
}
}
#make it pretty
coef = c("(Intercept)" = beta[1], x1 = beta[2], x2 = beta[3], x3 = beta[4])
return(coef)
}
```

One thing to note is that I defined `calc_p()`

(the function to calculate $\mathbf{p}$) inside the `manual_logistic_regression()`

function. This means that `calc_p()`

is a function only accessible from within `manual_logistic_regression()`

and other functions won’t be able to access `calc_p()`

. Keep this in mind if you wanted to extend this code or perform some testing or something.

We can go ahead and see how well our version compares against R’s implementation of logistic regression. We usually run logistic regression by calling the `glm()`

function whilst setting the family argument to “binomial”.
Let’s compare the two:

```
#using R package
M1 = glm(y~x1+x2+x3, family = "binomial")
M1$coefficients
# (Intercept) x1 x2 x3
# -1.3512086 0.3191309 0.2033449 -0.0832102
#our version
manual_logistic_regression(X,y)
# (Intercept) x1 x2 x3
# -1.3512086 0.3191309 0.2033449 -0.0832102
#they're the same! Amazing!
```

And that’s all there is to it.