<- rpois(1000, 45)
X ggplot(tibble(x=X),aes(x=x)) + geom_histogram() + theme_barbie()
Perhaps no other subject in applied statistics and machine learning has caused people as much trouble as the humble logit model. Most people who learn either subject start off with some form of linear regression, and having grasped it, are then told that if they have a variable with only two values, none of the rules apply. Horror of horrors: logit is a non-linear model, a doom from which few escape unscathed. Interpreting logit coefficients is a fraught exercise as not only applied researchers but also statisticians have endless arguments about how this should be done.
In this post I de-mystify logit—and show why it’s such a useful model—by using maps. Yes, non-linear spaces can be tricky to understand, but as humans we can think reasonably well in up to three dimensions. If you know what the difference between a map and a globe is, then you can run and interpret logit just fine.
I’m also going to argue based on this analogy that people who assiduously avoid logit are not so dissimilar from flat earthers. It’s true that moving from a map to a globe induces additional complications, but the alternative is run into significant problems in navigation. Similarly, applying the workhouse OLS model to a binary outcome isn’t wrong, but it can lead you astray when you want to figure out the distance between two points. Once you read this post, I hope you’ll see why logit is not all that complicated and should be the preferred model with a binary outcome.
B Is For Bernoulli
I’ll start first with a review of what a logit model actually is as there is some confusion on this score. Logit is the wrong name for what most people are estimating; the model is in fact the Bernoulli distribution. This distribution has the following form for a variable
As soon as the math appears, of course, it looks complicated. This formula is actually the simplest statistical distribution once you break it down. What it says is that, on average, the probability that a variable
What is important to remember, though, is that
You’ve probably noticed that so far we haven’t mentioned the word “logit” at all. Where does this tricky little monster come in?
Well if all we want to do is calculate the average proportion of 1s in
We see that the average person in our data is about 45 years old and the age range is between roughly 20 and 60. What we want to understand is whether on average older (or younger) people tend to vote more for Biden. But we have a problem -
If you take a look at
ggplot(tibble(x=X/100),aes(x=x)) + geom_histogram() + theme_barbie()
We just squashed
Here we simply replaced
Draw a random number between 0 and 1.
If the random number is less than
, set equal to 1, and 0 otherwise.
Here’s what our random data
<- as.numeric(runif(length(X))<(X/100))
Y
ggplot(data=tibble(Y,X),aes(y=Y,x=X/100)) +
geom_point(alpha=0.5) +
stat_smooth() +
theme_barbie()
What our plot shows is that when X is high, we tend to get more 1s in
If you look at this formula, though, you might think something is a bit off. Do we think that someone’s age as a fraction will be exactly equal to their probability of voting for Biden? Will voters who are 70 years old have an exactly 70% chance of voting for him? We only have one option for describing that relationship, and a 1:1 correspondence seems very limiting. So what do we do if we think this relationship is more nuanced?
To allow the relationship to be less exact, we’ll take a page out of linear regression and add a coefficient to multiply
Now let’s say
You might see some further issues, though. What if
This is a handy dandy trick to allow
We now have a linear model we want to stick in
<- as.numeric(runif(length(X))<(0.1 + .5*(X/100)))
Y
ggplot(data=tibble(Y,X),aes(y=Y,x=X/100)) +
geom_point(alpha=0.5) +
stat_smooth() +
theme_barbie()
We now have a much weaker relationship between age and voting: voting for Biden increases a bit as people age, but not much. But we do now have a much more nuanced relationship than we started out with by using the divide by 100 method. Very cool!
However, you might be wondering—what if we want even more diverse relationships? What if
<- as.numeric(runif(length(X))<(0.1 + 3*(X/100)))
Y
ggplot(data=tibble(Y,X),aes(y=Y,x=X/100)) +
geom_point(alpha=0.5) +
stat_smooth() +
theme_barbie()
Yikes! Now we’re only getting values for
Not All Heroes Are Linear
We now have two options. We can keep on guessing what values might work for
This is where our good friend logit comes in. Logit is a function that can take in any number–literally any number–and magically transform it to a number that is strictly between 0 and 1. It’s a magical map that let’s us move from the world of ages and to the world of proportions between 0 and 1. NB: Technically this is not the logit function but rather the inverse logit function. But everyone calls the model “logit” so I’ll just use that term.
The function itself is a bit ugly, but I’ll include it here along with our linear model as the input and
I won’t spend a lot of time on explaining the function other than to note that it has the number
<- .2
beta <- -10
alpha
<- function(x) { 1 / (1 + exp(-(x))) }
logit
tibble(linear_model=alpha + beta * X) %>%
mutate(p=logit(linear_model)) %>%
ggplot(aes(y=p, x=linear_model)) +
geom_point() +
theme_barbie()
On the bottom you can see all the values for our linear model when
If you notice, though, while the line curves, it never changes the direction. When the linear model goes up, so does
A Whole New World
Often people get confused or suspicious at this point. What’s up with this weird S-curve thing? Linear models are nice and straight. Why do we need the bend-y thing?
To understand this, we can return back to our analogies of maps. The maps that we use are all two-dimensional: we can put them on a table and they lie flat. However, we know that they aren’t completely accurate because the Earth is not flat, it’s round. A globe is the truly accurate representation of the Earth. A flattened map is going to inevitably cause distortions.
As a result, when we trace a flight path on a flat map, the plane’s journey is always curved as in the picture below:
This flight path–which happens to be for a very long flight from Paris to Tahiti–seems bizarrely roundabout on the map. You would think that the straightest path between Paris and Tahiti would be a straight line. This would be true if the Earth were flat, but it’s not. The Earth is round, and the map is a linear projection of a 3-D object onto a 2-D surface. When we do that and calculate the most direct route, we end up with this weird bend-y shape: just like our logit plot above.
By the same analogy, our linear model is our simplified version of reality that allows us to understand the relationship between
<- .2
beta <- -10
alpha
<- function(x) { 1 / (1 + exp(-(x))) }
logit
tibble(linear_model=alpha + beta * X) %>%
mutate(p=logit(linear_model)) %>%
ggplot(aes(y=p, x=linear_model)) +
geom_point() +
theme_barbie()
When the linear model is 0–or
The magical part of logit is that it creates more space as the linear model gets very big or very small. The linear model keeps moving at the same rate (
Flat Earthers and the Logit Function
What is logit then? A handy dandy scaling function that maps a linear model to a proportion between 0 and 1. That’s all it does. The actual distribution is the Bernoulli (though sadly, no one calls it that). A logit function isn’t special—there are other similar functions, just as there are a lot of map coordinate transformations (talk to a GIS person sometime; it’ll bore you to tears).
There are people who really don’t like logit. Usually it’s something to do with how the function complicates things and makes these bend-y shapes. It does require some thinking as we have to think a bit about what our linear model coefficient
It is this bend-y (or non-linear) nature of logit that leads some people to just ignore the Bernoulli and use a linear model. That is an easy way to deal with the problem, but unfortunately, we cannot simply ignore this transformation. Like flat earthers, if we ignore the fact that the Earth is round, we won’t be able to calculate distances accurately. If we have a linear model that can get larger or smaller than 0% or 100%, then we will end up with nonsensical predictions from our linear model like a plane that ends up at the wrong airport because the pilot believed the Earth was flat.
So how do you interpret a logit model coefficients? There are two ways once you understand what the logit function does:
- Leave it as it is. The coefficient
is the relationship of the covariate ( ) to the outcome ( ) in the linear or “flat” space. The larger is, the stronger the relationship is between the covariate and the outcome. No, we don’t know exactly how much the proportion will change as changes, but we can still interpret as expressing what that change looks like in the simple space of the linear model. - Convert it to changes in
via the proportion . Just like a plane figuring out distances using transformation of coordinates, we can figure out the average change in a proportion by averaging (or marginalizing) over different logit functions for a given . I won’t go into the technical details, but we have very handy R packages to calculate what are called marginal effects: how much does the “real world” change as our linear model coefficient changes?
There are plenty of vignettes about how to calculate marginal effects on the R package marginaleffects
website. Here I’ll show how to calculate a marginal effect of glm
:
# simulate our outcome Y given our linear model parameters
<- as.numeric(runif(length(X))<logit(alpha + beta*X))
Y
# make a dataset and fit the model for Y and X
# note that glm says the family is "Binomial"
# Bernoulli is a special case of the Binomial
# and our link (scaling) function is logit!
<- tibble(Y=Y, X=X)
sim_data <- glm(Y ~ X,data=sim_data,family=binomial(link="logit"))
logit_fit
summary(logit_fit)
Call:
glm(formula = Y ~ X, family = binomial(link = "logit"), data = sim_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2091 -0.7986 -0.4686 0.8678 2.5201
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.78933 0.70117 -13.96 <2e-16 ***
X 0.19578 0.01484 13.20 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1250.71 on 999 degrees of freedom
Residual deviance: 998.47 on 998 degrees of freedom
AIC: 1002.5
Number of Fisher Scoring iterations: 5
Above we see the output of the glm
command that shows us the values of Estimate
for X
. These are the values of these parameters in the linear, or simple, space. There is nothing wrong with these coefficients: they tell you how much
If we want to convert these relationships to the distances/effects in the outcome avg_slopes
function from marginaleffects
:
library(marginaleffects)
avg_slopes(logit_fit,variables="X")
Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
X 0.0324 0.00158 20.5 <0.001 0.0293 0.0355
Columns: term, estimate, std.error, statistic, p.value, conf.low, conf.high
Note that this Estimate
is way smaller than the Estimate
for glm
command. That estimate was from the linear space and so is much bigger. The marginaleffects
command is telling us how much
Very cool! Now that you understand logit models and how to interpret linear model coefficients in both the linear space and via marginal effects, you know pretty much everything you need to know about this wonderful little function logit. Next time you have a binary outcome like voting for Biden, or anything else, give logit a go! It’s your map to a whole new world.