Logistic regression

Introduction

Prof. Eric Friedlander

Nov 01, 2024

Announcements

Project: Paper Due November 18th
Oral R Quiz

📋 AE 18 - Intro to Logistic Regression

Open up AE 18

Logistic regression

Topics

Introduction to modeling categorical data
Logistic regression for binary response variable
Relationship between odds and probabilities

Computational setup

# load packages
library(tidyverse)
library(ggformula)
library(broom)
library(knitr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Predicting categorical outcomes

Types of outcome variables

Quantitative outcome variable:

Sales price of a house
Model: Expected sales price given the number of bedrooms, lot size, etc.

Categorical outcome variable:

Indicator of being high risk of getting coronary heart disease in the next 10 years
Model: Probability an adult is high risk of heart disease in the next 10 years given their age, total cholesterol, etc.

Models for categorical outcomes

Logistic regression

2 Outcomes

1: “Success” (models probability of this category…)
0: “Failure”

Multinomial logistic regression

3+ Outcomes

1: Democrat
2: Republican
3: Independent

2024 election forecasts

The Economist

2020 NBA finals predictions

Source: FiveThirtyEight 2019-20 NBA Predictions

Data: Framingham Study

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use the total cholesterol to predict if a randomly selected adult is high risk for heart disease in the next 10 years.

heart_disease <- read_csv("data/framingham.csv") |>
  select(totChol, TenYearCHD) |>
  drop_na() |>
  mutate(high_risk = TenYearCHD) |>
  select(totChol, high_risk)

Variables

Response:
- high_risk:
  - 1: High risk of having heart disease in next 10 years
  - 0: Not high risk of having heart disease in next 10 years
Predictor:
- totChol: total cholesterol (mg/dL)

Complete Exercises 1-2.

Plot the data

Let’s fit a linear regression model

Outcome: \(Y\) = 1: high risk, 0: not high risk

What happens if we zoom out?

🛑 This model produces predictions outside of 0 and 1.

Let’s try another model

✅ This model (called a logistic regression model) only produces predictions between 0 and 1.

The code

heart_disease |> 
  gf_point(high_risk ~ totChol)  |>
  gf_hline(yintercept = c(0,1), lty = 2) |> 
  gf_lims(x=c(-1000, 2000), y = c(-1, 2)) |> 
  gf_labs(y = "High Risk", x = "Total Cholesterol") |> 
  gf_refine(stat_smooth(method ="glm", method.args = list(family = "binomial"), 
              fullrange = TRUE, se = FALSE))

Different types of models

Method	Outcome	Model
Linear regression	Quantitative	\(Y = \beta_0 + \beta_1~ X\)
Linear regression (transform Y)	Quantitative	\(\log(Y) = \beta_0 + \beta_1~ X\)
Logistic regression	Binary	\(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 ~ X\)

Note: In this class (and in most college level math classes) ((and and in R)) \(\log\) means log base \(e\) (i.e. natural log)

Linear vs. logistic regression

Complete Exercise 3.

Linear vs. logistic regression

State whether a linear regression model or logistic regression model is more appropriate for each scenario.

Use age and education to predict if a randomly selected person will vote in the next election.
Use budget and run time (in minutes) to predict a movie’s total revenue.
Use age and sex to calculate the probability a randomly selected adult will visit St. Lukes in the next year.

Odds and probabilities

Binary response variable

\(Y\):
- 1: “success” (not necessarily a good thing)
- 0: “failure”
\(\pi\): probability that \(Y=1\), i.e., \(P(Y = 1)\)
\(\frac{\pi}{1-\pi}\): odds that \(Y = 1\)
\(\log\big(\frac{\pi}{1-\pi}\big)\): log-odds
Go from \(\pi\) to \(\log\big(\frac{\pi}{1-\pi}\big)\) using the logit transformation

Odds

Suppose there is a 70% chance it will rain tomorrow

Probability it will rain is \(\mathbf{p = 0.7}\)
Probability it won’t rain is \(\mathbf{1 - p = 0.3}\)
Odds it will rain are 7 to 3, 7:3, \(\mathbf{\frac{0.7}{0.3} \approx 2.33}\)
Log-Odds it will rain is \(\log\mathbf{\frac{0.7}{0.3} \approx \log(2.33) \approx 0.847}\)

What are the odds of developing heart disease?

Complete Exercise 4.

From log-odds to probabilities

log-odds

\[\omega = \log \frac{\pi}{1-\pi}\]

odds

\[e^\omega = \frac{\pi}{1-\pi}\]

probability

\[\pi = \frac{e^\omega}{1 + e^\omega}\]

Logistic regression

From odds to probabilities

Logistic model: log-odds = \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\)
Odds = \(\exp\big\{\log\big(\frac{\pi}{1-\pi}\big)\big\} = \frac{\pi}{1-\pi}\)
Combining (1) and (2) with what we saw earlier

\[\text{probability} = \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]

Logistic regression model

Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]

Probability form:

\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]

Recap

Introduced logistic regression for binary response variable
Described relationship between odds and probabilities