Logistic regression

Introduction

Prof. Eric Friedlander

Nov 01, 2024

Announcements

đź“‹ AE 18 - Intro to Logistic Regression

  • Open up AE 18

Logistic regression

Topics

  • Introduction to modeling categorical data

  • Logistic regression for binary response variable

  • Relationship between odds and probabilities

Computational setup

# load packages
library(tidyverse)
library(ggformula)
library(broom)
library(knitr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Predicting categorical outcomes

Types of outcome variables

Quantitative outcome variable:

  • Sales price of a house
  • Model: Expected sales price given the number of bedrooms, lot size, etc.

Categorical outcome variable:

  • Indicator of being high risk of getting coronary heart disease in the next 10 years
  • Model: Probability an adult is high risk of heart disease in the next 10 years given their age, total cholesterol, etc.

Models for categorical outcomes

Logistic regression

2 Outcomes

  • 1: “Success” (models probability of this category…)
  • 0: “Failure”

Multinomial logistic regression

3+ Outcomes

  • 1: Democrat
  • 2: Republican
  • 3: Independent

2024 election forecasts

The Economist

2020 NBA finals predictions

Source: FiveThirtyEight 2019-20 NBA Predictions

Data: Framingham Study

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use the total cholesterol to predict if a randomly selected adult is high risk for heart disease in the next 10 years.

heart_disease <- read_csv("data/framingham.csv") |>
  select(totChol, TenYearCHD) |>
  drop_na() |>
  mutate(high_risk = TenYearCHD) |>
  select(totChol, high_risk)

Variables

  • Response:
    • high_risk:
      • 1: High risk of having heart disease in next 10 years
      • 0: Not high risk of having heart disease in next 10 years
  • Predictor:
    • totChol: total cholesterol (mg/dL)

Complete Exercises 1-2.

Plot the data

Let’s fit a linear regression model

Outcome: \(Y\) = 1: high risk, 0: not high risk

What happens if we zoom out?

🛑 This model produces predictions outside of 0 and 1.

Let’s try another model

âś… This model (called a logistic regression model) only produces predictions between 0 and 1.

The code

heart_disease |> 
  gf_point(high_risk ~ totChol)  |>
  gf_hline(yintercept = c(0,1), lty = 2) |> 
  gf_lims(x=c(-1000, 2000), y = c(-1, 2)) |> 
  gf_labs(y = "High Risk", x = "Total Cholesterol") |> 
  gf_refine(stat_smooth(method ="glm", method.args = list(family = "binomial"), 
              fullrange = TRUE, se = FALSE))

Different types of models

Method Outcome Model
Linear regression Quantitative \(Y = \beta_0 + \beta_1~ X\)
Linear regression (transform Y) Quantitative \(\log(Y) = \beta_0 + \beta_1~ X\)
Logistic regression Binary \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 ~ X\)

Note: In this class (and in most college level math classes) ((and and in R)) \(\log\) means log base \(e\) (i.e. natural log)

Linear vs. logistic regression

Complete Exercise 3.

Linear vs. logistic regression

State whether a linear regression model or logistic regression model is more appropriate for each scenario.

  1. Use age and education to predict if a randomly selected person will vote in the next election.

  2. Use budget and run time (in minutes) to predict a movie’s total revenue.

  3. Use age and sex to calculate the probability a randomly selected adult will visit St. Lukes in the next year.

Odds and probabilities

Binary response variable

  • \(Y\):
    • 1: “success” (not necessarily a good thing)
    • 0: “failure”
  • \(\pi\): probability that \(Y=1\), i.e., \(P(Y = 1)\)
  • \(\frac{\pi}{1-\pi}\): odds that \(Y = 1\)
  • \(\log\big(\frac{\pi}{1-\pi}\big)\): log-odds
  • Go from \(\pi\) to \(\log\big(\frac{\pi}{1-\pi}\big)\) using the logit transformation

Odds

Suppose there is a 70% chance it will rain tomorrow

  • Probability it will rain is \(\mathbf{p = 0.7}\)
  • Probability it won’t rain is \(\mathbf{1 - p = 0.3}\)
  • Odds it will rain are 7 to 3, 7:3, \(\mathbf{\frac{0.7}{0.3} \approx 2.33}\)
  • Log-Odds it will rain is \(\log\mathbf{\frac{0.7}{0.3} \approx \log(2.33) \approx 0.847}\)

What are the odds of developing heart disease?

Complete Exercise 4.

From log-odds to probabilities

log-odds

\[\omega = \log \frac{\pi}{1-\pi}\]

odds

\[e^\omega = \frac{\pi}{1-\pi}\]

probability

\[\pi = \frac{e^\omega}{1 + e^\omega}\]

Logistic regression

From odds to probabilities

  1. Logistic model: log-odds = \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\)
  2. Odds = \(\exp\big\{\log\big(\frac{\pi}{1-\pi}\big)\big\} = \frac{\pi}{1-\pi}\)
  3. Combining (1) and (2) with what we saw earlier

\[\text{probability} = \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]

Logistic regression model

Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]

Probability form:

\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]

Recap

  • Introduced logistic regression for binary response variable

  • Described relationship between odds and probabilities