Assessing Logistic Regression Models

Prof. Eric Friedlander

Nov 08, 2024

Announcements

📋 AE 20 - Assessing Logistic Regression Models

  • Open up AE 20 and complete Exercise 0.

Topics

  • Estimating coefficients in logistic regression
  • Checking model conditions for logistic regression

Computational setup

# load packages
library(tidyverse)
library(broom)
library(ggformula)
library(openintro)
library(knitr)
library(kableExtra)  # for table embellishments
library(Stat2Data)   # for empirical logit
library(countdown)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Data

Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.

  • TenYearCHD:

    • 1: High risk of having heart disease in next 10 years
    • 0: Not high risk of having heart disease in next 10 years
  • age: Age at exam time (in years)

  • currentSmoker: 0 = nonsmoker, 1 = smoker

Data prep

heart_disease <- read_csv("data/framingham.csv") |>
  select(TenYearCHD, age, currentSmoker) |>
  drop_na() |>
  mutate(currentSmoker = as.factor(currentSmoker))

heart_disease |> head() |> kable()
TenYearCHD age currentSmoker
0 39 0
0 46 0
0 48 1
1 61 1
0 46 1
0 43 0

Estimating coefficients

Statistical model

The form of the statistical model for logistic regression is

\[ \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p \]

where \(\pi\) is the probability \(Y = 1\).

Notice there is no error term when writing the statistical model for logistic regression. Why?

  • The statistical model is the “data-generating” model
  • Each individual observed \(Y\) is generated from a Bernoulli distribution, \(Bernoulli(\pi)\)
  • Therefore, the randomness is not produced by an error term but rather in the distribution used to generate \(Y\)

Bernoulli Distribution

  • Think of two possible outcomes:
  • 1 = “Success” which occurs with probability \(\pi\)
  • 0 = “Failure” which occurs with probability \(1-\pi\)
  • We can think of each of our observations as having a Bernoulli distribution with mean \(\pi_i\)
  • Our logistic regression model is changing \(\pi_i\) (the probability of success) for each new observation
  • The probability that we got our data, given our model is the truth, is then called the Likelihood \[L = \prod_{i=1}^n \pi_i^{y_i}(1-\pi_i)^{1-y_i}\]

Log Likelihood Function

Log-Likelihood Function: the log of the likelihood function is easier to work with and has the same maxima and minima!

\[ \log L = \sum\limits_{i=1}^n[y_i \log(\hat{\pi}_i) + (1 - y_i)\log(1 - \hat{\pi}_i)] \]

where

\[\hat{\pi} = \frac{\exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}{1 + \exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}\]

  • The coefficients \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) are estimated using maximum likelihood estimation

  • Basic idea: Find the values of \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) that give the observed data the maximum probability of occurring

Maximum Likelihood Estimation

  • This is called maximum likelihood estimation and is EXTREMELY common in statistics and data science

  • Need a strong foundation in probability and applied mathematics to fully understand

  • Logistic regression: maximum found through numerical methods (clever computer algorithms that approximate the maximum)

  • Linear regression: maximum found through calculus

Complete Exercise 1.

05:00

Recap

  • How do we fit a logistic regression model?
    • Maximum likelihood estimation