Assessing Logistic Regression Models

Prof. Eric Friedlander

Nov 08, 2024

Announcements

Project: Paper Due November 18th
Oral R Quiz

📋 AE 20 - Assessing Logistic Regression Models

Open up AE 20 and complete Exercise 0.

Topics

Estimating coefficients in logistic regression
Checking model conditions for logistic regression

Computational setup

# load packages
library(tidyverse)
library(broom)
library(ggformula)
library(openintro)
library(knitr)
library(kableExtra)  # for table embellishments
library(Stat2Data)   # for empirical logit
library(countdown)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Data

Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.

TenYearCHD:
- 1: High risk of having heart disease in next 10 years
- 0: Not high risk of having heart disease in next 10 years
age: Age at exam time (in years)
currentSmoker: 0 = nonsmoker, 1 = smoker

Data prep

heart_disease <- read_csv("data/framingham.csv") |>
  select(TenYearCHD, age, currentSmoker) |>
  drop_na() |>
  mutate(currentSmoker = as.factor(currentSmoker))

heart_disease |> head() |> kable()

TenYearCHD	age	currentSmoker
0	39	0
0	46	0
0	48	1
1	61	1
0	46	1
0	43	0

Estimating coefficients

Statistical model

The form of the statistical model for logistic regression is

\[ \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p \]

where \(\pi\) is the probability \(Y = 1\).

Notice there is no error term when writing the statistical model for logistic regression. Why?

The statistical model is the “data-generating” model
Each individual observed \(Y\) is generated from a Bernoulli distribution, \(Bernoulli(\pi)\)
Therefore, the randomness is not produced by an error term but rather in the distribution used to generate \(Y\)

Bernoulli Distribution

Think of two possible outcomes:
1 = “Success” which occurs with probability \(\pi\)
0 = “Failure” which occurs with probability \(1-\pi\)
We can think of each of our observations as having a Bernoulli distribution with mean \(\pi_i\)
Our logistic regression model is changing \(\pi_i\) (the probability of success) for each new observation
The probability that we got our data, given our model is the truth, is then called the Likelihood \[L = \prod_{i=1}^n \pi_i^{y_i}(1-\pi_i)^{1-y_i}\]

Log Likelihood Function

Log-Likelihood Function: the log of the likelihood function is easier to work with and has the same maxima and minima!

\[ \log L = \sum\limits_{i=1}^n[y_i \log(\hat{\pi}_i) + (1 - y_i)\log(1 - \hat{\pi}_i)] \]

where

\[\hat{\pi} = \frac{\exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}{1 + \exp\{\hat{\beta}_0 + \hat{\beta}_1X_1 + \dots + \hat{\beta}_pX_p\}}\]

The coefficients \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) are estimated using maximum likelihood estimation
Basic idea: Find the values of \(\hat{\beta}_0, \ldots, \hat{\beta}_p\) that give the observed data the maximum probability of occurring

Maximum Likelihood Estimation

This is called maximum likelihood estimation and is EXTREMELY common in statistics and data science
Need a strong foundation in probability and applied mathematics to fully understand
Logistic regression: maximum found through numerical methods (clever computer algorithms that approximate the maximum)
Linear regression: maximum found through calculus

Complete Exercise 1.

05:00

Recap

How do we fit a logistic regression model?
- Maximum likelihood estimation