Logistic regression continued

Introduction

Prof. Eric Friedlander

Nov 06, 2024

Announcements

Project: Paper Due November 18th
Oral R Quiz

📋 AE 19 - Intro to Logistic Regression Continued

Open up AE 19 and complete Exercise 0.

Logistic regression

Topics

Use R to fit logistic regression models
Interpret coefficients of logistic regression models
Use logistic regression model to calculate predicted odds and probabilities

Computational setup

# load packages
library(tidyverse)
library(ggformula)
library(ggforce)
library(broom)
library(knitr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Data: Framingham Study

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use the total cholesterol to predict if a randomly selected adult is high risk for heart disease in the next 10 years.

heart_disease <- read_csv("data/framingham.csv") |>
  select(totChol, TenYearCHD, age, BMI, cigsPerDay, heartRate) |>
  drop_na()

Variables

Response:
- TenYearCHD:
  - 1: Patient developed heart disease within 10 years of exam
  - 0: Patient did not develop heart disease within 10 years of exam
Predictors:
- totChol: total cholesterol (mg/dL)
- age: age in years

Logistic regression

Logistic regression model

Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]

Probability form:

\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]

Today: Using R to fit this model.

High risk vs. age

heart_disease |> 
gf_sina(age ~ factor(TenYearCHD)) |> 
  gf_labs(x = "High risk - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. High risk of heart disease")

High risk vs. age

heart_disease |> 
gf_violin(age ~ factor(TenYearCHD), fill = "steelblue") |> 
  gf_labs(x = "High risk - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. High risk of heart disease")

High risk vs. age

heart_disease |> 
gf_boxplot(age ~ factor(TenYearCHD), fill = "steelblue") |> 
  gf_sina(size = 0.75, alpha=0.25) |> 
  gf_labs(x = "High risk - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. High risk of heart disease")

Let’s fit a model

heart_disease_fit <- glm(TenYearCHD ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.661	0.290	-19.526	0
age	0.076	0.005	14.198	0

The model

tidy(heart_disease_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.661	0.290	-19.526	0
age	0.076	0.005	14.198	0

\[\textbf{Logit form:}\qquad\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -5.561 + 0.075 \times \text{age}\]

\[\textbf{Probability form:}\qquad\hat{\pi} = \frac{\exp(-5.561 + 0.075 \times \text{age})}{1+\exp(-5.561 + 0.075 \times \text{age})}\]

where \(\hat{\pi}\) is the predicted probability of developing heart disease in the next 10 years.

Interpreting \(\hat{\beta}\)’s

tidy(heart_disease_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.661	0.290	-19.526	0
age	0.076	0.005	14.198	0

For every addition year of age, the log-odds of developing heart disease in the next 10 years, increases by 0.077.

Complete Exercises 1 - 3.

Interpretability of \(\beta\) for predicted probabilities

SLOPE IS CHANGING!
Increase in \(\hat{\pi}\) due to increase of 1 year of Age depends on what starting age is

`glm` and `augment`

The .fitted values in augment correspond to predictions from the logistic form of the model (i.e. the log-odds):

augment(heart_disease_fit)  |> head()

# A tibble: 6 × 8
  TenYearCHD   age .fitted .resid     .hat .sigma   .cooksd .std.resid
       <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>     <dbl>      <dbl>
1          0    39   -2.68 -0.363 0.000472  0.891 0.0000161     -0.363
2          0    46   -2.15 -0.469 0.000330  0.891 0.0000192     -0.469
3          0    48   -2.00 -0.504 0.000295  0.891 0.0000200     -0.504
4          1    61   -1.01  1.62  0.000730  0.891 0.000999       1.62 
5          0    46   -2.15 -0.469 0.000330  0.891 0.0000192     -0.469
6          0    43   -2.38 -0.421 0.000393  0.891 0.0000182     -0.421

Note: The residuals do not make sense here!

For observation 1

\[\text{predicted probability} = \hat{\pi} = \frac{\exp\{-2.650\}}{1 + \exp\{-2.650\}} = 0.066\]

Using `predict` with `glm`

Default output is log-odds:

predict(heart_disease_fit, new_data = heart_disease) |> head() |> kable(digits = 3)

x
-2.685
-2.150
-1.998
-1.006
-2.150
-2.379

Using `predict` with `glm`

More commonly you want the predicted probability:

predict(heart_disease_fit, newdata = heart_disease, type = "response") |> head() |> kable(digits = 3)

x
0.064
0.104
0.119
0.268
0.104
0.085

Complete Exercise 4

Recap

Fit logistic regression models using glm
Interpreted coefficients in logistic regression models
Used logistic regression model to calculate predicted odds and probabilities
Use predict to make predictions using glm

Logistic regression continued

Announcements

Logistic regression

Topics

Computational setup

Data: Framingham Study

Variables

Logistic regression

Logistic regression model

High risk vs. age

High risk vs. age

High risk vs. age

Let’s fit a model

The model

Interpreting \(\hat{\beta}\)’s

Interpretability of \(\beta\) for predicted probabilities

glm and augment

Using predict with glm

Using predict with glm

Recap

`glm` and `augment`

Using `predict` with `glm`

Using `predict` with `glm`