Inference for Logistic Regression Models

Prof. Eric Friedlander

Nov 13, 2024

Announcements

Project: Paper Due November 18th
Oral R Quiz

📋 AE 21 - Inference for Logistic Regression Models

Open up AE 21 and complete Exercise 0.

Topics

Inference for coefficients in logistic regression

Computational setup

# load packages
library(tidyverse)
library(broom)
library(ggformula)
library(knitr)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Data

Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.

TenYearCHD:
- 1: Developed heart disease in next 10 years
- 0: Did not develop heart disease in next 10 years
age: Age at exam time (in years)

Data prep: What’s wrong with this code?

heart_disease <- read_csv("data/framingham.csv") |>
  select(TenYearCHD, age) |>
  drop_na() |> 
  head() |> 
  kable()

Data prep

heart_disease <- read_csv("data/framingham.csv") |>
  select(TenYearCHD, age) |>
  drop_na()

heart_disease |> head() |> kable()

TenYearCHD	age
0	39
0	46
0	48
1	61
0	46
0	43

Modeling risk of coronary heart disease

What’s wrong with this code?

risk_fit <- glm(TenYearCHD ~ age, data = heart_disease, 
                family = "binomial") |> 
  tidy() |> 
  kable()

Modeling risk of coronary heart disease

Using age:

risk_fit <- glm(TenYearCHD ~ age, data = heart_disease, 
                family = "binomial")

risk_fit |> tidy() |> kable()

term	estimate	std.error	statistic	p.value
(Intercept)	-5.5610898	0.2837460	-19.59883	0
age	0.0746501	0.0052651	14.17821	0

Inference for Logistic Regression

Recall: Inference for Linear Regression

t-test: determine whether \(\beta_1\) (the slope) is different than zero
ANOVA/F-Test: To test the full model or to compare nested models
SSModel/SSE/ \(R^2\) / \(\hat{\sigma}_{\epsilon}\) : metrics to try a measure the amount of variability explained by competing models

Hypothesis test for \(\beta_1\)

Hypotheses: \(H_0: \beta_1 = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_1 \neq 0\)

Test Statistic: \[z = \frac{\hat{\beta}_1 - 0}{SE_{\hat{\beta}_1}}\]

\(z\) is sometimes called a Wald statistic and this test is sometimes called a Wald Hypothesis Test.

P-value: \(P(|Z| > |z|)\), where \(Z \sim N(0, 1)\), the Standard Normal distribution

Confidence interval for \(\beta_1\)

We can calculate the C% confidence interval for \(\beta_1\) as the following:

\[ \Large{\hat{\beta}_1 \pm z^* SE_{\hat{\beta}_1}} \]

where \(z^*\) is calculated from the \(N(0,1)\) distribution

Note

This is an interval for the change in the log-odds for every one unit increase in \(x\)

Interpretation in terms of the odds

The change in odds for every one unit increase in \(x_1\).

\[ \Large{\exp\{\hat{\beta}_1 \pm z^* SE_{\hat{\beta}_1}\}} \]

Interpretation: We are \(C\%\) confident that for every one unit increase in \(x_1\), the odds multiply by a factor of \(\exp\{\hat{\beta}_1 - z^* SE_{\hat{\beta}_1}\}\) to \(\exp\{\hat{\beta}_1 + z^* SE_{\hat{\beta}_1}\}\), holding all else constant.

Coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.561	0.284	-19.599	0	-6.124	-5.011
age	0.075	0.005	14.178	0	0.064	0.085

Hypotheses:

\[ H_0: \beta_{age} = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_{age} \neq 0 \]

Coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.561	0.284	-19.599	0	-6.124	-5.011
age	0.075	0.005	14.178	0	0.064	0.085

Test statistic:

\[z = \frac{0.0747 - 0}{0.00527} \approx 14.178\]

Note: rounding errors!

Coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.561	0.284	-19.599	0	-6.124	-5.011
age	0.075	0.005	14.178	0	0.064	0.085

P-value:

\[ P(|Z| > |14.178|) \approx 0 \]

2 * pnorm(14.178,lower.tail = FALSE)

[1] 1.253689e-45

Coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.561	0.284	-19.599	0	-6.124	-5.011
age	0.075	0.005	14.178	0	0.064	0.085

Conclusion:

The p-value is very small, so we reject \(H_0\). The data provide sufficient evidence that age is a statistically significant predictor of whether someone will develop heart disease in the next 10 years.

CI for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.561	0.284	-19.599	0	-6.124	-5.011
age	0.075	0.005	14.178	0	0.064	0.085

We are 95% confident that for each additional year of age, the change in the log-odds of someone developing heart disease in the next 10 years is between 0.064 and 0.085.

We are 95% confident that for each additional year of age, the odds of someone developing heart disease in the next 10 years will increase by a factor of \(\exp(0.064) \approx 1.077\) to \(\exp(0.085)\approx 1.089\).

Complete Exercises 1-4.

Recap

Inference: Linear vs. Logistic Regression

t-test: determine whether \(\beta_1\) (the slope) is different than zero
ANOVA/F-Test: To test the full model or to compare nested models
SSModel/SSE/ \(R^2\) / \(\hat{\sigma}_{\epsilon}\) : metrics to try a measure the amount of variability explained by competing models

Inference: Linear vs. Logistic Regression

t-test: determine whether \(\beta_1\) (the slope) is different than zero
ANOVA/F-Test: To test the full model or to compare nested models (Next time)
SSModel/SSE/ \(R^2\) / \(\hat{\sigma}_{\epsilon}\) : metrics to try and measure the amount of variability explained by competing models (Next time)

Inference for \(\beta_1\)

Same idea as linear regression
Define null and alternative hypotheses
- Null: my variable has no effect
- Alternative: my variable has an effect
Compute test statistic (how many standard errors is my observed \(\hat{\beta}_1\) away from 0?)
Compute p-value
- What is the probability I got my \(\hat{\beta}_1\) or even stronger evidence of an effect if the null hypothesis were the truth?

Inference for Logistic Regression Models

Announcements

Topics

Computational setup

Data

Risk of coronary heart disease

Data prep: What’s wrong with this code?

Data prep

Modeling risk of coronary heart disease

Modeling risk of coronary heart disease

Inference for Logistic Regression

Recall: Inference for Linear Regression

Hypothesis test for \(\beta_1\)

Confidence interval for \(\beta_1\)

Interpretation in terms of the odds

Coefficient for age

Coefficient for age

Coefficient for age

Coefficient for age

CI for age

Recap

Inference: Linear vs. Logistic Regression

Inference: Linear vs. Logistic Regression

Inference for \(\beta_1\)

Coefficient for `age`

Coefficient for `age`

Coefficient for `age`

Coefficient for `age`

CI for `age`