SLR: Math Models Continued

Prof. Eric Friedlander

Sep 11, 2024

Questions from last class?

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(ggformula)   # for plotting using formulas
library(broom)       # for formatting model output
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# HEB Dataset
heb <- read_csv("data/HEBIncome.csv") |> 
  mutate(Avg_Income_K = Avg_Household_Income/1000)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Regression model, revisited

heb_fit <- lm(Number_Organic ~ Avg_Income_K, data = heb)

tidy(heb_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-14.719	9.298	-1.583	0.122
Avg_Income_K	0.959	0.128	7.505	0.000

Mathematical representation, visualized

\[ Y|X \sim N(\beta_0 + \beta_1 X, \sigma_\epsilon^2) \]

Image source: Introduction to the Practice of Statistics (5th ed)

Confidence interval for the slope

\[ \text{Estimate} \pm \text{ (critical value) } \times \text{SE} \]

\[ \hat{\beta}_1 \pm t^* \times SE_{\hat{\beta}_1} \]

where $t^*$ is calculated from a $t$ distribution with $n-2$ degrees of freedom

Confidence interval: Critical value

# confidence level: 95%
qt(0.975, df = nrow(heb) - 2)

[1] 2.030108

# confidence level: 90%
qt(0.95, df = nrow(heb) - 2)

[1] 1.689572

# confidence level: 99%
qt(0.995, df = nrow(heb) - 2)

[1] 2.723806

95% CI for the slope: Calculation

term	estimate	std.error	statistic	p.value
(Intercept)	-14.72	9.30	-1.58	0.12
Avg_Income_K	0.96	0.13	7.50	0.00

\[\hat{\beta}_1 = 0.96 \hspace{15mm} t^* = 2.03 \hspace{15mm} SE_{\hat{\beta}_1} = 0.13\]

\[ 0.96 \pm 2.03 \times 0.13 = (0.70, 1.22) \]

95% CI for the slope: Computation

tidy(heb_fit, conf.int = TRUE, conf.level = 0.95) |> 
  kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-14.72	9.30	-1.58	0.12	-33.59	4.16
Avg_Income_K	0.96	0.13	7.50	0.00	0.70	1.22

Intervals for predictions

Question: “What is the predicted number of organic vegetable options in a neighborhood with an average income of $70k?”
We said reporting a single estimate for the slope is not wise, and we should report a plausible range instead
Similarly, reporting a single prediction for a new value is not wise, and we should report a plausible range instead

Two types of predictions

Prediction for the mean: “What is the average predicted number of organic vegetable options in a neighborhood with an average income of $70k?”
Prediction for an individual observation: “What is the predicted number of organic vegetable options at a single HEB in a neighborhood with an average income of $70k?”

Which would you expect to be more variable? The average prediction or the prediction for an individual observation? Based on your answer, how would you expect the widths of plausible ranges for these two predictions to compare?

Uncertainty in predictions

Confidence interval for the mean outcome: \[\large{\hat{y} \pm t_{n-2}^* \times \color{purple}{\mathbf{SE}_{\hat{\boldsymbol{\mu}}}}}\]

Prediction interval for an individual observation: \[\large{\hat{y} \pm t_{n-2}^* \times \color{purple}{\mathbf{SE_{\hat{y}}}}}\]

Standard errors

Standard error of the mean outcome: \[SE_{\hat{\mu}} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

Standard error of an individual outcome: \[SE_{\hat{y}} = \hat{\sigma}_\epsilon\sqrt{1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

Standard errors

Standard error of the mean outcome: \[SE_{\hat{\mu}} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

Standard error of an individual outcome: \[SE_{\hat{y}} = \hat{\sigma}_\epsilon\sqrt{\mathbf{\color{purple}{\Large{1}}} + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

Confidence interval

The 95% confidence interval for the mean outcome:

new_neighborhood <- tibble(Avg_Income_K = 70)

predict(heb_fit, newdata = new_neighborhood, interval = "confidence", level = 0.95) |>
  kable()

fit	lwr	upr
52.4175	46.58558	58.24942

We are 95% confident that the mean number of organic vegetable options offered by HEB in a neighborhood with an average income of $70k is between 46.59 and 58.25.

Prediction interval

The 95% prediction interval for an individual outcome:

predict(heb_fit, newdata = new_neighborhood, interval = "prediction", level = 0.95) |>
  kable()

fit	lwr	upr
52.4175	16.48941	88.34559

We are 95% confident that the number of organic vegetable options offered by HEB in a neighborhood with an average income of $70k is between 16.49 and 88.35.

Comparing intervals

Extrapolation

Using the model to predict for values outside the range of the original data is extrapolation.

Calculate the prediction interval for the number of organic options in an extremely wealthy neighborhood with an average household income of $500k.

No, thanks!

AI Generated Image of "A very rich person eating organic vegetables"

Extrapolation

Why do we want to avoid extrapolation?