SLR: Simulation-based inference

Bootstrap confidence intervals for the slope

Author

Prof. Eric Friedlander

Published

Sep 04, 2024

AE 02 Follow-up

AE 02 Follow-up

Goal: Use simple linear regression to model the relationship between temperature and daily bike rentals in the winter season

AE 02 Follow-up

Statistical Model:

\[count = \beta_0 +\beta_1 ~ temp\_orig + \epsilon, \hspace{5mm} \epsilon \sim N(0, \sigma_{\epsilon}^2)\]

. . .

winter_fit <- lm(count ~ temp_orig, data = winter)

tidy(winter_fit) |> kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -111.038 238.312 -0.466 0.642
temp_orig 222.416 18.459 12.049 0.000

AE 02 Follow-up

Use the output to write out the estimated regression equation.

\[ \hat{count} = -111.038 + 222.416 ~temp\_orig \]

LaTex:

\$\$\hat{count} = -111.038 + 222.416 ~ temp\_orig\$\$

Your turn!
  • Interpret the slope in the context of the data.

  • Why is there no error term in the regression equation?

Simulation-based inference

Bootstrap confidence intervals

Topics

  • Find range of plausible values for the slope using bootstrap confidence intervals

Computational setup

# load packages
library(tidyverse)   # for data wrangling and visualization
library(ggformula)  # for modeling
library(scales)      # for pretty axis labels
library(knitr)       # for neatly formatted tables
library(kableExtra)  # also for neatly formatted tablesf


# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

Data: San Antonio Income & Organic Food Access

  • Average household income (per zip code) and number of organic vegetable offerings in San Antonio, TX
  • Data from HEB website, compiles by high school student Linda Saucedo, Fall 2019
  • Source: Skew The Script

Goal: Use the average household income to understand variation in access to organic foods.

Exploratory data analysis

Code
heb <- read_csv("data/HEBIncome.csv") |> 
  mutate(Avg_Income_K = Avg_Household_Income/1000)

gf_point(Number_Organic ~ Avg_Income_K, data = heb, alpha = 0.7) |> 
  gf_labs(
    x = "Average Household Income (in thousands)",
    y = "Number of Organic Vegetables",
  ) |> 
  gf_refine(scale_x_continuous(labels = label_dollar()))

Modeling

heb_fit <- lm(Number_Organic ~ Avg_Income_K, data = heb)

tidy(heb_fit) |>
  kable(digits=2) #neatly format table to 2 digits
term estimate std.error statistic p.value
(Intercept) -14.72 9.30 -1.58 0.12
Avg_Income_K 0.96 0.13 7.50 0.00

. . .

  • Intercept: HEBs in Zip Codes with an average household income of $0 are expected to have -14.72 organic vegetable options, on average.
    • Is this interpretation useful?
  • Slope: For each additional $1,000 in average household income, we expect the number of organic options available at nearby HEBs to increase by 0.96, on average.

From sample to population

For each additional $1,000 in average household income, we expect the number of organic options available at nearby HEBs to increase by 0.96, on average.


  • Estimate is valid for the single sample of 37 HEBs
  • What if we’re not interested quantifying the relationship between the size and price of a house in this single sample?
  • What if we want to say something about the relationship between these variables for all supermarkets in America?

Statistical inference

  • Statistical inference refers to ideas, methods, and tools for to generalizing the single observed sample to make statements (inferences) about the population it comes from

  • For our inferences to be valid, the sample should be random and representative of the population we’re interested in

Inference for simple linear regression

  • Calculate a confidence interval for the slope, \(\beta_1\)

  • Conduct a hypothesis test for the slope, \(\beta_1\)

  • Why not \(\beta_0\)?

  • We can but it isn’t super interesting typically

. . .

  • What is a confidence interval?

  • What is a hypothesis test?

Confidence interval for the slope

Confidence interval

  • Confidence interval: plausible range of values for a population parameter
  • single point estimate \(\implies\) fishing in a murky lake with a spear
  • confidence interval \(\implies\) fishing with a net
    • We can throw a spear where we saw a fish but we will probably miss, if we toss a net in that area, we have a good chance of catching the fish
    • If we report a point estimate, we probably will not hit the exact population parameter, but if we report a range of plausible values we have a good shot at capturing the parameter
    • High confidence \(\implies\) wider interval (larger net)
  • Remember: single CI \(\implies\) either you hit parameter or you don’t

Confidence interval for the slope

A confidence interval will allow us to make a statement like “For each $1K in average income, the model predicts the number of organic vegetables available at local supermarkets to be higher, on average, by 0.96, plus or minus X options.

. . .

  • Should X be 1? 2? 3?

  • If we were to take another sample of 37 would we expect the slope calculated based on that sample to be exactly 0.96? Off by 1? 2? 3?

  • The answer depends on how variable (from one sample to another sample) the sample statistic (the slope) is

  • We need a way to quantify the variability of the sample statistic

Quantify the variability of the slope

for estimation

  • Two approaches:
    1. Via simulation (what we’ll do today)
    2. Via mathematical models (what we’ll do in the soon)
  • Bootstrapping to quantify the variability of the slope for the purpose of estimation:
    • Generate new samples by sampling with replacement from the original sample
    • Fit models to each of the new samples and estimate the slope
    • Use features of the distribution of the bootstrapped slopes to construct a confidence interval

Original Sample

Bootstrap sample 1

Bootstrap sample 2

Bootstrap sample 3

Bootstrap sample 4

Bootstrap sample 5

Bootstrap samples 1 - 5

Bootstrap samples 1 - 100

Slopes of bootstrap samples

Fill in the blank: For each additional $1k in average household income, the model predicts the number of organic vegetables available to be higher, on average, by 0.96, plus or minus ___.

Slopes of bootstrap samples

Fill in the blank: For each additional $1k in average household income, the model predicts the number of organic vegetables available to be higher, on average, by 0.96, plus or minus ___.

Confidence level

How confident are you that the true slope is between 0.8 and 1.2? How about 0.9 and 1.0? How about 1.0 and 1.4?

95% confidence interval

  • 95% bootstrapped confidence interval: bounded by the middle 95% of the bootstrap distribution
  • We are 95% confident that for each additional $1K in average household income, the model predicts the number of organic vegetables options at local supermarkets to be higher, on average, by 0.81 to 1.31.

Computing the CI for the slope I

Calculate the observed slope:

library(infer) # package that does Simulation-Based Inference

observed_fit <- heb |>
  specify(Number_Organic ~ Avg_Income_K) |>
  fit()

observed_fit
# A tibble: 2 × 2
  term         estimate
  <chr>           <dbl>
1 intercept     -14.7  
2 Avg_Income_K    0.959

Computing the CI for the slope II

Take 100 bootstrap samples and fit models to each one:

set.seed(1120)

boot_fits <- heb |>
  specify(Number_Organic ~ Avg_Income_K) |>
  generate(reps = 100, type = "bootstrap") |>
  fit()

boot_fits
# A tibble: 200 × 3
# Groups:   replicate [100]
   replicate term         estimate
       <int> <chr>           <dbl>
 1         1 intercept     -40.9  
 2         1 Avg_Income_K    1.25 
 3         2 intercept     -23.9  
 4         2 Avg_Income_K    1.09 
 5         3 intercept     -18.6  
 6         3 Avg_Income_K    1.02 
 7         4 intercept      -1.96 
 8         4 Avg_Income_K    0.828
 9         5 intercept     -15.1  
10         5 Avg_Income_K    0.951
# ℹ 190 more rows

Computing the CI for the slope III

Percentile method: Compute the 95% CI as the middle 95% of the bootstrap distribution:

get_confidence_interval(
  boot_fits, 
  point_estimate = observed_fit, 
  level = 0.95,
  type = "percentile" #default method
)
# A tibble: 2 × 3
  term         lower_ci upper_ci
  <chr>           <dbl>    <dbl>
1 Avg_Income_K    0.822     1.27
2 intercept     -34.5      -2.56

Precision vs. accuracy

If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval?

. . .

Precision vs. accuracy

How can we get best of both worlds – high precision and high accuracy?

Changing confidence level

How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval?

get_confidence_interval(
  boot_fits, 
  point_estimate = observed_fit, 
  level = 0.95,
  type = "percentile"
)
# A tibble: 2 × 3
  term         lower_ci upper_ci
  <chr>           <dbl>    <dbl>
1 Avg_Income_K    0.822     1.27
2 intercept     -34.5      -2.56

Changing confidence level

## confidence level: 90%
get_confidence_interval(
  boot_fits, point_estimate = observed_fit, 
  level = 0.90, type = "percentile"
)
# A tibble: 2 × 3
  term         lower_ci upper_ci
  <chr>           <dbl>    <dbl>
1 Avg_Income_K    0.829     1.23
2 intercept     -31.7      -3.76
## confidence level: 99%
get_confidence_interval(
  boot_fits, point_estimate = observed_fit, 
  level = 0.99, type = "percentile"
)
# A tibble: 2 × 3
  term         lower_ci upper_ci
  <chr>           <dbl>    <dbl>
1 Avg_Income_K    0.795    1.36 
2 intercept     -43.3     -0.535

Application exercise

Recap

  • Population: Complete set of observations of whatever we are studying, e.g., people, tweets, photographs, etc. (population size = \(N\))

  • Sample: Subset of the population, ideally random and representative (sample size = \(n\))

  • Sample statistic \(\ne\) population parameter, but if the sample is good, it can be a good estimate

  • Statistical inference: Discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) process

  • We report the estimate with a confidence interval, and the width of this interval depends on the variability of sample statistics from different samples from the population

  • Since we can’t continue sampling from the population, we bootstrap from the one sample we have to estimate sampling variability