SLR: Conditions

Prof. Eric Friedlander

Sep 13, 2024

Application exercise

Questions from last class?

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(ggformula)   # for plotting using formulas
library(broom)       # for formatting model output
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# HEB Dataset
heb <- read_csv("data/HEBIncome.csv") |> 
  mutate(Avg_Income_K = Avg_Household_Income/1000)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Regression model, revisited

heb_fit <- lm(Number_Organic ~ Avg_Income_K, data = heb)

tidy(heb_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -14.719 9.298 -1.583 0.122
Avg_Income_K 0.959 0.128 7.505 0.000

Mathematical representation, visualized

\[ Y|X \sim N(\beta_0 + \beta_1 X, \sigma_\epsilon^2) \]

Image source: Introduction to the Practice of Statistics (5th ed)

Model conditions

Model conditions

  1. Linearity: There is a linear relationship between the outcome and predictor variables
  2. Constant variance: The variability of the errors is equal for all values of the predictor variable
  3. Normality: The errors follow a normal distribution
  4. Independence: The errors are independent from each other

Linearity

  • If the linear model, \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i\), adequately describes the relationship between \(X\) and \(Y\), then the residuals should reflect random (chance) error

  • To assess this, we can look at a plot of the residuals vs. the fitted values

  • Linearity satisfied if there is no distinguishable pattern in the residuals plot, i.e. the residuals should be randomly scattered

  • A non-random pattern (e.g. a parabola) suggests a linear model does not adequately describe the relationship between \(X\) and \(Y\)

Linearity

βœ… The residuals vs. fitted values plot should show a random scatter of residuals (no distinguishable pattern or structure)

The augment function

heb_aug <- augment(heb_fit)

head(heb_aug) |> kable()
Number_Organic Avg_Income_K .fitted .resid .hat .sigma .cooksd .std.resid
36 71.186 53.55498 -17.554983 0.0272386 17.45293 0.0145449 -1.0192493
4 34.234 18.11468 -14.114677 0.0925015 17.53471 0.0366889 -0.8484596
28 71.186 53.55498 -25.554983 0.0272386 17.15160 0.0308219 -1.4837325
31 48.760 32.04642 -1.046423 0.0493994 17.71691 0.0000981 -0.0614599
78 78.096 60.18230 17.817702 0.0312671 17.44374 0.0173428 1.0366517
14 40.506 24.13009 -10.130092 0.0711183 17.62593 0.0138683 -0.6018889

Residuals vs. fitted values (code)

heb_aug <- augment(heb_fit)

gf_point(.resid ~ .fitted, data = heb_aug) |> 
  gf_hline(yintercept = 0, linetype = "dashed") |> 
  gf_labs(
    x = "Fitted value", y = "Residual",
    title = "Residuals vs. fitted values"
  )

Non-linear relationships

Constant variance

  • If the spread of the distribution of \(Y\) is equal for all values of \(X\) then the spread of the residuals should be approximately equal for each value of \(X\)

  • To assess this, we can look at a plot of the residuals vs. the fitted values

  • Constant variance satisfied if the vertical spread of the residuals is approximately equal as you move from left to right (i.e. there is no β€œfan” pattern)

  • A fan pattern suggests the constant variance assumption is not satisfied and transformation or some other remedy is required (more on this later in the semester)

  • CAREFUL: Inconsistent distribution of \(X\)s can make it seem as if there is non-constant variance

Constant variance

βœ… The vertical spread of the residuals is relatively constant across the plot

Non-constant variance

  • Think: Is my error/variance proportional to the thing I’m predicting?

Normality

  • The linear model assumes that the distribution of \(Y\) is Normal for every value of \(X\)

  • This is impossible to check in practice, so we will look at the overall distribution of the residuals to assess if the normality assumption is satisfied

  • Normality satisfied if a histogram of the residuals is approximately normal

    • Can also check that the points on a normal QQ-plot falls along a diagonal line
  • Most inferential methods for regression are robust to some departures from normality, so we can proceed with inference if the sample size is sufficiently large, \(n > 30\)

Normality

Check normality using a QQ-plot

Code
gf_histogram(~.resid, data = heb_aug,
             bins=7, color = "white") |> 
  gf_labs(
    x = "Residual",
    y = "Count",
    title = "Histogram of residuals"
  )

Code
gf_qq(~.resid, data = heb_aug) |> 
  gf_qqline() |>  
  gf_labs(x = "Theoretical quantile", 
       y = "Observed quantile", 
       title = "Normal QQ-plot of residuals")

  • Assess whether residuals lie along the diagonal line of the Quantile-quantile plot (QQ-plot).

  • If so, the residuals are normally distributed.

Normality

❌ The residuals do not appear to follow a normal distribution, because the points do not lie on the diagonal line, so normality is not satisfied.

βœ… The sample size \(n = 37 > 30\), so the sample size is large enough to relax this condition and proceed with inference.

Independence

  • We can often check the independence assumption based on the context of the data and how the observations were collected

  • Two common violations of the independence assumption:

    • Serial Effect: If the data were collected over time, plot the residuals in time order to see if there is a pattern (serial correlation)

    • Cluster Effect: If there are subgroups represented in the data that are not accounted for in the model (e.g., type of supermarket), you can color the points in the residual plots by group to see if the model systematically over or under predicts for a particular subgroup

Independence

Recall the description of the data:

  • Average household income (per zip code) and number of organic vegetable offerings in San Antonio, TX

  • Data from HEB website, compiled by high school student Linda Saucedo, Fall 2019


❌ Based on the information we have, it’s unclear if the data are independent. In fact, I’d guess that they are likely geographically correlated.

Recap

Used residual plots to check conditions for SLR:

  • Linearity
  • Constant variance
  • Normality
  • Independence
  1. Which of these conditions are required for fitting a SLR (and not doing any inference)?
  2. Which for simulation-based inference for the slope for an SLR?
  3. Which for inference with mathematical models?
03:00