MLR: Conditions

Prof. Eric Friedlander

Oct 21, 2024

Announcements

  • Project: EDA Due Wednesday, October 30th
  • Oral R Quiz (time to start scheduling it)

đź“‹ AE 13 - Rail Trails

  • Open up AE 13

Topics

  • Checking model conditions

Computational setup

# load packages
library(tidyverse)
library(broom)
library(mosaic)
library(mosaicData)
library(patchwork)
library(knitr)
library(kableExtra)
library(scales)
library(countdown)
library(rms)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Data: rail_trail

  • The Pioneer Valley Planning Commission (PVPC) collected data for ninety days from April 5, 2005 to November 15, 2005.
  • Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.
# A tibble: 90 Ă— 7
   volume hightemp avgtemp season cloudcover precip day_type
    <dbl>    <dbl>   <dbl> <chr>       <dbl>  <dbl> <chr>   
 1    501       83    66.5 Summer       7.60 0      Weekday 
 2    419       73    61   Summer       6.30 0.290  Weekday 
 3    397       74    63   Spring       7.5  0.320  Weekday 
 4    385       95    78   Summer       2.60 0      Weekend 
 5    200       44    48   Spring      10    0.140  Weekday 
 6    375       69    61.5 Spring       6.60 0.0200 Weekday 
 7    417       66    52.5 Spring       2.40 0      Weekday 
 8    629       66    52   Spring       0    0      Weekend 
 9    533       80    67.5 Summer       3.80 0      Weekend 
10    547       79    62   Summer       4.10 0      Weekday 
# ℹ 80 more rows

Source: Pioneer Valley Planning Commission via the mosaicData package.

Conditions for inference

Full model

Complete Exercise 0 to fit the so-called “full-model”.

Full model

term estimate std.error statistic p.value
(Intercept) 17.622161 76.582860 0.2301058 0.8185826
hightemp 7.070528 2.420523 2.9210743 0.0045045
avgtemp -2.036685 3.142113 -0.6481896 0.5186733
seasonSpring 35.914983 32.992762 1.0885716 0.2795319
seasonSummer 24.153571 52.810486 0.4573632 0.6486195
cloudcover -7.251776 3.843071 -1.8869743 0.0627025
precip -95.696525 42.573359 -2.2478030 0.0272735
day_typeWeekend 35.903750 22.429056 1.6007696 0.1132738

Model conditions

Our model conditions are the same as they were with SLR:

  1. Linearity: There is a linear relationship between the response and predictor variables.

  2. Constant Variance: The variability about the least squares line is generally constant.

  3. Normality: The distribution of the residuals is approximately normal.

  4. Independence: The residuals are independent from each other.

Checking Linearity

  • Look at a plot of the residuals vs. predicted values

  • Look at a plot of the residuals vs. each predictor

    • Use scatter plots for quantitative and boxplots of categorical predictors
  • Linearity is met if there is no discernible pattern in each of these plots

Complete Exercises 1-4

Residuals vs. predicted values

Residuals vs. each predictor

Checking linearity

  • The plot of the residuals vs. predicted values looked OK

  • The plots of residuals vs. hightemp and avgtemp appear to have a parabolic pattern.

  • The linearity condition does not appear to be satisfied given these plots.

Given this conclusion, what might be a next step in the analysis?

Checking constant variance

Does the constant variance condition appear to be satisfied?

Checking constant variance

  • The vertical spread of the residuals is not constant across the plot.

  • The constant variance condition is not satisfied.

Given this conclusion, what might be a next step in the analysis?

Complete Exercises 5-6.

Checking normality

The distribution of the residuals is approximately unimodal and symmetric, so the normality condition is satisfied. The sample size 90 is sufficiently large to relax this condition if it was not satisfied.

Checking independence

  • We can often check the independence condition based on the context of the data and how the observations were collected.

  • If the data were collected in a particular order, examine a scatter plot of the residuals versus order in which the data were collected.

  • If there is a grouping variable lurking in the background, check the residuals based on that grouping variable.

  • Why might the independence condition be violated here?

Checking independence

Residuals vs. order of data collection:

Code
gf_line(.resid ~ 1:nrow(rt_full_aug), data = rt_full_aug) |>
  gf_point()  |>
  gf_hline(yintercept = 0, color = "red", linetype = "dashed")  |>
  gf_labs(x = "Order of data collection", y = "Residuals")