SLR: Outliers

Prof. Eric Friedlander

Sep 25, 2024

Double Application Exercise

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(broom)       # for formatting model output
library(ggformula)   # for creating plots using formulas
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(moderndive)  # for house_price dataset
library(fivethirtyeight) # for fandango dataset
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Outliers

Types of “Unusual” Points in SLR

  • Outlier: a data point that is far from the regression line
  • Influential point: a data point that has a large effect on the regression fit
  • How do we measure “far”?
  • How do we measure “effect on the fit”?

Detecting Unusual Cases: Overview

  1. Compute residuals
    • “raw”, standardized, studentized
  2. Plots of residuals
    • boxplot, scatterplot, normal plot
  3. Leverage
    • unusual values for the predictors

Example: Movie scores

Fandango logo

IMDB logo

Rotten Tomatoes logo

Metacritic logo

Data prep

  • Rename Rotten Tomatoes columns as critics and audience
  • Rename the dataset as movie_scores
movie_scores <- fandango |>
  rename(critics = rottentomatoes, 
         audience = rottentomatoes_user)

Example: Movie Scores

Code
movie_scores |> 
  gf_point(audience ~ critics) |> 
  gf_lm() |> 
  gf_labs(x = "Critics Score", 
       y = "Audience Score")

Boxplot of Residuals

movie_fit <- lm(audience ~ critics, data = movie_scores)
movie_fit_aug <- augment(movie_fit)

gf_boxplot(.resid ~ "", data = movie_fit_aug, 
           fill = "salmon", ylab = "Residuals", xlab = "")

  • Dots (outliers) indicate data points more than 1.5 IQRs above (or below) quartiles

Standardized Residuals

  • Recall: Z-scores
  • Fact: If \(X\) has mean \(\mu\) and standard deviation \(\sigma\), then \((X-\mu)/\sigma\) has mean 0 and standard deviation 1
  • For residuals: mean 0 and standard deviation \(\hat{\sigma}_\epsilon\)
  • Standardized residuals: \(\frac{y_i-\hat{y}_i}{\hat{\sigma}_\epsilon}\)
    • Look for values beyond \(\pm 2\) or \(\pm 3\)

Recap: Augment function

movie_fit_aug |> 
  head() |> 
  kable()
audience critics .fitted .resid .hat .sigma .cooksd .std.resid
86 74 70.69768 15.302321 0.0081597 12.51615 0.0061774 1.2254688
80 85 76.40313 3.596866 0.0112688 12.57830 0.0004743 0.2885034
90 80 73.80975 16.190255 0.0096283 12.50817 0.0081839 1.2975389
84 18 41.65173 42.348272 0.0207618 12.06226 0.1234982 3.4131653
28 14 39.57702 -11.577018 0.0234805 12.54373 0.0104964 -0.9343768
62 63 64.99222 -2.992225 0.0068844 12.57943 0.0001988 -0.2394750

Example: Movie Scores

Code
p1 <- movie_fit_aug |>  # Augmented data
  gf_boxplot("" ~ .std.resid, 
           xlab = "Standardized Residual")

p2 <- movie_fit_aug |>  # Augmented data
  gf_point(.std.resid ~ .fitted, 
           xlab = "Predicted", ylab = "Standardized Residual")

p1 + p2

(Externally) Studentized Residuals

  • Concern: An unusual value may exert great influence on the fit
    • Its residual might be underestimated because the model “moves” a lot to fit it
    • The standard error may also be inflated due to the outlier error
  • Studentize: Fit the model without that case, then find new \(\hat{\sigma}_\epsilon\)

Example: Movie Scores

movie_fit_aug |>  # Augmented data
  mutate(studentized_residual = rstudent(movie_fit)) |> 
  gf_point(studentized_residual ~ .fitted, 
           xlab = "Predicted", ylab = "Studentized Residual")

What to do with an outlier?

  • Look into it
  • If something is unusual about it and you can make a case that it is not a good representation of the population you can throw it out
  • If not and the value is just unusual, keep it

Influence vs. Leverage

  • High Influence Point: point that DOES impact the regression line
  • High Leverage Point: point with “potential” to impact regression line because \(X\)-value is unusual

High Leverage, Low Influence

High Leverage, High Influence

Low Leverage, Low Influence

Low Leverage, High Influence

Low Leverage, High Influence

Application exercise

đź“‹ AE-09: Outliers