Model comparison

Prof. Eric Friedlander

Oct 23, 2024

Announcements

  • Project: EDA Due Wednesday, October 30th
  • Oral R Quiz (time to start scheduling it)

๐Ÿ“‹ AE 14 - Model Comparison

  • Open up AE 14 and complete Exercises 0-2

Topics

  • ANOVA for multiple linear regression and sum of squares
  • Comparing models with \(R^2\)

Computational setup

# load packages
library(tidyverse)
library(broom)
library(yardstick)
library(ggformula)
library(supernova)
library(tidymodels)
library(patchwork)
library(knitr)
library(janitor)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Introduction

Data: Restaurant tips

Which variables help us predict the amount customers tip at a restaurant?

# A tibble: 169 ร— 4
     Tip Party Meal   Age   
   <dbl> <dbl> <chr>  <chr> 
 1  2.99     1 Dinner Yadult
 2  2        1 Dinner Yadult
 3  5        1 Dinner SenCit
 4  4        3 Dinner Middle
 5 10.3      2 Dinner SenCit
 6  4.85     2 Dinner Middle
 7  5        4 Dinner Yadult
 8  4        3 Dinner Middle
 9  5        2 Dinner Middle
10  1.58     1 Dinner SenCit
# โ„น 159 more rows

Variables

Predictors:

  • Party: Number of people in the party
  • Meal: Time of day (Lunch, Dinner, Late Night)
  • Age: Age category of person paying the bill (Yadult, Middle, SenCit)

Outcome: Tip: Amount of tip

Outcome: Tip

Predictors

Relevel categorical predictors

tips <- tips |>
  mutate(
    Meal = fct_relevel(Meal, "Lunch", "Dinner", "Late Night"),
    Age  = fct_relevel(Age, "Yadult", "Middle", "SenCit")
  )

Predictors, again

Outcome vs. predictors

Fit and summarize model

term estimate std.error statistic p.value
(Intercept) -0.170 0.366 -0.465 0.643
Party 1.837 0.124 14.758 0.000
AgeMiddle 1.009 0.408 2.475 0.014
AgeSenCit 1.388 0.485 2.862 0.005


Is this model good?

Another model summary

anova(tip_fit) |>
  tidy() |>
  kable(digits = 2)
term df sumsq meansq statistic p.value
Party 1 1188.64 1188.64 285.71 0.00
Age 2 38.03 19.01 4.57 0.01
Residuals 165 686.44 4.16 NA NA

Analysis of variance (ANOVA)

Analysis of variance (ANOVA)


ANOVA

  • Main Idea: Decompose the total variation of the outcome into:
    • the variation that can be explained by the each of the variables in the model
    • the variation that canโ€™t be explained by the model (left in the residuals)
  • \(SS_{Total}\): Total sum of squares, variability of outcome, \(\sum_{i = 1}^n (y_i - \bar{y})^2\)
  • \(SS_{Error}\): Residual sum of squares, variability of residuals, \(\sum_{i = 1}^n (y_i - \hat{y}_i)^2\)
  • \(SS_{Model} = SS_{Total} - SS_{Error}\): Variability explained by the model, \(\sum_{i = 1}^n (\hat{y}_i - \bar{y})^2\)

Complete Exercise 3.

ANOVA output in R1

term df sumsq meansq statistic p.value
Party 1 1188.63588 1188.635880 285.711511 0.000000
Age 2 38.02783 19.013916 4.570361 0.011699
Residuals 165 686.44389 4.160266 NA NA

ANOVA output, with totals

term df sumsq meansq statistic p.value
Party 1 1188.64 1188.64 285.71 0
Age 2 38.03 19.01 4.57 0.01
Residuals 165 686.44 4.16
Total 168 1913.11

Sum of squares

term df sumsq
Party 1 1188.64
Age 2 38.03
Residuals 165 686.44
Total 168 1913.11
  • \(SS_{Total}\): Total sum of squares, variability of outcome, \(\sum_{i = 1}^n (y_i - \bar{y})^2\)
  • \(SS_{Error}\): Residual sum of squares, variability of residuals, \(\sum_{i = 1}^n (y_i - \hat{y}_i)^2\)
  • \(SS_{Model} = SS_{Total} - SS_{Error}\): Variability explained by the model, \(\sum_{i = 1}^n (\hat{y}_i - \bar{y})^2\)

Sum of squares: \(SS_{Total}\)

term df sumsq
Party 1 1188.64
Age 2 38.03
Residuals 165 686.44
Total 168 1913.11


\(SS_{Total}\): Total sum of squares, variability of outcome


\(\sum_{i = 1}^n (y_i - \bar{y})^2\) = 1913.11

Sum of squares: \(SS_{Error}\)

term df sumsq
Party 1 1188.64
Age 2 38.03
Residuals 165 686.44
Total 168 1913.11


\(SS_{Error}\): Residual sum of squares, variability of residuals


\(\sum_{i = 1}^n (y_i - \hat{y}_i)^2\) = 686.44

Sum of squares: \(SS_{Model}\)

term df sumsq
Party 1 1188.64
Age 2 38.03
Residuals 165 686.44
Total 168 1913.11


\(SS_{Model}\): Variability explained by the model


\(\sum_{i = 1}^n (\hat{y}_i - \bar{y})^2 = SS_{Model} = SS_{Total} - SS_{Error} =\) 1226.67

F-Test: Testing the whole model at once

Hypotheses:

\(H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0\) vs. \(H_A:\) at least one \(\beta_i \neq 0\)

Test statistic: F-statistics

\[ F = \frac{MSModel}{MSE} = \frac{SSModel/k}{SSE/(n-k-1)} \\ \]

p-value: Probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed

\[ \text{p-value} = P(F > \text{test statistic}), \]

calculated from an \(F\) distribution with \(k\) and \(n - k - 1\) degrees of freedom.

F-test in R

  • Use glance function from broom package
    • statistic: F-statistic
    • p.value: p-value from F-test

R-squared, \(R^2\)

Recall: \(R^2\) is the proportion of the variation in the response variable explained by the regression model.

\[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Error}}{SS_{Total}} \]

Complete Exercises 4-7.

Recap

  • ANOVA for multiple linear regression and sum of squares
  • \(R^2\) for multiple linear regression