Model Selection

Prof. Eric Friedlander

Oct 28, 2024

Announcements

  • Project: EDA Due Wednesday
  • Oral R Quiz

đź“‹ AE 16 - Model Selection

  • Open up AE 16 and Complete Exercise 0

Topics

  • ANOVA, revisited
  • Model Selection

Computational setup

# load packages
library(tidyverse)
library(broom)
library(yardstick)
library(ggformula)
library(GGally)
library(tidymodels)
library(patchwork)
library(knitr)
library(janitor)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Introduction

Data: Restaurant tips

Which variables help us predict the amount customers tip at a restaurant?

Day Meal Payment Party Age GiftCard Comps Alcohol Bday Bill Tip
Saturday Dinner Credit 1 Yadult No No No No 17.01 2.99
Saturday Dinner Credit 1 Yadult No No No No 14.23 2.00
Tuesday Dinner Credit 1 SenCit No No No No 20.97 5.00
Tuesday Dinner Credit 3 Middle No No No No 20.87 4.00
Tuesday Dinner Cash 2 SenCit No No No No 34.66 10.34
Tuesday Dinner Credit 2 Middle No No No No 25.15 4.85

Variables

Predictors:

  • Party: Number of people in the party
  • Meal: Time of day (Lunch, Dinner, Late Night)
  • Age: Age category of person paying the bill (Yadult, Middle, SenCit)
  • Day: Day of the week
  • Payment: the type of payment used
  • GiftCard: whether a giftcard was used
  • Comps: Whether any food was comped
  • Alcohol: Whether any alcohol was ordered
  • Bday: whether it was someones birthday
  • Bill: the size of the bill

Outcome: Tip: Amount of tip

Comparing sets of predictors

Nested Models

  • We say one model is nested inside another model if all of its TERMS are present in the other model
  • Consider three different models:
    • Model 1: \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\)
    • Model 2: \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon\)
    • Model 3: \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon\)
  • Model 2 is nested inside both Model 1 and Model 3.
  • Why isn’t Model 3 nested in Model 1?
  • Smaller model is called the Reduced Model
  • Larger model is called the Full Model (be careful, this term depends on context)
  • Complete Exercises 1-2.

Recall: ANOVA, F-Test

Hypotheses:

\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) vs. \(H_A:\) at least one \(\beta_i \neq 0\)

Test statistic: F-statistic

\[ F = \frac{MSModel}{MSE} = \frac{SSModel/p}{SSE/(n-p-1)} \\ \]

p-value: Probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed

\[ \text{p-value} = P(F > \text{test statistic}), \]

calculated from an \(F\) distribution with \(p\) and \(n - p - 1\) degrees of freedom.

Nested F-Test

Suppose \(k\) is the number of \(\beta\)’s in the nested model and \(p\) is the full number of predictors in the larger model. I.e. \(\beta_{k+1},\ldots, \beta_{p}\) are the new \(\beta\)’s

Hypotheses:

\(H_0: \beta_{k+1} = \beta_{k+2} = \cdots = \beta_p = 0\) vs. \(H_A:\) at least one \(\beta_i \neq 0\) for \(i>k+1\)

Test statistic: F-statistic

\[ F = \frac{(SSModel_{full} - SSModel_{reduced})/(p-k)}{SSE_{full}/(n-p-1)} \\ \]

p-value: Probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed

\[ \text{p-value} = P(F > \text{test statistic}), \]

calculated from an \(F\) distribution with \(p-k\) (the number of predictors being tested) and \(n - p - 1\) degrees of freedom.

Note: Same as regular F-test if reduced model is just \(Y= \beta_0\).

Nested F-Test in R

tip_fit_1 <- lm(Tip ~ Party + Age + Meal, data = tips)
tip_fit_2 <- lm(Tip ~ Party + Age + Meal + Day, data = tips)

anova(tip_fit_1, tip_fit_2) |> # Enter reduced model first
  tidy() |> 
  kable()
term df.residual rss df sumsq statistic p.value
Tip ~ Party + Age + Meal 163 622.9793 NA NA NA NA
Tip ~ Party + Age + Meal + Day 158 607.3815 5 15.59778 0.8114993 0.543086

\[ F = \frac{(SSModel_{full} - SSModel_{reduced})/(p-k)}{SSE_{full}/(n-p-1)} = \frac{15.59778/5}{6073815/158} = 0.8114993 \]

Let’s interpret this together.

Complete Exercise 3.

Model Selection

Model Selection

  • So far: We’ve come up with a variety of metrics and tests which help us compare different models
  • How do we choose the models to compare in the first place?
  • Today: Best subset, forward selection, and backward selection

AIC, BIC, Mallows’ \(C_p\)

Estimators of prediction error and relative quality of models:

Akaike’s Information Criterion (AIC): \[AIC = n\log(SS_\text{Error}) - n \log(n) + 2(p+1)\]

Schwarz’s Bayesian Information Criterion (BIC): \[BIC = n\log(SS_\text{Error}) - n\log(n) + log(n)\times(p+1)\]

Mallows’ \(C_p\): \[C_p = \frac{SSE_{p}}{MSE_{full model}} - n + 2(p+1)\]

Best Subset Selection

  • Computers are great now!
  • Frequently feasible to try out EVERY combination of predictors if you total number of possible predictors is not too high.

Best Subset Selection in R

library(leaps)

regsubsets(Tip ~ ., data = tips) |> 
  tidy() |> 
  kable()
(Intercept) DaySaturday DaySunday DayThursday DayTuesday DayWednesday MealLate Night MealLunch PaymentCredit PaymentCredit/CashTip Party AgeSenCit AgeYadult GiftCardYes CompsYes AlcoholYes BdayYes Bill r.squared adj.r.squared BIC mallows_cp
TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 0.7661946 0.7647946 -235.3422 7.3539038
TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE 0.7742958 0.7715765 -236.1719 3.3819803
TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE 0.7803064 0.7763120 -235.6035 0.9511378
TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE 0.7839141 0.7786437 -233.2719 0.2916884
TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE 0.7883724 0.7818808 -231.6653 -0.9948739
TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE 0.7906664 0.7829133 -228.3773 -0.6859082
TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE 0.7922172 0.7831831 -224.5040 0.1709197
TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE 0.7931116 0.7827672 -220.1032 1.5115594

Shows you “best” model for every model size.

Backward Elimination

Different model selection technique:

  1. Start by fitting the full model (the model that includes all terms under consideration).
  2. Identify the term with the largest p-value.
    1. If p-value is large (say, greater than 5%), eliminate that term to produce a smaller model. Fit that model and return to the start of Step 2.
    2. If p-value is small (less than 5%), stop since all of the predictors in the model are “significant.”

Note: this can be altered to work with other criterea (e.g. AIC) instead of p-values. This is actually what regsubsets does.

Forward Selection

  1. Start with a model with no predictors and find the best single predictor (the largest correlation with the response gives the biggest initial).
  2. Add the new predictor to the model, run the regression, and find its individual p-value:
    1. If p-value is small (say, less than 5%), add predictor which would produce the most benefit (biggest increase in \(R^2\)) when added to the existing model.
    2. If the p-value is large (over 5%), stop and discard this predictor. At this point, no (unused) predictor should be significant when added to the model and we are done.

Stepwise Selection

Forward, stepwise selection

  1. Start with a model with no predictors and find the best single predictor (the largest correlation with the response gives the biggest initial).
  2. Add the new predictor to the model, run the regression, and find its individual p-value:
    1. If p-value is small (say, less than 5%), run backward elimination, then add predictor which would produce the most benefit (biggest increase in \(R^2\)) when added to the existing model.
    2. If the p-value is large (over 5%), stop and discard this predictor. At this point, no (unused) predictor should be significant when added to the model and we are done.
  • Why? Sometimes variables that were significant early on, can become insignificant after other new variables are added to the model.

Backward, stepwise selection is the same, except you perform forward selection every time you delete a term from the model.

CAUTION

  • These automated methods have fallen out of favor in recent years, but you can still use them and should know what they are.
  • Automated methods ARE NOT a replacement for subject matter expertise
  • Think of the models that come out of these procedures as suggestions
  • The order in which variables are added to a model can help us understand which variables are more important and which are redundant.

Complete Exercise 5.

Recap

  • Comparing subsets of models using nested F-Test
  • Choosing models using:
    • Exhaustive search
    • Forward/Backward/Stepwise selection