Model Selection

Prof. Eric Friedlander

Oct 28, 2024

Announcements

Project: EDA Due Wednesday
Oral R Quiz

📋 AE 16 - Model Selection

Open up AE 16 and Complete Exercise 0

Topics

ANOVA, revisited
Model Selection

Computational setup

# load packages
library(tidyverse)
library(broom)
library(yardstick)
library(ggformula)
library(GGally)
library(tidymodels)
library(patchwork)
library(knitr)
library(janitor)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Introduction

Data: Restaurant tips

Which variables help us predict the amount customers tip at a restaurant?

Day	Meal	Payment	Party	Age	GiftCard	Comps	Alcohol	Bday	Bill	Tip
Saturday	Dinner	Credit	1	Yadult	No	No	No	No	17.01	2.99
Saturday	Dinner	Credit	1	Yadult	No	No	No	No	14.23	2.00
Tuesday	Dinner	Credit	1	SenCit	No	No	No	No	20.97	5.00
Tuesday	Dinner	Credit	3	Middle	No	No	No	No	20.87	4.00
Tuesday	Dinner	Cash	2	SenCit	No	No	No	No	34.66	10.34
Tuesday	Dinner	Credit	2	Middle	No	No	No	No	25.15	4.85

Variables

Predictors:

Party: Number of people in the party
Meal: Time of day (Lunch, Dinner, Late Night)
Age: Age category of person paying the bill (Yadult, Middle, SenCit)
Day: Day of the week
Payment: the type of payment used
GiftCard: whether a giftcard was used
Comps: Whether any food was comped
Alcohol: Whether any alcohol was ordered
Bday: whether it was someones birthday
Bill: the size of the bill

Outcome: Tip: Amount of tip

Comparing sets of predictors

Nested Models

We say one model is nested inside another model if all of its TERMS are present in the other model

Consider three different models:
- Model 1: \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\)
- Model 2: \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon\)
- Model 3: \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon\)

Model 2 is nested inside both Model 1 and Model 3.
Why isn’t Model 3 nested in Model 1?
Smaller model is called the Reduced Model
Larger model is called the Full Model (be careful, this term depends on context)

Complete Exercises 1-2.

Recall: ANOVA, F-Test

Hypotheses:

\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) vs. \(H_A:\) at least one \(\beta_i \neq 0\)

Test statistic: F-statistic

\[ F = \frac{MSModel}{MSE} = \frac{SSModel/p}{SSE/(n-p-1)} \\ \]

p-value: Probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed

\[ \text{p-value} = P(F > \text{test statistic}), \]

calculated from an \(F\) distribution with \(p\) and \(n - p - 1\) degrees of freedom.

Nested F-Test

Suppose \(k\) is the number of \(\beta\)’s in the nested model and \(p\) is the full number of predictors in the larger model. I.e. \(\beta_{k+1},\ldots, \beta_{p}\) are the new \(\beta\)’s

Hypotheses:

\(H_0: \beta_{k+1} = \beta_{k+2} = \cdots = \beta_p = 0\) vs. \(H_A:\) at least one \(\beta_i \neq 0\) for \(i>k+1\)

Test statistic: F-statistic

\[ F = \frac{(SSModel_{full} - SSModel_{reduced})/(p-k)}{SSE_{full}/(n-p-1)} \\ \]

p-value: Probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed

\[ \text{p-value} = P(F > \text{test statistic}), \]

calculated from an \(F\) distribution with \(p-k\) (the number of predictors being tested) and \(n - p - 1\) degrees of freedom.

Note: Same as regular F-test if reduced model is just \(Y= \beta_0\).

Nested F-Test in R

tip_fit_1 <- lm(Tip ~ Party + Age + Meal, data = tips)
tip_fit_2 <- lm(Tip ~ Party + Age + Meal + Day, data = tips)

anova(tip_fit_1, tip_fit_2) |> # Enter reduced model first
  tidy() |> 
  kable()

term	df.residual	rss	df	sumsq	statistic	p.value
Tip ~ Party + Age + Meal	163	622.9793	NA	NA	NA	NA
Tip ~ Party + Age + Meal + Day	158	607.3815	5	15.59778	0.8114993	0.543086

\[ F = \frac{(SSModel_{full} - SSModel_{reduced})/(p-k)}{SSE_{full}/(n-p-1)} = \frac{15.59778/5}{6073815/158} = 0.8114993 \]

Let’s interpret this together.

Complete Exercise 3.

Model Selection

So far: We’ve come up with a variety of metrics and tests which help us compare different models
How do we choose the models to compare in the first place?
Today: Best subset, forward selection, and backward selection

AIC, BIC, Mallows’ \(C_p\)

Estimators of prediction error and relative quality of models:

Akaike’s Information Criterion (AIC): \[AIC = n\log(SS_\text{Error}) - n \log(n) + 2(p+1)\]

Schwarz’s Bayesian Information Criterion (BIC): \[BIC = n\log(SS_\text{Error}) - n\log(n) + log(n)\times(p+1)\]

Mallows’ \(C_p\): \[C_p = \frac{SSE_{p}}{MSE_{full model}} - n + 2(p+1)\]

Best Subset Selection

Computers are great now!
Frequently feasible to try out EVERY combination of predictors if you total number of possible predictors is not too high.

Best Subset Selection in R

library(leaps)

regsubsets(Tip ~ ., data = tips) |> 
  tidy() |> 
  kable()

(Intercept)	DaySaturday	DaySunday	DayThursday	DayTuesday	DayWednesday	MealLate Night	MealLunch	PaymentCredit	PaymentCredit/CashTip	Party	AgeSenCit	AgeYadult	GiftCardYes	CompsYes	AlcoholYes	BdayYes	Bill	r.squared	adj.r.squared	BIC	mallows_cp
TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	0.7661946	0.7647946	-235.3422	7.3539038
TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	0.7742958	0.7715765	-236.1719	3.3819803
TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	0.7803064	0.7763120	-235.6035	0.9511378
TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	0.7839141	0.7786437	-233.2719	0.2916884
TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	0.7883724	0.7818808	-231.6653	-0.9948739
TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	0.7906664	0.7829133	-228.3773	-0.6859082
TRUE	TRUE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	0.7922172	0.7831831	-224.5040	0.1709197
TRUE	TRUE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	0.7931116	0.7827672	-220.1032	1.5115594

Shows you “best” model for every model size.

Backward Elimination

Different model selection technique:

Start by fitting the full model (the model that includes all terms under consideration).
Identify the term with the largest p-value.
1. If p-value is large (say, greater than 5%), eliminate that term to produce a smaller model. Fit that model and return to the start of Step 2.
2. If p-value is small (less than 5%), stop since all of the predictors in the model are “significant.”

Note: this can be altered to work with other criterea (e.g. AIC) instead of p-values. This is actually what regsubsets does.

Forward Selection

Start with a model with no predictors and find the best single predictor (the largest correlation with the response gives the biggest initial).
Add the new predictor to the model, run the regression, and find its individual p-value:
1. If p-value is small (say, less than 5%), add predictor which would produce the most benefit (biggest increase in \(R^2\)) when added to the existing model.
2. If the p-value is large (over 5%), stop and discard this predictor. At this point, no (unused) predictor should be significant when added to the model and we are done.

Stepwise Selection

Forward, stepwise selection

Start with a model with no predictors and find the best single predictor (the largest correlation with the response gives the biggest initial).
Add the new predictor to the model, run the regression, and find its individual p-value:
1. If p-value is small (say, less than 5%), run backward elimination, then add predictor which would produce the most benefit (biggest increase in \(R^2\)) when added to the existing model.
2. If the p-value is large (over 5%), stop and discard this predictor. At this point, no (unused) predictor should be significant when added to the model and we are done.

Why? Sometimes variables that were significant early on, can become insignificant after other new variables are added to the model.

Backward, stepwise selection is the same, except you perform forward selection every time you delete a term from the model.

CAUTION

These automated methods have fallen out of favor in recent years, but you can still use them and should know what they are.
Automated methods ARE NOT a replacement for subject matter expertise
Think of the models that come out of these procedures as suggestions
The order in which variables are added to a model can help us understand which variables are more important and which are redundant.

Complete Exercise 5.

Recap

Comparing subsets of models using nested F-Test
Choosing models using:
- Exhaustive search
- Forward/Backward/Stepwise selection