Model comparison continued

Prof. Eric Friedlander

Oct 25, 2024

Announcements

Project: EDA Due Wednesday, October 30th
Oral R Quiz

📋 AE 15 - Model Comparison 2

Open up AE 15

Topics

Comparing models with \(R^2\) vs. \(R^2_{adj}\)
Comparing models with AIC and BIC
Occam’s razor and parsimony

Computational setup

# load packages
library(tidyverse)
library(broom)
library(yardstick)
library(ggformula)
library(supernova)
library(tidymodels)
library(patchwork)
library(knitr)
library(janitor)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Introduction

Data: Restaurant tips

Which variables help us predict the amount customers tip at a restaurant?

# A tibble: 169 × 4
     Tip Party Meal   Age   
   <dbl> <dbl> <chr>  <chr> 
 1  2.99     1 Dinner Yadult
 2  2        1 Dinner Yadult
 3  5        1 Dinner SenCit
 4  4        3 Dinner Middle
 5 10.3      2 Dinner SenCit
 6  4.85     2 Dinner Middle
 7  5        4 Dinner Yadult
 8  4        3 Dinner Middle
 9  5        2 Dinner Middle
10  1.58     1 Dinner SenCit
# ℹ 159 more rows

Variables

Predictors:

Party: Number of people in the party
Meal: Time of day (Lunch, Dinner, Late Night)
Age: Age category of person paying the bill (Yadult, Middle, SenCit)

Outcome: Tip: Amount of tip

Outcome: `Tip`

Predictors

Outcome vs. predictors

Fit and summarize model

term	estimate	std.error	statistic	p.value
(Intercept)	0.838	0.397	2.112	0.036
Party	1.837	0.124	14.758	0.000
AgeSenCit	0.379	0.410	0.925	0.356
AgeYadult	-1.009	0.408	-2.475	0.014

Is this model good?

Model comparison

R-squared, \(R^2\), Overfitting

\(R^2\) will always increase as we add more variables to the model
- If we add enough variables, we can usually achieve \(R^2=100\%\)
- Eventually our model will over-align to the noise in our data and become worse at predicting new data… this is called overfitting
If we only use \(R^2\) to choose a best fit model, we will be prone to choosing the model with the most predictor variables

Adjusted \(R^2\)

Adjusted \(R^2\): measure that includes a penalty for unnecessary predictor variables
Similar to \(R^2\), it is a measure of the amount of variation in the response that is explained by the regression model
Differs from \(R^2\) by using the mean squares (sum of squares/degrees of freedom) rather than sums of squares and therefore adjusting for the number of predictor variables

\(R^2\) and Adjusted \(R^2\)

\[R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Error}}{SS_{Total}}\]

\[R^2_{adj} = 1 - \frac{SS_{Error}/(n-p-1)}{SS_{Total}/(n-1)}\]

where

\(n\) is the number of observations used to fit the model
\(p\) is the number of terms (not including the intercept) in the model

Using \(R^2\) and Adjusted \(R^2\)

\(R^2_{adj}\) can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment!
Use \(R^2\) when describing the relationship between the response and predictor variables

Complete Exercises 1-3.

Comparing models with \(R^2_{adj}\)

tip_fit_1:

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.6743626	0.6643738	1.954983	67.51136	0	5	-350.0405	714.0811	735.9904	622.9793	163	169

tip_fit_2:

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.6825157	0.6624218	1.96066	33.96625	0	10	-347.898	719.7959	757.3547	607.3815	158	169

Which model would we choose based on \(R^2\)?
Which model would we choose based on Adjusted \(R^2\)?
Which statistic should we use to choose the final model - \(R^2\) or Adjusted \(R^2\)? Why?

AIC & BIC

Estimators of prediction error and relative quality of models:

Akaike’s Information Criterion (AIC): \[AIC = n\log(SS_\text{Error}) - n \log(n) + 2(p+1)\]

Schwarz’s Bayesian Information Criterion (BIC): \[BIC = n\log(SS_\text{Error}) - n\log(n) + log(n)\times(p+1)\]

AIC & BIC

\[ \begin{aligned} & AIC = \color{blue}{n\log(SS_\text{Error})} - n \log(n) + 2(p+1) \\ & BIC = \color{blue}{n\log(SS_\text{Error})} - n\log(n) + \log(n)\times(p+1) \end{aligned} \]

First Term: Decreases as p increases… why?

AIC & BIC

\[ \begin{aligned} & AIC = n\log(SS_\text{Error}) - \color{blue}{n \log(n)} + 2(p+1) \\ & BIC = n\log(SS_\text{Error}) - \color{blue}{n\log(n)} + \log(n)\times(p+1) \end{aligned} \]

Second Term: Fixed for a given sample size n

AIC & BIC

\[ \begin{aligned} & AIC = n\log(SS_\text{Error}) - n\log(n) + \color{blue}{2(p+1)} \\ & BIC = n\log(SS_\text{Error}) - n\log(n) + \color{blue}{\log(n)\times(p+1)} \end{aligned} \]

Third Term: Increases as p increases

Using AIC & BIC

\[ \begin{aligned} & AIC = n\log(SS_{Error}) - n \log(n) + \color{red}{2(p+1)} \\ & BIC = n\log(SS_{Error}) - n\log(n) + \color{red}{\log(n)\times(p+1)} \end{aligned} \]

Choose model with the smaller value of AIC or BIC
If \(n \geq 8\), the penalty for BIC is larger than that of AIC, so BIC tends to favor more parsimonious models (i.e. models with fewer terms)

Complete Exercise 4.

Comparing models with AIC and BIC

tip_fit_1

AIC	BIC
714.0811	735.9904

tip_fit_2

AIC	BIC
719.7959	757.3547

Which model would we choose based on AIC?
Which model would we choose based on BIC?

Commonalities between criteria

\(R^2_{adj}\), AIC, and BIC all apply a penalty for more predictors
The penalty for added model complexity attempts to strike a balance between underfitting (too few predictors in the model) and overfitting (too many predictors in the model)
Goal: Parsimony

Parsimony and Occam’s razor

The principle of parsimony is attributed to William of Occam (early 14th-century English nominalist philosopher), who insisted that, given a set of equally good explanations for a given phenomenon, the correct explanation is the simplest explanation¹
Called Occam’s razor because he “shaved” his explanations down to the bare minimum
Parsimony in modeling:
- models should have as few parameters as possible
- linear models should be preferred to non-linear models
- experiments relying on few assumptions should be preferred to those relying on many
- models should be pared down until they are minimal adequate (i.e. contain the minimum number of predictors required to meet some critereon)
- simple explanations should be preferred to complex explanations

In pursuit of Occam’s razor

Occam’s razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected
Model selection follows this principle
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model
In other words, we prefer the simplest best model, i.e. parsimonious model

Alternate views

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

Radford Neal - Bayesian Learning for Neural Networks¹

Other concerns with our approach

All criteria we considered for model comparison require making predictions for our data and then uses the prediction error (\(SS_{Error}\)) somewhere in the formula
But we’re making prediction for the data we used to build the model (estimate the coefficients), which can lead to overfitting

Recap

Comparing models with
- \(R^2\) vs. \(R^2_{Adj}\)
- AIC and BIC
Occam’s razor and parsimony
Complete Exercise 5.

Model comparison continued

Announcements

Topics

Computational setup

Introduction

Data: Restaurant tips

Variables

Outcome: Tip

Predictors

Outcome vs. predictors

Fit and summarize model

Model comparison

R-squared, \(R^2\), Overfitting

Adjusted \(R^2\)

\(R^2\) and Adjusted \(R^2\)

Using \(R^2\) and Adjusted \(R^2\)

Comparing models with \(R^2_{adj}\)

AIC & BIC

AIC & BIC

AIC & BIC

AIC & BIC

Using AIC & BIC

Comparing models with AIC and BIC

Commonalities between criteria

Parsimony and Occam’s razor

In pursuit of Occam’s razor

Alternate views

Other concerns with our approach

Recap

Outcome: `Tip`