AE 16: Model Selection

Tips Data

Author

Driver: ____, Reporter: _____, Gopher: _____

Important
  • Open RStudio and create a subfolder in your AE folder called “AE-16”.

  • Go to the Canvas and locate your AE-16 assignment to get started.

  • Upload the ae-16.qmd and tip-data.csv files into the folder you just created. The .qmd and PDF responses are due in Canvas. You can check the due date on the Canvas assignment.

Packages + data

library(tidyverse)
library(broom)
library(yardstick)
library(ggformula)
library(patchwork)
library(knitr)
library(kableExtra)

tips <- read_csv("tip-data.csv") |> 
  drop_na(Party) |> 
  select(-`Tip Percentage`, -`W/Tip`)

What factors are associated with the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.1

The variables we’ll focus on for this analysis are

  • Tip: amount of the tip
  • Party: Number of people in the party
  • Meal: Time of day (Lunch, Dinner, Late Night)
  • Age: Age category of person paying the bill (Yadult, Middle, SenCit)
  • Day: Day of the week
  • Payment: the type of payment used
  • GiftCard: whether a gift card was used
  • Comps: Whether any food was comped
  • Alcohol: Whether any alcohol was ordered
  • Bday: whether it was someones birthday
  • Bill: the size of the bill

Analysis goal

The goals of this activity are to:

  • Compare subsets of predictors using nested F-Tests
  • Use best subset, forward, and backward selection to choose models

Exercise 0

Fit two models:

  1. tip_fit_1: predict Tip from Party, Age, and Meal
  2. tip_fit_2: predict Tip from Party, Age, Meal, and Day.

Exercise 1

Of the two models you just fit, which is nested inside the other?

Exercise 2

Fit a third model called tip_fit_3 so that tip_fit_1 is nested inside tip_fit_3. Try to choose a model that you think will do a better job than tip_fit_1 of modeling the data. Feel free to include interaction or polynomial terms. Have your reporter write the model on the board.

Exercise 3

Change eval: FALSE to eval: TRUE. Alter the code below so that you’re comparing tip_fit_3 to tip_fit_1.

anova(tip_fit_1, tip_fit_2) |> # Enter reduced model first
  tidy() |> 
  kable()

Write down the hypotheses, test statistic, and p-value of your test. Interpret the output in the context of the problem. Have your reporter write the p-value on the board.

Exercise 4

Change eval: FALSE to eval: TRUE. First, look at the documentation for regsubsets and determine what nvmax does. Alter the code below so that the largest model regsubets will consider has 20 (wow!) terms in it. Find the “best” model according to the following criteria, and have your reporter write the model on a white board. Note: arrange will sort the rows so that the optimal model will appear at the top or bottom.

  • Group 1: Find the best model according to \(R^2_{adj}\).
  • Group 2: Find the best model according to \(C_p/AIC\).
  • Group 3: Find the best model according to \(BIC\).
  • Group 4: Find the best model according to \(R^2\). (Don’t write it up but think about if it’s what we would expect)
library(leaps)

best_subsets <- regsubsets(Tip ~ ., data = tips) 

best_subsets |> 
  tidy() |> 
  arrange(desc(_____)) |> # Delete desc if you want to sort in decreasing order
  kable()

coef(best_subsets, ___) # Enter the number of predictors in the best model

Exercise 5

The regsubsets function has an argument called method. This argument tells regsubsets whether to use exhaustive (i.e. best subsets) selection, forward stepwise, or backward stepwise selection. Re-purpose your code above to find the best model according to the following criteria and have your reporter write down the terms that are included in your model (you do not need to write out the coefficients):

  • Group 1: Forward stepwise selection using \(C_p\).
  • Group 2: Backward stepwise selection using \(C_p\).
  • Group 3: Forward stepwise selection using \(BIC\).
  • Group 4: Backward stepwise selection using \(BIC\).

To submit the AE

Important
  • Render the document to produce the PDF with all of your work from today’s class.
  • Upload your QMD and PDF files to the Canvas assignment.

Footnotes

  1. Dahlquist, Samantha, and Jin Dong. 2011. “The Effects of Credit Cards on Tipping.” Project for Statistics 212-Statistics for the Sciences, St. Olaf College.↩︎