library(tidyverse)
library(broom)
library(ggformula)
library(fivethirtyeight)
library(knitr)
library(yardstick)
# add other packages as neededHomework 06: Candy Competition
Inference for Multiple Linear Regression
Due: Friday, October 20, 11:59pm
Introduction
In today’s homework you will analyze data about candy that was collected from an online experiment conducted at FiveThirtyEight.
Learning goals
By the end of the homework you will be able to
- Fit a linear model with multiple predictors and an interaction term
- Fit a linear model with categorical predictors
- Conduct inference on multiple linear models
Getting started
Packages
The following packages are used in the lab.
Data: Candy
The data from this lab comes from the the article FiveThirtyEight The Ultimate Halloween Candy Power Ranking by Walt Hickey. To collect data, Hickey and collaborators at FiveThirtyEight set up an experiment people could vote on a series of randomly generated candy matchups (e.g. Reeses vs. Skittles). Click here to check out some of the match ups.
The data set contains the characteristics and win percentage from 85 candies in the experiment. The variables are
| Variable | Description |
|---|---|
chocolate |
Does it contain chocolate? |
fruity |
Is it fruit flavored? |
caramel |
Is there caramel in the candy? |
peanutalmondy |
Does it contain peanuts, peanut butter or almonds? |
nougat |
Does it contain nougat? |
crispedricewafer |
Does it contain crisped rice, wafers, or a cookie component? |
hard |
Is it a hard candy? |
bar |
Is it a candy bar? |
pluribus |
Is it one of many candies in a bag or box? |
sugarpercent |
The percentile of sugar it falls under within the data set. Values 0 - 1. |
pricepercent |
The unit price percentile compared to the rest of the set. Values 0 - 1. |
winpercent |
The overall win percentage according to 269,000 matchups. Values 0 - 100. |
Use the code below to get a glimpse of the candy_rankings data frame in the fivethirtyeight R package.
glimpse(candy_rankings)Rows: 85
Columns: 13
$ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…
Exercises
The goal of this analysis is to use multiple linear regression to understand the factors that make a good candy.
Exercise 1
Notice that the values of pricepercent and sugarpercent are proportions. Change the scale so that they are percentages.
Exercise 2
- Our response variable in this homework will be
winpercentage. Choose two additional variables, one quantitative and one categorical. Generate a SINGLE plot that visualizes all three variables. Hint: remember that you can tie any aesthetic in your plot (e.g. color) to a variable be writingaesthatic = ~variable_name. - Write two observations from your plot.
Exercise 3
Fit a linear model including both variables you chose above and include an interaction term between the two.
Exercise 4
Interpret the following in the context of the data:
Intercept
Coefficient of your quantitative variable
Coefficient(s) of your categorical variables
Coefficient(s) of your interaction term.
Exercise 5
Choose one of the coefficients from your model and write out all the steps in the hypothesis test that corresponds to it’s p-value as in this slide.
Exercise 6
Interpret the other p-values in your model in the context of the data. You do not need to write out the full hypothesis testing framework as you did in Exercise 5.
Exercise 7
Generate 95% confidence intervals for all of the slopes in your model and interpret them in the context of the data. Are these intervals independent of one another?
Exercise 8
Fit a linear model predicting winpercent using chocolate and pricepercent as predictors. Include a interaction term between pricepercent and chocolate. Describe the effect of pricepercent for chocolate candy in the context of the data.
Exercise 9
Consider the model from Exercise 3 “Model 1” and the model fit in Exercise 8 “Model 2”. If you happened to have chosen
pricepercentandchocolatein Exercise 3, please choose two new variables and to complete this problem with.Which model would you choose based on \(R^2\)? Briefly explain your choice.
Which model would you choose based on \(RMSE\)? Briefly explain your choice.
Exercise 10
- Use the model you selected in Exercise 9 to describe what generally makes a good candy, i.e., one with a high win percentage.
Grading
Total points available: 50 points.
| Component | Points |
|---|---|
| Ex 1 | 2 |
| Ex 2 | 3 |
| Ex 3 | 5 |
| Ex 4 | 8 |
| Ex 5 | 6 |
| Ex 6 | 5 |
| Ex 7 | 5 |
| Ex 8 | 4 |
| Ex 9 | 4 |
| Ex 10 | 4 |
| Workflow & formatting | 41 |
Footnotes
The “Workflow & formatting” grade is to assess the reproducible workflow, clarity, and professionalism.↩︎