Simple Linear Regression

Prediction + Using R

Prof. Eric Friedlander

Aug 30, 2024

Finish Wedneday’s AE

📋 AE 01 - Movie Budgets and Revenues

Last Time

Used simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.
Used the least squares method to estimate the slope and intercept.
Interpreted the slope and intercept.
- Slope: For every one unit increase in \(x\), we expect y to change by \(\hat{\beta}_1\) units, on average.
- Intercept: If \(x\) is 0, then we expect \(y\) to be \(\hat{\beta}_0\) units

Topics

Predict the response given a value of the predictor variable.
Use R to fit and summarize regression models.

Computation set up

# load packages
library(tidyverse)       # for data wrangling
library(ggformula)       # for plotting
library(fivethirtyeight) # for the fandango dataset
library(broom)           # for formatting model output
library(knitr)           # for formatting tables

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "80%"
)

Data

Movie scores

Data behind the FiveThirtyEight story Be Suspicious Of Online Movie Ratings, Especially Fandango’s
In the fivethirtyeight package: fandango
Contains every film released in 2014 and 2015 that has at least 30 fan reviews on Fandango, an IMDb score, Rotten Tomatoes critic and user ratings, and Metacritic critic and user scores

Fandango logo

IMDB logo

Rotten Tomatoes logo

Metacritic logo

Data prep

Rename Rotten Tomatoes columns as critics and audience
Rename the dataset as movie_scores

movie_scores <- fandango |>
  rename(critics = rottentomatoes, 
         audience = rottentomatoes_user)

Movie scores data

The data set contains the “Tomatometer” score (critics) and audience score (audience) for 146 movies rated on rottentomatoes.com.

Code

movie_scores |> 
gf_point(audience ~ critics, alpha = 0.5) + 
  labs(x = "Critics Score" , 
       y = "Audience Score")

Movie ratings data

Goal: Fit a line to describe the relationship between the critics score and audience score.

Prediction

Recall: Our Model

\[\begin{aligned} \widehat{Y} &= 32.3142 + 0.5187 \times X\\ \widehat{\text{audience}} &= 32.3142 + 0.5187 \times \text{critics} \end{aligned}\]

Making a prediction

Suppose that a movie has a critics score of 70. According to this model, what is the movie’s predicted audience score?

\[\begin{aligned} \widehat{\text{audience}} &= 32.3142 + 0.5187 \times \text{critics} \\ &= 32.3142 + 0.5187 \times 70 \\ &= 68.6232 \end{aligned}\]

Fitting the model

Fit model & estimate parameters

movie_fit <- lm(audience ~ critics, data = movie_scores)
movie_fit


Call:
lm(formula = audience ~ critics, data = movie_scores)

Coefficients:
(Intercept)      critics  
    32.3155       0.5187

Look at the regression output

movie_fit <- lm(audience ~ critics, data = movie_scores)
movie_fit


Call:
lm(formula = audience ~ critics, data = movie_scores)

Coefficients:
(Intercept)      critics  
    32.3155       0.5187

\[\widehat{\text{audience}} = 32.3155 + 0.5187 \times \text{critics}\]

Note: The intercept is off by a tiny bit from the hand-calculated intercept, this is just due to rounding in the hand calculation.

The regression output

We’ll focus on the first column for now…

movie_fit |> 
  tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   32.3      2.34        13.8 4.03e-28
2 critics        0.519    0.0345      15.0 2.70e-31

Format output with `kable`

Use the kable function from the knitr package to produce a table and specify number of significant digits

movie_fit |> 
  tidy() |>
  kable(digits = 4)

term	estimate	std.error	statistic	p.value
(Intercept)	32.3155	2.3425	13.7953	0
critics	0.5187	0.0345	15.0281	0

Visualize Model

movie_scores |> 
  gf_point(audience ~ critics) |> 
  gf_lm()

Prediction

# create a data frame for a new movie
new_movie <- tibble(critics = 70)

# predict the outcome for a new movie
predict(movie_fit, new_movie)

       1 
68.62297

Simple Linear Regression

Finish Wedneday’s AE

Last Time

Topics

Computation set up

Data

Movie scores

Data prep

Movie scores data

Movie ratings data

Prediction

Recall: Our Model

Making a prediction

Fitting the model

Fit model & estimate parameters

Look at the regression output

The regression output

Format output with `kable`

Visualize Model

Prediction

Application Exercise

Wrap up

Recap

Simple Linear Regression

Finish Wedneday’s AE

Last Time

Topics

Computation set up

Data

Movie scores

Data prep

Movie scores data

Movie ratings data

Prediction

Recall: Our Model

Making a prediction

Fitting the model

Fit model & estimate parameters

Look at the regression output

The regression output

Format output with kable

Visualize Model

Prediction

Application Exercise

Wrap up

Recap

Format output with `kable`