Simple Linear Regression

Prediction + Using R

Prof. Eric Friedlander

Aug 30, 2024

Finish Wedneday’s AE

Last Time

  • Used simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.

  • Used the least squares method to estimate the slope and intercept.

  • Interpreted the slope and intercept.

    • Slope: For every one unit increase in \(x\), we expect y to change by \(\hat{\beta}_1\) units, on average.
    • Intercept: If \(x\) is 0, then we expect \(y\) to be \(\hat{\beta}_0\) units

Topics

  • Predict the response given a value of the predictor variable.

  • Use R to fit and summarize regression models.

Computation set up

# load packages
library(tidyverse)       # for data wrangling
library(ggformula)       # for plotting
library(fivethirtyeight) # for the fandango dataset
library(broom)           # for formatting model output
library(knitr)           # for formatting tables

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "80%"
)

Data

Movie scores

Fandango logo

IMDB logo

Rotten Tomatoes logo

Metacritic logo

Data prep

  • Rename Rotten Tomatoes columns as critics and audience
  • Rename the dataset as movie_scores
movie_scores <- fandango |>
  rename(critics = rottentomatoes, 
         audience = rottentomatoes_user)

Movie scores data

The data set contains the “Tomatometer” score (critics) and audience score (audience) for 146 movies rated on rottentomatoes.com.

Code
movie_scores |> 
gf_point(audience ~ critics, alpha = 0.5) + 
  labs(x = "Critics Score" , 
       y = "Audience Score")

Movie ratings data

Goal: Fit a line to describe the relationship between the critics score and audience score.

Prediction

Recall: Our Model

\[\begin{aligned} \widehat{Y} &= 32.3142 + 0.5187 \times X\\ \widehat{\text{audience}} &= 32.3142 + 0.5187 \times \text{critics} \end{aligned}\]

Making a prediction

Suppose that a movie has a critics score of 70. According to this model, what is the movie’s predicted audience score?

\[\begin{aligned} \widehat{\text{audience}} &= 32.3142 + 0.5187 \times \text{critics} \\ &= 32.3142 + 0.5187 \times 70 \\ &= 68.6232 \end{aligned}\]

Fitting the model

Fit model & estimate parameters

movie_fit <- lm(audience ~ critics, data = movie_scores)
movie_fit

Call:
lm(formula = audience ~ critics, data = movie_scores)

Coefficients:
(Intercept)      critics  
    32.3155       0.5187  

Look at the regression output

movie_fit <- lm(audience ~ critics, data = movie_scores)
movie_fit

Call:
lm(formula = audience ~ critics, data = movie_scores)

Coefficients:
(Intercept)      critics  
    32.3155       0.5187  

\[\widehat{\text{audience}} = 32.3155 + 0.5187 \times \text{critics}\]

Note: The intercept is off by a tiny bit from the hand-calculated intercept, this is just due to rounding in the hand calculation.

The regression output

We’ll focus on the first column for now…

movie_fit |> 
  tidy() 
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   32.3      2.34        13.8 4.03e-28
2 critics        0.519    0.0345      15.0 2.70e-31

Format output with kable

Use the kable function from the knitr package to produce a table and specify number of significant digits

movie_fit |> 
  tidy() |>
  kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) 32.3155 2.3425 13.7953 0
critics 0.5187 0.0345 15.0281 0

Visualize Model

movie_scores |> 
  gf_point(audience ~ critics) |> 
  gf_lm()

Prediction

# create a data frame for a new movie
new_movie <- tibble(critics = 70)

# predict the outcome for a new movie
predict(movie_fit, new_movie)
       1 
68.62297 

Application Exercise

Wrap up

Recap

  • Predicted the response given a value of the predictor variable.

  • Used lm and the broom package to fit and summarize regression models in R.