# load packages
library(tidyverse) # for data wrangling
library(tidymodels) # for modeling
library(fivethirtyeight) # for the fandango dataset
library(knitr) # for formatting tables
# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))
# set default figure parameters for knitr
knitr::opts_chunk$set(
fig.width = 8,
fig.asp = 0.618,
fig.retina = 3,
dpi = 300,
out.width = "80%"
)Simple Linear Regression
Questions?
Join sta210-fa23 on GitHub
Topics
Use simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.
Estimate the slope and intercept of the regression line using the least squares method.
Interpret the slope and intercept of the regression line.
Predict the response given a value of the predictor variable.
Use tidymodels to fit and summarize regression models in R.
Computation set up
Data
Movie scores
- Data behind the FiveThirtyEight story Be Suspicious Of Online Movie Ratings, Especially Fandangoβs
- In the fivethirtyeight package:
fandango - Contains every film released in 2014 and 2015 that has at least 30 fan reviews on Fandango, an IMDb score, Rotten Tomatoes critic and user ratings, and Metacritic critic and user scores




Data prep
- Rename Rotten Tomatoes columns as
criticsandaudience - Rename the dataset as
movie_scores
movie_scores <- fandango |>
rename(critics = rottentomatoes,
audience = rottentomatoes_user)Data overview
glimpse(movie_scores)Rows: 146
Columns: 23
$ film <chr> "Avengers: Age of Ultron", "Cinderella", "Aβ¦
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2β¦
$ critics <int> 74, 85, 80, 18, 14, 63, 42, 86, 99, 89, 84,β¦
$ audience <int> 86, 80, 90, 84, 28, 62, 53, 64, 82, 87, 77,β¦
$ metacritic <int> 66, 67, 64, 22, 29, 50, 53, 81, 81, 80, 71,β¦
$ metacritic_user <dbl> 7.1, 7.5, 8.1, 4.7, 3.4, 6.8, 7.6, 6.8, 8.8β¦
$ imdb <dbl> 7.8, 7.1, 7.8, 5.4, 5.1, 7.2, 6.9, 6.5, 7.4β¦
$ fandango_stars <dbl> 5.0, 5.0, 5.0, 5.0, 3.5, 4.5, 4.0, 4.0, 4.5β¦
$ fandango_ratingvalue <dbl> 4.5, 4.5, 4.5, 4.5, 3.0, 4.0, 3.5, 3.5, 4.0β¦
$ rt_norm <dbl> 3.70, 4.25, 4.00, 0.90, 0.70, 3.15, 2.10, 4β¦
$ rt_user_norm <dbl> 4.30, 4.00, 4.50, 4.20, 1.40, 3.10, 2.65, 3β¦
$ metacritic_norm <dbl> 3.30, 3.35, 3.20, 1.10, 1.45, 2.50, 2.65, 4β¦
$ metacritic_user_nom <dbl> 3.55, 3.75, 4.05, 2.35, 1.70, 3.40, 3.80, 3β¦
$ imdb_norm <dbl> 3.90, 3.55, 3.90, 2.70, 2.55, 3.60, 3.45, 3β¦
$ rt_norm_round <dbl> 3.5, 4.5, 4.0, 1.0, 0.5, 3.0, 2.0, 4.5, 5.0β¦
$ rt_user_norm_round <dbl> 4.5, 4.0, 4.5, 4.0, 1.5, 3.0, 2.5, 3.0, 4.0β¦
$ metacritic_norm_round <dbl> 3.5, 3.5, 3.0, 1.0, 1.5, 2.5, 2.5, 4.0, 4.0β¦
$ metacritic_user_norm_round <dbl> 3.5, 4.0, 4.0, 2.5, 1.5, 3.5, 4.0, 3.5, 4.5β¦
$ imdb_norm_round <dbl> 4.0, 3.5, 4.0, 2.5, 2.5, 3.5, 3.5, 3.5, 3.5β¦
$ metacritic_user_vote_count <int> 1330, 249, 627, 31, 88, 34, 17, 124, 62, 54β¦
$ imdb_user_vote_count <int> 271107, 65709, 103660, 3136, 19560, 39373, β¦
$ fandango_votes <int> 14846, 12640, 12055, 1793, 1021, 397, 252, β¦
$ fandango_difference <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5β¦
Movie scores data
The data set contains the βTomatometerβ score (critics) and audience score (audience) for 146 movies rated on rottentomatoes.com.

Movie ratings data
Goal: Fit a line to describe the relationship between the critics score and audience score.
`geom_smooth()` using formula = 'y ~ x'

Why fit a line?
We fit a line to accomplish one or both of the following:
. . .
Prediction
What is the audience score expected to be for an upcoming movie that received 35% from the critics?
. . .
Inference
Is the critics score a useful predictor of the audience score? By how much is the audience score expected to change for each additional point in the critics score?
Terminology
Response, Y: variable describing the outcome of interest
Predictor, X: variable we use to help understand the variability in the response
`geom_smooth()` using formula = 'y ~ x'

Regression model
A regression model is a function that describes the relationship between the response, \(Y\), and the predictor, \(X\).
\[\begin{aligned} Y &= \color{black}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{black}{\mathbf{f(X)}} + \epsilon \\[8pt] &= \color{black}{\boldsymbol{\mu_{Y|X}}} + \epsilon \end{aligned}\]Regression model
`geom_smooth()` using formula = 'y ~ x'

\(\mu_{Y|X}\) is the mean value of \(Y\) given a particular value of \(X\).
Regression model
\[ \begin{aligned} Y &= \color{purple}{\textbf{Model}} + \color{blue}{\textbf{Error}} \\[5pt] &= \color{purple}{\mathbf{f(X)}} + \color{blue}{\boldsymbol{\epsilon}} \\[5pt] &= \color{purple}{\boldsymbol{\mu_{Y|X}}} + \color{blue}{\boldsymbol{\epsilon}} \\[5pt] \end{aligned} \]
`geom_smooth()` using formula = 'y ~ x'

Simple linear regression (SLR)
SLR: Statistical model
When we have a quantitative response, \(Y\), and a single quantitative predictor, \(X\), we can use a simple linear regression model to describe the relationship between \(Y\) and \(X\). \[\Large{Y = \mathbf{\beta_0 + \beta_1 X} + \epsilon}\]
. . .
- \(\beta_1\): True slope of the relationship between \(X\) and \(Y\)
- \(\beta_0\): True intercept of the relationship between \(X\) and \(Y\)
- \(\epsilon\): Error
SLR: Regression equation
\[\Large{\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X}\]
- \(\hat{\beta}_1\): Estimated slope of the relationship between \(X\) and \(Y\)
- \(\hat{\beta}_0\): Estimated intercept of the relationship between \(X\) and \(Y\)
- No error term!
Choosing values for \(\hat{\beta}_1\) and \(\hat{\beta}_0\)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
βΉ Please use `linewidth` instead.

Residuals
`geom_smooth()` using formula = 'y ~ x'

\[\text{residual} = \text{observed} - \text{predicted} = y_i - \hat{y}_i\]
Least squares line
- The residual for the \(i^{th}\) observation is
\[e_i = \text{observed} - \text{predicted} = y_i - \hat{y}_i\]
- The sum of squared residuals is
\[e^2_1 + e^2_2 + \dots + e^2_n\]
- The least squares line is the one that minimizes the sum of squared residuals
Slope and intercept
Properties of least squares regression
The regression line goes through the center of mass point, the coordinates corresponding to average \(X\) and average \(Y\): \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\)
The slope has the same sign as the correlation coefficient: \(\hat{\beta}_1 = r \frac{s_Y}{s_X}\)
The sum of the residuals is approximately zero: \(\sum_{i = 1}^n e_i \approx 0\)
The residuals and \(X\) values are uncorrelated
Estimating the slope
\[\large{\hat{\beta}_1 = r \frac{s_Y}{s_X}}\]
Clickhere for details on deriving the equations for slope and intercept.
Estimating the intercept
\[\large{\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}}\]
Click here for details on deriving the equations for slope and intercept.
Interpretation
Post your answers to the following questions on Ed Discussion:
The slope of the model for predicting audience score from critics score is 0.5187 . Which of the following is the best interpretation of this value?
32.3142 is the predicted mean audience score for what type of movies?
Link for Section 001 (10:05am lecture)
Link for Section 002 (1:25pm lecture)
03:00
Does it make sense to interpret the intercept?
. . .
β The intercept is meaningful in the context of the data if
the predictor can feasibly take values equal to or near zero, or
there are values near zero in the observed data.
. . .
π Otherwise, the intercept may not be meaningful!
Prediction
Making a prediction
Suppose that a movie has a critics score of 70. According to this model, what is the movieβs predicted audience score?
\[\begin{aligned} \widehat{\text{audience}} &= 32.3142 + 0.5187 \times \text{critics} \\ &= 32.3142 + 0.5187 \times 70 \\ &= 68.6232 \end{aligned}\]Fitting regression models with tidymodels
tidymodels
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
. . .
library(tidymodels)ββ Attaching packages ββββββββββββββββββββββββββββββββββββββ tidymodels 1.2.0 ββ
β broom 1.0.6 β rsample 1.2.1
β dials 1.2.1 β tune 1.2.1
β infer 1.0.7 β workflows 1.1.4
β modeldata 1.4.0 β workflowsets 1.1.0
β parsnip 1.2.1 β yardstick 1.3.1
β recipes 1.1.0
ββ Conflicts βββββββββββββββββββββββββββββββββββββββββ tidymodels_conflicts() ββ
β scales::discard() masks purrr::discard()
β dplyr::filter() masks stats::filter()
β recipes::fixed() masks stringr::fixed()
β dplyr::lag() masks stats::lag()
β yardstick::spec() masks readr::spec()
β recipes::step() masks stats::step()
β’ Use suppressPackageStartupMessages() to eliminate package startup messages
Why tidymodels?
- Consistent syntax for different model types (linear, logistic, random forest, Bayesian, etc.)
- Streamline modeling workflow
- Split data into train and test sets
- Transform and create new variables
- Assess model performance
- Use model for prediction and inference
Fitting the model
Step 1: Specify model
linear_reg()Linear Regression Model Specification (regression)
Computational engine: lm
Step 2: Set model fitting engine
Linear Regression Model Specification (regression)
Computational engine: lm
Step 3: Fit model & estimate parameters
using formula syntax
A closer look at the regression output
movie_fit <- linear_reg() |>
set_engine("lm") |>
fit(audience ~ critics, data = movie_scores)
movie_fitparsnip model object
Call:
stats::lm(formula = audience ~ critics, data = data)
Coefficients:
(Intercept) critics
32.3155 0.5187
\[\widehat{\text{audience}} = 32.3155 + 0.5187 \times \text{critics}\]
. . .
Note: The intercept is off by a tiny bit from the hand-calculated intercept, this is just due to rounding in the hand calculation.
The regression output
Weβll focus on the first column for nowβ¦
Format output with kable
Use the kable function from the knitr package to produce a table and specify number of significant digits
Prediction
Application exercise
Find your
ae-03repo in the course GitHub organization.If you do not see an
ae-03repo, click here to create one.
Wrap up
Recap
Used simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.
Used the least squares method to estimate the slope and intercept.
Interpreted the slope and intercept.
- Slope: For every one unit increase in \(x\), we expect y to change by \(\hat{\beta}_1\) units, on average.
- Intercept: If \(x\) is 0, then we expect \(y\) to be \(\hat{\beta}_0\) units
Predicted the response given a value of the predictor variable.
Used tidymodels to fit and summarize regression models in R.