SLR: Model evaluation

Prof. Eric Friedlander

Sep 18, 2024

Questions from last class?

Model conditions

Model conditions

  1. Linearity: There is a linear relationship between the outcome and predictor variables
  2. Constant variance: The variability of the errors is equal for all values of the predictor variable
  3. Normality: The errors follow a normal distribution
  4. Independence: The errors are independent from each other

Warm-Up: Comparing inferential methods

  • What are the advantages of using simulation-based inference methods? What are the advantages of using inference methods based on mathematical models?

  • Under what scenario(s) would you prefer to use simulation-based methods? Under what scenario(s) would you prefer to use methods based on mathematical models?

02:00

Application exercise

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(ggformula)   # for plotting using formulas
library(broom)       # for formatting model output
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(patchwork)   # arrange plots

# HEB Dataset
heb <- read_csv("data/HEBIncome.csv") |> 
  mutate(Avg_Income_K = Avg_Household_Income/1000)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Regression model, revisited

heb_fit <- lm(Number_Organic ~ Avg_Income_K, data = heb)

tidy(heb_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -14.719 9.298 -1.583 0.122
Avg_Income_K 0.959 0.128 7.505 0.000

Model evaluation

Partitioning Variability

Letโ€™s think about variation:

  • DATA = MODEL + ERROR
  • \(\substack{\text{Variation} \\ \text{in Y}} = \substack{\text{Variation explained} \\ \text{by model}} + \substack{\text{Variation not explained} \\ \text{by model}}\)

Partitioning Variability (ANOVA)

  • \(y_i - \bar{y} = (\hat{y}_i - \bar{y}) + (y_i-\hat{y}_i)\)
  • Square and sum: \(\sum(y_i-\bar{y})^2 = \sum(\hat{y} - \bar{y})^2 + \sum(y-\hat{y})^2\)
  • \(\substack{\text{Sum of squares} \\ \text{Total}} = \substack{\text{Sum of squares} \\ \text{model}} + \substack{\text{Sum of squares} \\ \text{error}}\)
  • \(SSTotal = SSModel + SSE\)
  • \(SST = SSM + SSE\)

ANOVA in R

heb_fit |> 
  anova() |> 
  tidy() |> 
  kable()
term df sumsq meansq statistic p.value
Avg_Income_K 1 17175.06 17175.0595 56.32026 0
Residuals 35 10673.37 304.9535 NA NA
  • More on this later in the semester

Recall: Correlation Coefficient

  • The correlation coefficient, \(r\), is a number between -1 and +1 that measures how strong the linear relationship between two variables \(x\) and \(y\) is.

\[ r = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}} = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{s_xs_y} \]

Two statistics: \(R^2\)

  • R-squared, \(R^2\), Coefficient of Determination : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor) \[ R^2 = Cor(y, \hat{y})^2 \]
    • Also called PRE (Percent Reduction in Error) because: \[ R^2 = \frac{SSModel}{SSTotal} \]

Two statistics: RMSE

  • Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome) \[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]
    • Sometimes people just case about numerator (SSE) or version without the square-root (MSE)
    • Sometimes the denominator may have \(n-1\) instead

What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?

\(R^2\)

  • Ranges between 0 (terrible predictor) and 1 (perfect predictor)
  • Has no units
  • Calculate with rsq() from yardstick package using the augmented data:
library(yardstick)
heb_aug <- augment(heb_fit)

rsq(heb_aug, truth = Number_Organic, estimate = .fitted) |> kable()
.metric .estimator .estimate
rsq standard 0.6167334

Interpreting \(R^2\)

๐Ÿ—ณ๏ธ Discussion

The \(R^2\) of the model for Number_Organic from Average_Income_K is 61.7%. Which of the following is the correct interpretation of this value?

  1. Avg_Income_K correctly predicts 61.7% of Number_Organic in San Antontio HEBs.
  2. 61.7% of the variability in Number_Organic can be explained by Avg_Income_K.
  3. 61.7% of the variability in Avg_Income_K can be explained by Number_Organic.
  4. 61.7% of the time Number_Organic can be predicted by Avg_Income_K.

Activity

In groups, at the board, design a simulation-based procedure for producing a p-value for the following hypothesis test.

  • \(H_0: R^2 = 0\)
  • \(H_A: R^2 \neq 0\)

RMSE

  • Ranges between 0 (perfect predictor) and infinity (terrible predictor)

  • Same units as the response variable

  • Calculate with rmse() from yardstick package using the augmented data:

rmse(heb_aug, truth = Number_Organic, estimate = .fitted)
# A tibble: 1 ร— 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        17.0
  • The value of RMSE is not very meaningful on its own, but itโ€™s useful for comparing across models (more on this and ANOVA when we get to regression with multiple predictors)