SLR: Model evaluation

Prof. Eric Friedlander

Sep 18, 2024

Questions from last class?

Model conditions

Linearity: There is a linear relationship between the outcome and predictor variables
Constant variance: The variability of the errors is equal for all values of the predictor variable
Normality: The errors follow a normal distribution
Independence: The errors are independent from each other

Warm-Up: Comparing inferential methods

What are the advantages of using simulation-based inference methods? What are the advantages of using inference methods based on mathematical models?
Under what scenario(s) would you prefer to use simulation-based methods? Under what scenario(s) would you prefer to use methods based on mathematical models?

02:00

Application exercise

📋 AE-06: Model Conditions

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(ggformula)   # for plotting using formulas
library(broom)       # for formatting model output
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(patchwork)   # arrange plots

# HEB Dataset
heb <- read_csv("data/HEBIncome.csv") |> 
  mutate(Avg_Income_K = Avg_Household_Income/1000)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Regression model, revisited

heb_fit <- lm(Number_Organic ~ Avg_Income_K, data = heb)

tidy(heb_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-14.719	9.298	-1.583	0.122
Avg_Income_K	0.959	0.128	7.505	0.000

Model evaluation

Partitioning Variability

Let’s think about variation:

DATA = MODEL + ERROR
\(\substack{\text{Variation} \\ \text{in Y}} = \substack{\text{Variation explained} \\ \text{by model}} + \substack{\text{Variation not explained} \\ \text{by model}}\)

Partitioning Variability (ANOVA)

\(y_i - \bar{y} = (\hat{y}_i - \bar{y}) + (y_i-\hat{y}_i)\)
Square and sum: \(\sum(y_i-\bar{y})^2 = \sum(\hat{y} - \bar{y})^2 + \sum(y-\hat{y})^2\)
\(\substack{\text{Sum of squares} \\ \text{Total}} = \substack{\text{Sum of squares} \\ \text{model}} + \substack{\text{Sum of squares} \\ \text{error}}\)
\(SSTotal = SSModel + SSE\)
\(SST = SSM + SSE\)

ANOVA in R

heb_fit |> 
  anova() |> 
  tidy() |> 
  kable()

term	df	sumsq	meansq	statistic	p.value
Avg_Income_K	1	17175.06	17175.0595	56.32026	0
Residuals	35	10673.37	304.9535	NA	NA

More on this later in the semester

Recall: Correlation Coefficient

The correlation coefficient, \(r\), is a number between -1 and +1 that measures how strong the linear relationship between two variables \(x\) and \(y\) is.

\[ r = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}} = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{s_xs_y} \]

Two statistics: \(R^2\)

R-squared, \(R^2\), Coefficient of Determination : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor) \[ R^2 = Cor(y, \hat{y})^2 \]
- Also called PRE (Percent Reduction in Error) because: \[ R^2 = \frac{SSModel}{SSTotal} \]

Two statistics: RMSE

Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome) \[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]
- Sometimes people just case about numerator (SSE) or version without the square-root (MSE)
- Sometimes the denominator may have \(n-1\) instead

What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?

\(R^2\)

Ranges between 0 (terrible predictor) and 1 (perfect predictor)
Has no units
Calculate with rsq() from yardstick package using the augmented data:

library(yardstick)
heb_aug <- augment(heb_fit)

rsq(heb_aug, truth = Number_Organic, estimate = .fitted) |> kable()

.metric	.estimator	.estimate
rsq	standard	0.6167334

Interpreting \(R^2\)

🗳️ Discussion

The \(R^2\) of the model for Number_Organic from Average_Income_K is 61.7%. Which of the following is the correct interpretation of this value?

Avg_Income_K correctly predicts 61.7% of Number_Organic in San Antontio HEBs.
61.7% of the variability in Number_Organic can be explained by Avg_Income_K.
61.7% of the variability in Avg_Income_K can be explained by Number_Organic.
61.7% of the time Number_Organic can be predicted by Avg_Income_K.

Activity

In groups, at the board, design a simulation-based procedure for producing a p-value for the following hypothesis test.

\(H_0: R^2 = 0\)
\(H_A: R^2 \neq 0\)

RMSE

Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable
Calculate with rmse() from yardstick package using the augmented data:

rmse(heb_aug, truth = Number_Organic, estimate = .fitted)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        17.0

The value of RMSE is not very meaningful on its own, but it’s useful for comparing across models (more on this and ANOVA when we get to regression with multiple predictors)