MLR: Inference and conditions

Prof. Eric Friedlander

Oct 04, 2024

Announcements

Midterm next Friday 10/11 (right before spring break)
Project proposal also due 10/11 but will accept until 10/14 without penalty

Topics

Inference for multiple linear regression
Checking model conditions

Computational setup

# load packages
library(tidyverse)
library(broom)
library(mosaic)
library(ISLR2)
library(patchwork)
library(knitr)
library(kableExtra)
library(scales)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Inference for multiple linear regression

Data: Credit Cards

The data is from the Credit data set in the ISLR2 R package. It is a simulated data set of 400 credit card customers.

Rows: 400
Columns: 11
$ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
$ Limit     <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
$ Rating    <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
$ Cards     <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
$ Age       <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
$ Education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
$ Own       <fct> No, Yes, No, Yes, No, No, Yes, No, Yes, Yes, No, No, Yes, No…
$ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
$ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Ye…
$ Region    <fct> South, West, West, West, South, South, East, West, South, Ea…
$ Balance   <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…

Variables

Features (another name for predictors)

Income: Annual income (in 1000’s of US dollars)
Rating: Credit Rating

Outcome

Limit: Credit limit

Conduct a hypothesis test for \(\beta_j\)

Review: Simple linear regression (SLR)

gf_point(Limit ~ Rating, data = Credit, alpha = 0.5) |> 
  gf_lm()  |> 
  gf_labs(x = "Credit Rating", y = "Income") |> 
  gf_refine(scale_y_continuous(labels = dollar_format()),
            scale_x_continuous(labels = dollar_format()))

SLR model summary

income_slr_fit <- lm(Limit ~ Income, data = Credit)

tidy(income_slr_fit) |> kable()

term	estimate	std.error	statistic	p.value
(Intercept)	2389.86941	114.828758	20.81246	0
Income	51.87502	2.003836	25.88785	0

SLR hypothesis test

term	estimate	std.error	statistic	p.value
(Intercept)	2389.87	114.83	20.81	0
Income	51.88	2.00	25.89	0

Set hypotheses: \(H_0: \beta_1 = 0\) vs. \(H_A: \beta_1 \ne 0\)

Calculate test statistic and p-value: The test statistic is \(t= 25.89\) . The p-value is calculated using a \(t\) distribution with 399 degrees of freedom. The p-value is \(\approx 0\) .

State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that income is a helpful predictor for a credit card holder’s credit limit, i.e. there is a linear relationship between income and credit limit.

Multiple linear regression

credit_fit <- lm(Limit ~ Rating + Income, data = Credit)

tidy(credit_fit) |> kable(digits = 2)

term	estimate	std.error	statistic	p.value
(Intercept)	-532.47	24.17	-22.03	0.00
Rating	14.77	0.10	153.12	0.00
Income	0.56	0.42	1.32	0.19

Multiple linear regression

The multiple linear regression model assumes \[Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_\epsilon^2)\]

For a given observation \((x_{i1}, x_{i2}, \ldots, x_{ip}, y_i)\), we can rewrite the previous statement as

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i}, \hspace{10mm} \epsilon_i \sim N(0,\sigma_{\epsilon}^2)\]

Estimating \(\sigma_\epsilon\)

For a given observation \((x_{i1}, x_{i2}, \ldots,x_{ip}, y_i)\) the residual is \[ \begin{aligned} e_i &= y_{i} - \hat{y_i}\\ &= y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip}) \end{aligned} \]

The estimated value of the regression standard error , \(\sigma_{\epsilon}\), is

\[\hat{\sigma}_\epsilon = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n-p-1}}\]

As with SLR, we use \(\hat{\sigma}_{\epsilon}\) to calculate \(SE_{\hat{\beta}_j}\), the standard error of each coefficient. See Matrix Form of Linear Regression for more detail.

MLR hypothesis test: Income

Set hypotheses: \(H_0: \beta_{Income} = 0\) vs. \(H_A: \beta_{Income} \ne 0\), given Rating is in the model

Calculate test statistic and p-value: The test statistic is \(t = 1.32\). The p-value is calculated using a \(t\) distribution with \[(n - p - 1) = 400 - 2 -1 = 398\] degrees of freedom. The p-value is \(\approx 0.19\).

State the conclusion: The p-value is not small, so we fail to reject \(H_0\). The data does not provide convincing evidence that a borrowers income is a useful predictor in a model that already contains credit rating as a predictor for the credit limit of a borrower.

Complete Exercises 1-2.

Confidence interval for \(\beta_j\)

The \(C\%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.
Generically: We are \(C\%\) confident that the interval LB to UB contains the population coefficient of \(x_j\).
In context: We are \(C\%\) confident that for every one unit increase in \(x_j\), we expect \(y\) to change by LB to UB units, holding all else constant.

Complete Exercise 3.

Confidence interval for \(\beta_j\)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-532.47	24.17	-22.03	0.00	-579.99	-484.95
Rating	14.77	0.10	153.12	0.00	14.58	14.96
Income	0.56	0.42	1.32	0.19	-0.28	1.39

Inference pitfalls

Large sample sizes

Caution

If the sample size is large enough, the test will likely result in rejecting \(H_0: \beta_j = 0\) even \(x_j\) has a very small effect on \(y\).

Consider the practical significance of the result not just the statistical significance.
Use the confidence interval to draw conclusions instead of relying only p-values.

Small sample sizes

Caution

If the sample size is small, there may not be enough evidence to reject \(H_0: \beta_j=0\).

When you fail to reject the null hypothesis, DON’T immediately conclude that the variable has no association with the response.
There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.

Complete Exercise 4