Multiple linear regression (MLR)

Prof. Eric Friedlander

Sep 30, 2024

Trying something new

Open but do not start AE-10
Notify Dr. F when you’re ready to proceed
The goal is to integrate the activity into the lecture (keep track of how you like this approach and give Dr. F feedback at the end of class)
While you’re waiting feel free to start on Exercise 0

Computational setup

# load packages
library(tidyverse)
library(broom)
library(mosaic)
library(ISLR2)
library(patchwork)
library(knitr)
library(kableExtra)
library(scales)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Considering multiple variables

Data: Credit Cards

The data is from the Credit data set in the ISLR2 R package. It is a simulated data set of 400 credit card customers.

Rows: 400
Columns: 11
$ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
$ Limit     <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
$ Rating    <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
$ Cards     <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
$ Age       <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
$ Education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
$ Own       <fct> No, Yes, No, Yes, No, No, Yes, No, Yes, Yes, No, No, Yes, No…
$ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
$ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Ye…
$ Region    <fct> South, West, West, West, South, South, East, West, South, Ea…
$ Balance   <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…

Variables

Features (another name for predictors)

Income: Annual income (in 1000’s of US dollars)
Rating: Credit Rating

Outcome

Limit: Credit limit

Complete Exercises 0-1. Please don’t look ahead in the slides.

Outcome: `Limit`

Code

Credit |> 
  gf_density(~Limit, fill = "steelblue") |> 
  gf_labs(title = "Distribution of credit limit",
          x = "Credit Limit")|> 
  gf_refine(scale_x_continuous(labels = dollar_format()))

	min	Q1	median	Q3	max	mean	sd	n	missing
	855	3088	4622.5	5872.75	13913	4735.6	2308.199	400	0

Predictors

Code

p1 <- Credit |> 
  gf_density(~Limit, fill = "steelblue") |> 
  gf_labs(title = "Distribution of credit limit",
          x = "Credit Limit")|> 
    gf_refine(scale_x_continuous(labels = dollar_format()))

p2 <- Credit |> 
  gf_histogram(~Rating, binwidth = 50) |> 
  gf_labs(title = "",
       x = "Credit Rating")

p3 <- Credit |> 
  gf_histogram(~Income, binwidth = 10) |> 
  gf_labs(title = "",
       x = "Annual income (in $1,000s)") |> 
  gf_refine(scale_x_continuous(labels = dollar_format()))

p1 / (p2 + p3)

Outcome vs. predictors

Code

p4 <- Credit |> 
  gf_point(Limit ~ Rating, color = "steelblue") |> 
  gf_labs(
    y = "Credit Limit",
    x = "Credit Rating"
  ) |> 
  gf_refine(scale_y_continuous(labels = dollar_format()))


p5 <- Credit |> 
  gf_point(Limit ~ Income, color = "steelblue") |> 
  gf_labs(
    y = "Credit Limit",
    x = "Annual income (in $1,000s)"
  ) |> 
  gf_refine(scale_x_continuous(labels = dollar_format()),
                               scale_y_continuous(labels = percent_format(scale = 1)))

p4 / p5

Single vs. multiple predictors

So far we’ve used a single predictor variable to understand variation in a quantitative response variable

Now we want to use multiple predictor variables to understand variation in a quantitative response variable

Multiple linear regression

Multiple linear regression (MLR)

Based on the analysis goals, we will use a multiple linear regression model of the following form

\[ \begin{aligned}\hat{\text{Limit}} ~ = \hat{\beta}_0 & + \hat{\beta}_1 \text{Rating} + \hat{\beta}_2 \text{Income} \end{aligned} \]

Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values Limit follow a Normal distribution.

Multiple linear regression

Recall: The simple linear regression model assumes

\[ Y|X\sim N(\beta_0 + \beta_1 X, \sigma_{\epsilon}^2) \]

Similarly: The multiple linear regression model assumes

\[ Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_{\epsilon}^2) \]

Multiple linear regression

At any combination of the predictors, the mean value of the response $Y$, is

\[ \mu_{Y|X_1, \ldots, X_p} = \beta_0 + \beta_1 X_{1} + \beta_2 X_2 + \dots + \beta_p X_p \]

Using multiple linear regression, we can estimate the mean response for any combination of predictors

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_{1} + \hat{\beta}_2 X_2 + \dots + \hat{\beta}_p X_{p} \]

Model fit

Complete exercises 2-4 in the application exercise. Don’t look ahead!

Model fit

term	estimate	std.error	statistic	p.value
(Intercept)	-532.471	24.173	-22.028	0.000
Rating	14.771	0.096	153.124	0.000
Income	0.557	0.423	1.316	0.189

Model equation

\[ \begin{align}\hat{\text{Limit}} = -532.471 &+14.771 \times \text{Rating}\\ & -0.557 \times \text{Income} \end{align} \]

Interpreting $\hat{\beta}_j$

The estimated coefficient $\hat{\beta}_j$ is the expected change in the mean of $y$ when $x_j$ increases by one unit, holding the values of all other predictor variables constant.

Complete Exercises 5-6.

Prediction

What is the predicted credit limit for an borrower with an credit rating of ratio of 700, and who has an annual income of $59,000?

-532.471 + 14.771 * 700 + - .577 * 59

[1] 9773.186

The predicted credit limit for an borrower with an credit rating of ratio of 700, and who has an annual income of $59,000 is $9773.19.

Prediction, revisited

Just like with simple linear regression, we can use the predict function in R to calculate the appropriate intervals for our predicted values:

       fit      lwr      upr
1 9840.213 9476.983 10203.44

Note

Difference in predicted value due to rounding the coefficients on the previous slide.

Prediction interval for $\hat{y}$

Calculate a 90% confidence interval for the predicted credit limit for an individual borrower an credit rating of ratio of 700, and who has an annual income of $59,000?

predict(lim_fit, new_borrower, interval = "prediction", level = 0.90)

       fit      lwr      upr
1 9840.213 9535.599 10144.83

When would you use "confidence"? Would the interval be wider or narrower?

Cautions

Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
The multiple regression model only shows association, not causality
- To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Multiple linear regression (MLR)

Trying something new

Computational setup

Considering multiple variables

Data: Credit Cards

Variables

Outcome: `Limit`

Predictors

Outcome vs. predictors

Single vs. multiple predictors

Multiple linear regression

Multiple linear regression (MLR)

Multiple linear regression

Multiple linear regression

Model fit

Model fit

Model equation

Interpreting \(\hat{\beta}_j\)

Prediction

Prediction, revisited

Prediction interval for \(\hat{y}\)

Cautions

Multiple linear regression (MLR)

Trying something new

Computational setup

Considering multiple variables

Data: Credit Cards

Variables

Outcome: Limit

Predictors

Outcome vs. predictors

Single vs. multiple predictors

Multiple linear regression

Multiple linear regression (MLR)

Multiple linear regression

Multiple linear regression

Model fit

Model fit

Model equation

Interpreting \(\hat{\beta}_j\)

Prediction

Prediction, revisited

Prediction interval for \(\hat{y}\)

Cautions

Outcome: `Limit`