Sep 30, 2024
The data is from the Credit data set in the ISLR2 R package. It is a simulated data set of 400 credit card customers.
Rows: 400
Columns: 11
$ Income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
$ Limit <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
$ Rating <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
$ Cards <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
$ Age <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
$ Education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
$ Own <fct> No, Yes, No, Yes, No, No, Yes, No, Yes, Yes, No, No, Yes, No…
$ Student <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
$ Married <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Ye…
$ Region <fct> South, West, West, West, South, South, East, West, South, Ea…
$ Balance <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…
Features (another name for predictors)
Income: Annual income (in 1000’s of US dollars)Rating: Credit RatingOutcome
Limit: Credit limitComplete Exercises 0-1. Please don’t look ahead in the slides.
Limit| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 855 | 3088 | 4622.5 | 5872.75 | 13913 | 4735.6 | 2308.199 | 400 | 0 |
p1 <- Credit |>
gf_density(~Limit, fill = "steelblue") |>
gf_labs(title = "Distribution of credit limit",
x = "Credit Limit")|>
gf_refine(scale_x_continuous(labels = dollar_format()))
p2 <- Credit |>
gf_histogram(~Rating, binwidth = 50) |>
gf_labs(title = "",
x = "Credit Rating")
p3 <- Credit |>
gf_histogram(~Income, binwidth = 10) |>
gf_labs(title = "",
x = "Annual income (in $1,000s)") |>
gf_refine(scale_x_continuous(labels = dollar_format()))
p1 / (p2 + p3)p4 <- Credit |>
gf_point(Limit ~ Rating, color = "steelblue") |>
gf_labs(
y = "Credit Limit",
x = "Credit Rating"
) |>
gf_refine(scale_y_continuous(labels = dollar_format()))
p5 <- Credit |>
gf_point(Limit ~ Income, color = "steelblue") |>
gf_labs(
y = "Credit Limit",
x = "Annual income (in $1,000s)"
) |>
gf_refine(scale_x_continuous(labels = dollar_format()),
scale_y_continuous(labels = percent_format(scale = 1)))
p4 / p5So far we’ve used a single predictor variable to understand variation in a quantitative response variable
Now we want to use multiple predictor variables to understand variation in a quantitative response variable
Based on the analysis goals, we will use a multiple linear regression model of the following form
\[ \begin{aligned}\hat{\text{Limit}} ~ = \hat{\beta}_0 & + \hat{\beta}_1 \text{Rating} + \hat{\beta}_2 \text{Income} \end{aligned} \]
Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values Limit follow a Normal distribution.
Recall: The simple linear regression model assumes
\[ Y|X\sim N(\beta_0 + \beta_1 X, \sigma_{\epsilon}^2) \]
Similarly: The multiple linear regression model assumes
\[ Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_{\epsilon}^2) \]
At any combination of the predictors, the mean value of the response \(Y\), is
\[ \mu_{Y|X_1, \ldots, X_p} = \beta_0 + \beta_1 X_{1} + \beta_2 X_2 + \dots + \beta_p X_p \]
Using multiple linear regression, we can estimate the mean response for any combination of predictors
\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_{1} + \hat{\beta}_2 X_2 + \dots + \hat{\beta}_p X_{p} \]
Complete exercises 2-4 in the application exercise. Don’t look ahead!
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -532.471 | 24.173 | -22.028 | 0.000 |
| Rating | 14.771 | 0.096 | 153.124 | 0.000 |
| Income | 0.557 | 0.423 | 1.316 | 0.189 |
\[ \begin{align}\hat{\text{Limit}} = -532.471 &+14.771 \times \text{Rating}\\ & -0.557 \times \text{Income} \end{align} \]
Complete Exercises 5-6.
What is the predicted credit limit for an borrower with an credit rating of ratio of 700, and who has an annual income of $59,000?
The predicted credit limit for an borrower with an credit rating of ratio of 700, and who has an annual income of $59,000 is $9773.19.
Just like with simple linear regression, we can use the predict function in R to calculate the appropriate intervals for our predicted values:
fit lwr upr
1 9840.213 9476.983 10203.44
Note
Difference in predicted value due to rounding the coefficients on the previous slide.
Calculate a 90% confidence interval for the predicted credit limit for an individual borrower an credit rating of ratio of 700, and who has an annual income of $59,000?
fit lwr upr
1 9840.213 9535.599 10144.83
When would you use "confidence"? Would the interval be wider or narrower?