Oct 04, 2024
The data is from the Credit data set in the ISLR2 R package. It is a simulated data set of 400 credit card customers.
Rows: 400
Columns: 11
$ Income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
$ Limit <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
$ Rating <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
$ Cards <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
$ Age <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
$ Education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
$ Own <fct> No, Yes, No, Yes, No, No, Yes, No, Yes, Yes, No, No, Yes, No…
$ Student <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
$ Married <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Ye…
$ Region <fct> South, West, West, West, South, South, East, West, South, Ea…
$ Balance <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…
Features (another name for predictors)
Income: Annual income (in 1000’s of US dollars)Rating: Credit RatingOutcome
Limit: Credit limit| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2389.87 | 114.83 | 20.81 | 0 |
| Income | 51.88 | 2.00 | 25.89 | 0 |
The multiple linear regression model assumes \[Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_\epsilon^2)\]
For a given observation \((x_{i1}, x_{i2}, \ldots, x_{ip}, y_i)\), we can rewrite the previous statement as
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i}, \hspace{10mm} \epsilon_i \sim N(0,\sigma_{\epsilon}^2)\]
For a given observation \((x_{i1}, x_{i2}, \ldots,x_{ip}, y_i)\) the residual is \[ \begin{aligned} e_i &= y_{i} - \hat{y_i}\\ &= y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip}) \end{aligned} \]
The estimated value of the regression standard error , \(\sigma_{\epsilon}\), is
\[\hat{\sigma}_\epsilon = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n-p-1}}\]
As with SLR, we use \(\hat{\sigma}_{\epsilon}\) to calculate \(SE_{\hat{\beta}_j}\), the standard error of each coefficient. See Matrix Form of Linear Regression for more detail.
Rating is in the modelComplete Exercises 1-2.
The \(C\%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.
Generically: We are \(C\%\) confident that the interval LB to UB contains the population coefficient of \(x_j\).
In context: We are \(C\%\) confident that for every one unit increase in \(x_j\), we expect \(y\) to change by LB to UB units, holding all else constant.
Complete Exercise 3.
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -532.47 | 24.17 | -22.03 | 0.00 | -579.99 | -484.95 |
| Rating | 14.77 | 0.10 | 153.12 | 0.00 | 14.58 | 14.96 |
| Income | 0.56 | 0.42 | 1.32 | 0.19 | -0.28 | 1.39 |
Caution
If the sample size is large enough, the test will likely result in rejecting \(H_0: \beta_j = 0\) even \(x_j\) has a very small effect on \(y\).
Consider the practical significance of the result not just the statistical significance.
Use the confidence interval to draw conclusions instead of relying only p-values.
Caution
If the sample size is small, there may not be enough evidence to reject \(H_0: \beta_j=0\).
When you fail to reject the null hypothesis, DON’T immediately conclude that the variable has no association with the response.
There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.
Complete Exercise 4