Exam 02 Review

Published

November 25, 2024

Important

Imagine that this is printed out for you and all answers are given in written form.

Note: This in-class review is not exhaustive. Use lecture notes, application exercises, and homework for a comprehensive exam review. It’s length is also not indicative of the length of the exam.

Packages

library(tidyverse)
library(ggformula)
library(Stat2Data)
library(broom)
library(knitr)
library(rms)

Data

The data for this analysis is about credit card customers. The following variables are in the data set:

  • income: Income in $1,000’s
  • limit: Credit limit
  • rating: Credit rating
  • cards: Number of credit cards
  • age: Age in years
  • education: Number of years of education
  • own: A factor with levels No and Yes indicating whether the individual owns their home
  • student: A factor with levels No and Yes indicating whether the individual was a student
  • married: A factor with levels No and Yes indicating whether the individual was married
  • region: A factor with levels South, East, and West indicating the region of the US the individual is from
  • balance: Average credit card balance in $.
credit <- read_csv("data/credit.csv") |>
  mutate(maxed = if_else(balance == 0, 1, 0),
         student = as.factor(student))

Part 1: Linear Regression

The objective of this analysis is to predict a persons average card balance.

Exercise 1

Consider the following models:

model1 <- lm(balance ~ income + limit + age + rating, data = credit)


vif(model1)
    income      limit        age     rating 
  2.754597 161.397384   1.036651 160.830635 

What does VIF stand for? How do you use it? What conclusions can you draw from the output above.

Exercise 2

model2 <- lm(balance ~ income + limit + student + limit*student, data = credit)

tidy(model2) |>  kable()
term estimate std.error statistic p.value
(Intercept) -419.2675685 12.5863697 -33.311239 0.00e+00
income -7.9475646 0.2386001 -33.309145 0.00e+00
limit 0.2652167 0.0036815 72.040373 0.00e+00
studentYes 275.1225320 40.4303482 6.804852 0.00e+00
limit:studentYes 0.0323603 0.0078077 4.144661 4.17e-05

Describe and interpret the interaction term from the above model. Be sure to give the value of the coefficient an describe what it means.

Exercise 3

Consider the following analysis:

model3 <- lm(balance ~ income + limit + age + rating + limit*student, data = credit)

anova(model2, model3)
Analysis of Variance Table

Model 1: balance ~ income + limit + student + limit * student
Model 2: balance ~ income + limit + age + rating + limit * student
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    395 4137079                                  
2    393 3832554  2    304525 15.613 2.986e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Describe the test above. Write down the null and alternative hypotheses, test statistic, p-value, and interpret the results.

Exercise 4

Based on the previous question and the output below, which model is better? Cite at least three forms of evidence.

glance(model2) |> kable()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.9509476 0.9504508 102.3407 1914.402 0 4 -2416.383 4844.765 4868.714 4137079 395 400
glance(model3) |> kable()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.9545582 0.9538645 98.75245 1375.905 0 6 -2401.091 4818.182 4850.114 3832554 393 400

Part 2: Logistic Regression

The objective of this analysis is to predict whether a person has maxed out their credit card, i.e., had $0 average card balance.

Exercise 1

  • Why is logistic regression the best modeling approach for this analysis?

Exercise 2

# A tibble: 2 × 2
  maxed     n
  <dbl> <int>
1     0   310
2     1    90

Let \(\pi\) represent the probability that a person has maxed out their credit card.

  • Compute the empirical probability
  • Compute the empirical odds
  • Compute the empirical log odds

Exercise 3

Consider the following code and output:

credit_rec <- glm(maxed ~ income + age + student, 
                     data = credit, family = "binomial")
credit_rec |> tidy() |> kable()
term estimate std.error statistic p.value
(Intercept) -0.3331885 0.4604783 -0.7235705 0.4693295
income -0.0325537 0.0068189 -4.7740540 0.0000018
age 0.0075144 0.0075360 0.9971344 0.3186993
studentYes -2.6142983 1.0264655 -2.5468935 0.0108687
  • Write out the formula for the probability of “success” that corresponds to this model.
  • Write out the formula for the log-odds of “success” that corresponds to this model.

Exercise 4

What condition of logistic regression can be assessed with this plot. Does it appear that that condition is satisfied?

emplogitplot1(maxed ~ income, data = credit, ngroups = 10)

Exercise 5

What is wrong with the procedure below?

model1 <- glm(maxed ~ income + age + student, data = credit, family = "binomial")
model2 <- glm(maxed ~ income + rating + limit + age, data = credit, family = "binomial")

anova(model1, model2, test = "Chisq")
Analysis of Deviance Table

Model 1: maxed ~ income + age + student
Model 2: maxed ~ income + rating + limit + age
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       396     375.13                          
2       395     107.36  1   267.76 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 6

Consider income below. Interpret the coefficient of income. There are two different answers I will accept here. Once this is done, interpret the p-value associated with income. Make sure to:

  • State the null and alternative hypotheses
  • Identify the test statistic
  • State the distribution used to calculate the p-value
  • State the conclusion of the test at a significance level of \(\alpha = 0.01\)
model2 |> tidy()
# A tibble: 5 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)  9.02      1.65        5.46  0.0000000482
2 income       0.111     0.0210      5.28  0.000000126 
3 rating      -0.0316    0.0217     -1.46  0.145       
4 limit       -0.00172   0.00145    -1.19  0.235       
5 age         -0.00383   0.0147     -0.261 0.794