Categorical Predictors

Prof. Eric Friedlander

Oct 07, 2024

Announcements & Getting Started

  • Exam Review Wednesday
  • Exam Friday
  • Project proposals due Friday!
    • Accepted without penalty through Monday 10/14 at 11:59pm
  • Don’t forget about Oral R Quiz!
    • Happy to administer over Teams during the break if you’d like…

📋 AE 12 - P2P Loans

  • Open up AE 12
  • Complete Exercise 0 if you have time.

Categorical predictors

Topics

  • Understanding categorical predictors
  • Understand how categorical predictors interact with quantitative predictors

Computational setup

# load packages
library(tidyverse)
library(ggformula)
library(mosaic)
library(broom)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

Data

Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

# A tibble: 50 × 3
   annual_income verified_income interest_rate
           <dbl> <fct>                   <dbl>
 1         59000 Not Verified            10.9 
 2         60000 Not Verified             9.92
 3         75000 Verified                26.3 
 4         75000 Not Verified             9.92
 5        254000 Not Verified             9.43
 6         67000 Source Verified          9.92
 7         28800 Source Verified         17.1 
 8         80000 Not Verified             6.08
 9         34000 Not Verified             7.97
10         80000 Source Verified         12.6 
# ℹ 40 more rows

Variables

Predictors:

  • annual_income: Annual income
  • verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Response: interest_rate: Interest rate for the loan

Response: interest_rate

min median max iqr
5.31 9.93 26.3 5.755

Predictors

Code
p1 <- loan50 |> 
  gf_bar(verified_income~.) |> 
  gf_labs(title = "Verified Income", 
       y = "")

p2 <- loan50 |> 
  gf_histogram(~annual_income, binwidth = 20000) |> 
  gf_labs(title = "",
       x = "Annual income")

p1 / p2

Data manipulation: Rescale income

loan50 <- loan50 |>
  mutate(annual_income_k = annual_income / 1000)

loan50 |> 
  gf_histogram(~annual_income_k, binwidth = 20) |> 
  gf_labs(title = "Annual income (in $1000s)", x = "")

Categorical predictor variables

Complete Exercises 1 and 2.

Indicator variables

  • Suppose there is a categorical variable with \(K\) categories (levels)

  • We can make \(K\) indicator variables - one indicator for each category

  • An indicator variable takes values 1 or 0

    • 1 if the observation belongs to that category
    • 0 if the observation does not belong to that category

Data manipulation: Create indicator variables for verified_income

loan50 <- loan50 |>
  mutate(
    not_verified = if_else(verified_income == "Not Verified", 1, 0),
    source_verified = if_else(verified_income == "Source Verified", 1, 0),
    verified = if_else(verified_income == "Verified", 1, 0)
  )

loan50 |>
  select(verified_income, not_verified, source_verified, verified) |>
  slice(1, 3, 6)
# A tibble: 3 × 4
  verified_income not_verified source_verified verified
  <fct>                  <dbl>           <dbl>    <dbl>
1 Not Verified               1               0        0
2 Verified                   0               0        1
3 Source Verified            0               1        0

Indicators in the model

  • We will use \(K-1\) of the indicator variables in the model.
  • The reference level or baseline is the category that doesn’t have a term in the model.
  • The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
  • This approach is also called dummy coding and R will do this for you
loan50 |>
  select(verified_income, source_verified, verified) |>
  slice(1, 3, 6)
# A tibble: 3 × 3
  verified_income source_verified verified
  <fct>                     <dbl>    <dbl>
1 Not Verified                  0        0
2 Verified                      0        1
3 Source Verified               1        0

Application Exercise

Complete Exercises 3 & 4

Interpreting verified_income

term estimate std.error statistic p.value
(Intercept) 9.541 1.006 9.487 0.000
verified_incomeSource Verified 2.224 1.440 1.544 0.129
verified_incomeVerified 6.312 1.836 3.437 0.001

  • Where do we see each of the estimates in the plot?
  • Where do we see the values we’d predict in the plot?
  • Are verified_income and interest_rate correlated?
03:00

Model equation

\[ \begin{align}\hat{\text{interest_rate}} = 9.541 &+ 2.224 \times \text{source_verified}\\ &+ 6.312 \times \text{verified} \end{align} \]

Adding in another predictor

Complete Exercise 5-6.

Interest rate vs. annual income: parallel slopes

Parallel slopes interpretation

term estimate std.error statistic p.value
(Intercept) 11.388 1.352 8.423 0.000
annual_income_k -0.022 0.011 -1.974 0.054
verified_incomeSource Verified 2.171 1.398 1.553 0.127
verified_incomeVerified 6.792 1.799 3.776 0.000
  • Slope of annual_income_k is -0.022 regardless of verified_income level
  • Change in verified_income corresponds to a shift in the intercept
    • Intercept for Not Verified is 11.388
    • For Source Verified shift intercept up 2.171
      • (i.e. intercept \(= 11.388 + 2.171 = 13.599\))
    • For Verified shift intercept up 6.792 from Not Verified
      • (i.e. intercept \(= 11.388 + 6.792 = 18.180\))

Interest rate vs. annual income: interaction term

Interpreting interaction terms

term estimate std.error statistic p.value
(Intercept) 10.303 1.897 5.432 0.000
annual_income_k -0.009 0.019 -0.471 0.640
verified_incomeSource Verified 3.423 2.534 1.351 0.184
verified_incomeVerified 9.788 3.652 2.680 0.010
annual_income_k:verified_incomeSource Verified -0.015 0.026 -0.591 0.558
annual_income_k:verified_incomeVerified -0.031 0.033 -0.961 0.342
  • Slope of annual_income_k depends on verified_income level
  • No difference: fit three separate linear models on the data corresponding to each level of verified_income

Understanding the model

\[ \begin{aligned} \hat{interest\_rate} &= 910.303 - 0.009 \times annual\_income\_k \\ &+ 3.423 \times SourceVerified + 9.788 \times Verified \\ &- 0.015 \times annual\_income\_k \times SourceVerified\\ &- 0.031 \times annual\_income\_k \times Verified \end{aligned} \]

Complete Exercise 7