Categorical Predictors

Prof. Eric Friedlander

Oct 07, 2024

Announcements & Getting Started

Exam Review Wednesday
Exam Friday
Project proposals due Friday!
- Accepted without penalty through Monday 10/14 at 11:59pm
Don’t forget about Oral R Quiz!
- Happy to administer over Teams during the break if you’d like…

📋 AE 12 - P2P Loans

Open up AE 12
Complete Exercise 0 if you have time.

Categorical predictors

Topics

Understanding categorical predictors
Understand how categorical predictors interact with quantitative predictors

Computational setup

# load packages
library(tidyverse)
library(ggformula)
library(mosaic)
library(broom)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

Data

Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

# A tibble: 50 × 3
   annual_income verified_income interest_rate
           <dbl> <fct>                   <dbl>
 1         59000 Not Verified            10.9 
 2         60000 Not Verified             9.92
 3         75000 Verified                26.3 
 4         75000 Not Verified             9.92
 5        254000 Not Verified             9.43
 6         67000 Source Verified          9.92
 7         28800 Source Verified         17.1 
 8         80000 Not Verified             6.08
 9         34000 Not Verified             7.97
10         80000 Source Verified         12.6 
# ℹ 40 more rows

Variables

Predictors:

annual_income: Annual income
verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Response: interest_rate: Interest rate for the loan

Response: `interest_rate`

min	median	max	iqr
5.31	9.93	26.3	5.755

Predictors

Code

p1 <- loan50 |> 
  gf_bar(verified_income~.) |> 
  gf_labs(title = "Verified Income", 
       y = "")

p2 <- loan50 |> 
  gf_histogram(~annual_income, binwidth = 20000) |> 
  gf_labs(title = "",
       x = "Annual income")

p1 / p2

Data manipulation: Rescale income

loan50 <- loan50 |>
  mutate(annual_income_k = annual_income / 1000)

loan50 |> 
  gf_histogram(~annual_income_k, binwidth = 20) |> 
  gf_labs(title = "Annual income (in $1000s)", x = "")

Categorical predictor variables

Complete Exercises 1 and 2.

Indicator variables

Suppose there is a categorical variable with \(K\) categories (levels)
We can make \(K\) indicator variables - one indicator for each category
An indicator variable takes values 1 or 0
- 1 if the observation belongs to that category
- 0 if the observation does not belong to that category

Data manipulation: Create indicator variables for `verified_income`

loan50 <- loan50 |>
  mutate(
    not_verified = if_else(verified_income == "Not Verified", 1, 0),
    source_verified = if_else(verified_income == "Source Verified", 1, 0),
    verified = if_else(verified_income == "Verified", 1, 0)
  )

loan50 |>
  select(verified_income, not_verified, source_verified, verified) |>
  slice(1, 3, 6)

# A tibble: 3 × 4
  verified_income not_verified source_verified verified
  <fct>                  <dbl>           <dbl>    <dbl>
1 Not Verified               1               0        0
2 Verified                   0               0        1
3 Source Verified            0               1        0

Indicators in the model

We will use \(K-1\) of the indicator variables in the model.
The reference level or baseline is the category that doesn’t have a term in the model.
The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
This approach is also called dummy coding and R will do this for you

loan50 |>
  select(verified_income, source_verified, verified) |>
  slice(1, 3, 6)

# A tibble: 3 × 3
  verified_income source_verified verified
  <fct>                     <dbl>    <dbl>
1 Not Verified                  0        0
2 Verified                      0        1
3 Source Verified               1        0

Application Exercise

Complete Exercises 3 & 4

Interpreting `verified_income`

term	estimate	std.error	statistic	p.value
(Intercept)	9.541	1.006	9.487	0.000
verified_incomeSource Verified	2.224	1.440	1.544	0.129
verified_incomeVerified	6.312	1.836	3.437	0.001

Where do we see each of the estimates in the plot?
Where do we see the values we’d predict in the plot?
Are verified_income and interest_rate correlated?

03:00

Model equation

\[ \begin{align}\hat{\text{interest_rate}} = 9.541 &+ 2.224 \times \text{source_verified}\\ &+ 6.312 \times \text{verified} \end{align} \]

Adding in another predictor

Complete Exercise 5-6.

Interest rate vs. annual income: parallel slopes

Parallel slopes interpretation

term	estimate	std.error	statistic	p.value
(Intercept)	11.388	1.352	8.423	0.000
annual_income_k	-0.022	0.011	-1.974	0.054
verified_incomeSource Verified	2.171	1.398	1.553	0.127
verified_incomeVerified	6.792	1.799	3.776	0.000

Slope of annual_income_k is -0.022 regardless of verified_income level
Change in verified_income corresponds to a shift in the intercept
- Intercept for Not Verified is 11.388
- For Source Verified shift intercept up 2.171
  - (i.e. intercept \(= 11.388 + 2.171 = 13.599\))
- For Verified shift intercept up 6.792 from Not Verified
  - (i.e. intercept \(= 11.388 + 6.792 = 18.180\))

Interest rate vs. annual income: interaction term

Interpreting interaction terms

term	estimate	std.error	statistic	p.value
(Intercept)	10.303	1.897	5.432	0.000
annual_income_k	-0.009	0.019	-0.471	0.640
verified_incomeSource Verified	3.423	2.534	1.351	0.184
verified_incomeVerified	9.788	3.652	2.680	0.010
annual_income_k:verified_incomeSource Verified	-0.015	0.026	-0.591	0.558
annual_income_k:verified_incomeVerified	-0.031	0.033	-0.961	0.342

Slope of annual_income_k depends on verified_income level
No difference: fit three separate linear models on the data corresponding to each level of verified_income

Understanding the model

\[ \begin{aligned} \hat{interest\_rate} &= 910.303 - 0.009 \times annual\_income\_k \\ &+ 3.423 \times SourceVerified + 9.788 \times Verified \\ &- 0.015 \times annual\_income\_k \times SourceVerified\\ &- 0.031 \times annual\_income\_k \times Verified \end{aligned} \]

Complete Exercise 7