Regression Modeling in People Analytics: Survival Analysis

class: center, middle, inverse, title-slide

# Regression Modeling in People Analytics: Survival Analysis
### Keith McNulty

---

class: left, middle, r-logo

## Note on source

This document is a summary learning developed for the London HR Analytics Meetup on 6 July 2021.  It is based on material in the textbook *[Handbook of Regression Modeling in People Analytics](https://www.routledge.com/Handbook-of-Regression-Modeling-in-People-Analytics-With-Examples-in-R/McNulty/p/book/9781032041742)*. Please consult this book for a deeper understanding/treatment.

The code for this document is [here](https://github.com/keithmcnulty/peopleanalytics-regression-book/blob/master/presentations/london_meetup_2021/index.Rmd).

## Note on languages

This document is coded in R based on the material in Chapter 9 of the textbook.  Instructions for doing this analysis in Python can be found in Section 10.2.5 of the textbook.

---
class: left, middle, r-logo

## Quick poll

> "I am confident that I know about regression modeling."
>
> Yes
>
> No

---
class: left, middle, r-logo

## What is regression?

* Regression is a statistical method that helps relate certain input data to an outcome of interest, based on certain statistical assumptions oriented around the *Central Limit Theorem*.

* It is helpful as a way to explain how certain constructs influence (or do not influence) your outcome, but it *cannot prove causation*, which is much harder to do.

* It is an extremely useful technique in practice.  Because it is often too difficult to prove causation, decision makers use the next best thing - proof of influence combined with expert knowledge and judgment - in making business decisions.

---
class: left, middle, r-logo

## How does regression help?

Given data on a set of input variables and an outcome of interest, regression analysis can estimate the following:

1.  Which of the input variables have a statistically significant influence on the outcome?
2.  'How big' is the influence of each significant variable?
3.  How much of the overall outcome do your input variables explain?

---
class: left, middle, r-logo

## Example 1 - Linear Regression

Circa 1880, Francis Galton did a study on some British children to see how their heights related to that of their parents.  Let's grab Galton's data.

```r
# get data from URL
url <- "https://peopleanalytics-regression-book.org/data/Galton.txt"
galton <- read.delim(url)

# view first few rows
head(galton)
```

```
##   Family Father Mother Gender Height Kids
## 1      1   78.5   67.0      M   73.2    4
## 2      1   78.5   67.0      F   69.2    4
## 3      1   78.5   67.0      F   69.0    4
## 4      1   78.5   67.0      F   69.0    4
## 5      2   75.5   66.5      M   73.5    4
## 6      2   75.5   66.5      M   72.5    4
```

---
class: left, middle, r-logo

## What is the relationship between mid-parent height and child height

Galton simplistically expected the child's height to be perfectly explained by the average height of their parents.  We can test that using a simple linear regression model.

```r
library(dplyr)

# create midParentHeight column
galton <- galton |> 
  dplyr::mutate(midParentHeight = (Father + Mother)/2)

# simple linear regression of child vs mid-parent height
simplemodel <- lm(formula = Height ~ midParentHeight, data = galton)

# how much does it explain (R-squared)
summary(simplemodel)$r.squared
```

```
## [1] 0.1069774
```

---
class: left, middle, r-logo

## Galton realized that children 'regress' away from their parents height towards a mean population height

---
class: left, middle, r-logo

## Example 2 - Logistic Regression

In 1986, the Space Shuttle Challenger exploded shortly after take off and all on board were killed.  This started a major investigation by NASA.  Video evidence pointed to unstable structures called O-rings as the cause of the accident, and there was a strong hypothesis that these structures did not function well in colder temperatures, such as on the day of the tragic accident.

Let's get some data on the Space Shuttle launches up to and including Challengers.

```r
shuttles <- read.table(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/space-shuttle/o-ring-erosion-only.data"
)

colnames(shuttles) <- c("total_orings", "distressed_orings", "temp", "leakcheck_psi", "order")
head(shuttles)
```

```
##   total_orings distressed_orings temp leakcheck_psi order
## 1            6                 0   66            50     1
## 2            6                 1   70            50     2
## 3            6                 0   69            50     3
## 4            6                 0   68            50     4
## 5            6                 0   67            50     5
## 6            6                 0   72            50     6
```

We create a new binary field called `incident` based on where there were any distressed o-rings:

```r
shuttles <- shuttles |> 
  dplyr::mutate(incident = ifelse(distressed_orings > 0, 1, 0))
```

---
class: left, middle, r-logo

## A simple logistic regression associates the temperature with the likelihood of o-ring failure

Now we run a logistic regression model to relate `temp` and `leakcheck_psi` to the likelihood of there being an incident:

```r
model <- glm("incident ~ temp + leakcheck_psi", data = shuttles, family = "binomial")

summary(model)
```

```
## 
## Call:
## glm(formula = "incident ~ temp + leakcheck_psi", family = "binomial", 
##     data = shuttles)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.00837  -0.54507  -0.28098   0.02512   2.21488  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   21.843631  11.936459   1.830   0.0673 .
## temp          -0.350098   0.172977  -2.024   0.0430 *
## leakcheck_psi  0.006007   0.009749   0.616   0.5378  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.402  on 22  degrees of freedom
## Residual deviance: 14.033  on 20  degrees of freedom
## AIC: 20.033
## 
## Number of Fisher Scoring iterations: 6
```

---
class: left, middle, r-logo

## Methods covered in textbook

<p>
<img src="www/coverpage.png" style="float:left;margin-right:50px;
}" width="260" align = "middle">
<ul>
<li>Hypothesis testing
<li>Linear Regression
<li>Binomial Logistic Regression
<li>Multinomial Logistic Regression
<li>Proportional Odds Regression
<li>Multilevel Models
<li>Structural Equation Models
<li>Survival Analysis
<li>Power Analysis
</ul>
<ul>
<li><a href="https://www.routledge.com/Handbook-of-Regression-Modeling-in-People-Analytics-With-Examples-in-R/McNulty/p/book/9781032041742">Pre-order now</a> for print/Kindle.  All author proceeds donated to <a href="https://rladies.org/">R-ladies Global</a>
<li><a href="https://peopleanalytics-regression-book.org">Free online version</a>
</ul>
</p>

---
class: left, middle, r-logo

## Analyzing time-based event outcomes

Most elementary models analyze an outcome as at a specific point in time.  For example, in a logistic regression we might model the likelihood of high job performance (Yes or No) as at a specific performance review cycle.  In this case our outcome variable `$y$` is simply positive (1) or negative (0).

Then if our input variables are `$x_1, x_2, ...,x_n$`, our model estimates the *odds* of `$y$` being positive as

$$
\frac{P(y = 1)}{P(y = 0)} = e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n}
$$
where `$\beta_0, \beta_1, \beta_2, ..., \beta_n$` are our coefficients which we can use to explain the influence of each input variable.

But in People Analytics, many of our outcomes can occur *at any time* after we take measures of our input variables, and we are also interested in the time effect as well as the event itself.  Common examples of this are attrition or promotion.

---
class: left, middle, r-logo

## Survival analysis

Survival analysis is family of regression-based techniques designed to understand outcome events that occur over time.  This family of techniques came into existence to research the effect of lifestyle factors, drug treatment and other interventions on adverse clinical outcomes like death or the onset of serious disease.  Medically, while it is important to understand *whether* such an outcome occurs, it is also important to understand *when* it occurs.

Key components of a basic survival analysis are:

1.  The construction of a survival outcome
2.  The calculation of Kaplan-Meier survival rates and curves
3.  Running and checking a Cox Proportional Hazard regression model

---
class: left, middle, r-logo

## Case example:  Employee surveys and attrition

*Dropping Like Flies, Inc* is a company that is experiencing very high employee attrition.  The Chief People Officer believes that the employee experience is not a major factor in this issue and that this issue is being caused by aggressive competitors 'poaching' staff.

You have been asked to find evidence to support or contradict this claim by analyzing the data on past survey responses and on employee departures.

**Key question:**. Is there a reason to believe that poor employee experience is a factor in recent attrition?

---
class: left, middle, r-logo

## Data set 1 - Survey responses from survey conducted at the end of 2018

```r
survey_responses <- read.csv("https://peopleanalytics-regression-book.org/data/survey_responses.csv")

head(survey_responses)
```

```
##                                iid gender department  level sentiment intention
## 1 dad26b9ee392cb7197bc7df941b92e2b      M         IT   High         3         8
## 2 0fa80c4c486e1d22ef1ae51373c21e56      F    Finance    Low         8         4
## 3 e122762138e5e1b1c9cd09d1f3935c13      M         HR Medium         7         7
## 4 f67f76e945aa8ce99454c66c47b78c87      M    Finance    Low         8         4
## 5 bd3276714f5f60d356309f1533a24920      M    Finance   High         7         6
## 6 e8edf7a19afb5841a8e6997e467dbfbf      F        R&D Medium         6        10
```

* `sentiment` is the level of agreement to the statement 'I am happy in my job' on a Likert scale of 1 to 10.
* `intention` is the level of agreement to the statement 'I intend to find a new job' on a Likert scale of 1 to 10.

---
class: left, middle, r-logo

## Quick look at the survey response distribution

---
class: left, middle, r-logo

## Quick poll

> Do you think sentiment or intention to leave will make a significant difference in departures over the next two years?
>
> None
>
> Sentiment Only
>
> Intention Only
>
> Both of them

---
class: left, middle, r-logo

## Data set 2 - Data on departures in the past in 2019 and 2020

```r
departure_dates <- read.csv("https://peopleanalytics-regression-book.org/data/departure_dates.csv")

head(departure_dates)
```

```
##                                iid departure_date
## 1 dad26b9ee392cb7197bc7df941b92e2b     2019-01-31
## 2 e122762138e5e1b1c9cd09d1f3935c13     2019-11-30
## 3 bd3276714f5f60d356309f1533a24920     2019-02-28
## 4 e8edf7a19afb5841a8e6997e467dbfbf     2019-05-31
## 5 b0f5468f305c719192575909563e7ea6     2019-11-30
## 6 86d665dc11c92142b46db64260223cbe     2019-07-31
```

* Only employees who departed are in this data
* All departure dates are end of month

---
class: left, middle, r-logo

## Designing the analysis

Survival analysis requires an event and a time, so we will need to define and extract both of these from this data.

* **Event:** Clearly we are interested in whether someone departed or not, which means they have a departure date in our records.  So our event would be labelled 1 if there is a departure date and 0 otherwise.
* **Time:** Given that the data is captured as at the end of every month over a two-year period, it makes sense to count our time in months from 1 to 24.  If there is no departure date recorded for an individual, we assume they remain at the company and assign the time for that individual as 24.  That is, we are assuming our data is well-captured and there are no *censored* observations.

---
class: left, middle, r-logo

## Creating the data set - generating the departure event

First we left join the survey responses to the departure data and create the event column.

```r
survival_data <- survey_responses |> 
  dplyr::left_join(departure_dates, by = "iid") |> 
  dplyr::select(-iid) |> 
  dplyr::mutate(departure_event = ifelse(is.na(departure_date), 0, 1))

head(survival_data)
```

```
##   gender department  level sentiment intention departure_date departure_event
## 1      M         IT   High         3         8     2019-01-31               1
## 2      F    Finance    Low         8         4           <NA>               0
## 3      M         HR Medium         7         7     2019-11-30               1
## 4      M    Finance    Low         8         4           <NA>               0
## 5      M    Finance   High         7         6     2019-02-28               1
## 6      F        R&D Medium         6        10     2019-05-31               1
```

---
class: left, middle, r-logo

## Creating the data set - generating the time

To create the time, we will extract the month and year.  If the year is 2019, we use the month.  If the year is 2020, we add 12 to the month.  Finally, if the event is zero, we assign a month value of 24.

```r
library(lubridate)

survival_data <- survival_data |> 
  dplyr::mutate(month = dplyr::case_when(
    year(departure_date) == 2019 ~ month(departure_date),
    year(departure_date) == 2020 ~ month(departure_date) + 12,
    departure_event == 0 ~ 24,
    TRUE ~ NA_real_
  )) |> 
  dplyr::select(-departure_date)

head(survival_data)
```

```
##   gender department  level sentiment intention departure_event month
## 1      M         IT   High         3         8               1     1
## 2      F    Finance    Low         8         4               0    24
## 3      M         HR Medium         7         7               1    11
## 4      M    Finance    Low         8         4               0    24
## 5      M    Finance   High         7         6               1     2
## 6      F        R&D Medium         6        10               1     5
```

---
class: left, middle, r-logo

## Creating the survival outcome

The survival outcome is generated from both the `departure_event` and the `month` columns.

```r
library(survival)

survival_outcome <- Surv(event = survival_data$departure_event, 
                         time = survival_data$month)

unique(survival_outcome)
```

```
##  [1]  1  24+ 11   2   5   7  16   9   6  17  22  21   3  19  12   8  23  13  20 
## [20] 15  24  10  18  14   4
```

---
class: left, middle, r-logo

## Kaplan-Meier survival rates

We can calculate survival rates and sketch curves for any *categorical* variable.   The survival rate `$S_i$` at month `$i$` is calculated as

$$
`\begin{align*}
S_i = S_{i - 1}(1 - \frac{l_i}{n_i})
\end{align*}`
$$

where `$l_i$` is the number reported as departed in month `$i$`, and `$n_i$` is the number still at the company after month `$i - 1$`, with `$S_0 = 1$`.

Let's look at how employee sentiment affects survival rates.

---
class: left, middle, r-logo

## How does sentiment relate to survival?

To do this we will convert our sentiment rating into three categories: "High", "Medium", "Low".

```r
# create a new field to define high, medium and low sentiment (>= 7)
survival_data$sentiment_category <- dplyr::case_when(
  survival_data$sentiment >= 7 ~ "High",
  survival_data$sentiment >= 4 ~ "Medium",
  survival_data$sentiment < 4 ~ "Low",
  TRUE ~ NA_character_
)
```

We can then use a simple function in the `survival` R package to relate these categories to the survival rate:

```r
# generate survival rates by sentiment category
kmestimate_sentimentcat <- survival::survfit(
  formula = Surv(event = departure_event, time = month) ~ sentiment_category,
  data = survival_data
)
```

---
class: left, middle, r-logo

## Kaplan-Meier survival curves

```r
library(survminer)

# show survival curves with p-value estimate and confidence intervals
survminer::ggsurvplot(
  kmestimate_sentimentcat, pval = TRUE, conf.int = TRUE,
  palette = c("blue", "green", "red"),
  linetype = c("solid", "dashed", "dotted"), xlab = "Month",
  ylab = "Retention Rate"
)
```

---
class: left, middle, r-logo

## We can do the same for the `intention` survey response

---
class: left, middle, r-logo

## Quick poll

> What kind of regression model should we use to understand multivariate effects on a survival outcome?
>
> Linear Regression
>
> Structural Equation Model
>
> Cox Proportional Hazard Model
>
> Multinomial Logistic Regression

---
class: left, middle, r-logo

## Regression using Cox's proportional hazard model

We assume that each individual has a 'baseline' probability of departing as a function of time.  This is called the 'baseline hazard' function and denoted `$h_0(t)$`.  Like any logistic regression model, each input variable `$x_1, x_2, \dots, x_p$` will have an effect on that baseline hazard function, increasing it or decreasing it by certain multiples depending on a set of coefficients `$\beta_1, \beta_2, \dots, \beta_p$`.  For individual A, their personal hazard function will be:

$$
h_A(t) = h_0(t)e^{\beta_1x_1^A + \beta_2x_2^A + \dots + \beta_px_p^A}
$$
If we take another person B and then ask 'what is their hazard function relative to person A?', we get:

$$
`\begin{aligned}
\frac{h^B(t)}{h^A(t)} &= \frac{h_0(t)e^{\beta_1x_1^B + \beta_2x_2^B + \dots + \beta_px_p^B}}{h_0(t)e^{\beta_1x_1^A + \beta_2x_2^A + \dots + \beta_px_p^A}} \\
&= e^{\beta_1(x_1^B-x_1^A) + \beta_2(x_2^B-x_2^A) + \dots \beta_p(x_p^B-x_p^A)}
\end{aligned}`
$$
Note that time has disappeared from our calculation.  This works on the assumption that `$h_B(t)$` and `$h_A(t)$` are *proportional to each other*, in other words the hazard curves do not cross.  This assumption needs to be checked before finalizing your model (see Section 9.2.2 of the textbook).

---
class: left, middle, r-logo

## Running a Cox's proportional hazard model

We can run a multiple regression to understand the relative effects of many input variables:

```r
# run cox model against survival outcome
cox_model <- survival::coxph(
  formula = Surv(event = departure_event, time = month) ~ gender + 
    department + level + sentiment + intention,
  data = survival_data
)

summary(cox_model)
```

---
class: left, middle, r-logo

## Interpreting model results

```
## Call:
## survival::coxph(formula = Surv(event = departure_event, time = month) ~ 
##     gender + department + level + sentiment + intention, data = survival_data)
## 
##   n= 3770, number of events= 1354 
## 
##                                     coef exp(coef) se(coef)      z Pr(>|z|)    
## genderM                          0.04324   1.04418  0.05899  0.733 0.463556    
## departmentHR                    -0.25350   0.77608  0.06684 -3.792 0.000149 ***
## departmentIT                    -0.14364   0.86620  0.08576 -1.675 0.093978 .  
## departmentR&D                    0.04249   1.04341  0.12699  0.335 0.737920    
## departmentSales/Marketing       -0.12443   0.88300  0.09949 -1.251 0.211053    
## departmentStrategy & Operations -0.12534   0.88220  0.14328 -0.875 0.381689    
## levelLow                         0.17191   1.18757  0.08976  1.915 0.055450 .  
## levelMedium                      0.12096   1.12858  0.10201  1.186 0.235707    
## sentiment                       -0.02984   0.97060  0.01521 -1.961 0.049830 *  
## intention                        0.20556   1.22821  0.01354 15.181  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                                 exp(coef) exp(-coef) lower .95 upper .95
## genderM                            1.0442     0.9577    0.9302    1.1722
## departmentHR                       0.7761     1.2885    0.6808    0.8847
## departmentIT                       0.8662     1.1545    0.7322    1.0248
## departmentR&D                      1.0434     0.9584    0.8135    1.3383
## departmentSales/Marketing          0.8830     1.1325    0.7266    1.0731
## departmentStrategy & Operations    0.8822     1.1335    0.6662    1.1682
## levelLow                           1.1876     0.8421    0.9960    1.4160
## levelMedium                        1.1286     0.8861    0.9241    1.3784
## sentiment                          0.9706     1.0303    0.9421    1.0000
## intention                          1.2282     0.8142    1.1960    1.2612
## 
## Concordance= 0.638  (se = 0.008 )
## Likelihood ratio test= 309  on 10 df,   p=<2e-16
## Wald test            = 327.2  on 10 df,   p=<2e-16
## Score (logrank) test = 337.3  on 10 df,   p=<2e-16
```

---
class: left, middle, r-logo

## Thank you!  Key textbook chapters to learn this methodology...

* Chapter 3 - Hypothesis testing
* Chapter 4 - Basic elements of multivariate regression
* Chapter 5 - Binomial logistic regression (binary event modeling)
* Chapter 9 - Survival analysis