class: center, middle, inverse, title-slide # Regression Modeling in People Analytics: Survival Analysis ### Keith McNulty --- class: left, middle, r-logo ## Note on source This document is a summary learning developed for the London HR Analytics Meetup on 6 July 2021. It is based on material in the textbook *[Handbook of Regression Modeling in People Analytics](https://www.routledge.com/Handbook-of-Regression-Modeling-in-People-Analytics-With-Examples-in-R/McNulty/p/book/9781032041742)*. Please consult this book for a deeper understanding/treatment. The code for this document is [here](https://github.com/keithmcnulty/peopleanalytics-regression-book/blob/master/presentations/london_meetup_2021/index.Rmd). ## Note on languages This document is coded in R based on the material in Chapter 9 of the textbook. Instructions for doing this analysis in Python can be found in Section 10.2.5 of the textbook. --- class: left, middle, r-logo ## Quick poll > "I am confident that I know about regression modeling." > > Yes > > No --- class: left, middle, r-logo ## What is regression? * Regression is a statistical method that helps relate certain input data to an outcome of interest, based on certain statistical assumptions oriented around the *Central Limit Theorem*. * It is helpful as a way to explain how certain constructs influence (or do not influence) your outcome, but it *cannot prove causation*, which is much harder to do. * It is an extremely useful technique in practice. Because it is often too difficult to prove causation, decision makers use the next best thing - proof of influence combined with expert knowledge and judgment - in making business decisions. --- class: left, middle, r-logo ## How does regression help? Given data on a set of input variables and an outcome of interest, regression analysis can estimate the following: 1. Which of the input variables have a statistically significant influence on the outcome? 2. 'How big' is the influence of each significant variable? 3. How much of the overall outcome do your input variables explain? --- class: left, middle, r-logo ## Example 1 - Linear Regression Circa 1880, Francis Galton did a study on some British children to see how their heights related to that of their parents. Let's grab Galton's data. ```r # get data from URL url <- "https://peopleanalytics-regression-book.org/data/Galton.txt" galton <- read.delim(url) # view first few rows head(galton) ``` ``` ## Family Father Mother Gender Height Kids ## 1 1 78.5 67.0 M 73.2 4 ## 2 1 78.5 67.0 F 69.2 4 ## 3 1 78.5 67.0 F 69.0 4 ## 4 1 78.5 67.0 F 69.0 4 ## 5 2 75.5 66.5 M 73.5 4 ## 6 2 75.5 66.5 M 72.5 4 ``` --- class: left, middle, r-logo ## What is the relationship between mid-parent height and child height Galton simplistically expected the child's height to be perfectly explained by the average height of their parents. We can test that using a simple linear regression model. ```r library(dplyr) # create midParentHeight column galton <- galton |> dplyr::mutate(midParentHeight = (Father + Mother)/2) # simple linear regression of child vs mid-parent height simplemodel <- lm(formula = Height ~ midParentHeight, data = galton) # how much does it explain (R-squared) summary(simplemodel)$r.squared ``` ``` ## [1] 0.1069774 ``` --- class: left, middle, r-logo ## Galton realized that children 'regress' away from their parents height towards a mean population height <img src="index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## Example 2 - Logistic Regression In 1986, the Space Shuttle Challenger exploded shortly after take off and all on board were killed. This started a major investigation by NASA. Video evidence pointed to unstable structures called O-rings as the cause of the accident, and there was a strong hypothesis that these structures did not function well in colder temperatures, such as on the day of the tragic accident. Let's get some data on the Space Shuttle launches up to and including Challengers. ```r shuttles <- read.table( "https://archive.ics.uci.edu/ml/machine-learning-databases/space-shuttle/o-ring-erosion-only.data" ) colnames(shuttles) <- c("total_orings", "distressed_orings", "temp", "leakcheck_psi", "order") head(shuttles) ``` ``` ## total_orings distressed_orings temp leakcheck_psi order ## 1 6 0 66 50 1 ## 2 6 1 70 50 2 ## 3 6 0 69 50 3 ## 4 6 0 68 50 4 ## 5 6 0 67 50 5 ## 6 6 0 72 50 6 ``` We create a new binary field called `incident` based on where there were any distressed o-rings: ```r shuttles <- shuttles |> dplyr::mutate(incident = ifelse(distressed_orings > 0, 1, 0)) ``` --- class: left, middle, r-logo ## A simple logistic regression associates the temperature with the likelihood of o-ring failure Now we run a logistic regression model to relate `temp` and `leakcheck_psi` to the likelihood of there being an incident: ```r model <- glm("incident ~ temp + leakcheck_psi", data = shuttles, family = "binomial") summary(model) ``` ``` ## ## Call: ## glm(formula = "incident ~ temp + leakcheck_psi", family = "binomial", ## data = shuttles) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.00837 -0.54507 -0.28098 0.02512 2.21488 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 21.843631 11.936459 1.830 0.0673 . ## temp -0.350098 0.172977 -2.024 0.0430 * ## leakcheck_psi 0.006007 0.009749 0.616 0.5378 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 26.402 on 22 degrees of freedom ## Residual deviance: 14.033 on 20 degrees of freedom ## AIC: 20.033 ## ## Number of Fisher Scoring iterations: 6 ``` --- class: left, middle, r-logo ## Methods covered in textbook <p> <img src="www/coverpage.png" style="float:left;margin-right:50px; }" width="260" align = "middle"> <ul> <li>Hypothesis testing <li>Linear Regression <li>Binomial Logistic Regression <li>Multinomial Logistic Regression <li>Proportional Odds Regression <li>Multilevel Models <li>Structural Equation Models <li>Survival Analysis <li>Power Analysis </ul> <ul> <li><a href="https://www.routledge.com/Handbook-of-Regression-Modeling-in-People-Analytics-With-Examples-in-R/McNulty/p/book/9781032041742">Pre-order now</a> for print/Kindle. All author proceeds donated to <a href="https://rladies.org/">R-ladies Global</a> <li><a href="https://peopleanalytics-regression-book.org">Free online version</a> </ul> </p> --- class: left, middle, r-logo ## Analyzing time-based event outcomes Most elementary models analyze an outcome as at a specific point in time. For example, in a logistic regression we might model the likelihood of high job performance (Yes or No) as at a specific performance review cycle. In this case our outcome variable `\(y\)` is simply positive (1) or negative (0). Then if our input variables are `\(x_1, x_2, ...,x_n\)`, our model estimates the *odds* of `\(y\)` being positive as $$ \frac{P(y = 1)}{P(y = 0)} = e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n} $$ where `\(\beta_0, \beta_1, \beta_2, ..., \beta_n\)` are our coefficients which we can use to explain the influence of each input variable. But in People Analytics, many of our outcomes can occur *at any time* after we take measures of our input variables, and we are also interested in the time effect as well as the event itself. Common examples of this are attrition or promotion. --- class: left, middle, r-logo ## Survival analysis Survival analysis is family of regression-based techniques designed to understand outcome events that occur over time. This family of techniques came into existence to research the effect of lifestyle factors, drug treatment and other interventions on adverse clinical outcomes like death or the onset of serious disease. Medically, while it is important to understand *whether* such an outcome occurs, it is also important to understand *when* it occurs. Key components of a basic survival analysis are: 1. The construction of a survival outcome 2. The calculation of Kaplan-Meier survival rates and curves 3. Running and checking a Cox Proportional Hazard regression model --- class: left, middle, r-logo ## Case example: Employee surveys and attrition *Dropping Like Flies, Inc* is a company that is experiencing very high employee attrition. The Chief People Officer believes that the employee experience is not a major factor in this issue and that this issue is being caused by aggressive competitors 'poaching' staff. You have been asked to find evidence to support or contradict this claim by analyzing the data on past survey responses and on employee departures. **Key question:**. Is there a reason to believe that poor employee experience is a factor in recent attrition? --- class: left, middle, r-logo ## Data set 1 - Survey responses from survey conducted at the end of 2018 ```r survey_responses <- read.csv("https://peopleanalytics-regression-book.org/data/survey_responses.csv") head(survey_responses) ``` ``` ## iid gender department level sentiment intention ## 1 dad26b9ee392cb7197bc7df941b92e2b M IT High 3 8 ## 2 0fa80c4c486e1d22ef1ae51373c21e56 F Finance Low 8 4 ## 3 e122762138e5e1b1c9cd09d1f3935c13 M HR Medium 7 7 ## 4 f67f76e945aa8ce99454c66c47b78c87 M Finance Low 8 4 ## 5 bd3276714f5f60d356309f1533a24920 M Finance High 7 6 ## 6 e8edf7a19afb5841a8e6997e467dbfbf F R&D Medium 6 10 ``` * `sentiment` is the level of agreement to the statement 'I am happy in my job' on a Likert scale of 1 to 10. * `intention` is the level of agreement to the statement 'I intend to find a new job' on a Likert scale of 1 to 10. --- class: left, middle, r-logo ## Quick look at the survey response distribution <img src="index_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## Quick poll > Do you think sentiment or intention to leave will make a significant difference in departures over the next two years? > > None > > Sentiment Only > > Intention Only > > Both of them --- class: left, middle, r-logo ## Data set 2 - Data on departures in the past in 2019 and 2020 ```r departure_dates <- read.csv("https://peopleanalytics-regression-book.org/data/departure_dates.csv") head(departure_dates) ``` ``` ## iid departure_date ## 1 dad26b9ee392cb7197bc7df941b92e2b 2019-01-31 ## 2 e122762138e5e1b1c9cd09d1f3935c13 2019-11-30 ## 3 bd3276714f5f60d356309f1533a24920 2019-02-28 ## 4 e8edf7a19afb5841a8e6997e467dbfbf 2019-05-31 ## 5 b0f5468f305c719192575909563e7ea6 2019-11-30 ## 6 86d665dc11c92142b46db64260223cbe 2019-07-31 ``` * Only employees who departed are in this data * All departure dates are end of month --- class: left, middle, r-logo ## Designing the analysis Survival analysis requires an event and a time, so we will need to define and extract both of these from this data. * **Event:** Clearly we are interested in whether someone departed or not, which means they have a departure date in our records. So our event would be labelled 1 if there is a departure date and 0 otherwise. * **Time:** Given that the data is captured as at the end of every month over a two-year period, it makes sense to count our time in months from 1 to 24. If there is no departure date recorded for an individual, we assume they remain at the company and assign the time for that individual as 24. That is, we are assuming our data is well-captured and there are no *censored* observations. --- class: left, middle, r-logo ## Creating the data set - generating the departure event First we left join the survey responses to the departure data and create the event column. ```r survival_data <- survey_responses |> dplyr::left_join(departure_dates, by = "iid") |> dplyr::select(-iid) |> dplyr::mutate(departure_event = ifelse(is.na(departure_date), 0, 1)) head(survival_data) ``` ``` ## gender department level sentiment intention departure_date departure_event ## 1 M IT High 3 8 2019-01-31 1 ## 2 F Finance Low 8 4 <NA> 0 ## 3 M HR Medium 7 7 2019-11-30 1 ## 4 M Finance Low 8 4 <NA> 0 ## 5 M Finance High 7 6 2019-02-28 1 ## 6 F R&D Medium 6 10 2019-05-31 1 ``` --- class: left, middle, r-logo ## Creating the data set - generating the time To create the time, we will extract the month and year. If the year is 2019, we use the month. If the year is 2020, we add 12 to the month. Finally, if the event is zero, we assign a month value of 24. ```r library(lubridate) survival_data <- survival_data |> dplyr::mutate(month = dplyr::case_when( year(departure_date) == 2019 ~ month(departure_date), year(departure_date) == 2020 ~ month(departure_date) + 12, departure_event == 0 ~ 24, TRUE ~ NA_real_ )) |> dplyr::select(-departure_date) head(survival_data) ``` ``` ## gender department level sentiment intention departure_event month ## 1 M IT High 3 8 1 1 ## 2 F Finance Low 8 4 0 24 ## 3 M HR Medium 7 7 1 11 ## 4 M Finance Low 8 4 0 24 ## 5 M Finance High 7 6 1 2 ## 6 F R&D Medium 6 10 1 5 ``` --- class: left, middle, r-logo ## Creating the survival outcome The survival outcome is generated from both the `departure_event` and the `month` columns. ```r library(survival) survival_outcome <- Surv(event = survival_data$departure_event, time = survival_data$month) unique(survival_outcome) ``` ``` ## [1] 1 24+ 11 2 5 7 16 9 6 17 22 21 3 19 12 8 23 13 20 ## [20] 15 24 10 18 14 4 ``` --- class: left, middle, r-logo ## Kaplan-Meier survival rates We can calculate survival rates and sketch curves for any *categorical* variable. The survival rate `\(S_i\)` at month `\(i\)` is calculated as $$ `\begin{align*} S_i = S_{i - 1}(1 - \frac{l_i}{n_i}) \end{align*}` $$ where `\(l_i\)` is the number reported as departed in month `\(i\)`, and `\(n_i\)` is the number still at the company after month `\(i - 1\)`, with `\(S_0 = 1\)`. Let's look at how employee sentiment affects survival rates. --- class: left, middle, r-logo ## How does sentiment relate to survival? To do this we will convert our sentiment rating into three categories: "High", "Medium", "Low". ```r # create a new field to define high, medium and low sentiment (>= 7) survival_data$sentiment_category <- dplyr::case_when( survival_data$sentiment >= 7 ~ "High", survival_data$sentiment >= 4 ~ "Medium", survival_data$sentiment < 4 ~ "Low", TRUE ~ NA_character_ ) ``` We can then use a simple function in the `survival` R package to relate these categories to the survival rate: ```r # generate survival rates by sentiment category kmestimate_sentimentcat <- survival::survfit( formula = Surv(event = departure_event, time = month) ~ sentiment_category, data = survival_data ) ``` --- class: left, middle, r-logo ## Kaplan-Meier survival curves ```r library(survminer) # show survival curves with p-value estimate and confidence intervals survminer::ggsurvplot( kmestimate_sentimentcat, pval = TRUE, conf.int = TRUE, palette = c("blue", "green", "red"), linetype = c("solid", "dashed", "dotted"), xlab = "Month", ylab = "Retention Rate" ) ``` <img src="index_files/figure-html/unnamed-chunk-15-1.png" height="400" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## We can do the same for the `intention` survey response <img src="index_files/figure-html/unnamed-chunk-16-1.png" height="400" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## Quick poll > What kind of regression model should we use to understand multivariate effects on a survival outcome? > > Linear Regression > > Structural Equation Model > > Cox Proportional Hazard Model > > Multinomial Logistic Regression --- class: left, middle, r-logo ## Regression using Cox's proportional hazard model We assume that each individual has a 'baseline' probability of departing as a function of time. This is called the 'baseline hazard' function and denoted `\(h_0(t)\)`. Like any logistic regression model, each input variable `\(x_1, x_2, \dots, x_p\)` will have an effect on that baseline hazard function, increasing it or decreasing it by certain multiples depending on a set of coefficients `\(\beta_1, \beta_2, \dots, \beta_p\)`. For individual A, their personal hazard function will be: $$ h_A(t) = h_0(t)e^{\beta_1x_1^A + \beta_2x_2^A + \dots + \beta_px_p^A} $$ If we take another person B and then ask 'what is their hazard function relative to person A?', we get: $$ `\begin{aligned} \frac{h^B(t)}{h^A(t)} &= \frac{h_0(t)e^{\beta_1x_1^B + \beta_2x_2^B + \dots + \beta_px_p^B}}{h_0(t)e^{\beta_1x_1^A + \beta_2x_2^A + \dots + \beta_px_p^A}} \\ &= e^{\beta_1(x_1^B-x_1^A) + \beta_2(x_2^B-x_2^A) + \dots \beta_p(x_p^B-x_p^A)} \end{aligned}` $$ Note that time has disappeared from our calculation. This works on the assumption that `\(h_B(t)\)` and `\(h_A(t)\)` are *proportional to each other*, in other words the hazard curves do not cross. This assumption needs to be checked before finalizing your model (see Section 9.2.2 of the textbook). --- class: left, middle, r-logo ## Running a Cox's proportional hazard model We can run a multiple regression to understand the relative effects of many input variables: ```r # run cox model against survival outcome cox_model <- survival::coxph( formula = Surv(event = departure_event, time = month) ~ gender + department + level + sentiment + intention, data = survival_data ) summary(cox_model) ``` --- class: left, middle, r-logo ## Interpreting model results ``` ## Call: ## survival::coxph(formula = Surv(event = departure_event, time = month) ~ ## gender + department + level + sentiment + intention, data = survival_data) ## ## n= 3770, number of events= 1354 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## genderM 0.04324 1.04418 0.05899 0.733 0.463556 ## departmentHR -0.25350 0.77608 0.06684 -3.792 0.000149 *** ## departmentIT -0.14364 0.86620 0.08576 -1.675 0.093978 . ## departmentR&D 0.04249 1.04341 0.12699 0.335 0.737920 ## departmentSales/Marketing -0.12443 0.88300 0.09949 -1.251 0.211053 ## departmentStrategy & Operations -0.12534 0.88220 0.14328 -0.875 0.381689 ## levelLow 0.17191 1.18757 0.08976 1.915 0.055450 . ## levelMedium 0.12096 1.12858 0.10201 1.186 0.235707 ## sentiment -0.02984 0.97060 0.01521 -1.961 0.049830 * ## intention 0.20556 1.22821 0.01354 15.181 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(-coef) lower .95 upper .95 ## genderM 1.0442 0.9577 0.9302 1.1722 ## departmentHR 0.7761 1.2885 0.6808 0.8847 ## departmentIT 0.8662 1.1545 0.7322 1.0248 ## departmentR&D 1.0434 0.9584 0.8135 1.3383 ## departmentSales/Marketing 0.8830 1.1325 0.7266 1.0731 ## departmentStrategy & Operations 0.8822 1.1335 0.6662 1.1682 ## levelLow 1.1876 0.8421 0.9960 1.4160 ## levelMedium 1.1286 0.8861 0.9241 1.3784 ## sentiment 0.9706 1.0303 0.9421 1.0000 ## intention 1.2282 0.8142 1.1960 1.2612 ## ## Concordance= 0.638 (se = 0.008 ) ## Likelihood ratio test= 309 on 10 df, p=<2e-16 ## Wald test = 327.2 on 10 df, p=<2e-16 ## Score (logrank) test = 337.3 on 10 df, p=<2e-16 ``` --- class: left, middle, r-logo ## Thank you! Key textbook chapters to learn this methodology... * Chapter 3 - Hypothesis testing * Chapter 4 - Basic elements of multivariate regression * Chapter 5 - Binomial logistic regression (binary event modeling) * Chapter 9 - Survival analysis