class: center, middle, inverse, title-slide # Regression Analysis: The ‘Swiss Army Knife’ of Advanced People Analytics ### Keith McNulty --- class: left, middle, r-logo ## Note on source This document is a summary learning developed for the DFW People Analytics Meetup on 19 February 2021. It is based on material in the open source book *[Handbook of Regression Modeling in People Analytics](http://peopleanalytics-regression-book.org)* and other named sources. Please consult this book for a deeper understanding/treatment. The code for this document is [here](https://github.com/keithmcnulty/peopleanalytics-regression-book/tree/master/presentations/dfw_meetup.Rmd). ## Note on languages This document is coded in R. Where available, alternative code in Python for some of the methods is available in the appendix. Some of the model types used in this document are not currently well supported in Python. --- class: left, middle, r-logo ## Skills in regression are key to operating at higher levels of the people analytics maturity staircase <img src="SVGZ-How-to-be-great-ex1.svg" width="80%" style="display: block; margin: auto;" /> .footnote[<i>Source: How To Be Great At People Analytics</i>, <a href = "https://www.mckinsey.com/business-functions/organization/our-insights/how-to-be-great-at-people-analytics">mckinsey.com</a>, October 2020] --- class: left, middle, r-logo ## Inferential vs Predictive Modeling * An *inferential model* is a model which is primarily concerned with *describing* a relationship between a set of input variables and an outcome variable. It is particularly useful for answering questions related to *why* something is happening. * A *predictive model* is primarily concerned with accurately predicting *if* an event will occur. Often these can be complex in nature and not particularly suited to answering 'why?'. * Regression models can act as both inferential models and as predictive models. * Most commonly, questions in people analytics are primarily concerned with why something is happening. --- class: left, middle, r-logo ## Fun exercise - the first ever model to be called a 'regression model' Circa 1900, Francis Galton did a study on some British children to see how their heights related to that of their parents. Let's grab Galton's data. ```r # get data from URL url <- "https://raw.githubusercontent.com/keithmcnulty/peopleanalytics-regression-book/master/data/Galton.txt" galton <- read.delim(url) # view first few rows head(galton) ``` ``` ## Family Father Mother Gender Height Kids ## 1 1 78.5 67.0 M 73.2 4 ## 2 1 78.5 67.0 F 69.2 4 ## 3 1 78.5 67.0 F 69.0 4 ## 4 1 78.5 67.0 F 69.0 4 ## 5 2 75.5 66.5 M 73.5 4 ## 6 2 75.5 66.5 M 72.5 4 ``` --- class: left, middle, r-logo ## What is the relationship between mid-parent height and child height Galton simplistically expected the child's height to be perfectly explained by the average height of their parents. We can test that using a simple linear regression model. ```r library(dplyr) # create midParentHeight column galton <- galton %>% dplyr::mutate(midParentHeight = (Father + Mother)/2) # simple linear regression of child vs mid-parent height simplemodel <- lm(formula = Height ~ midParentHeight, data = galton) # how much does it explain (R-squared) summary(simplemodel)$r.squared ``` ``` ## [1] 0.1069774 ``` --- class: left, middle, r-logo ## Galton realized that children 'regress' away from their parents height towards a mean population height <img src="dfw_meetup_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## Today, we would do a multiple regression ```r multimodel <- lm(formula = Height ~ Father + Mother + Gender, data = galton) summary(multimodel) ``` ``` ## ## Call: ## lm(formula = Height ~ Father + Mother + Gender, data = galton) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.523 -1.440 0.117 1.473 9.114 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.34476 2.74696 5.586 3.08e-08 *** ## Father 0.40598 0.02921 13.900 < 2e-16 *** ## Mother 0.32150 0.03128 10.277 < 2e-16 *** ## GenderM 5.22595 0.14401 36.289 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.154 on 894 degrees of freedom ## Multiple R-squared: 0.6397, Adjusted R-squared: 0.6385 ## F-statistic: 529 on 3 and 894 DF, p-value: < 2.2e-16 ``` --- class: left, middle, r-logo ## This model explains a lot about the *why* of a child's height * We can see which variables are significant explainers of a child's height - all of them are! * We can see the relative influence of each variable on a child's height - each *coefficient* tells us how many inches of height you gain for every extra unit of input, assuming all other inputs are the same. For example: a male child will gain over 5 inches over a female child of the same parents. For two children of the same gender and mother height, the height difference will be about 40% of the height difference of their fathers. * We can see how much of the overall variation in child height is explained by these three variables: `\(R^2 = 0.64\)`. --- class: left, middle, r-logo ## We now have many different regression methods available <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Outcome </th> <th style="text-align:left;"> Example </th> <th style="text-align:left;"> Approach </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Continuous </td> <td style="text-align:left;width: 30em; "> Compensation in $ </td> <td style="text-align:left;width: 30em; "> Linear Regression </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Binary </td> <td style="text-align:left;width: 30em; "> Promoted or Not </td> <td style="text-align:left;width: 30em; "> Binomial Logistic Regression </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Nominal Category </td> <td style="text-align:left;width: 30em; "> Choice of Health Cover Product </td> <td style="text-align:left;width: 30em; "> Multinomial Logistic Regression </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;background-color: yellow !important;"> Ordered Category </td> <td style="text-align:left;width: 30em; background-color: yellow !important;"> Performance Rating </td> <td style="text-align:left;width: 30em; background-color: yellow !important;"> Proportional Odds Regression </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Explicit grouping in data </td> <td style="text-align:left;width: 30em; "> Individuals in sales teams </td> <td style="text-align:left;width: 30em; "> Mixed/Multilevel Models </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Latent grouping in data </td> <td style="text-align:left;width: 30em; "> Large survey instruments </td> <td style="text-align:left;width: 30em; "> Sturctural Equation Models </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Time-associated events </td> <td style="text-align:left;width: 30em; "> Attrition </td> <td style="text-align:left;width: 30em; "> Survival Analysis/Cox Proportional Hazard Regression </td> </tr> </tbody> </table> --- class: left, middle, r-logo ## Example data set: salespeople performance ratings Let's look at some data that relates to the performance ratings of sales people in an organization on a scale of 1 to 3 of increasing performance: ```r # load peopleanalyticsdata package library(peopleanalyticsdata) # look at first rows of the employee_performance data set head(employee_performance) ``` ``` ## sales new_customers region gender rating ## 1 10.10 9 North M 2 ## 2 11.67 17 West F 3 ## 3 8.91 13 North M 3 ## 4 7.73 10 South F 2 ## 5 11.36 12 North M 2 ## 6 8.06 12 North M 2 ``` Sales is in $m. We are being asked which of the other four factors influence the performance rating and how do they influence it. --- class: left, middle, r-logo ## What does our data look like? The first data step in any modeling process is to make sure your data is in good shape. ```r ## look at the data types - are they appropriate? str(employee_performance) ``` ``` ## 'data.frame': 366 obs. of 5 variables: ## $ sales : num 10.1 11.67 8.91 7.73 11.36 ... ## $ new_customers: int 9 17 13 10 12 12 10 12 9 11 ... ## $ region : chr "North" "West" "North" "South" ... ## $ gender : chr "M" "F" "M" "F" ... ## $ rating : int 2 3 3 2 2 2 3 2 2 2 ... ``` --- class: left, middle, r-logo ## Adjusting data types for appropriate modeling We need to make sure that `region` and `gender` is understood as an **unordered** categorical column of data. We also need to make sure that the `rating` column is understood as an **ordered** categorical column. ```r ## unordered categories cat_columns <- c("region", "gender") employee_performance[cat_columns] <- lapply(employee_performance[cat_columns], as.factor) ## ordered categories employee_performance$rating <- ordered(employee_performance$rating, levels = 1:3) str(employee_performance) ``` ``` ## 'data.frame': 366 obs. of 5 variables: ## $ sales : num 10.1 11.67 8.91 7.73 11.36 ... ## $ new_customers: int 9 17 13 10 12 12 10 12 9 11 ... ## $ region : Factor w/ 4 levels "East","North",..: 2 4 2 3 2 2 2 2 1 2 ... ## $ gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 2 1 2 2 ... ## $ rating : Ord.factor w/ 3 levels "1"<"2"<"3": 2 3 3 2 2 2 3 2 2 2 ... ``` --- class: left, middle, r-logo ## Quick visualizations of bivariate relationships in the data ```r library(GGally) ## create a pairplot GGally::ggpairs(employee_performance) ``` <img src="dfw_meetup_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## We want to answer a few questions as precisely as we can 1. Which variables are meaningful in explaining employee performance? 2. To what extent does each meaningful variable influence performance? 3. How much of the entire variance of performance do these variables explain? --- class: left, middle, r-logo ## First we need to select the right type of model Through asking some questions about the data that we have, we can determine an appropriate model to use. **Question 1:** What type of outcome are we studying? *Ordered categories* **Question 2:** Can we assume that each input acts similarly on each level of the outcome? *Yes for now and we can check this later. This is called the proportional odds assumption*. Then we should use a proportional odds logistic regression model. This model will tell us how each input variable affects the *odds of someone having a higher performance rating*. See [Section 7.1](http://peopleanalytics-regression-book.org/ord-reg.html#when-to-use-it-2) for more details. --- class: left, middle, r-logo ## Run the model We can use the `polr()` function from the `MASS` package to run the model. ```r library(MASS) # formula for the model our_formula = "rating ~ ." # run model model <- polr(data = employee_performance, formula = our_formula) ``` Now we have model sitting ready to be viewed. --- class: left, middle, r-logo ## Viewing and clarifying the results I like to use the `broom` package to view model results in a tidy way and to easily add columns to the results. ```r library(dplyr) library(broom) (model_results <- tidy(model) %>% filter(coef.type == "coefficient")) ``` ``` ## # A tibble: 6 x 5 ## term estimate std.error statistic coef.type ## <chr> <dbl> <dbl> <dbl> <chr> ## 1 sales 0.385 0.0477 8.07 coefficient ## 2 new_customers 0.387 0.0413 9.37 coefficient ## 3 regionNorth 0.182 0.322 0.564 coefficient ## 4 regionSouth 0.0944 0.335 0.281 coefficient ## 5 regionWest 0.303 0.321 0.944 coefficient ## 6 genderM -0.0222 0.230 -0.0965 coefficient ``` See [Section 10.1.1](http://peopleanalytics-regression-book.org/alt-approaches.html#the-broom-package) for more information on tidy output and `broom`. --- class: left, middle, r-logo ## Are variables significant in explaining performance? To determine this we need to convert our `statistic` into a p-value, and determine if the p-value is less than an alpha which is usually 0.05. See [Section 3.3](http://peopleanalytics-regression-book.org/found-stats.html#hyp-tests). ```r ## add p-value (model_results <- model_results %>% dplyr::mutate( pval = (1 - pnorm(abs(statistic), 0, 1))*2 )) ``` ``` ## # A tibble: 6 x 6 ## term estimate std.error statistic coef.type pval ## <chr> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 sales 0.385 0.0477 8.07 coefficient 6.66e-16 ## 2 new_customers 0.387 0.0413 9.37 coefficient 0. ## 3 regionNorth 0.182 0.322 0.564 coefficient 5.73e- 1 ## 4 regionSouth 0.0944 0.335 0.281 coefficient 7.78e- 1 ## 5 regionWest 0.303 0.321 0.944 coefficient 3.45e- 1 ## 6 genderM -0.0222 0.230 -0.0965 coefficient 9.23e- 1 ``` We can safely drop everything except `sales` and `new_customers`, because `region` and `gender` have no significant effect. --- class: left, middle, r-logo ## Simplify the model ```r simpler_formula <- "rating ~ sales + new_customers" ## create simpler model simpler_model <- polr(data = employee_performance, formula = simpler_formula) ## generate tidy results (simpler_results <- tidy(simpler_model) %>% filter(coef.type == "coefficient")) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic coef.type ## <chr> <dbl> <dbl> <dbl> <chr> ## 1 sales 0.380 0.0471 8.07 coefficient ## 2 new_customers 0.389 0.0411 9.48 coefficient ``` --- class: left, middle, r-logo ## How does each variable affect performance? We need to take exponents of the `estimate` to get an interpretable *odds ratio*. See [Section 7.2](http://peopleanalytics-regression-book.org/ord-reg.html#modeling-ordinal-outcomes-under-the-assumption-of-proportional-odds). ```r ## create odds ratio column (simpler_results <- simpler_results %>% dplyr::mutate( odds_ratio = exp(estimate) )) ``` ``` ## # A tibble: 2 x 6 ## term estimate std.error statistic coef.type odds_ratio ## <chr> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 sales 0.380 0.0471 8.07 coefficient 1.46 ## 2 new_customers 0.389 0.0411 9.48 coefficient 1.48 ``` The odds ratio is the multiplier of the odds of a higher category associated with a one unit increase in the input variable **assuming all other variables are equal**. Here's how we interpret this: 1. For individuals with similar new customers, each additional $1m of sales is associated with ~46% increase in the odds of higher performance. 2. For individuals with similar sales, each additional new customer is associated with ~48% increase in the odds of higher performance. --- class: left, middle, r-logo ## How much of the overall variance in performance is explained by this model? In a linear model we would use an `\(R^2\)` value to get a good point of view on this. In a logistic model there are several variants of this called *pseudo-* `\(R^2\)`. Let's look at a few popular variants. See [Section 5.3.2](http://peopleanalytics-regression-book.org/bin-log-reg.html#logistic-gof) for more information on these metrics. ```r library(DescTools) PseudoR2(simpler_model, which = c("CoxSnell", "Nagelkerke")) ``` ``` ## CoxSnell Nagelkerke ## 0.4956424 0.5600221 ``` This statistic should be interpreted and referenced with care. For example: >"It is estimated that sales and new customer acquisition explain somewhere around half of the variance in performance" --- class: left, middle, r-logo ## Graphing the model If you want to you can visualize this model because it has only 3 dimensions. Different surfaces can represent the probability of each performance level. See appendix for code.
--- class: left, middle, r-logo ## Which chapters of the book should I read to fully understand this model? 1. Chapter 3 on foundational statistics 2. Chapter 4 on the foundations of regression 3. Chapter 5 on logistic regression 4. Chapter 7 on proportional odds --- class: center, middle, r-logo # Thank you! Questions? --- class: left, middle, r-logo ## Appendix: Checking our assumption of proportional odds See [Section 7.3](http://peopleanalytics-regression-book.org/ord-reg.html#testing-the-proportional-odds-assumption) for more info on how to test your proportional odds assumption. The quickest way is to do a Brant-Wald test on your model, looking for p-values less than 0.05 to indicate failure of the test. ```r library(brant) brant(simpler_model) ``` ``` ## -------------------------------------------- ## Test for X2 df probability ## -------------------------------------------- ## Omnibus 5.52 2 0.06 ## sales 3.87 1 0.05 ## new_customers 2.29 1 0.13 ## -------------------------------------------- ## ## H0: Parallel Regression Assumption holds ``` Here we can see that we pass the Brant test, but only just. If the Brant test is failed, this indicates that you would need to model each level of the rating separately using different binomial models. --- class: left, middle, r-logo ## Appendix: Testing goodness of fit of the model The goodness of fit test is an indicator of whether we have used an appropriate model to explain the data. For a proportional odds model, this test fails if the p-value is less than 0.05. Here is an example which confirms a good fit for our model. See [Section 7.3.2](http://peopleanalytics-regression-book.org/ord-reg.html#model-diagnostics) for further information. ```r library(generalhoslem) lipsitz.test(simpler_model) ``` ``` ## ## Lipsitz goodness of fit test for ordinal response models ## ## data: formula: rating ~ sales + new_customers ## LR statistic = 12.366, df = 9, p-value = 0.1935 ``` --- class: left, middle, r-logo ## Appendix: Code for 3D-plot ```r library(plotly) plot_ly(data = employee_performance) %>% add_trace(z = simpler_model$fitted.values[ , 1], x = ~sales, y = ~new_customers, type = "mesh3d", name = "P(Low Performance)") %>% add_trace(z = simpler_model$fitted.values[ , 2], x = ~sales, y = ~new_customers, type = "mesh3d", name = "P(Middle Performance)") %>% add_trace(z = simpler_model$fitted.values[ , 3], x = ~sales, y = ~new_customers, type = "mesh3d", name = "P(High Performance)") %>% layout(scene = list(xaxis = list(title = 'sales'), yaxis = list(title = 'new_customers'), zaxis = list(title = 'Probability'), aspectmode='cube')) ``` --- class: left, middle, r-logo ## Appendix: Running Galton regression in Python ```python import pandas as pd import statsmodels.formula.api as smf # get data url = "https://raw.githubusercontent.com/keithmcnulty/peopleanalytics-regression-book/master/data/Galton.txt" galton = pd.read_csv(url, delimiter = "\t") # define model model = smf.ols(formula = "Height ~ Father + Mother + Gender", data = galton) # fit model galton_model = model.fit() # see results summary print(galton_model.summary()) ```