class: center, middle, inverse, title-slide # Explaining Performance Ratings Using Proportional Odds Regression ### Keith McNulty --- class: left, middle, r-logo ## Note on source This document is a summary learning developed for the NY Strategic HR Analytics Meetup on 15 January 2021. It is based on material in the open source book *[Handbook of Regression Modeling in People Analytics](http://peopleanalytics-regression-book.org)*. Please consult this book for a deeper understanding/treatment. The code for this document is [here](https://github.com/keithmcnulty/peopleanalytics-regression-book/tree/master/presentations). ## Note on languages This document is coded in R. The model type used in this document (proportional odds logistic regression) is not currently well supported in Python. --- class: left, middle, r-logo ## Example data set: salespeople performance ratings Let's look at some data that relates to the performance ratings of sales people in an organization on a scale of 1 to 3 of increasing performance: ```r ## download data url <- "https://raw.githubusercontent.com/keithmcnulty/peopleanalytics-regression-book/master/presentations/employee_performance.csv" employee_performance <- read.csv(url) ## look at the data head(employee_performance) ``` ``` ## sales new_customers region gender rating ## 1 10.10 9 North M 2 ## 2 11.67 17 West F 3 ## 3 8.91 13 North M 3 ## 4 7.73 10 South F 2 ## 5 11.36 12 North M 2 ## 6 8.06 12 North M 2 ``` Sales is in $m. We are being asked which of the other four factors influence the performance rating and how do they influence it. --- class: left, middle, r-logo ## What does our data look like? The first data step in any modeling process is to make sure your data is in good shape. ```r ## look at the data types - are they appropriate? str(employee_performance) ``` ``` ## 'data.frame': 366 obs. of 5 variables: ## $ sales : num 10.1 11.67 8.91 7.73 11.36 ... ## $ new_customers: int 9 17 13 10 12 12 10 12 9 11 ... ## $ region : chr "North" "West" "North" "South" ... ## $ gender : chr "M" "F" "M" "F" ... ## $ rating : int 2 3 3 2 2 2 3 2 2 2 ... ``` --- class: left, middle, r-logo ## Adjusting data types for appropriate modeling We need to make sure that `region` and `gender` is understood as an **unordered** categorical column of data. We also need to make sure that the `rating` column is understood as an **ordered** categorical column. ```r ## unordered categories cat_columns <- c("region", "gender") employee_performance[cat_columns] <- lapply(employee_performance[cat_columns], as.factor) ## ordered categories employee_performance$rating <- ordered(employee_performance$rating, levels = 1:3) str(employee_performance) ``` ``` ## 'data.frame': 366 obs. of 5 variables: ## $ sales : num 10.1 11.67 8.91 7.73 11.36 ... ## $ new_customers: int 9 17 13 10 12 12 10 12 9 11 ... ## $ region : Factor w/ 4 levels "East","North",..: 2 4 2 3 2 2 2 2 1 2 ... ## $ gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 2 1 2 2 ... ## $ rating : Ord.factor w/ 3 levels "1"<"2"<"3": 2 3 3 2 2 2 3 2 2 2 ... ``` --- class: left, middle, r-logo ## Quick visualizations of bivariate relationships in the data ```r library(GGally) ## create a pairplot GGally::ggpairs(employee_performance) ``` <img src="polr_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## We want to answer a few questions as precisely as we can 1. Which variables are meaningful in explaining employee performance? 2. To what extent does each meaningful variable influence performance? 3. How much of the entire variance of performance do these variables explain? --- class: left, middle, r-logo ## First we need to select the right type of model Through asking some questions about the data that we have, we can determine an appropriate model to use. **Question 1:** What type of outcome are we studying? *Ordered categories* **Question 2:** Can we assume that each input acts similarly on each level of the outcome? *Yes for now and we can check this later. This is called the proportional odds assumption*. Then we should use a proportional odds logistic regression model. This model will tell us how each input variable affects the *odds of someone having a higher performance rating*. See [Section 7.1](http://peopleanalytics-regression-book.org/ord-reg.html#when-to-use-it-2) for more details. --- class: left, middle, r-logo ## Run the model We can use the `polr()` function from the `MASS` package to run the model. ```r library(MASS) # formula for the model our_formula = "rating ~ ." # run model model <- polr(data = employee_performance, formula = our_formula) ``` Now we have model sitting ready to be viewed. --- class: left, middle, r-logo ## Viewing and clarifying the results I like to use the `broom` package to view model results in a tidy way and to easily add columns to the results. ```r library(dplyr) library(broom) (model_results <- tidy(model) %>% filter(coef.type == "coefficient")) ``` ``` ## # A tibble: 6 x 5 ## term estimate std.error statistic coef.type ## <chr> <dbl> <dbl> <dbl> <chr> ## 1 sales 0.385 0.0477 8.07 coefficient ## 2 new_customers 0.387 0.0413 9.37 coefficient ## 3 regionNorth 0.182 0.322 0.564 coefficient ## 4 regionSouth 0.0944 0.335 0.281 coefficient ## 5 regionWest 0.303 0.321 0.944 coefficient ## 6 genderM -0.0222 0.230 -0.0965 coefficient ``` See [Section 10.1.1](http://peopleanalytics-regression-book.org/alt-approaches.html#the-broom-package) for more information on tidy output and `broom`. --- class: left, middle, r-logo ## Are variables significant in explaining performance? To determine this we need to convert our `statistic` into a p-value, and determine if the p-value is less than an alpha which is usually 0.05. See [Section 3.3](http://peopleanalytics-regression-book.org/found-stats.html#hyp-tests). ```r ## add p-value (model_results <- model_results %>% dplyr::mutate( pval = (1 - pnorm(abs(statistic), 0, 1))*2 )) ``` ``` ## # A tibble: 6 x 6 ## term estimate std.error statistic coef.type pval ## <chr> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 sales 0.385 0.0477 8.07 coefficient 6.66e-16 ## 2 new_customers 0.387 0.0413 9.37 coefficient 0. ## 3 regionNorth 0.182 0.322 0.564 coefficient 5.73e- 1 ## 4 regionSouth 0.0944 0.335 0.281 coefficient 7.78e- 1 ## 5 regionWest 0.303 0.321 0.944 coefficient 3.45e- 1 ## 6 genderM -0.0222 0.230 -0.0965 coefficient 9.23e- 1 ``` We can safely drop everything except `sales` and `new_customers`, because `region` and `gender` have no significant effect. --- class: left, middle, r-logo ## Simplify the model ```r simpler_formula <- "rating ~ sales + new_customers" ## create simpler model simpler_model <- polr(data = employee_performance, formula = simpler_formula) ## generate tidy results (simpler_results <- tidy(simpler_model) %>% filter(coef.type == "coefficient")) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic coef.type ## <chr> <dbl> <dbl> <dbl> <chr> ## 1 sales 0.380 0.0471 8.07 coefficient ## 2 new_customers 0.389 0.0411 9.48 coefficient ``` --- class: left, middle, r-logo ## How does each variable affect performance? We need to take exponents of the `estimate` to get an interpretable *odds ratio*. See [Section 7.2](http://peopleanalytics-regression-book.org/ord-reg.html#modeling-ordinal-outcomes-under-the-assumption-of-proportional-odds). ```r ## create odds ratio column (simpler_results <- simpler_results %>% dplyr::mutate( odds_ratio = exp(estimate) )) ``` ``` ## # A tibble: 2 x 6 ## term estimate std.error statistic coef.type odds_ratio ## <chr> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 sales 0.380 0.0471 8.07 coefficient 1.46 ## 2 new_customers 0.389 0.0411 9.48 coefficient 1.48 ``` The odds ratio is the multiplier of the odds of a higher category associated with a one unit increase in the input variable **assuming all other variables are equal**. Here's how we interpret this: 1. For individuals with similar new customers, each additional $1m of sales is associated with ~46% increase in the odds of higher performance. 2. For individuals with similar sales, each additional new customer is associated with ~48% increase in the odds of higher performance. --- class: left, middle, r-logo ## How much of the overall variance in performance is explained by this model? In a linear model we would use an `\(R^2\)` value to get a good point of view on this. In a logistic model there are several variants of this called *pseudo-* `\(R^2\)`. Let's look at a few popular variants. See [Section 5.3.2](http://peopleanalytics-regression-book.org/bin-log-reg.html#logistic-gof) for more information on these metrics. ```r library(DescTools) PseudoR2(simpler_model, which = c("CoxSnell", "Nagelkerke")) ``` ``` ## CoxSnell Nagelkerke ## 0.4956424 0.5600221 ``` This statistic should be interpreted and referenced with care. For example: >"It is estimated that sales and new customer acquisition explain somewhere around half of the variance in performance" --- class: left, middle, r-logo ## Graphing the model If you want to you can visualize this model because it has only 3 dimensions. Different surfaces can represent the probability of each performance level. See appendix for code.
--- class: left, middle, r-logo ## Which chapters of the book should I read to fully understand this model? 1. Chapter 3 on foundational statistics 2. Chapter 4 on the foundations of regression 3. Chapter 5 on logistic regression 4. Chapter 7 on proportional odds --- class: center, middle, r-logo # Thank you! Questions? --- class: left, middle, r-logo ## Appendix: Checking our assumption of proportional odds See [Section 7.3](http://peopleanalytics-regression-book.org/ord-reg.html#testing-the-proportional-odds-assumption) for more info on how to test your proportional odds assumption. The quickest way is to do a Brant-Wald test on your model, looking for p-values less than 0.05 to indicate failure of the test. ```r library(brant) brant(simpler_model) ``` ``` ## -------------------------------------------- ## Test for X2 df probability ## -------------------------------------------- ## Omnibus 5.52 2 0.06 ## sales 3.87 1 0.05 ## new_customers 2.29 1 0.13 ## -------------------------------------------- ## ## H0: Parallel Regression Assumption holds ``` Here we can see that we pass the Brant test, but only just. If the Brant test is failed, this indicates that you would need to model each level of the rating separately using different binomial models. --- class: left, middle, r-logo ## Appendix: Testing goodness of fit of the model The goodness of fit test is an indicator of whether we have used an appropriate model to explain the data. For a proportional odds model, this test fails if the p-value is less than 0.05. Here is an example which confirms a good fit for our model. See [Section 7.3.2](http://peopleanalytics-regression-book.org/ord-reg.html#model-diagnostics) for further information. ```r library(generalhoslem) lipsitz.test(simpler_model) ``` ``` ## ## Lipsitz goodness of fit test for ordinal response models ## ## data: formula: rating ~ sales + new_customers ## LR statistic = 12.366, df = 9, p-value = 0.1935 ``` --- class: left, middle, r-logo ## Appendix: Code for 3D-plot ```r library(plotly) plot_ly(data = employee_performance) %>% add_trace(z = simpler_model$fitted.values[ , 1], x = ~sales, y = ~new_customers, type = "mesh3d", name = "P(Low Performance)") %>% add_trace(z = simpler_model$fitted.values[ , 2], x = ~sales, y = ~new_customers, type = "mesh3d", name = "P(Middle Performance)") %>% add_trace(z = simpler_model$fitted.values[ , 3], x = ~sales, y = ~new_customers, type = "mesh3d", name = "P(High Performance)") %>% layout(scene = list(xaxis = list(title = 'sales'), yaxis = list(title = 'new_customers'), zaxis = list(title = 'Probability'), aspectmode='cube')) ```