This document is a summary learning developed for the R Ladies Tunis Meetup on 26 February 2021. It is based on material in the open source book Handbook of Regression Modeling in People Analytics and other named sources. Please consult this book for a deeper understanding/treatment.
This document is coded in R Markdown using the xaringan
package. The code for this document is here.
Approach | Objective | Examples | Considerations |
---|---|---|---|
Inferential Modeling (Why?) | Accurately describe a relationship between inputs and an outcome | Regression models | No need for train/test split. Priority is to understand and interpret coefficient and fit parameters. |
Predictive Modeling (What if?) | Accurately predict the outcome for new observations | Machine learning models (decision trees, neural networks, some regression models) | Train/test split required. Priority is accurate prediction of test observations |
peopleanalyticsdata
package# install package if needed# install.packages("peopleanalyticsdata")library(peopleanalyticsdata)data(package = "peopleanalyticsdata")
Data sets in package ‘peopleanalyticsdata’:charity_donation Charity donation dataemployee_performance Employee performance dataemployee_survey Employee survey datagraduates Graduate salary datahealth_insurance Health insurance datajob_retention Job retention datalearning Learning program feedback datamanagers Manager performance datapolitics_survey Politics survey datapromotion Promotion datarecruiting Recruiting datasalespeople Salespeople promotion datasoccer Soccer discipline datasociological_data Sociological survey dataspeed_dating Speed dating dataugtests Undergraduate examination data
In statistics and analytics, we are usually asked questions about a population, but we only have a sample of data.
Example: Imagine a research study to determine if vaccination with two doses of different COVID vaccines produces a different outcome to vaccination with two doses of the same vaccine?
The question refers to everyone - past, present and future. We want to make a general conclusion that can guide future vaccination strategy.
But our data will not be data for everyone - it will only be a sample. Generally, bigger samples give us more certainty about the true answer to the question, but we can never be 100% certain of the truth.
salespeople
data set# view first few rows of salespeople data head(salespeople)
## promoted sales customer_rate performance## 1 0 594 3.94 2## 2 0 446 4.06 3## 3 1 674 3.83 4## 4 0 525 3.62 2## 5 1 657 4.40 3## 6 1 918 4.54 2
# view summarysummary(salespeople)
## promoted sales customer_rate performance ## Min. :0.0000 Min. :151.0 Min. :1.000 Min. :1.0 ## 1st Qu.:0.0000 1st Qu.:389.2 1st Qu.:3.000 1st Qu.:2.0 ## Median :0.0000 Median :475.0 Median :3.620 Median :3.0 ## Mean :0.3219 Mean :527.0 Mean :3.608 Mean :2.5 ## 3rd Qu.:1.0000 3rd Qu.:667.2 3rd Qu.:4.290 3rd Qu.:3.0 ## Max. :1.0000 Max. :945.0 Max. :5.000 Max. :4.0 ## NA's :1 NA's :1 NA's :1
library(dplyr)# reduce to complete cases to remove NAssalespeople <- salespeople %>% dplyr::filter(complete.cases(.))
# high performer saleshigh <- salespeople %>% dplyr::filter(performance == 4) %>% dplyr::pull(sales)# low performer saleslow <- salespeople %>% dplyr::filter(performance == 1) %>% dplyr::pull(sales)# mean difference(mean_diff <- mean(high) - mean(low))
## [1] 154.9742
Let's assume that high performers generate the same or less sales than low performers. We call this the null hypothesis.
We now need to examine the likelihood that our sample would look the way it does under the null hypothesis.
If we find that this is very unlikely, we reject the null hypothesis and confirm the alternative hypothesis: that high performers generate higher sales.
The mean sales difference is an estimated statistic of a random variable. This is itself a random variable with a t-distribution - an approximation of the normal distribution for sample data. The standard deviation of this distribution is known as the standard error of the statistic. In this case the standard error of the mean difference can be calculated as follows:
√s2highnhigh+s2lownlow
where shigh and slow are the standard deviations of the high and low sales samples, and nhigh and nlow are the two sample sizes.
# standard error(se <- sqrt(sd(high)^2/length(high) + sd(low)^2/length(low)))
## [1] 33.47554
library(ggplot2)(g <- ggplot(data.frame(x = c(-5, 5)), aes(x = x)) + stat_function(fun = dt, args = list(df = 100.98), color = "blue") + geom_vline(xintercept = -mean_diff/se, linetype = "dashed", color = "red") + labs(x = "Standard errors around sample mean difference", y = "Density") + theme_minimal())
We will use a t-distribution with 100.98 degrees of freedom1. Zero is our sample mean difference of 154.97 and each unit of x is a standard error of 33.48. A mean difference of zero or less would equate to the area to the left of the red dashed line. Very unlikely.
[1] I have calculated this manually via the Welch–Satterthwaite equation, but it is automatically calculated as part of the t.test()
function in R. You can see my manual calculation in the source code of this document.
We calculated in our sample that the mean sales difference was 154.97. We know from our t-distribution that there is a range of likelihood around this for the actual mean sales difference for the population.
We can convert our standard errors into probability estimates. For example, if we wanted to know how many standard errors to the left of our sample estimate corresponds to a 95% chance of containing the real population statistic:
# get se multiple for 95% confidence(conf_mult <- qt(p = 0.05, df = 100.98))
## [1] -1.660084
# now we can create a lower bound for 95% confidence interval(lower_bound <- mean_diff + conf_mult*se)
## [1] 99.40205
g + geom_vline(xintercept = conf_mult, color = "purple")
The red dashed line, which represents the boundary for our null hypothesis is far outside the 95% confidence interval. We can be more than 95% confident in the alternative hypothesis, that high performers generate higher sales.
We can find where the red-dashed line falls in the probability density of our t-distribution and ask 'what is the maximum probability of such a difference occurring?'.
# get lower-tail probability(pval <- pt(-mean_diff/se, df = 100.98, lower.tail = TRUE))
## [1] 5.46606e-06
We look for the p-value to be less than a certain standard called α in order to reject the null hypothesis. Usually α=0.05. Again we reject the null hypothesis in favour of the alternative hypothesis.
t.test()
function in R does all this for yout.test(high, low, alternative = "greater")
## ## Welch Two Sample t-test## ## data: high and low## t = 4.6295, df = 100.98, p-value = 5.466e-06## alternative hypothesis: true difference in means is greater than 0## 95 percent confidence interval:## 99.40204 Inf## sample estimates:## mean of x mean of y ## 619.8909 464.9167
cor.test()
cor.test()
tests the alternative hypothesis that there is a non-zero correlation between two variables in a sample.
cor.test(salespeople$sales, salespeople$customer_rate)
## ## Pearson's product-moment correlation## ## data: salespeople$sales and salespeople$customer_rate## t = 6.6952, df = 348, p-value = 8.648e-11## alternative hypothesis: true correlation is not equal to 0## 95 percent confidence interval:## 0.2415282 0.4274964## sample estimates:## cor ## 0.337805
chisq.test()
chisq.test()
tests the alternative hypothesis that there is a difference in the distribution of a variable between categories in the data set.
# contingency table of promoted versus performance(contingency <- table(salespeople$promoted, salespeople$performance))
## ## 1 2 3 4## 0 50 85 77 25## 1 10 25 48 30
# chisq test on contingency tablechisq.test(contingency)
## ## Pearson's Chi-squared test## ## data: contingency## X-squared = 25.895, df = 3, p-value = 1.003e-05
In 1885, Francis Galton did a study on some British children to see how their heights related to that of their parents. Let's grab Galton's data.
# get data from URLurl <- "https://raw.githubusercontent.com/keithmcnulty/peopleanalytics-regression-book/master/data/Galton.txt"galton <- read.delim(url)# view first few rowshead(galton)
## Family Father Mother Gender Height Kids## 1 1 78.5 67.0 M 73.2 4## 2 1 78.5 67.0 F 69.2 4## 3 1 78.5 67.0 F 69.0 4## 4 1 78.5 67.0 F 69.0 4## 5 2 75.5 66.5 M 73.5 4## 6 2 75.5 66.5 M 72.5 4
Galton simplistically expected the child's height to be perfectly explained by the average height of their parents. We can test that using a simple linear regression model.
library(dplyr)# create midParentHeight columngalton <- galton %>% dplyr::mutate(midParentHeight = (Father + Mother)/2)# simple linear regression of child vs mid-parent heightsimplemodel <- lm(formula = Height ~ midParentHeight, data = galton)# how much does it explain (R-squared)summary(simplemodel)$r.squared
## [1] 0.1069774
ggplot(galton, aes(x = midParentHeight, y = Height)) + geom_point(color = "blue") + geom_jitter(color = "blue") + geom_function(fun = function(x) {y = x}) + geom_function(fun = function(x) {y = mean(galton$Height)}, linetype = "dotdash") + geom_smooth(method='lm', formula = y~x, se = FALSE, color = "red", linetype = "dashed") + xlab("Mid-Parent Height (inches)") + ylab("Child Height (inches)") + theme_minimal()
multimodel <- lm(formula = Height ~ Father + Mother + Gender, data = galton)summary(multimodel)
## ## Call:## lm(formula = Height ~ Father + Mother + Gender, data = galton)## ## Residuals:## Min 1Q Median 3Q Max ## -9.523 -1.440 0.117 1.473 9.114 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.34476 2.74696 5.586 3.08e-08 ***## Father 0.40598 0.02921 13.900 < 2e-16 ***## Mother 0.32150 0.03128 10.277 < 2e-16 ***## GenderM 5.22595 0.14401 36.289 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.154 on 894 degrees of freedom## Multiple R-squared: 0.6397, Adjusted R-squared: 0.6385 ## F-statistic: 529 on 3 and 894 DF, p-value: < 2.2e-16
Outcome | Example | Approach |
---|---|---|
Continuous | Compensation in $ | Linear Regression |
Binary | Promoted or Not | Binomial Logistic Regression |
Nominal Category | Choice of Health Cover Product | Multinomial Logistic Regression |
Ordered Category | Performance Rating | Proportional Odds Regression |
Explicit grouping in data | Individuals in sales teams | Mixed/Multilevel Models |
Latent grouping in data | Large survey instruments | Sturctural Equation Models |
Time-associated events | Attrition | Survival Analysis/Cox Proportional Hazard Regression |
P(y=1)=11+e−kx
ggplot() + xlim(-5, 5) + xlab("x") + ylab("P (cumulative)") + geom_function(fun = pnorm, color = "red") + geom_function(fun = plogis, color = "blue", linetype = "dashed")
ln(P(y=1)P(y=0))=β0+β1x The log odds of the event are linear in our input variables. If we exponentiate this, we get:
P(y=1)P(y=0)=eβ0+β1x=eβ0(eβ1)x
If x=0, the base odds of the event are eβ0. For every unit increase in x, the odds are multiplied by eβ1 - this is called an odds ratio.
salespeople
data setWe want to answer the question: how do the variables of sales, customer rating and performance rating influence the likelihood of a salesperson getting promoted?
head(salespeople)
## promoted sales customer_rate performance## 1 0 594 3.94 2## 2 0 446 4.06 3## 3 1 674 3.83 4## 4 0 525 3.62 2## 5 1 657 4.40 3## 6 1 918 4.54 2
What structure does our data have?
str(salespeople)
## 'data.frame': 350 obs. of 4 variables:## $ promoted : int 0 0 1 0 1 1 0 0 0 0 ...## $ sales : int 594 446 674 525 657 918 318 364 342 387 ...## $ customer_rate: num 3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ...## $ performance : int 2 3 4 2 3 2 3 1 3 3 ...
Our promoted
and performance
columns are currently integers, but they should be categorical factors. Let's convert them.
# lets use some of the new dplyr functionssalespeople <- salespeople %>% mutate(across(starts_with("p"), ~as.factor(.x)))str(salespeople)
## 'data.frame': 350 obs. of 4 variables:## $ promoted : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 1 1 1 ...## $ sales : int 594 446 674 525 657 918 318 364 342 387 ...## $ customer_rate: num 3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ...## $ performance : Factor w/ 4 levels "1","2","3","4": 2 3 4 2 3 2 3 1 3 3 ...
library(GGally)ggpairs(salespeople)
model <- glm(formula = promoted ~ ., data = salespeople, family = "binomial")summary(model)
## ## Call:## glm(formula = promoted ~ ., family = "binomial", data = salespeople)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.11890 -0.08910 -0.01981 0.00867 2.98107 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -19.858932 3.444079 -5.766 8.11e-09 ***## sales 0.040124 0.006576 6.101 1.05e-09 ***## customer_rate -1.112131 0.482682 -2.304 0.0212 * ## performance2 0.263000 1.021980 0.257 0.7969 ## performance3 0.684955 0.982167 0.697 0.4856 ## performance4 0.734493 1.071964 0.685 0.4932 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 440.303 on 349 degrees of freedom## Residual deviance: 64.374 on 344 degrees of freedom## AIC: 76.374## ## Number of Fisher Scoring iterations: 8
exp(model$coefficients) %>% round(2)
## (Intercept) sales customer_rate performance2 performance3 ## 0.00 1.04 0.33 1.30 1.98 ## performance4 ## 2.08
Interpretation:
library(DescTools)PseudoR2(model, which = c("CoxSnell", "Nagelkerke", "Tjur"))
## CoxSnell Nagelkerke Tjur ## 0.6583888 0.9198193 0.8795290
Interpretation: our model explains more than two thirds of the variation in the promotion outcome.
plotly
library(plotly)plot_ly(data = salespeople[complete.cases(salespeople), ]) %>% add_trace(x = ~sales, y = ~customer_rate, z = ~promoted, mode = "markers", type = "scatter3d", marker = list(size = 5, color = "blue", symbol = 104), name = "Observations") %>% add_trace(z = model$fitted.values, x = ~sales, y = ~customer_rate, type = "mesh3d", name = "Fitted values") %>% layout(scene = list(xaxis = list(title = 'sales'), yaxis = list(title = 'customer_rate', nticks = 5), camera = list(eye = list(x = -0.5, y = 2, z = 0)), zaxis = list(title = 'promoted'), aspectmode='cube'))
This document is a summary learning developed for the R Ladies Tunis Meetup on 26 February 2021. It is based on material in the open source book Handbook of Regression Modeling in People Analytics and other named sources. Please consult this book for a deeper understanding/treatment.
This document is coded in R Markdown using the xaringan
package. The code for this document is here.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |