class: center, middle, inverse, title-slide # Statistical Inference and Regression Analysis using R ### Keith McNulty --- class: left, middle, r-logo ## Note on source This document is a summary learning developed for the R Ladies Tunis Meetup on 26 February 2021. It is based on material in the open source book *[Handbook of Regression Modeling in People Analytics](http://peopleanalytics-regression-book.org)* and other named sources. Please consult this book for a deeper understanding/treatment. This document is coded in R Markdown using the `xaringan` package. The code for this document is [here](https://github.com/keithmcnulty/peopleanalytics-regression-book/tree/master/presentations/r_ladies_tunis.Rmd). --- class: left, middle, r-logo ## Inference versus Prediction <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Approach </th> <th style="text-align:left;"> Objective </th> <th style="text-align:left;"> Examples </th> <th style="text-align:left;"> Considerations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;background-color: yellow !important;"> Inferential Modeling (Why?) </td> <td style="text-align:left;width: 30em; background-color: yellow !important;"> Accurately describe a relationship between inputs and an outcome </td> <td style="text-align:left;width: 30em; background-color: yellow !important;"> Regression models </td> <td style="text-align:left;background-color: yellow !important;"> No need for train/test split. Priority is to understand and interpret coefficient and fit parameters. </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Predictive Modeling (What if?) </td> <td style="text-align:left;width: 30em; "> Accurately predict the outcome for new observations </td> <td style="text-align:left;width: 30em; "> Machine learning models (decision trees, neural networks, some regression models) </td> <td style="text-align:left;"> Train/test split required. Priority is accurate prediction of test observations </td> </tr> </tbody> </table> --- class: left, middle, r-logo ## Examples of inferential methods we will look at today * **Hypothesis testing**: a common approach to deciding if a statement is true for a population based on a sample of data (Chapter 3 of the book) * **Linear regression**: a way of describing a linear relationship between a set of input variables and a *continuous* outcome variable (for example, height, weight, money) (Chapter 4 of the book) * **Binomial logistic regression**: a way of describing the relationship between a set of input variables and the *likelihood of a binary event occurring* (Chapter 5 of the book) --- class: left, middle, r-logo ## We will use datasets in the `peopleanalyticsdata` package ```r # install package if needed # install.packages("peopleanalyticsdata") library(peopleanalyticsdata) data(package = "peopleanalyticsdata") ``` ``` Data sets in package ‘peopleanalyticsdata’: charity_donation Charity donation data employee_performance Employee performance data employee_survey Employee survey data graduates Graduate salary data health_insurance Health insurance data job_retention Job retention data learning Learning program feedback data managers Manager performance data politics_survey Politics survey data promotion Promotion data recruiting Recruiting data salespeople Salespeople promotion data soccer Soccer discipline data sociological_data Sociological survey data speed_dating Speed dating data ugtests Undergraduate examination data ``` --- class: left, middle, r-logo ## Populations and samples In statistics and analytics, we are usually asked questions about a population, but we only have a sample of data. **Example:** Imagine a research study to determine if vaccination with two doses of different COVID vaccines produces a different outcome to vaccination with two doses of the same vaccine? The question refers to *everyone* - past, present and future. We want to make a general conclusion that can guide future vaccination strategy. But our data will not be data for everyone - it will only be a sample. Generally, bigger samples give us more certainty about the true answer to the question, but we can never be 100% certain of the truth. --- class: left, middle, r-logo # Example 1: Hypothesis testing --- class: left, middle, r-logo ## The `salespeople` data set ```r # view first few rows of salespeople data head(salespeople) ``` ``` ## promoted sales customer_rate performance ## 1 0 594 3.94 2 ## 2 0 446 4.06 3 ## 3 1 674 3.83 4 ## 4 0 525 3.62 2 ## 5 1 657 4.40 3 ## 6 1 918 4.54 2 ``` ```r # view summary summary(salespeople) ``` ``` ## promoted sales customer_rate performance ## Min. :0.0000 Min. :151.0 Min. :1.000 Min. :1.0 ## 1st Qu.:0.0000 1st Qu.:389.2 1st Qu.:3.000 1st Qu.:2.0 ## Median :0.0000 Median :475.0 Median :3.620 Median :3.0 ## Mean :0.3219 Mean :527.0 Mean :3.608 Mean :2.5 ## 3rd Qu.:1.0000 3rd Qu.:667.2 3rd Qu.:4.290 3rd Qu.:3.0 ## Max. :1.0000 Max. :945.0 Max. :5.000 Max. :4.0 ## NA's :1 NA's :1 NA's :1 ``` --- class: left, middle, r-logo ## Do high performing salespeople generate more sales than low performing salespeople? ```r library(dplyr) # reduce to complete cases to remove NAs salespeople <- salespeople %>% dplyr::filter(complete.cases(.)) ``` ```r # high performer sales high <- salespeople %>% dplyr::filter(performance == 4) %>% dplyr::pull(sales) # low performer sales low <- salespeople %>% dplyr::filter(performance == 1) %>% dplyr::pull(sales) # mean difference (mean_diff <- mean(high) - mean(low)) ``` ``` ## [1] 154.9742 ``` --- class: left, middle, r-logo ## Yes for our sample, but what about for the entire population? Let's assume that high performers generate the same or less sales than low performers. We call this the *null hypothesis*. We now need to examine the likelihood that our sample would look the way it does under the null hypothesis. If we find that this is very unlikely, we reject the null hypothesis and confirm the *alternative hypothesis*: that high performers generate higher sales. --- class: left, middle, r-logo ## Any sampled statistic of a random variable is a random variable The mean sales difference is an estimated statistic of a random variable. This is itself a random variable with a `\(t\)`-distribution - an approximation of the normal distribution for sample data. The standard deviation of this distribution is known as the *standard error* of the statistic. In this case the standard error of the mean difference can be calculated as follows: `$$\sqrt{\frac{s_{\mathrm{high}}^2}{n_{\mathrm{high}}} + \frac{s_{\mathrm{low}}^2}{n_{\mathrm{low}}}}$$` where `\(s_{\mathrm{high}}\)` and `\(s_{\mathrm{low}}\)` are the standard deviations of the high and low sales samples, and `\(n_{\mathrm{high}}\)` and `\(n_{\mathrm{low}}\)` are the two sample sizes. ```r # standard error (se <- sqrt(sd(high)^2/length(high) + sd(low)^2/length(low))) ``` ``` ## [1] 33.47554 ``` --- class: left, middle, r-logo ## Graph of the `\(t\)`-distribution ```r library(ggplot2) (g <- ggplot(data.frame(x = c(-5, 5)), aes(x = x)) + stat_function(fun = dt, args = list(df = 100.98), color = "blue") + geom_vline(xintercept = -mean_diff/se, linetype = "dashed", color = "red") + labs(x = "Standard errors around sample mean difference", y = "Density") + theme_minimal()) ``` <img src="r_ladies_tunis_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> We will use a `\(t\)`-distribution with 100.98 degrees of freedom<sup>1</sup>. Zero is our sample mean difference of 154.97 and each unit of `\(x\)` is a standard error of 33.48. A mean difference of zero or less would equate to the area to the left of the red dashed line. *Very unlikely*. .footnote[[1] I have calculated this manually via the <a href = "https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation">Welch–Satterthwaite equation</a>, but it is automatically calculated as part of the `t.test()` function in R. You can see my manual calculation in the source code of this document.] --- class: left, middle, r-logo ## Confidence intervals We calculated in our sample that the mean sales difference was 154.97. We know from our `\(t\)`-distribution that there is a range of likelihood around this for the actual mean sales difference for the population. We can convert our standard errors into probability estimates. For example, if we wanted to know how many standard errors to the left of our sample estimate corresponds to a 95% chance of containing the real population statistic: ```r # get se multiple for 95% confidence (conf_mult <- qt(p = 0.05, df = 100.98)) ``` ``` ## [1] -1.660084 ``` ```r # now we can create a lower bound for 95% confidence interval (lower_bound <- mean_diff + conf_mult*se) ``` ``` ## [1] 99.40205 ``` --- class: left, middle, r-logo ## We can be 95% confident that high performers generate higher sales ```r g + geom_vline(xintercept = conf_mult, color = "purple") ``` <img src="r_ladies_tunis_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> The red dashed line, which represents the boundary for our null hypothesis is far outside the 95% confidence interval. We can be more than 95% confident in the alternative hypothesis, that high performers generate higher sales. --- class: left, middle, r-logo ## Another route is to find the *p-value* of a zero population difference We can find where the red-dashed line falls in the probability density of our `\(t\)`-distribution and ask 'what is the maximum probability of such a difference occurring?'. ```r # get lower-tail probability (pval <- pt(-mean_diff/se, df = 100.98, lower.tail = TRUE)) ``` ``` ## [1] 5.46606e-06 ``` We look for the p-value to be less than a certain standard called `\(\alpha\)` in order to reject the null hypothesis. Usually `\(\alpha = 0.05\)`. Again we reject the null hypothesis in favour of the alternative hypothesis. --- class: left, middle, r-logo ## The `t.test()` function in R does all this for you ```r t.test(high, low, alternative = "greater") ``` ``` ## ## Welch Two Sample t-test ## ## data: high and low ## t = 4.6295, df = 100.98, p-value = 5.466e-06 ## alternative hypothesis: true difference in means is greater than 0 ## 95 percent confidence interval: ## 99.40204 Inf ## sample estimates: ## mean of x mean of y ## 619.8909 464.9167 ``` --- class: left, middle, r-logo ## Other common hypothesis tests: `cor.test()` `cor.test()` tests the alternative hypothesis that there is a non-zero correlation between two variables in a sample. ```r cor.test(salespeople$sales, salespeople$customer_rate) ``` ``` ## ## Pearson's product-moment correlation ## ## data: salespeople$sales and salespeople$customer_rate ## t = 6.6952, df = 348, p-value = 8.648e-11 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.2415282 0.4274964 ## sample estimates: ## cor ## 0.337805 ``` --- class: left, middle, r-logo ## Other common hypothesis tests: `chisq.test()` `chisq.test()` tests the alternative hypothesis that there is a difference in the distribution of a variable between categories in the data set. ```r # contingency table of promoted versus performance (contingency <- table(salespeople$promoted, salespeople$performance)) ``` ``` ## ## 1 2 3 4 ## 0 50 85 77 25 ## 1 10 25 48 30 ``` ```r # chisq test on contingency table chisq.test(contingency) ``` ``` ## ## Pearson's Chi-squared test ## ## data: contingency ## X-squared = 25.895, df = 3, p-value = 1.003e-05 ``` --- class: left, middle, r-logo # Example 2: Linear Regression --- class: left, middle, r-logo ## Fun exercise - the first ever model to be called a 'regression model' In 1885, Francis Galton did a study on some British children to see how their heights related to that of their parents. Let's grab Galton's data. ```r # get data from URL url <- "https://raw.githubusercontent.com/keithmcnulty/peopleanalytics-regression-book/master/data/Galton.txt" galton <- read.delim(url) # view first few rows head(galton) ``` ``` ## Family Father Mother Gender Height Kids ## 1 1 78.5 67.0 M 73.2 4 ## 2 1 78.5 67.0 F 69.2 4 ## 3 1 78.5 67.0 F 69.0 4 ## 4 1 78.5 67.0 F 69.0 4 ## 5 2 75.5 66.5 M 73.5 4 ## 6 2 75.5 66.5 M 72.5 4 ``` --- class: left, middle, r-logo ## What is the relationship between mid-parent height and child height Galton simplistically expected the child's height to be perfectly explained by the average height of their parents. We can test that using a simple linear regression model. ```r library(dplyr) # create midParentHeight column galton <- galton %>% dplyr::mutate(midParentHeight = (Father + Mother)/2) # simple linear regression of child vs mid-parent height simplemodel <- lm(formula = Height ~ midParentHeight, data = galton) # how much does it explain (R-squared) summary(simplemodel)$r.squared ``` ``` ## [1] 0.1069774 ``` --- class: left, middle, r-logo ## Galton realized that children 'regress' away from their parents height towards a mean population height ```r ggplot(galton, aes(x = midParentHeight, y = Height)) + geom_point(color = "blue") + geom_jitter(color = "blue") + geom_function(fun = function(x) {y = x}) + geom_function(fun = function(x) {y = mean(galton$Height)}, linetype = "dotdash") + geom_smooth(method='lm', formula = y~x, se = FALSE, color = "red", linetype = "dashed") + xlab("Mid-Parent Height (inches)") + ylab("Child Height (inches)") + theme_minimal() ``` <img src="r_ladies_tunis_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## Today, we would do a multiple regression ```r multimodel <- lm(formula = Height ~ Father + Mother + Gender, data = galton) summary(multimodel) ``` ``` ## ## Call: ## lm(formula = Height ~ Father + Mother + Gender, data = galton) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.523 -1.440 0.117 1.473 9.114 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.34476 2.74696 5.586 3.08e-08 *** ## Father 0.40598 0.02921 13.900 < 2e-16 *** ## Mother 0.32150 0.03128 10.277 < 2e-16 *** ## GenderM 5.22595 0.14401 36.289 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.154 on 894 degrees of freedom ## Multiple R-squared: 0.6397, Adjusted R-squared: 0.6385 ## F-statistic: 529 on 3 and 894 DF, p-value: < 2.2e-16 ``` --- class: left, middle, r-logo ## This model explains a lot about the *why* of a child's height * We can see which variables are significant explainers of a child's height - all of them are! * We can see the relative influence of each variable on a child's height - each *coefficient* tells us how many inches of height you gain for every extra unit of input, assuming all other inputs are the same. For example: a male child will gain over 5 inches over a female child of the same parents. For two children of the same gender and mother height, the height difference will be about 40% of the height difference of their fathers. * We can see how much of the overall variation in child height is explained by these three variables: `\(R^2 = 0.64\)`. --- class: left, middle, r-logo ## We now have many different regression methods available <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Outcome </th> <th style="text-align:left;"> Example </th> <th style="text-align:left;"> Approach </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Continuous </td> <td style="text-align:left;width: 30em; "> Compensation in $ </td> <td style="text-align:left;width: 30em; "> Linear Regression </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;background-color: yellow !important;"> Binary </td> <td style="text-align:left;width: 30em; background-color: yellow !important;"> Promoted or Not </td> <td style="text-align:left;width: 30em; background-color: yellow !important;"> Binomial Logistic Regression </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Nominal Category </td> <td style="text-align:left;width: 30em; "> Choice of Health Cover Product </td> <td style="text-align:left;width: 30em; "> Multinomial Logistic Regression </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Ordered Category </td> <td style="text-align:left;width: 30em; "> Performance Rating </td> <td style="text-align:left;width: 30em; "> Proportional Odds Regression </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Explicit grouping in data </td> <td style="text-align:left;width: 30em; "> Individuals in sales teams </td> <td style="text-align:left;width: 30em; "> Mixed/Multilevel Models </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Latent grouping in data </td> <td style="text-align:left;width: 30em; "> Large survey instruments </td> <td style="text-align:left;width: 30em; "> Sturctural Equation Models </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Time-associated events </td> <td style="text-align:left;width: 30em; "> Attrition </td> <td style="text-align:left;width: 30em; "> Survival Analysis/Cox Proportional Hazard Regression </td> </tr> </tbody> </table> --- class: left, middle, r-logo # Example 3: Binomial Logistic Regression --- class: left, middle, r-logo ## The logistic function is a pretty good approximation of the normal probability distribution $$ P(y = 1) = \frac{1}{1 + e^{-kx}} $$ ```r ggplot() + xlim(-5, 5) + xlab("x") + ylab("P (cumulative)") + geom_function(fun = pnorm, color = "red") + geom_function(fun = plogis, color = "blue", linetype = "dashed") ``` <img src="r_ladies_tunis_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## The logistic function helps us interpret how input variables affect the odds of an event $$ \mathrm{ln}\left(\frac{P(y = 1)}{P(y = 0)}\right) = \beta_0 + \beta_1x $$ The *log odds* of the event are linear in our input variables. If we exponentiate this, we get: $$ `\begin{aligned} \frac{P(y = 1)}{P(y = 0)} &= e^{\beta_0 + \beta_1x} \\ &=e^{\beta_0}(e^{\beta_1})^x \end{aligned}` $$ If `\(x = 0\)`, the base odds of the event are `\(e^{\beta_0}\)`. For every unit increase in `\(x\)`, the odds are multiplied by `\(e^{\beta_1}\)` - this is called an *odds ratio*. --- class: left, middle, r-logo ## Let's go back to our `salespeople` data set We want to answer the question: *how do the variables of sales, customer rating and performance rating influence the likelihood of a salesperson getting promoted*? ```r head(salespeople) ``` ``` ## promoted sales customer_rate performance ## 1 0 594 3.94 2 ## 2 0 446 4.06 3 ## 3 1 674 3.83 4 ## 4 0 525 3.62 2 ## 5 1 657 4.40 3 ## 6 1 918 4.54 2 ``` What structure does our data have? ```r str(salespeople) ``` ``` ## 'data.frame': 350 obs. of 4 variables: ## $ promoted : int 0 0 1 0 1 1 0 0 0 0 ... ## $ sales : int 594 446 674 525 657 918 318 364 342 387 ... ## $ customer_rate: num 3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ... ## $ performance : int 2 3 4 2 3 2 3 1 3 3 ... ``` --- class: left, middle, r-logo ## Transform the data Our `promoted` and `performance` columns are currently integers, but they should be categorical factors. Let's convert them. ```r # lets use some of the new dplyr functions salespeople <- salespeople %>% mutate(across(starts_with("p"), ~as.factor(.x))) str(salespeople) ``` ``` ## 'data.frame': 350 obs. of 4 variables: ## $ promoted : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 1 1 1 ... ## $ sales : int 594 446 674 525 657 918 318 364 342 387 ... ## $ customer_rate: num 3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ... ## $ performance : Factor w/ 4 levels "1","2","3","4": 2 3 4 2 3 2 3 1 3 3 ... ``` --- class: left, middle, r-logo ## Visualize relationships in the data using a pairplot ```r library(GGally) ggpairs(salespeople) ``` <img src="r_ladies_tunis_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## Run a binomial logistic regression model to try to answer our question ```r model <- glm(formula = promoted ~ ., data = salespeople, family = "binomial") summary(model) ``` ``` ## ## Call: ## glm(formula = promoted ~ ., family = "binomial", data = salespeople) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.11890 -0.08910 -0.01981 0.00867 2.98107 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -19.858932 3.444079 -5.766 8.11e-09 *** ## sales 0.040124 0.006576 6.101 1.05e-09 *** ## customer_rate -1.112131 0.482682 -2.304 0.0212 * ## performance2 0.263000 1.021980 0.257 0.7969 ## performance3 0.684955 0.982167 0.697 0.4856 ## performance4 0.734493 1.071964 0.685 0.4932 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 440.303 on 349 degrees of freedom ## Residual deviance: 64.374 on 344 degrees of freedom ## AIC: 76.374 ## ## Number of Fisher Scoring iterations: 8 ``` --- class: left, middle, r-logo ## Exponentiate the coefficients to get odds ratios ```r exp(model$coefficients) %>% round(2) ``` ``` ## (Intercept) sales customer_rate performance2 performance3 ## 0.00 1.04 0.33 1.30 1.98 ## performance4 ## 2.08 ``` Interpretation: 1. For two salespeople with the same customer rating and the same performance, each additional thousand dollars in sales increases the odds of promotion by 4%. 2. For two salespeople with the same sales and performance, each additional point in customer rating *decreases* the odds of promotion by 67% 3. For two salespeople of the same sales and customer rating, performance rating has no significant effect on the odds of promotion. --- class: left, middle, r-logo ## How much of the likelihood of promotion does my model explain? ```r library(DescTools) PseudoR2(model, which = c("CoxSnell", "Nagelkerke", "Tjur")) ``` ``` ## CoxSnell Nagelkerke Tjur ## 0.6583888 0.9198193 0.8795290 ``` Interpretation: our model explains more than two thirds of the variation in the promotion outcome. --- class: left, middle, r-logo ## Visualize the model using `plotly` ```r library(plotly) plot_ly(data = salespeople[complete.cases(salespeople), ]) %>% add_trace(x = ~sales, y = ~customer_rate, z = ~promoted, mode = "markers", type = "scatter3d", marker = list(size = 5, color = "blue", symbol = 104), name = "Observations") %>% add_trace(z = model$fitted.values, x = ~sales, y = ~customer_rate, type = "mesh3d", name = "Fitted values") %>% layout(scene = list(xaxis = list(title = 'sales'), yaxis = list(title = 'customer_rate', nticks = 5), camera = list(eye = list(x = -0.5, y = 2, z = 0)), zaxis = list(title = 'promoted'), aspectmode='cube')) ``` --- class: left, middle, r-logo ## Output of previous command
--- class: left, middle, r-logo # Thank you! Questions?