Central Limit Theorem

For an interactive example of the Central Limit Theoreom, check out the Distributions Applet by Brown University.

In probability theory, the central limit theorem (CLT) establishes that, for the most commonly studied scenarios, when independent random variables are added, their sum tends toward a normal distribution (commonly known as a bell curve) even if the original variables themselves are not normally distributed. In more precise terms, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined (finite) expected value and finite variance, will be approximately normally distributed, regardless of the underlying distribution.

— Wikipedia: Central limit theorem

This requires that two conditions are met:

Practical examples of what this means:

Example 1

Drawing from three distributions.

Example of summing samples from three distributions together. We see what happens for 15 summed samples, 150 summed samples, and 1500 summed samples

Example of subtracting samples from two distributions together. We see what happens for 15 subtracted samples, 150 subtracted samples, and 1500 subtracted samples.

Example of taking the mean of 30 samples from a uniform distribution, repeated 150 times.

These illustrate the Central Limit Theorem in action. As the numbers of samples increase, the bell curve shape emerges.

Gender discrimation study simulation

Running the simulation for the gender discrimination study using R

First, we label Men as 0 and Women as 1

candidates <- c(rep(0, 24), rep(1, 24))

We then run the simulation assuming the Null Hypothesis is true. We take the list of candidates and randomly select 35 without replacement for promotion. We then compare the difference in rates for selected men and selected women.

simulation.list <- rep(100, 0.0)
# Loop to run simulation 100 times
for(i in 1:100){
  # Sample 35 candidates at random without replacement
  tmp <- sample(candidates, 35, replace = FALSE)
  # Calculate fraction of promoted men
  p1 <- length(tmp[tmp == 0])/35
  # Calculate fraction of promoted women
  p2 <- length(tmp[tmp == 1])/35
  # Store difference of p1 - p2 in a vector
  simulation.list[i] <- p1 - p2
}
# Convert vector of simulation outcomes to a Tibble for plotting
simulation.list <- tibble(p = simulation.list)

Check the summary statistics for the simulation:

cat("Summary statistics for 100 run simulation of null hypothesis",
    "\nfor the Gender Discrimination study\n\n")
## Summary statistics for 100 run simulation of null hypothesis 
## for the Gender Discrimination study
summarize(simulation.list, mean = mean(p), median = median(p), sd = sd(p))
mean median sd
0.0148571 0.0285714 0.081268

Finally, let’s plot histogram of simulation outcomes, and plot a red line of the result obtained in the gender discrimination study

ggplot(simulation.list) +
  geom_histogram(mapping = aes(x = p, y = ..density..), bins = 30,
                 fill = "cyan2", color = "cyan4") +
  xlim(c(-0.4, 0.4)) +
  geom_vline(xintercept = 0.875 - 0.583, color = "indianred")

Using inference()

Now let’s try it again with the inference function. First, set up the categorical data.

full_results <- tibble(sex = as.factor(candidates),
                       promoted = as.factor(c(rep(1, 21), rep(0, 3), rep(1, 14), rep(0, 10))))
inference(y = promoted, x = sex, data = full_results, type = "ht", statistic = "proportion",
          success = 1, order = c(1, 0), method = "simulation", boot_method = "se",
          alternative = "twosided")
## Warning: Missing null value, set to 0
## Response variable: categorical (2 levels, success: 1)
## Explanatory variable: categorical (2 levels) 
## n_1 = 24, p_hat_1 = 0.5833
## n_0 = 24, p_hat_0 = 0.875
## H0: p_1 =  p_0
## HA: p_1 != p_0
## p_value = 0.0545