For an interactive example of the Central Limit Theoreom, check out the Distributions Applet by Brown University.
In probability theory, the central limit theorem (CLT) establishes that, for the most commonly studied scenarios, when independent random variables are added, their sum tends toward a normal distribution (commonly known as a bell curve) even if the original variables themselves are not normally distributed. In more precise terms, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined (finite) expected value and finite variance, will be approximately normally distributed, regardless of the underlying distribution.
Samples from a population, under common circumstances, will follow a normal distribution
Justifies the use of normal distribution for statistical modeling
This requires that two conditions are met:
Sampled observations of a population must be independent.
Independence is guaranteed when we take a random sample from a population (using a random number generator, for example). It can also be guaranteed if we randomly divide individuals into treatment and control groups.
The sample size must be large enough.
Practical examples of what this means:
Start with three different distributions (they don’t need to be normal). Draw one sample from each, and add the numbers together. Repeat this procedure many times, and plot a histogram of the sums. You will get a normal distribution.
Use one distribution (doesn’t have to be normal) and draw multiple samples from it. Compute the mean of those samples. Then repeat the procedure again and again, and plot the histogram of the means. You will get a normal distribution.
Draw samples from two populations and subtract the numbers. Do this many times and plot the histogram of the differences. You will get a normal distribution
Drawing from three distributions.
Example of summing samples from three distributions together. We see what happens for 15 summed samples, 150 summed samples, and 1500 summed samples
Example of subtracting samples from two distributions together. We see what happens for 15 subtracted samples, 150 subtracted samples, and 1500 subtracted samples.
Example of taking the mean of 30 samples from a uniform distribution, repeated 150 times.
These illustrate the Central Limit Theorem in action. As the numbers of samples increase, the bell curve shape emerges.
Running the simulation for the gender discrimination study using R
First, we label Men as 0 and Women as 1
candidates <- c(rep(0, 24), rep(1, 24))
We then run the simulation assuming the Null Hypothesis is true. We take the list of candidates and randomly select 35 without replacement for promotion. We then compare the difference in rates for selected men and selected women.
simulation.list <- rep(100, 0.0)
# Loop to run simulation 100 times
for(i in 1:100){
# Sample 35 candidates at random without replacement
tmp <- sample(candidates, 35, replace = FALSE)
# Calculate fraction of promoted men
p1 <- length(tmp[tmp == 0])/35
# Calculate fraction of promoted women
p2 <- length(tmp[tmp == 1])/35
# Store difference of p1 - p2 in a vector
simulation.list[i] <- p1 - p2
}
# Convert vector of simulation outcomes to a Tibble for plotting
simulation.list <- tibble(p = simulation.list)
Check the summary statistics for the simulation:
cat("Summary statistics for 100 run simulation of null hypothesis",
"\nfor the Gender Discrimination study\n\n")
## Summary statistics for 100 run simulation of null hypothesis
## for the Gender Discrimination study
summarize(simulation.list, mean = mean(p), median = median(p), sd = sd(p))
mean | median | sd |
---|---|---|
0.0148571 | 0.0285714 | 0.081268 |
Finally, let’s plot histogram of simulation outcomes, and plot a red line of the result obtained in the gender discrimination study
ggplot(simulation.list) +
geom_histogram(mapping = aes(x = p, y = ..density..), bins = 30,
fill = "cyan2", color = "cyan4") +
xlim(c(-0.4, 0.4)) +
geom_vline(xintercept = 0.875 - 0.583, color = "indianred")
inference()
Now let’s try it again with the inference function. First, set up the categorical data.
full_results <- tibble(sex = as.factor(candidates),
promoted = as.factor(c(rep(1, 21), rep(0, 3), rep(1, 14), rep(0, 10))))
inference(y = promoted, x = sex, data = full_results, type = "ht", statistic = "proportion",
success = 1, order = c(1, 0), method = "simulation", boot_method = "se",
alternative = "twosided")
## Warning: Missing null value, set to 0
## Response variable: categorical (2 levels, success: 1)
## Explanatory variable: categorical (2 levels)
## n_1 = 24, p_hat_1 = 0.5833
## n_0 = 24, p_hat_0 = 0.875
## H0: p_1 = p_0
## HA: p_1 != p_0
## p_value = 0.0545