class: center, middle, title-slide .upper-right[ ![](cds-101-logo.png)<!-- --> ] # Class 24 <br/> Normal distribution .title-hline[ ## November 20, 2017 ] --- class: middle, center, inverse # Applying the normal distribution with R --- layout: true # SAT scores example revisited --- * How can we compare the SAT score to the ACT score? -- * Compare the percentiles -- * Pam's SAT score was 1800 and Jim's ACT score was 24. -- * The SAT score distribution is normally distributed with a mean of 1500 and a standard deviation of 300. -- * The ACT score distribution is normally distributed with a mean of 21 and a standard deviation of 5. -- We can use `pnorm()` to compute the percentile. --- Pam's percentile is: -- ```r pam_score <- 1800 pam_percentile <- pnorm(q = pam_score, mean = 1500, sd = 300) ``` -- ``` ## [1] 0.8413447 ``` --- Visually, this corresponds to the following area under the normal distribution: <img src="class24_slides_files/figure-html/pam-normal-sat-density-1.png" width="80%" style="display: block; margin: auto;" /> --- Jim's percentile is: -- ```r jim_score <- 24 jim_percentile <- pnorm(q = jim_score, mean = 21, sd = 5) ``` -- ``` ## [1] 0.7257469 ``` --- Visually, this corresponds to the following area under the normal distribution: <img src="class24_slides_files/figure-html/jim-normal-act-density-1.png" width="80%" style="display: block; margin: auto;" /> --- layout: true # .font70[Cumulative distribution function of normal distribution] --- * Using percentiles is a useful way to compare distributions to see if they're the same -- * Instead of calculating percentiles one by one, use the cumulative distribution function (CDF) -- * Maps the percentile values to the corresponding values in the data set. -- * The CDF for the normal distribution model (not for imported datasets) can be accessed with `qnorm()` --- The CDF for the SAT scores is generated in the following way: -- ```r sat_score_percentiles <- seq(0.01, 1, 0.01) sat_score_cdf <- tibble( CDF = sat_score_percentiles, score = qnorm(p = sat_score_percentiles, mean = 1500, sd = 300)) ggplot(sat_score_cdf) + geom_line(mapping = aes(x = score, y = CDF)) ``` --- The CDF for the SAT scores is generated in the following way: <img src="class24_slides_files/figure-html/sat-score-cdf-1.png" width="80%" style="display: block; margin: auto;" /> -- This shows how the CDF maps Pam's SAT score to a percentile. --- layout: true # Q-Q Plots --- Load dataset on children's heights. ```r heights <- read_csv(file = "child_height_data.csv") ``` The first few lines in the dataset look like the following: ``` ## # A tibble: 6 x 2 ## sex height_inches ## <chr> <dbl> ## 1 M 73.2 ## 2 F 69.2 ## 3 F 69.0 ## 4 F 69.0 ## 5 M 73.5 ## 6 M 72.5 ``` --- Compute the PMF histogram: ```r ggplot(heights) + geom_histogram(mapping = aes(x = height_inches, y = ..density..), binwidth = 1) ``` <img src="class24_slides_files/figure-html/child-heights-pmf-histogram-1.png" width="70%" style="display: block; margin: auto;" /> --- layout: true # Q-Q Plots First, let's compute theoretical line for ideal agreement: --- * Find the 1st and 3rd quartiles ```r qq_y <- quantile(heights$height_inches, c(0.25, 0.75)) ``` -- * Find the matching normal values on the x-axis ```r qq_x <- qnorm(c(0.25, 0.75)) ``` -- * Compute line slope ```r qq_slope <- diff(qq_y) / diff(qq_x) ``` -- * Compute line intercept ```r qq_int <- qq_y[1] - qq_slope * qq_x[1] ``` --- layout: true # Q-Q Plots --- Now create the plot: ```r ggplot(heights) + stat_qq(mapping = aes(sample = height_inches)) + geom_abline(intercept = qq_int, slope = qq_slope) ``` -- <img src="class24_slides_files/figure-html/child-heights-qqplot-show-visual-1.png" width="70%" style="display: block; margin: auto;" /> --- Check histograms for male and female separated. ```r ggplot(heights) + geom_histogram( mapping = aes(x = height_inches, y = ..density.., fill = sex), binwidth = 1, position = "identity", alpha = 0.5) ``` <img src="class24_slides_files/figure-html/child-height-male-female-pmf-1.png" width="70%" style="display: block; margin: auto;" /> --- layout: true # Q-Q Plots Re-run Q-Q Plot for male and female separated: --- ```r # Male heights heights_male <- filter(heights, sex == "M") heights_female <- filter(heights, sex == "F") # First, compute theoretical line for ideal agreement # Find the 1st and 3rd quartiles qq_y_male <- quantile(heights_male$height_inches, c(0.25, 0.75)) qq_y_female <- quantile(heights_female$height_inches, c(0.25, 0.75)) # Find the matching normal values on the x-axis qq_x <- qnorm(c(0.25, 0.75)) ``` --- ```r # Compute line slope qq_slope_male <- diff(qq_y_male) / diff(qq_x) qq_slope_female <- diff(qq_y_female) / diff(qq_x) # Compute line intercept qq_int_male <- qq_y_male[1] - qq_slope_male * qq_x[1] qq_int_female <- qq_y_female[1] - qq_slope_female * qq_x[1] # Make the plot ggplot(heights) + stat_qq(mapping = aes(sample = height_inches, color = sex)) + geom_abline(intercept = qq_int_male, slope = qq_slope_male) + geom_abline(intercept = qq_int_female, slope = qq_slope_female) ``` --- <img src="class24_slides_files/figure-html/child-height-male-female-qqplot-1.png" width="80%" style="display: block; margin: auto;" />