In the Github Classroom repository that you copied, you should have a template file named
assignment_3.Rmd. Use this file to answer the questions below. Also fill out any missing information at the top of the RMarkdown document. Once you’re done, be sure to save, commit, and push to Github. Then, submit a pull request.
November 29, 2017 @ 11:59pm
This homework uses the National Survey of Family Growth, Cycle 6 dataset in the file
2002FemPreg.rds, published by the National Center for Health Statistics. Complete descriptions of all the variables can be found in the NSFG Cycle 6: Female Pregnancy File Codebook. Below are selected descriptions of variables that will be used for this homework assignment:
||integer ID of the respondent|
||integer duration of the pregnancy in weeks|
||integer code for the outcome of the pregnancy, with a 1 indicating a live birth|
||serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank|
||the birth weight of the baby in pounds|
||integer code about whether or not respondent smoked a cigarette at some point during the pregnancy
This homework assignment revolves around answering two questions using this dataset:
Do first born children either arrive early or late when compared with subsequently born children?
Do children born to mothers that reported smoking during their pregnancy either weigh more or weigh less than children born to non-smoking mothers?
The majority of the questions will step you through the procedure of answering the first question using inference methods. You will then use this procedure you stepped through as a template for answering the second question on your own.
To get started with with answering the question of whether or not first born children either arrive early or late, we need to do some basic data wrangling to clean up the dataset. Begin by filtering the dataset so that the dataset only contains outcomes with live births. Assign this result to the variable
live_births. Then, continuing with
live_births, filter again and create two additional variables:
Apply a filter to extract all first births. Then, use
select() to grab the
prglngth column and discard all other columns. Pipe this into
mutate(birth_order = "first"), and then assign the output to the variable
Apply a filter to extract all other births except first births. Then, use
select() to grab the
prglngth column and discard all other columns. Pipe this into
mutate(birth_order = "other"), and then assign the output to the variable
bind_rows() to combine the results in
other_births into a single, tidy
tibble. Pipe this tibble into
remove_missing() to remove any rows containing
NA entries and assign the result to the variable name
pregnancy_length and plot a probability mass function (PMF) histogram of the pregnancy length in weeks that shows first births and other births on the same plot. Choose a reasonable binwidth choice for the histogram and add
coord_cartesian(xlim = c(27, 46)) to your plot so that the window focuses on where most of the data is. After creating the plot, describe the shape, center, and spread of the two distributions. Based on the visualization, do you think the data looks like it supports the statement that “first born children either arrive early or arrive late”?
summarize(), compute the different summary statistics (mean, median, standard deviation, inter-quartile range) of the variable
other births in
pregnancy_length. How do the different summary statistics compare between the two distributions? Based on the initial summary statistics, does it look like there may be a statistically significant difference between the two distributions? If so, why?
Determine whether the distribution of pregnancy lengths follows a nearly normal distribution. Do this by creating two separate Q-Q plots for the variable
pregnancy_length. The first plot is for when
"first" and the second plot is for when
To get started, filter
pregnancy_length so that it only contains
first births and assign it to the variable named
first_births. Similarly, filter
pregnancy_length to only contain
other births and assign it to the variable named
After filtering, adapt the following code to create a Q-Q plot for
first_births and also for
other_births. The additional code helps to plot the “ideal” reference line to show any deviations from normality:
qq_x <- qnorm(p = c(0.25, 0.75)) qq_y <- quantile(x = dataset$variable, probs = c(0.25, 0.75), type = 1) qq_slope <- diff(qq_y) / diff(qq_x) qq_int <- qq_y - qq_slope * qq_x ggplot(data = dataset) + geom_qq(mapping = aes(sample = variable)) + geom_abline(slope = qq_slope, intercept = qq_int, size = 0.75)
Please note that you will have to change the
variable parameters. Based on your plots, does it appear that the pregnancy length distribution is nearly normal for both
other births? Why or why not?
Returning back to the question of whether or not “first babies arrive early or arrive late”, let’s plot the cumulative distribution functions (CDFs) of the pregnancy lengths for
other births. Plot the CDF for both distributions on the same figure so that we can directly compare them. You can either use the procedure outlined in Cumulative distribution functions from reading assignment 15 or use
stat_ecdf to do this. How do the distributions compare? Does it look like there is there a meaningful difference between the two distributions?
If we want to determine whether or not the difference between two distributions is statistically significant, we need to run a hypothesis test. Before going further, for the question of “do first babies arrive early or arrive late”, formalize your analysis by writing down the null hypothesis.
Next, use the pre-loaded
inference() function to run a hypothesis test for the difference in means of the
prglnth variable between the
inference(y = prglngth, x = birth_order, data = pregnancy_length, type = "ht", statistic = "mean", null = 0, alternative = "twosided", order = c("first", "other"), method = "simulation", show_eda_plot = FALSE)
Assume that we set our significance level to \(\alpha = 0.05\). Based on the outputted p-value, can we reject the null hypothesis?
For completeness, also compute the 95% confidence interval for the difference in means:
inference(y = prglngth, x = birth_order, data = pregnancy_length, type = "ci", statistic = "mean", null = 0, method = "simulation", order = c("first", "other"), boot_method = "perc", show_eda_plot = FALSE)
Does the confidence interval overlap with the null value?
In addition to hypothesis tests and confidence intervals, we should also consider the effect size, which measures the relative difference between two distributions. The effect size helps us better know how important a given result actually is, not just whether or not we can reject the null hypothesis. One measure of the effect size is called Cohen’s d, which we will use to compute the effect size between the pregnancy lengths for first births and other births. The different ranges of d can be interpreted using the following table:
The following set of functions should also be preloaded for you:
plot_ci(). These functions will use bootstrap simulations to compute the confidence interval for the Cohen’s d parameter. Run the bootstrap simulation as follows:
cohens_d_bootstrap(data = pregnancy_length, model = prglngth ~ birth_order)
Be sure to assign the results to a variable, for example
bootstrap_results. Note that the input
model specifies how to split the data into categories. You put the response variable (
prglngth) on left side of a tilde
~, and the categorical explanatory variable (
birth_order) on the right side.
To print a report for the bootstrap simulation, run:
To visualize the bootstrap distribution and confidence interval, run:
Interpret the outputs from the Cohen’s d bootstrap simulation. How large is the effect size between the pregnancy length of
first births and
other births? Based on this and the previous hypothesis test, provide an answer the question “do first born children arrive early or late compared to other children?”
Use the inference tools and procedures that you practiced in the previous exercises to obtain an answer to the question “Do children born to mothers that reported smoking during their pregnancy either weigh more or weigh less than children born to non-smoking mothers?” To answer this question, you will need to start from the dataset stored in
live_births and work with the columns
postsmks. Please review the About the dataset section for information on what the values in those columns mean.
In order to properly answer the question, you will need to include:
A visual comparison of the two data distributions
inference() function to run a hypothesis test and to compute the confidence interval
cohens_d_bootstrap() to compute and check the effect size.
At the end, write a paragraph that summarizes your results and provide an answer as to whether there is a connection between birth weights and smoking. Write the summary as if you are an academic or scientific journalist, focusing on how you can answer the question clearly, precisely, and honestly.