In the Github Classroom repository that you copied, you should have a template file named assignment_3.Rmd
. Use this file to answer the questions below. Also fill out any missing information at the top of the RMarkdown document. Once you’re done, be sure to save, commit, and push to Github. Then, submit a pull request.
November 29, 2017 @ 11:59pm
This homework uses the National Survey of Family Growth, Cycle 6 dataset in the file 2002FemPreg.rds
, published by the National Center for Health Statistics. Complete descriptions of all the variables can be found in the NSFG Cycle 6: Female Pregnancy File Codebook. Below are selected descriptions of variables that will be used for this homework assignment:
Variable | Description |
---|---|
caseid |
integer ID of the respondent |
prglngth |
integer duration of the pregnancy in weeks |
outcome |
integer code for the outcome of the pregnancy, with a 1 indicating a live birth |
birthord |
serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank |
totalwgt_lb |
the birth weight of the baby in pounds |
postsmks |
integer code about whether or not respondent smoked a cigarette at some point during the pregnancyvalue label 1 YES 5 NO |
This homework assignment revolves around answering two questions using this dataset:
Do first born children either arrive early or late when compared with subsequently born children?
Do children born to mothers that reported smoking during their pregnancy either weigh more or weigh less than children born to non-smoking mothers?
The majority of the questions will step you through the procedure of answering the first question using inference methods. You will then use this procedure you stepped through as a template for answering the second question on your own.
To get started with with answering the question of whether or not first born children either arrive early or late, we need to do some basic data wrangling to clean up the dataset. Begin by filtering the dataset so that the dataset only contains outcomes with live births. Assign this result to the variable live_births
. Then, continuing with live_births
, filter again and create two additional variables:
Apply a filter to extract all first births. Then, use select()
to grab the prglngth
column and discard all other columns. Pipe this into mutate(birth_order = "first")
, and then assign the output to the variable first_births
.
Apply a filter to extract all other births except first births. Then, use select()
to grab the prglngth
column and discard all other columns. Pipe this into mutate(birth_order = "other")
, and then assign the output to the variable other_births
.
Use bind_rows()
to combine the results in first_births
and other_births
into a single, tidy tibble
. Pipe this tibble into remove_missing()
to remove any rows containing NA
entries and assign the result to the variable name pregnancy_length
.
Take pregnancy_length
and plot a probability mass function (PMF) histogram of the pregnancy length in weeks that shows first births and other births on the same plot. Choose a reasonable binwidth choice for the histogram and add coord_cartesian(xlim = c(27, 46))
to your plot so that the window focuses on where most of the data is. After creating the plot, describe the shape, center, and spread of the two distributions. Based on the visualization, do you think the data looks like it supports the statement that “first born children either arrive early or arrive late”?
Using group_by()
and summarize()
, compute the different summary statistics (mean, median, standard deviation, inter-quartile range) of the variable prglngth
for first
and other
births in pregnancy_length
. How do the different summary statistics compare between the two distributions? Based on the initial summary statistics, does it look like there may be a statistically significant difference between the two distributions? If so, why?
Determine whether the distribution of pregnancy lengths follows a nearly normal distribution. Do this by creating two separate Q-Q plots for the variable prglngth
in pregnancy_length
. The first plot is for when birth_order
equals "first"
and the second plot is for when birth_order
equals "second"
.
To get started, filter pregnancy_length
so that it only contains first
births and assign it to the variable named first_births
. Similarly, filter pregnancy_length
to only contain other
births and assign it to the variable named other_births
.
After filtering, adapt the following code to create a Q-Q plot for first_births
and also for other_births
. The additional code helps to plot the “ideal” reference line to show any deviations from normality:
qq_x <- qnorm(p = c(0.25, 0.75))
qq_y <- quantile(x = dataset$variable, probs = c(0.25, 0.75), type = 1)
qq_slope <- diff(qq_y) / diff(qq_x)
qq_int <- qq_y[1] - qq_slope * qq_x[1]
ggplot(data = dataset) +
geom_qq(mapping = aes(sample = variable)) +
geom_abline(slope = qq_slope, intercept = qq_int, size = 0.75)
Please note that you will have to change the dataset
and variable
parameters. Based on your plots, does it appear that the pregnancy length distribution is nearly normal for both first
and other
births? Why or why not?
Returning back to the question of whether or not “first babies arrive early or arrive late”, let’s plot the cumulative distribution functions (CDFs) of the pregnancy lengths for first
and other
births. Plot the CDF for both distributions on the same figure so that we can directly compare them. You can either use the procedure outlined in Cumulative distribution functions from reading assignment 15 or use stat_ecdf
to do this. How do the distributions compare? Does it look like there is there a meaningful difference between the two distributions?
If we want to determine whether or not the difference between two distributions is statistically significant, we need to run a hypothesis test. Before going further, for the question of “do first babies arrive early or arrive late”, formalize your analysis by writing down the null hypothesis.
Next, use the pre-loaded inference()
function to run a hypothesis test for the difference in means of the prglnth
variable between the first_births
and other_births
datasets:
inference(y = prglngth, x = birth_order, data = pregnancy_length, type = "ht",
statistic = "mean", null = 0, alternative = "twosided",
order = c("first", "other"), method = "simulation",
show_eda_plot = FALSE)
Assume that we set our significance level to \(\alpha = 0.05\). Based on the outputted p-value, can we reject the null hypothesis?
For completeness, also compute the 95% confidence interval for the difference in means:
inference(y = prglngth, x = birth_order, data = pregnancy_length,
type = "ci", statistic = "mean", null = 0, method = "simulation",
order = c("first", "other"), boot_method = "perc",
show_eda_plot = FALSE)
Does the confidence interval overlap with the null value?
In addition to hypothesis tests and confidence intervals, we should also consider the effect size, which measures the relative difference between two distributions. The effect size helps us better know how important a given result actually is, not just whether or not we can reject the null hypothesis. One measure of the effect size is called Cohen’s d, which we will use to compute the effect size between the pregnancy lengths for first births and other births. The different ranges of d can be interpreted using the following table:
Effect size | d |
---|---|
Very small | 0.01 |
Small | 0.20 |
Medium | 0.50 |
Large | 0.80 |
Very large | 1.20 |
Huge | 2.00 |
The following set of functions should also be preloaded for you: cohens_d_bootstrap()
, bootstrap_report()
, and plot_ci()
. These functions will use bootstrap simulations to compute the confidence interval for the Cohen’s d parameter. Run the bootstrap simulation as follows:
cohens_d_bootstrap(data = pregnancy_length, model = prglngth ~ birth_order)
Be sure to assign the results to a variable, for example bootstrap_results
. Note that the input model
specifies how to split the data into categories. You put the response variable (prglngth
) on left side of a tilde ~
, and the categorical explanatory variable (birth_order
) on the right side.
To print a report for the bootstrap simulation, run:
bootstrap_report(bootstrap_results)
To visualize the bootstrap distribution and confidence interval, run:
plot_ci(bootstrap_results)
Interpret the outputs from the Cohen’s d bootstrap simulation. How large is the effect size between the pregnancy length of first
births and other
births? Based on this and the previous hypothesis test, provide an answer the question “do first born children arrive early or late compared to other children?”
Use the inference tools and procedures that you practiced in the previous exercises to obtain an answer to the question “Do children born to mothers that reported smoking during their pregnancy either weigh more or weigh less than children born to non-smoking mothers?” To answer this question, you will need to start from the dataset stored in live_births
and work with the columns totalwgt_lb
and postsmks
. Please review the About the dataset section for information on what the values in those columns mean.
In order to properly answer the question, you will need to include:
A visual comparison of the two data distributions
Use the inference()
function to run a hypothesis test and to compute the confidence interval
Use cohens_d_bootstrap()
to compute and check the effect size.
At the end, write a paragraph that summarizes your results and provide an answer as to whether there is a connection between birth weights and smoking. Write the summary as if you are an academic or scientific journalist, focusing on how you can answer the question clearly, precisely, and honestly.