Materials for Class 2


Sign up for Github!

  • If you don’t have a Github account, start signing up for one now
    • If on class computers, open Firefox or Chromium and navigate to
    • Click sign up, and create an account using your Mason email address


  • Complete the Try R tutorial and upload your proof of completion
  • Office hours reminder:
    • Mondays 1:00pm - 2:00pm
    • Tuesday 11:00am - 12:00pm
    • Appointment
  • Course materials currently on Blackboard, will move to Github next week

Class agenda

  • Motivation for “Reproducible research”
  • R and RStudio for computation and basic scripting
  • Github for version control and feedback
  • RMarkdown for documenting, analyzing, and reporting
  • Live demo

Motivation for Reproducible Research

The scientific method

  1. Review evidence
  2. Hypothesis
  3. Formulate predictive test
  4. Design/run experiment
  5. Validate or revise hypothesis
  • Key point: create a hypothesis and test it out
  • Validation by the natural world (“Nature”)
  • Anyone can double check!

Reproducibility in practice

  • Sometimes easier said than done, various reasons why
    • Lack of funding sources
    • Lack of data sharing
    • Lack of interest
    • “Top-tier” journals won’t publish
    • Vague methods
  • It’s very important that we have reproduced research, because…

The Reproducibility Project

Brian Nosek of University of Virginia and colleagues sought out to replicate 100 different studies that all were published in 2008. The project pulled these studies from three different [psychology] journals… to see if they could get the same results as the initial findings. […] Only 36.1% of the studies [were] replicated. Reproducibility Project Wikipedia entry

Science retracts gay marriage paper without agreement of lead author LaCour

  • In May 2015 Science retracted a study of how canvassers can sway people’s opinions about gay marriage published just 5 months ago.

  • Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results.
    • Survey incentives misrepresented.
    • Sponsorship statement false.
  • Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.

  • Methods we’ll discuss today can’t prevent this, but they can make it easier to discover issues.


Seizure study retracted after authors realize data got “terribly mixed”

The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness. Authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates


Bad spreadsheet merge kills depression paper, quick fix resurrects it

  • The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.

  • Original conclusion: “Lower levels of CSF IL-6 were associated with current depression and with future depression […]”.

  • Revised conclusion: “Higher levels of CSF IL-6 and IL-8 were associated with current depression […]”.


Reproducibility: why should we care?

Two-pronged approach

  • Convince researchers to adopt a reproducible research workflow

  • Train new researchers who don’t have any other workflow

Reproducible data analysis

  • Scriptability → R

  • Literate programming → R Markdown

  • Version control → Git / GitHub

Scripting and literate programming

Donald Knuth Literate Programming (1983)

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do. Donald Knuth in Literate Programming (1983)

  • These ideas have been around for years!
  • and tools for putting them to practice have also been around
  • but they have never been as accessible as the current tools

Reproducibility checklist

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
  • Can the code be used for other data?
  • Can you extend the code to do other things?


Start up RStudio

  • Launch RStudio on the class desktops

Live R/RStudio demo

  • R as a calculator
2 + 2
  • Working with variables
x <- 2
x * 3

Working with GitHub

Cloning the repository

  • Go to RStudio

  • File -> New Project
    • Version Control: Checkout a project from a version control repository
    • Git: Clone a project from a repository
    • Fill in the info:
      • URL: use HTTPS address
      • Create as a subdirectory of: Browse and create a new folder call cds101
  • Note for the future: Each course component you work on (an application exercise, a homework assignment, project, etc.) should be its own repository, and should be fully contained in a folder inside the folder cds101.


Slides adapted from these course notes by Mine Çetinkaya-Rundel.