Materials for Class 2
- Download for offline use (open these using your web-browser)
General
Sign up for Github!
- If you don’t have a Github account, start signing up for one now
- If on class computers, open Firefox or Chromium and navigate to https://github.com
- Click sign up, and create an account using your Mason email address
Annoucements
- Complete the Try R tutorial and upload your proof of completion
- Office hours reminder:
- Mondays 1:00pm - 2:00pm
- Tuesday 11:00am - 12:00pm
- Appointment
- Course materials currently on Blackboard, will move to Github next week
Class agenda
- Motivation for “Reproducible research”
- R and RStudio for computation and basic scripting
- Github for version control and feedback
- RMarkdown for documenting, analyzing, and reporting
- Live demo
Motivation for Reproducible Research
The scientific method
- Review evidence
- Hypothesis
- Formulate predictive test
- Design/run experiment
- Validate or revise hypothesis
- Key point: create a hypothesis and test it out
- Validation by the natural world (“Nature”)
- Anyone can double check!
Reproducibility in practice
- Sometimes easier said than done, various reasons why
- Lack of funding sources
- Lack of data sharing
- Lack of interest
- “Top-tier” journals won’t publish
- Vague methods
- It’s very important that we have reproduced research, because…
The Reproducibility Project
Brian Nosek of University of Virginia and colleagues sought out to replicate 100 different studies that all were published in 2008. The project pulled these studies from three different [psychology] journals… to see if they could get the same results as the initial findings. […] Only 36.1% of the studies [were] replicated. Reproducibility Project Wikipedia entry
Science retracts gay marriage paper without agreement of lead author LaCour
In May 2015 Science retracted a study of how canvassers can sway people’s opinions about gay marriage published just 5 months ago.
- Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results.
- Survey incentives misrepresented.
- Sponsorship statement false.
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
Methods we’ll discuss today can’t prevent this, but they can make it easier to discover issues.
Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent
Seizure study retracted after authors realize data got “terribly mixed”
The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness. Authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates
Source: http://retractionwatch.com/2013/02/01/seizure-study-retracted-after-authors-realize-data-got-terribly-mixed/
Bad spreadsheet merge kills depression paper, quick fix resurrects it
The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: “Lower levels of CSF IL-6 were associated with current depression and with future depression […]”.
Revised conclusion: “Higher levels of CSF IL-6 and IL-8 were associated with current depression […]”.
Source: http://retractionwatch.com/2014/07/01/bad-spreadsheet-merge-kills-depression-paper-quick-fix-resurrects-it/
Reproducibility: why should we care?
Two-pronged approach
Reproducible data analysis
Scripting and literate programming
Donald Knuth Literate Programming (1983)
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do. Donald Knuth in Literate Programming (1983)
- These ideas have been around for years!
- and tools for putting them to practice have also been around
- but they have never been as accessible as the current tools
Reproducibility checklist
- Are the tables and figures reproducible from the code and data?
- Does the code actually do what you think it does?
- In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
- Can the code be used for other data?
- Can you extend the code to do other things?
Demo
Start up RStudio
- Launch RStudio on the class desktops
Live R/RStudio demo
2 + 2
factorial(20)
x <- 2
x * 3
Working with GitHub
Cloning the repository
Go to RStudio
- File -> New Project
- Version Control: Checkout a project from a version control repository
- Git: Clone a project from a repository
- Fill in the info:
- URL: use HTTPS address
- Create as a subdirectory of: Browse and create a new folder call cds101
Note for the future: Each course component you work on (an application exercise, a homework assignment, project, etc.) should be its own repository, and should be fully contained in a folder inside the folder cds101
.
Credits
Slides adapted from these course notes by Mine Çetinkaya-Rundel.