September 27, 2017

General

Annoucements

  • Don't forget: Assignment 1, Part B due on Friday, September 29

Tidy data

Principles

  1. Each variable must have its own column.

  2. Each observation (case) must have its own row.

  3. Each value must have its own cell.

Why should we care?

First, according to R for Data Science,

  1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

  2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

    • Translation: Getting data into this form allows you to work on entire columns at a time using short and memorable commands

    • If you've programmed before, you are probably familiar with loops. In other languages, you sometimes have to explicitly tell your computer to move over the tabular dataset one cell at a time. R can do this, but it's slow, the "vectorized" tools of tidyverse are both faster and easier to understand.

Why should we care?

  • There's a theoretical foundation to this, actually

  • Closely related to the formalism of relational databases

  • If you follow these rules, your data will be in Codd's 3rd normal form (if this means anything to you)

  • Helpful if your dataset grows large enough that you need to store it in a formal database, such as SQL databases (Postgresql, Mysql)

Why should we care?

  • Practically speaking, the tidying process makes the categories in your data more clear

  • It makes analysis much easier too, because you can easily subdivide your data by category, and apply transformations where needed

  • Provides a standardized, "best practices" way to structure and store our datasets

    • Note that you may not collect or input your data straight into tidy format

Final note

  • Data tidying does not encompass the entire data cleaning process

  • Data tidying only refers to reshaping things, such as moving columns and rows around

  • Cleaning operations, such as correcting spelling errors, renaming variables, etc., is a separate topic

tidyr() package

  • Functions (commands) that allow you to reshape data

  • Oriented towards the kinds of datasets we've worked with previously, each column may be a different data type (numeric, string, logical, etc)

  • Functions (commands) are typed in a way that's very similar to the dplyr verbs, such as filter() and mutate()

tidyr verbs

  • gather(): transforms wide data to narrow data

  • spread(): transforms narrow data to wide data

  • separate(): make multiple columns out of a single column

  • unite(): make a single column out of multiple columns

Simple examples from textbook

Follow along in RStudio