September 27, 2017

## Annoucements

• Don't forget: Assignment 1, Part B due on Friday, September 29

## Principles

1. Each variable must have its own column.

2. Each observation (case) must have its own row.

3. Each value must have its own cell.

## Why should we care?

First, according to R for Data Science,

1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

• Translation: Getting data into this form allows you to work on entire columns at a time using short and memorable commands

• If you've programmed before, you are probably familiar with loops. In other languages, you sometimes have to explicitly tell your computer to move over the tabular dataset one cell at a time. R can do this, but it's slow, the "vectorized" tools of tidyverse are both faster and easier to understand.

## Why should we care?

• There's a theoretical foundation to this, actually

• Closely related to the formalism of relational databases

• If you follow these rules, your data will be in Codd's 3rd normal form (if this means anything to you)

• Helpful if your dataset grows large enough that you need to store it in a formal database, such as SQL databases (Postgresql, Mysql)

## Why should we care?

• Practically speaking, the tidying process makes the categories in your data more clear

• It makes analysis much easier too, because you can easily subdivide your data by category, and apply transformations where needed

• Provides a standardized, "best practices" way to structure and store our datasets

• Note that you may not collect or input your data straight into tidy format

## Final note

• Data tidying does not encompass the entire data cleaning process

• Data tidying only refers to reshaping things, such as moving columns and rows around

• Cleaning operations, such as correcting spelling errors, renaming variables, etc., is a separate topic

## tidyr() package

• Functions (commands) that allow you to reshape data

• Oriented towards the kinds of datasets we've worked with previously, each column may be a different data type (numeric, string, logical, etc)

• Functions (commands) are typed in a way that's very similar to the dplyr verbs, such as filter() and mutate()

## tidyr verbs

• gather(): transforms wide data to narrow data

• spread(): transforms narrow data to wide data

• separate(): make multiple columns out of a single column

• unite(): make a single column out of multiple columns

## Simple examples from textbook

Follow along in RStudio