CDS-101-001 Class 9 Data wrangling III

September 27, 2017

General

First, according to R for Data Science,

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
- Translation: Getting data into this form allows you to work on entire columns at a time using short and memorable commands
- If you've programmed before, you are probably familiar with loops. In other languages, you sometimes have to explicitly tell your computer to move over the tabular dataset one cell at a time. R can do this, but it's slow, the "vectorized" tools of tidyverse are both faster and easier to understand.

There's a theoretical foundation to this, actually
Closely related to the formalism of relational databases
If you follow these rules, your data will be in Codd's 3rd normal form (if this means anything to you)
Helpful if your dataset grows large enough that you need to store it in a formal database, such as SQL databases (Postgresql, Mysql)

Practically speaking, the tidying process makes the categories in your data more clear
It makes analysis much easier too, because you can easily subdivide your data by category, and apply transformations where needed
Provides a standardized, "best practices" way to structure and store our datasets
- Note that you may not collect or input your data straight into tidy format

Data tidying does not encompass the entire data cleaning process
Data tidying only refers to reshaping things, such as moving columns and rows around
Cleaning operations, such as correcting spelling errors, renaming variables, etc., is a separate topic

Functions (commands) that allow you to reshape data
Oriented towards the kinds of datasets we've worked with previously, each column may be a different data type (numeric, string, logical, etc)
Functions (commands) are typed in a way that's very similar to the dplyr verbs, such as filter() and mutate()

Follow along in RStudio