September 27, 2017
Each variable must have its own column.
Each observation (case) must have its own row.
Each value must have its own cell.
First, according to R for Data Science,
There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
Translation: Getting data into this form allows you to work on entire columns at a time using short and memorable commands
If you've programmed before, you are probably familiar with loops. In other languages, you sometimes have to explicitly tell your computer to move over the tabular dataset one cell at a time. R can do this, but it's slow, the "vectorized" tools of
tidyverse are both faster and easier to understand.
There's a theoretical foundation to this, actually
Closely related to the formalism of relational databases
If you follow these rules, your data will be in Codd's 3rd normal form (if this means anything to you)
Helpful if your dataset grows large enough that you need to store it in a formal database, such as SQL databases (Postgresql, Mysql)
Practically speaking, the tidying process makes the categories in your data more clear
It makes analysis much easier too, because you can easily subdivide your data by category, and apply transformations where needed
Provides a standardized, "best practices" way to structure and store our datasets
Data tidying does not encompass the entire data cleaning process
Data tidying only refers to reshaping things, such as moving columns and rows around
Cleaning operations, such as correcting spelling errors, renaming variables, etc., is a separate topic
Functions (commands) that allow you to reshape data
Oriented towards the kinds of datasets we've worked with previously, each column may be a different data type (numeric, string, logical, etc)
Functions (commands) are typed in a way that's very similar to the
dplyr verbs, such as
gather(): transforms wide data to narrow data
spread(): transforms narrow data to wide data
separate(): make multiple columns out of a single column
unite(): make a single column out of multiple columns
Follow along in RStudio