Stats blog 2 (Data wrangling)

3 min readMar 19, 2021

Today, I would like to share some knowledge related to tidy data and data wrangling learned from the previous statistics classes and STA303.

Tidy data:

There are three interrelated rules which make a dataset tidy

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

~ If a data (dataset) is not tidy, we can use the function pivot_longer() to make the data tidy so that it can follow the three interrelated rules.

Data wrangling: (Some R functions in R)

Glimpse: glimpse lets us take a quick look at the data and variables in it and to figure out what type is each variable.
Head: head lets us look at the top few rows.
Str: str is an even more sophisticated glimpse that is quite good if we know there are some complicated structures with our data.
View: view will open a new tab with a spreadsheet
nrow(): it can check the number of rows
distinct(): it can return a tibble with only the distinct rows from the original data.

We can use janitor to clean our data: Janitor is a great package with lots of convenience functions for cleaning up data for use in analysis and clean_names() which will make all the names consistent, specifically, uniquely, lowercase, spaces replaced with underscores and special characters simplified or removed.

Replace NAs with 0s:

mutate: mutate lets us create a new variable.
mutate_if: it will apply our instructions to every single column that meets our criteria.
filter(is.na()): it can check which part is NA and take out the part which is NA.
replace_na: it can replace NA

Join two datasets together (x and y):

full_join: it can join and return all rows and columns from two datasets by a common variable. Where there are not matching values, returns NA for the missing one.
left_join: it can return all rows from x and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join: it can return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
inner_join: it can return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

The functions which are used to fix typos: The stringr package is a part of the tidyverse and provides lots of options for working with character strings

str_replace: it allows us to replace parts of strings that match a certain pattern with another pattern.
str_detect: it will return a TRUE or FALSE depending on if the pattern specified is detected.
str_to_sentence: it can change a string to ‘sentence case’ which starts with a capitalized letter and the rest of the letters are lowercase.
str_replace_all: it can replace all instances of the specified pattern while str_replace() only replaces the first instance.
str_remove: it is a shortcut for str_replace() where the replacement is ‘nothing’.

Stats blog 2 (Data wrangling)

Tidy data:

Data wrangling: (Some R functions in R)

Written by Helen Li