Stats blog 2 (Data wrangling)

Helen Li
3 min readMar 19, 2021

Today, I would like to share some knowledge related to tidy data and data wrangling learned from the previous statistics classes and STA303.

Tidy data:

There are three interrelated rules which make a dataset tidy

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

~ If a data (dataset) is not tidy, we can use the function pivot_longer() to make the data tidy so that it can follow the three interrelated rules.

Data wrangling: (Some R functions in R)

  1. Glimpse: glimpse lets us take a quick look at the data and variables in it and to figure out what type is each variable.
  2. Head: head lets us look at the top few rows.
  3. Str: str is an even more sophisticated glimpse that is quite good if we know there are some complicated structures with our data.
  4. View: view will open a new tab with a spreadsheet
  5. nrow(): it can check the number of rows
  6. distinct(): it can return a tibble with only the distinct rows from the original data.

We can use janitor to clean our data: Janitor is a great package with lots of convenience functions for cleaning up data for use in analysis and clean_names() which will make all the names consistent, specifically, uniquely, lowercase, spaces replaced with underscores and special characters simplified or removed.

Replace NAs with 0s:

  1. mutate: mutate lets us create a new variable.
  2. mutate_if: it will apply our instructions to every single column that meets our criteria.
  3. filter( it can check which part is NA and take out the part which is NA.
  4. replace_na: it can replace NA

Join two datasets together (x and y):

  1. full_join: it can join and return all rows and columns from two datasets by a common variable. Where there are not matching values, returns NA for the missing one.
  2. left_join: it can return all rows from x and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
  3. right_join: it can return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
  4. inner_join: it can return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

The functions which are used to fix typos: The stringr package is a part of the tidyverse and provides lots of options for working with character strings

  1. str_replace: it allows us to replace parts of strings that match a certain pattern with another pattern.
  2. str_detect: it will return a TRUE or FALSE depending on if the pattern specified is detected.
  3. str_to_sentence: it can change a string to ‘sentence case’ which starts with a capitalized letter and the rest of the letters are lowercase.
  4. str_replace_all: it can replace all instances of the specified pattern while str_replace() only replaces the first instance.
  5. str_remove: it is a shortcut for str_replace() where the replacement is ‘nothing’.

