Stats blog 2 (Data wrangling)

Helen Li
3 min readMar 19, 2021

Today, I would like to share some knowledge related to tidy data and data wrangling learned from the previous statistics classes and STA303.

Tidy data:

There are three interrelated rules which make a dataset tidy

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

~ If a data (dataset) is not tidy, we can use the function pivot_longer() to make the data tidy so that it can follow the three interrelated rules.

Data wrangling: (Some R functions in R)

  1. Glimpse: glimpse lets us take a quick look at the data and variables in it and to figure out what type is each variable.
  2. Head: head lets us look at the top few rows.
  3. Str: str is an even more sophisticated glimpse that is quite good if we know there are some complicated structures with our data.
  4. View: view will open a new tab with a spreadsheet
  5. nrow(): it can check the number of rows
  6. distinct(): it can return a tibble with only the distinct rows from the original data.

We can use janitor to clean our data: Janitor is a great package with lots of convenience functions for cleaning up data for use in analysis and clean_names() which will make all the names consistent, specifically, uniquely, lowercase, spaces replaced with underscores and special characters simplified or removed.

Replace NAs with 0s:

  1. mutate: mutate lets us create a new variable.
  2. mutate_if: it will apply our instructions to every single column that meets our criteria.
  3. filter(is.na()): it can check which part is NA and take out the part which is NA.
  4. replace_na: it can replace NA

Join two datasets together (x and y):

  1. full_join: it can join and return all rows and columns from two datasets by a common variable. Where there are not matching values, returns NA for the missing one.
  2. left_join: it can return all rows from x and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
  3. right_join: it can return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
  4. inner_join: it can return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

The functions which are used to fix typos: The stringr package is a part of the tidyverse and provides lots of options for working with character strings

  1. str_replace: it allows us to replace parts of strings that match a certain pattern with another pattern.
  2. str_detect: it will return a TRUE or FALSE depending on if the pattern specified is detected.
  3. str_to_sentence: it can change a string to ‘sentence case’ which starts with a capitalized letter and the rest of the letters are lowercase.
  4. str_replace_all: it can replace all instances of the specified pattern while str_replace() only replaces the first instance.
  5. str_remove: it is a shortcut for str_replace() where the replacement is ‘nothing’.

--

--