Stats blog 2 (Data wrangling)

Today, I would like to share some knowledge related to tidy data and data wrangling learned from the previous statistics classes and STA303.

Tidy data:

There are three interrelated rules which make a dataset tidy

  1. Each variable must have its own column.

~ If a data (dataset) is not tidy, we can use the function pivot_longer() to make the data tidy so that it can follow the three interrelated rules.

Data wrangling: (Some R functions in R)

  1. Glimpse: glimpse lets us take a quick look at the data and variables in it and to figure out what type is each variable.

We can use janitor to clean our data: Janitor is a great package with lots of convenience functions for cleaning up data for use in analysis and clean_names() which will make all the names consistent, specifically, uniquely, lowercase, spaces replaced with underscores and special characters simplified or removed.

Replace NAs with 0s:

  1. mutate: mutate lets us create a new variable.

Join two datasets together (x and y):

  1. full_join: it can join and return all rows and columns from two datasets by a common variable. Where there are not matching values, returns NA for the missing one.

The functions which are used to fix typos: The stringr package is a part of the tidyverse and provides lots of options for working with character strings

  1. str_replace: it allows us to replace parts of strings that match a certain pattern with another pattern.