Today, I would like to share some knowledge related to tidy data and data wrangling learned from the previous statistics classes and STA303.
Tidy data:
There are three interrelated rules which make a dataset tidy
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
~ If a data (dataset) is not tidy, we can use the function pivot_longer() to make the data tidy so that it can follow the three interrelated rules.
Data wrangling: (Some R functions in R)
- Glimpse: glimpse lets us take a quick look at the data and variables in it and to figure out what type is each variable.
- Head: head lets us look at the top few rows.
- Str: str is an even more sophisticated glimpse that is quite good if we know there are some complicated structures with our data.
- View: view will open a new tab with a spreadsheet
- nrow(): it can check the number of rows
- distinct(): it can return a tibble with only the distinct rows from the original data.
We can use janitor to clean our data: Janitor is a great package with lots of convenience functions for cleaning up data for use in analysis and clean_names() which will make all the names consistent, specifically, uniquely, lowercase, spaces replaced with underscores and special characters simplified or removed.
Replace NAs with 0s:
- mutate: mutate lets us create a new variable.
- mutate_if: it will apply our instructions to every single column that meets our criteria.
- filter(is.na()): it can check which part is NA and take out the part which is NA.
- replace_na: it can replace NA
Join two datasets together (x and y):
- full_join: it can join and return all rows and columns from two datasets by a common variable. Where there are not matching values, returns NA for the missing one.
- left_join: it can return all rows from x and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
- right_join: it can return all rows from y, and all columns from
x
and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. - inner_join: it can return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
The functions which are used to fix typos: The stringr package is a part of the tidyverse and provides lots of options for working with character strings
- str_replace: it allows us to replace parts of strings that match a certain pattern with another pattern.
- str_detect: it will return a TRUE or FALSE depending on if the pattern specified is detected.
- str_to_sentence: it can change a string to ‘sentence case’ which starts with a capitalized letter and the rest of the letters are lowercase.
- str_replace_all: it can replace all instances of the specified pattern while str_replace() only replaces the first instance.
- str_remove: it is a shortcut for str_replace() where the replacement is ‘nothing’.