Setup

To find other tutorials for this class, go to the main website, https://ds112-lendway.netlify.app/.

Welcome to your second tutorial for this class, COMP/STAT 112: Introduction to Data Science! It will be similar to the first, although in this one I opted for introducing material right in the tutorial rather than creating separate videos and slide decks. There are still demo videos and files embedded in this document.

As most of our files do, we start this one with three R code chunks: 1. options, 2. libraries and settings, 3. data.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)     # for data cleaning and plotting
library(gardenR)       # for Lisa's garden data
library(lubridate)     # for date manipulation
library(palmerpenguins)# for Palmer penguin data
# Lisa's garden data
data("garden_harvest")

# Seeds/plants (and other garden supply) costs
data("garden_spending")

# Planting dates and locations
data("garden_planting")

# Palmer penguins
data("penguins")

# US tuition by state data
us_avg_tuition <- read_csv("https://www.dropbox.com/s/9e0paradcwvuzll/us_avg_tuition.csv?dl=1") %>% 
   mutate(across(starts_with("20"), parse_number))

Learning Goals

After this tutorial, you should be able to do the following:

  • Use pivot_longer() and pivot_wider()to change the way the data are laid out.
  • Join tables of data together using the dplyr join functions and understand the differences among the different types of joins.
  • Use various forcats functions, including ones not covered in the tutorial, to change the order or values of levels of categorical variables.
  • Use the stringr functions covered in this tutorial (plus separate()) and know where to find information about other stringr functions (HINT: the cheatsheet is a great start).

Changing the data layout with pivot functions

This part of the tutorial will introduce you to two functions: pivot_longer() and pivot_wider(). These functions are used to change the way the data are laid out. The GIF below illustrates what the functions do. I encourage you to revisit this illustration after reading through the more detailed descriptions.

Image credit: Mara Averick (tweet from 2019-10-04)

pivot_longer()

pivot_longer(): makes the dataset longer, reducing the number of columns and increasing the number of rows. Often used when column names should be values of a variable.

The data below shows average college tuition costs in the US by state. Notice that years are column names.

Now, we would like to change this so there is a variable called year that would indicate the year and the tuition values would be a sinle variable rather than spread across multiple variables - pivot_longer() to the rescue!

The generic code for pivot_longer() is shown here:

data %>% 
  pivot_longer(cols = ___________,
               names_to = "name_of_cols_variable",
               values_to = "name_of_values_variable")

Let’s try it with the tuition data:

us_avg_tuition %>% 
  pivot_longer(cols = starts_with("20"),
               names_to = "year",
               values_to = "avg_tuition")

Now there is a row for each unique state and year combination and year and avg_tuition are variables. This dataset has more rows and fewer columns than the original dataset.

Let’s go over each argument in the function in more detail.

us_avg_tuition %>% 
  pivot_longer(cols = starts_with("20"),
               names_to = "year",
               values_to = "avg_tuition")

The cols argument indicates which columns should be pivoted so that these column names become values of a new variable. You can make a list of column names or use helper functions to select columns (see the select() function on the dplyr cheatsheet for more detail or search for tidy-select in the Help tab). I used the starts_with() helper function in this example.

us_avg_tuition %>% 
  pivot_longer(cols = starts_with("20"),
               names_to = "year",
               values_to = "avg_tuition")

The names_to argument is what you want to name the new variable where the column names will be stored. This needs to be in quotes.

us_avg_tuition %>% 
  pivot_longer(cols = starts_with("20"),
               names_to = "year",
               values_to = "avg_tuition")

The values_to argument is what you want to name the new variable where the values that used to be spread across multiple columns will now be stored in one variable.

pivot_wider()

pivot_wider(): makes the dataset wider, reducing the number of rows and increasing the number of columns. Often used when observations are spread over multiple rows and the values for one variable should actually be their own variables.

Here is an example where the values of the third column (Population annual rate of increase (percent), Total fertility rate (children per women), etc.) should each be their own variable.

Now, let’s look at a similar example. I created a new dataset called penguins_fake which is a reorganization of the penguins data.

penguins_fake

In penguins_fake, the column called measurement has all the names of the measurements. We would like to return those to column names so there is once again only one row for each penguin. We will use pivot_wider() to do that! Notice there is also a column called obs that identifies each unique observation from the original data - this is very important!

The generic code for pivot_wider() is shown here:

data %>% 
  pivot_wider(id_cols = ___________,
              names_from = variable_with_names,
              values_from = variable_with_values)

Let’s do this with the penguins_fake data:

penguins_fake %>% 
  pivot_wider(id_cols = species:obs,
              names_from = measurement,
              values_from = value)

Now the four measurement variables each have their own column again. This dataset has more columns and fewer rows than the penguins_fake dataset.

Let’s go over each argument in the function in more detail.

penguins_fake %>% 
  pivot_wider(id_cols = species:obs,
              names_from = measurement,
              values_from = value)

The id_cols argument is the set of columns that uniquely identifies each observation. By default it will be all columns that are not in the names_from and values_from arguments. Like cols from the pivot_longer() function, you can make a list of column names or use helper functions to select columns (see the select() function on the dplyr cheatsheet for more detail or search for tidy-select in the Help tab).

!!CAUTION!!: It is easy to make a mistake on the id_cols argument. For example, in the code below, I forgot to include obs. The result is something weird and unexpected with only 35 rows.

penguins_fake %>% 
  pivot_wider(id_cols = species:year,
              names_from = measurement,
              values_from = value)
penguins_fake %>% 
  pivot_wider(id_cols = species:obs,
              names_from = measurement,
              values_from = value)

The names_from argument is the variable (or variables) that contain values that you want to be turned into their own columns. This is not in quotes.

penguins_fake %>% 
  pivot_wider(id_cols = species:obs,
              names_from = measurement,
              values_from = value)

The values_from argument is the variable (or variables) that should be values of the new variables.

Demo video

Now that you’ve learned the basics of pivoting, watch the video below that will walk you through some coding examples and download the R Markdown files to follow along. This is the same file you will use for the other topics.

Voicethread: pivoting demo

Resources

Your turn!

Exercise 1: pivot_wider()

Summarize the garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week. Display the results so that the vegetables are rows but the days of the week are columns.

Exercise 2: pivot_longer()

Use the billboard dataset (search for it in help or type ?billboard in the console). It has rankings of songs for each week they entered the Billboard Top 100. The weeks are column names. Use pivot_longer() to make weeks a single column and remove rows with missing values for rank (HINT: use values_drop_na argument in pivot_longer()).

Joining datasets

When analyzing data, it is common to need to combine together datasets that are related. The join verbs will give us a way to do this. For all joins we must establish a correspondance or match between each case in the left table and zero or more cases in the right table.

A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.

  • You specify which pairs to use.
  • A pair is a variable from the left table and a variable from the right table or a set of variables from the left and right table.
  • Cases must have exactly equal values in the pair for a match to be made.

When we join datasets, the general format is

left_dataset %>% 
  <JOIN>(right_dataset, 
         by=<HOW TO JOIN>)

where left_dataset and right_dataset are datasets, <JOIN> is the specific type of join, and <HOW TO JOIN> gives detailed information for how to do it.

The by argument tells it how to join the two datasets together, specifically which variables it should match. If the variables have the same names, we only need to write the name of that variable, in quotes: by = "variable_name".

If the two variables to match have different names in the two datasets, we can write by=c("name1"="name2"), where name1 is the variable in the left dataset to be matched to the name2 variable in the right dataset.

We can also match on multiple variables using by=c("name1"="name2", "name1a" = "name2a"), where the names to the left of the equals are variables from the left dataset and those on the right of the equals are from the right dataset.

If the by= is omitted from a join, then R will perform a natural join, which matches the two datasets by all variables they have in common. It is good practice to always include the by=.

Let’s discuss the different types of joins.

Mutating joins

The first class of joins are mutating joins, which add new variables (columns) to the left data table from matching observations in the right table.

The main difference in the three mutating join options in this class is how they answer the following questions:

  1. What happens when a case in the right table has no matches in the left table?
  2. What happens when a case in the left table has no matches in the right table?

Three mutating join functions:

left_join(): the output has all cases from the left, regardless if there is a match in the right, but discards any cases in the right that do not have a match in the left. (There is also a right_join() function which which does the opposite.)

Image credit: Wickham, R for Data Science

Credit: Garrick Aden-Buie – @grrrck

inner_join(): the output has only the cases from the left with a match in the right.

Image credit: Wickham, R for Data Science

Credit: Garrick Aden-Buie – @grrrck

full_join(): the output has all cases from the left and the right. This is less common than the first two join operators.

Image credit: Wickham, R for Data Science