Slides

Here are the introduction slides for this practical on data wrangling!

Overview

In this practical you’ll practice “data wrangling” with the dplyr package (part of the `tidyverse collection of packages). Data wrangling refers to modifying and summarizing data. For example, sorting, adding columns, recoding values, and calculating summary statistics.

By the end of this practical you will know how to:

Change column names, select specific columns.
Create new columns based on existing ones
Select specific rows of data based on multiple criteria.
Group data and calculate summary statistics
Combine multiple datasets through key columns

Cheatsheet

Data wrangling with dplyr Cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf.

Glossary

Here are the main verbs you will be using in dplyr:

verb	action	example
`filter()`	Select rows based on some criteria	data %>% filter(age > 40 & sex == “m”)
`arrange()`	Sort rows	data %>% arrange(date, group)
`select()`	Select columns (and ignore all others)	data %>% select(age, sex)
`rename()`	Rename columns	data %>% rename(new = old)
`mutate()`	Add new columns	data %>% mutate(height.m = height.cm / 100)
`case_when()`	Recode values of a column	data %>% mutate(sex_n = case_when(sex == 0 ~ “m”, sex == 1 ~ “f”))
`group_by(), summarise()`	Group data and then calculate summary statistics	data %>% group_by(…) %>% summarise(…)
`left_join()`	Combine multiple datasets using a key column	data %>% left_join(data2, by = “id”)

Examples

The following examples will take you through the steps of doing data wrangling with dplyr. Try to go through each line of code and see how it works!

# -----------------------------------------------
# Examples of using dplyr on the ChickWeight data
# ------------------------------------------------

library(tidyverse)         # Load tidyverse

chick <- as_tibble(ChickWeight)   # Save a copy of the ChickWeight data as chick

# Change some column names with rename()

chick <- chick %>%
  rename(arm = Diet,         # New = arm, old = Diet
         weight_g = weight)  # new = weight_g, old = weight

# Select rows with filter()

chick %>% 
  filter(weight_g > 40 & arm == 1)   # Only rows where weight_g > 40 and arm == 1

# Add columns with mutate()

chick <- chick %>%
  mutate(
    weight_g_kg = weight_g / 1000,  # Add new column of weight_g in kilograms
    time_week = Time / 7        # Add time_week as time in weeks
  )

# Sort rows with arrange()

chick <- chick %>%
  arrange(arm, weight_g)   # sort rows by arm and then weight_g

# Recode variables with case_when()

chick <- chick %>%
  mutate(
    arm_name = case_when(      # Rename arms with strings
      arm == 1 ~ "vegetables",
      arm == 2 ~ "fruit",
      arm == 3 ~ "candy",
      arm == 4 ~ "meat"
      
    )
  )

# Grouped statistics with group_by() and summarise()

chick %>%                             # Start with chick data...
  group_by(arm) %>%                   # Group by arm...
  summarise(                          # Summarise the following
    weight_g_mean = mean(weight_g),   # Mean weight
    weight_g_max = max(weight_g),     # Max weight
    time_mean = mean(Time),           # Mean time
    N = n()) %>%                      # Number of cases...
  arrange(weight_g_mean)              # Sort by mean weight


# Many sequential functions

chick %>%
  rename(weeks = Time) %>%            # Change column name
  filter(Chick > 10) %>%              # Only chicks with values larger than 10
  group_by(week, arm) %>%             # Group by Time and arm
  summarise(                          # Calculate summary statistics
    weight_g_median = median(weight_g),
    weight_g_sd = sd(weight_g),
    N = n()                           # Counts of cases
  )

Tasks

Datasets

You’ll use one dataset in this practical: ACTG175.csv. It is available in the data_BaselRBootcamp_Day2.zip file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip folder and unzip it to get the ACTG175.csv file.

Getting setup

A. In RStudio, Create a new project called BaselRBootcamp. In the project root directory, create two folders data and R.

# Done!

B. Outside of RStudio (on your computer), move the ACTG175.csv file you just downloaded in to the data folder you just created.

# Done!

C. Open a new R script and save it as a new file called wrangling_practical.R in the R folder in your working directory. At the top of the script, using comments, write your name and the date. The, load the set of tidyverse and speff2trial packages with library(). Here’s how the top of your script should look:

## NAME
## DATE
## Wrangling Practical

library(tidyverse)     # For tidyverse
library(speff2trial)   # For the ACTG175 data documentation

B. For this practical, we’ll use the ACTG175 data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act.

# Load ACTG175.csv from the data folder 
# The ACTG175.csv file MUST be in a folder called data in your working directory!!

trial_act <- read_csv(file = "data/ACTG175.csv")

C. The trial_act data is actually a copy of a dataset from the speff2trial package called ACTG175. We can make Look at the help menu for the ACTG175 data by running ?ACTG175 (If you become really interested in the data, you can also read an article discussing the trial here: http://www.nejm.org/doi/full/10.1056/nejm199610103351501#t=article)

# Look at documentation for ACTG175 data (contained in the speff2trial package)
?ACTG175

D. First thing’s first, take a look at the first few rows of the data by printing the trial_act object. It should look like this:

# Print trial_act object
trial_act

# A tibble: 2,139 x 27
   pidnum   age  wtkg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <int> <int> <dbl> <int> <int> <int>  <int>  <int> <int>  <int>   <int>
 1  10056    48  89.8     0     0     0    100      0     0      1       0
 2  10059    61  49.4     0     0     0     90      0     1      1     895
 3  10089    45  88.5     0     1     1     90      0     1      1     707
 4  10093    47  85.3     0     1     0    100      0     1      1    1399
 5  10124    43  66.7     0     1     0    100      0     1      1    1352
 6  10140    46  88.9     0     1     1    100      0     1      1    1181
 7  10165    31  73.0     0     1     0    100      0     1      1     930
 8  10190    41  66.2     0     1     1    100      0     1      1    1329
 9  10198    40  82.6     0     1     0     90      0     1      1    1074
10  10229    35  78.0     0     1     0    100      0     1      1     964
# ... with 2,129 more rows, and 16 more variables: race <int>,
#   gender <int>, str2 <int>, strat <int>, symptom <int>, treat <int>,
#   offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>, r <int>,
#   cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>

`rename()`

Lets change some of the column names in trial_act. Using rename(), change wtkg to weightkg and age to age_years (to specify that age is in years)

trial_act <- trial_act %>%
  rename(weightkg = wtkg,
         age_years = age)

Oops, someone in your company just told you that the name age_years conflicts with another table in their database. To fix this, change age_years to agey.

trial_act <- trial_act %>%
  rename(agey = age_years)

`mutate()`

Add a new column to trial_act called agem that shows each patient’s age in months instead of years (Hint: Just multiply agey by 12!).

trial_act <- trial_act %>% 
  mutate(
  agem = agey * 12
)

The column karnof shows each patient’s Karnofsky score on a scale from 0 to 100, add a new column to trial_act called karnof_b that shows the score on a scale from 0 to 1 (hint: just divide the original column by 100)

trial_act <- trial_act %>% 
  mutate(
         karnof_b = karnof / 100
        )

Now, do the previous two questions in one chunk of code. That is, using one call to mutate() add both agem and karnof_b columns to trial_act

trial_act <- trial_act %>% 
 mutate(
    agem = agey * 12,
    karnof_b = karnof / 100
    )

A physician wants to see a new score called the sparrow which is equal to the Karnofsky score divided by a person’s age plus the person’s weight in kg. Add each participant’s sparrow score as a new column to trial_act called sparrow (Hint: Just take `karnof / agey + weightkg)

trial_act <- trial_act %>% 
 mutate(
    sparrow = karnof / agey + weightkg
    )

`arrange()`

Arrange the trial_act data in ascending order of agey (from lowest to highest). After, look the data to make sure it worked!

trial_act <- trial_act %>% 
 arrange(agey)

trial_act

# A tibble: 2,139 x 30
   pidnum  agey weightkg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>    <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1 940533    12     41.4     1     0     0    100      0     0      1
 2 950037    12     53.1     1     0     0    100      0     1      1
 3 950056    12     31.0     1     0     0    100      0     1      1
 4 910034    13     32.7     1     0     0    100      0     1      1
 5 940534    13     62.9     1     0     0    100      0     0      1
 6 960014    13     48.5     1     0     0    100      0     0      1
 7 310767    14     65.0     1     0     0    100      0     1      1
 8 920050    14     54.2     1     0     0    100      0     1      1
 9 940544    14     41.1     1     0     0    100      0     0      1
10 950061    14     64.3     1     0     0    100      0     1      1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   agem <dbl>, karnof_b <dbl>, sparrow <dbl>

Now arrange the data in descending order of agey (from highest to lowest). After, look the data to make sure it worked.

trial_act <- trial_act %>% 
 arrange(desc(agey))

trial_act

# A tibble: 2,139 x 30
   pidnum  agey weightkg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>    <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1  11438    70     73.9     0     1     0    100      0     1      1
 2 211360    70     63.1     0     1     0     90      0     0      1
 3 211284    69     81.6     0     1     0    100      0     0      1
 4  50580    68     90.5     0     1     1    100      0     1      1
 5  81127    68     70.8     0     1     0     90      0     1      1
 6  10924    67     71.0     0     1     0    100      0     0      1
 7 241150    67     82.1     0     1     0     90      0     0      1
 8  50662    66     84.4     0     1     0    100      0     0      1
 9  11987    65     77.2     0     1     0     90      0     0      1
10 140797    65     60.5     0     1     0     90      0     0      1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   agem <dbl>, karnof_b <dbl>, sparrow <dbl>

To arrange data in descending order, just include desc() around the variable. E.g.; data %>% arrrange(desc(height))

You can sort the rows of dataframes with multiple columns by including many arguments to arrange(). Now sort the data by karnof and then agey (agey), and then arms (arms)

trial_act <- trial_act %>% 
 arrange(karnof, agey, arms)

trial_act

# A tibble: 2,139 x 30
   pidnum  agey weightkg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>    <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1 610015    25     58.7     0     1     0     70      0     0      1
 2 990018    27     80.3     1     0     0     70      0     1      1
 3 150147    32     90.2     0     1     0     70      0     1      1
 4 230109    34     64.4     0     1     0     70      0     1      1
 5 640022    36     83.9     0     1     1     70      0     0      1
 6 280278    40     79.4     0     0     1     70      0     1      1
 7 220418    43     64.4     0     1     0     70      0     1      1
 8  50632    45     72.1     0     1     1     70      0     0      1
 9  11013    54     71.2     0     1     1     70      0     1      1
10 230456    23     54.0     0     1     0     80      0     1      1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   agem <dbl>, karnof_b <dbl>, sparrow <dbl>

`case_when()`

Create a new column gender_char that shows gender as a string.

Look at the help file with ?ACTG175 to see how gender is coded.

trial_act <- trial_act %>%
  mutate(
  gender_char = case_when(
    gender == 0 ~ "female",
    gender == 1 ~ "male"
  )
  )

Create a new column over50 that is 1 when patients are older than 50, and 0 when they are younger than or equal to 50

trial_act <- trial_act %>%
  mutate(
  over50 = case_when(
    agey > 50 ~ 1,
    agey <= 50 ~ 0
  )
  )

Now, repeat the previous two questions, but do them both in the same call to mutate(). That is, in one block of code, create gender_char and over50

trial_act <- trial_act %>%
  mutate(
    gender_char = case_when(
    gender == 0 ~ "female",
    gender == 1 ~ "male"),
  over50 = case_when(
    agey > 50 ~ 1,
    agey <= 50 ~ 0)
  )

`filter()`

Create a new tibble called trial_act_B that only contains data from males (gender == 1)

trial_act_B <- trial_act %>%
  filter(gender == 1)

Create a new tibble called trial_act_C that only contains data from people who have not used intravenous drugs (drugs == 0) and are over 40 years old and are female

trial_act_C <- trial_act %>%
  filter(drugs == 0 & agey > 40 & gender == 0)

`left_join()`

You can use left_join() to combine multiple dataframes. At the following link https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/patient_demo.csv I have some additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Here’s how the first few rows of the data look:

# A tibble: 2,139 x 3
   pidnum exercise education
    <int>    <int> <chr>    
 1 250091        1 HS       
 2  70444        3 BA       
 3 220425        4 HS       
 4  50300        3 PHD      
 5 630013        3 <HS      
 6  60972        1 MS       
 7  81105        3 BA       
 8 211333        1 HS       
 9 170648        0 HS       
10 220488        1 HS       
# ... with 2,129 more rows

Using left_join() to add the demographic data to the trial_act data. To do this, first load the demographic data into R as a new dataframe called trial_act_demo. Then, use the following template to combine the data (Hint: Replace XXX with the correct values.)

# Get trial_act_demo data

trial_act_demo <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/patient_demo.csv")

Parsed with column specification:
cols(
  pidnum = col_integer(),
  exercise = col_integer(),
  education = col_character()
)

# Combine trial_act and trial_act_demo data
trial_act <- trial_act %>%
  left_join(trial_act_demo)        # The new dataframe to combine with trial_act

Joining, by = "pidnum"

Using your new trial_act tibble, which should contain the exercise and education data, calculate the mean number of days of exercise that patients reported.

mean(trial_act$exercise)

[1] 2.288453

`group_by()` and `summarise()`

Group the data by arms. Then, for each arm, calculate the mean participant age.

trial_act %>% 
  group_by(arms) %>%
  summarise(
    agey.mean = mean(agey)
  )

Group the data by arms.For each arm, calculate the mean participant age and the median Karnofsky score.

trial_act %>% 
  group_by(arms) %>%
  summarise(
    agey_mean = mean(agey),
    karnof_median = median(karnof)
  )

# A tibble: 4 x 3
   arms agey_mean karnof_median
  <int>     <dbl>         <dbl>
1     0      35.2          100.
2     1      35.2          100.
3     2      35.4          100.
4     3      35.1          100.

Group the data by gender. Then, separately for male and female patients, calculate the percent who have a history of intravenous drug use.

To calculate a percent of a binary variable with 0s and 1s, just calculate the mean

trial_act %>%
  group_by(gender) %>%
  summarise(
    drug_percent = mean(drugs)
  )

# A tibble: 2 x 2
  gender drug_percent
   <int>        <dbl>
1      0        0.236
2      1        0.110

Group the data by gender. Then calculate the percent of male and female patients who have a history of intravenous drug use (drugs), and the mean number of days until a major negative event days

trial_act %>%
  group_by(gender) %>%
  summarise(
    drug_percent = mean(drugs),
    homo_percent = mean(homo)
  )

# A tibble: 2 x 3
  gender drug_percent homo_percent
   <int>        <dbl>        <dbl>
1      0        0.236       0.0299
2      1        0.110       0.792

Separately for all combinations of gender and race, calculate the mean age and mean CD4 T cell count at baseline (cd40) (Hint: group by gender and race)

trial_act %>%
  group_by(gender, race) %>%
  summarise(
    agey_mean = mean(agey),
    cd40_mean = mean(cd40)
  )

# A tibble: 4 x 4
# Groups:   gender [?]
  gender  race agey_mean cd40_mean
   <int> <int>     <dbl>     <dbl>
1      0     0      35.1      356.
2      0     1      33.7      360.
3      1     0      35.9      350.
4      1     1      34.0      345.

Challenges

Now let’s check the major differences between the treatment arms. For each arm, calculate the following:

Mean days until a a major negative event (days)
Mean CD4 T cell count at baseline.
Mean CD4 T cell count at 20 weeks.
Mean CD4 T cell count at 96 weeks.
Mean change in CD4 T cell count between baseline and 96 weeks
Number of patients (Hint: use N = n())

trial_act %>%
  group_by(arms) %>%
  summarise(
    days_mean = mean(days),
    cd4_bl = mean(cd40),
    cd4_20 = mean(cd420),
    cd4_96 = mean(cd496, na.rm = TRUE),
    cd4_change = mean(cd496 - cd40, na.rm = TRUE),
    N = n()
  )

# A tibble: 4 x 7
   arms days_mean cd4_bl cd4_20 cd4_96 cd4_change     N
  <int>     <dbl>  <dbl>  <dbl>  <dbl>      <dbl> <int>
1     0      801.   353.   336.   288.     -77.5    532
2     1      916.   349.   403.   341.      -6.90   522
3     2      906.   353.   372.   355.      -4.36   524
4     3      893.   347.   374.   329.     -19.4    561

Repeat the previous analysis, but before you do the grouping and summary statistics, create a new variable called arms_char that shows the values of arms as characters that reflect what the values actually represent (hint: use mutate() and case_when()). For example, looking at the help file ?ACTG175, I can see that the treatment arm of 0 is “zidovudine”. I might call this arm "Z". Do this in the all in the same chunk of code.

trial_act %>%
  mutate(
    arms_char = case_when(
      arms == 0 ~ "Z",
      arms == 1 ~ "ZD",
      arms == 2 ~ "ZZ",
      arms == 3 ~ "D"
    )
  ) %>%
  group_by(arms_char) %>%
  summarise(
    days_mean = mean(days),
    cd4_bl = mean(cd40),
    cd4_20 = mean(cd420),
    cd4_96 = mean(cd496, na.rm = TRUE),
    cd4_change = mean(cd496 - cd40, na.rm = TRUE)
  )

# A tibble: 4 x 6
  arms_char days_mean cd4_bl cd4_20 cd4_96 cd4_change
  <chr>         <dbl>  <dbl>  <dbl>  <dbl>      <dbl>
1 D              893.   347.   374.   329.     -19.4 
2 Z              801.   353.   336.   288.     -77.5 
3 ZD             916.   349.   403.   341.      -6.90
4 ZZ             906.   353.   372.   355.      -4.36

Repeat the previous analysis, but only include patients with a Karnofsky score equal to 100, and who did not use zidovudine in the 30 days prior to the treatment initiation (z30)

trial_act %>%
  filter(karnof == 100 & z30 == 0) %>%
  mutate(
    arms_char = case_when(
      arms == 0 ~ "Z",
      arms == 1 ~ "ZD",
      arms == 2 ~ "ZZ",
      arms == 3 ~ "D"
    )
  ) %>%
  group_by(arms_char) %>%
  summarise(
    days_mean = mean(days),
    cd4_bl = mean(cd40),
    cd4_20 = mean(cd420),
    cd4_96 = mean(cd496, na.rm = TRUE),
    cd4_change = mean(cd496 - cd40, na.rm = TRUE)
  )

# A tibble: 4 x 6
  arms_char days_mean cd4_bl cd4_20 cd4_96 cd4_change
  <chr>         <dbl>  <dbl>  <dbl>  <dbl>      <dbl>
1 D              883.   364.   409.   359.      -2.22
2 Z              822.   373.   375.   318.     -63.1 
3 ZD             890.   359.   446.   381.      16.8 
4 ZZ             910.   383.   416.   421.      41.2

In one block of code, complete the following tasks
- Change the name of the column weightkg to weight
- Create a new column called drugs_char that uses strings instead of numbers to indicate drug use
- Filter the data to only include male patients with id numbers greater than 10100
- Group the data by drugs_char and arms
- For each group calculate the mean age, percentage of patients that are male, and mean number of days until a major negative event.

trial_act %>%
  rename(
    weight = weightkg
  ) %>%
  mutate(drugs_char = case_when(
    drugs == 0 ~ "No drug use",
    drugs == 1 ~ "Prior drug use"
  )) %>%
  filter(gender_char == "male" & pidnum > 10100) %>%
  group_by(drugs_char, arms) %>%
  summarise(
    age_mean = mean(agey),
    male = mean(gender_char == "male"),
    days_mean = mean(days)
  )

# A tibble: 8 x 5
# Groups:   drugs_char [?]
  drugs_char      arms age_mean  male days_mean
  <chr>          <int>    <dbl> <dbl>     <dbl>
1 No drug use        0     35.6    1.      803.
2 No drug use        1     35.1    1.      914.
3 No drug use        2     35.4    1.      910.
4 No drug use        3     34.7    1.      912.
5 Prior drug use     0     35.7    1.      741.
6 Prior drug use     1     39.2    1.      955.
7 Prior drug use     2     36.8    1.      905.
8 Prior drug use     3     37.7    1.      802.

Additional reading

For more details on data wrangling with R, check out the chapter in YaRrr! The Pirate’s Guide to R YaRrr! Chapter Link
Hadley Wickham, the author of dplyr, also has great examples in the dplyr vignette here: dplyr vignette link

Data wrangling with dplyr

BaselRBootcamp April 2018