Overview

In this practical you’ll practice “data wrangling” with the dplyr and tidyr packages (part of the `tidyverse collection of packages).

By the end of this practical you will know how to:

Change column names, select specific columns
Create new columns based on existing ones
Select specific rows of data based on multiple criteria
Group data and calculate summary statistics
Combine multiple data sets through key columns
Convert data between wide and long formats

Datasets

You’ll need the following datasets for this practical:

library(tidyverse)
trial_act <- read_csv("../_data/baselrbootcamp_data/trial_act.csv")
trial_act_demo <- read_csv("../_data/baselrbootcamp_data/trial_act_demo_fake.csv")

File	Rows	Columns
trial_act.csv	2139	27
trial_act_demo_fake	2139	3

Packages

Package	Installation
`tidyverse`	`install.packages("tidyverse")`

Glossary

Function	Package	Description
`rename()`	`dplyr`	Rename columns
`select()`	`dplyr`	Select columns based on name or index
`filter()`	`dplyr`	Select rows based on some logical criteria
`arrange()`	`dplyr`	Sort rows
`mutate()`	`dplyr`	Add new columns
`case_when()`	`dplyr`	Recode values of a column
`group_by(), summarise()`	`dplyr`	Group data and then calculate summary statistics
`left_join()`	`dplyr`	Combine multiple data sets using a key column
`spread()`	`tidyr`	Convert long data to wide format - from rows to columns
`gather()`	`tidyr`	Convert wide data to long format - from columns to rows

Cheatsheet

Trulli — https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Examples

# Wrangling with dplyr and tidyr ---------------------------

library(tidyverse)    # Load tidyverse for dplyr and tidyr

# Load baselers data
baselers <- read_csv("https://raw.githubusercontent.com/therbootcamp/baselers/master/inst/extdata/baselers.txt")

# Perform many dplyr operations in a row

baselers %>%
  
  # Change some names
  rename(age_y = age,
         swimming = rhine) %>%
  
  # Only include people over 30
  filter(age_y > 30) %>%
  
  # Calculate some new columns
  mutate(weight_lbs = weight * 2.22,
         height_m = height / 100,
         BMI = weight / height_m ^ 2,
         
         # Make binary version of sex
         sex_bin = case_when(
                      sex == "male" ~ 0,
                      sex == "female" ~ 1),

        # Show when height is greater than 150
        height_lt_150 = case_when(
                                height < 150 ~ 1,
                                height >= 150 ~ 0)) %>%
  
  # Sort in ascending order of sex, then
  #  descending order of age
  arrange(sex, desc(age_y)))


# Calculate grouped summary statistics

baselers_agg <- baselers %>%
  group_by(sex, education) %>%
  summarise(
    age_mean = mean(age_y, na.rm = TRUE),
    income_median = median(income, na.rm = TRUE),
    N = n()
  )

Tasks

A - Setup

Open your baselrbootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder

# Done!

Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called wrangling_practical.R in the 2_Code folder.
Using library() load the set of packages for this practical listed in the packages section above.
For this practical, we’ll use the trial_act data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act.

# Load trial_act.csv from the data folder in your working directory

trial_act <- read_csv(file = "XX")

trial_act <- read_csv(file = "1_Data/trial_act.csv")

Using the same code structure, load the trial_act_demo_fake.csv data as a new dataframe called trial_act_demo_fake

trial_act_demo_fake <- read_csv(file = "1_Data/trial_act_demo_fake")

Take a look at the first few rows of the datasets by printing them to the console.

# Print trial_act object
trial_act

# A tibble: 2,139 x 27
   pidnum   age  wtkg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <int> <int> <dbl> <int> <int> <int>  <int>  <int> <int>  <int>   <int>
 1  10056    48  89.8     0     0     0    100      0     0      1       0
 2  10059    61  49.4     0     0     0     90      0     1      1     895
 3  10089    45  88.5     0     1     1     90      0     1      1     707
 4  10093    47  85.3     0     1     0    100      0     1      1    1399
 5  10124    43  66.7     0     1     0    100      0     1      1    1352
 6  10140    46  88.9     0     1     1    100      0     1      1    1181
 7  10165    31  73.0     0     1     0    100      0     1      1     930
 8  10190    41  66.2     0     1     1    100      0     1      1    1329
 9  10198    40  82.6     0     1     0     90      0     1      1    1074
10  10229    35  78.0     0     1     0    100      0     1      1     964
# ... with 2,129 more rows, and 16 more variables: race <int>,
#   gender <int>, str2 <int>, strat <int>, symptom <int>, treat <int>,
#   offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>, r <int>,
#   cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>

Use the the summary() function to get more details on the datasets.

summary(trial_act)

     pidnum            age            wtkg            hemo      
 Min.   : 10056   Min.   :12.0   Min.   : 31.0   Min.   :0.000  
 1st Qu.: 81446   1st Qu.:29.0   1st Qu.: 66.7   1st Qu.:0.000  
 Median :190566   Median :34.0   Median : 74.4   Median :0.000  
 Mean   :248778   Mean   :35.2   Mean   : 75.1   Mean   :0.084  
 3rd Qu.:280277   3rd Qu.:40.0   3rd Qu.: 82.6   3rd Qu.:0.000  
 Max.   :990077   Max.   :70.0   Max.   :159.9   Max.   :1.000  
                                                                
      homo           drugs           karnof          oprior     
 Min.   :0.000   Min.   :0.000   Min.   : 70.0   Min.   :0.000  
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.: 90.0   1st Qu.:0.000  
 Median :1.000   Median :0.000   Median :100.0   Median :0.000  
 Mean   :0.661   Mean   :0.131   Mean   : 95.4   Mean   :0.022  
 3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:100.0   3rd Qu.:0.000  
 Max.   :1.000   Max.   :1.000   Max.   :100.0   Max.   :1.000  
                                                                
      z30           zprior     preanti          race           gender     
 Min.   :0.00   Min.   :1   Min.   :   0   Min.   :0.000   Min.   :0.000  
 1st Qu.:0.00   1st Qu.:1   1st Qu.:   0   1st Qu.:0.000   1st Qu.:1.000  
 Median :1.00   Median :1   Median : 142   Median :0.000   Median :1.000  
 Mean   :0.55   Mean   :1   Mean   : 379   Mean   :0.288   Mean   :0.828  
 3rd Qu.:1.00   3rd Qu.:1   3rd Qu.: 740   3rd Qu.:1.000   3rd Qu.:1.000  
 Max.   :1.00   Max.   :1   Max.   :2851   Max.   :1.000   Max.   :1.000  
                                                                          
      str2           strat         symptom          treat      
 Min.   :0.000   Min.   :1.00   Min.   :0.000   Min.   :0.000  
 1st Qu.:0.000   1st Qu.:1.00   1st Qu.:0.000   1st Qu.:1.000  
 Median :1.000   Median :2.00   Median :0.000   Median :1.000  
 Mean   :0.586   Mean   :1.98   Mean   :0.173   Mean   :0.751  
 3rd Qu.:1.000   3rd Qu.:3.00   3rd Qu.:0.000   3rd Qu.:1.000  
 Max.   :1.000   Max.   :3.00   Max.   :1.000   Max.   :1.000  
                                                               
     offtrt           cd40          cd420          cd496     
 Min.   :0.000   Min.   :   0   Min.   :  49   Min.   :   0  
 1st Qu.:0.000   1st Qu.: 264   1st Qu.: 269   1st Qu.: 209  
 Median :0.000   Median : 340   Median : 353   Median : 321  
 Mean   :0.363   Mean   : 351   Mean   : 371   Mean   : 329  
 3rd Qu.:1.000   3rd Qu.: 423   3rd Qu.: 460   3rd Qu.: 440  
 Max.   :1.000   Max.   :1199   Max.   :1119   Max.   :1190  
                                               NA's   :797   
       r              cd80          cd820           cens      
 Min.   :0.000   Min.   :  40   Min.   : 124   Min.   :0.000  
 1st Qu.:0.000   1st Qu.: 654   1st Qu.: 632   1st Qu.:0.000  
 Median :1.000   Median : 893   Median : 865   Median :0.000  
 Mean   :0.627   Mean   : 987   Mean   : 935   Mean   :0.244  
 3rd Qu.:1.000   3rd Qu.:1207   3rd Qu.:1146   3rd Qu.:0.000  
 Max.   :1.000   Max.   :5011   Max.   :6035   Max.   :1.000  
                                                              
      days           arms     
 Min.   :  14   Min.   :0.00  
 1st Qu.: 727   1st Qu.:1.00  
 Median : 997   Median :2.00  
 Mean   : 879   Mean   :1.52  
 3rd Qu.:1091   3rd Qu.:3.00  
 Max.   :1231   Max.   :3.00

summary(trial_act_demo_fake)

     pidnum          exercise    education        
 Min.   : 10056   Min.   :0.0   Length:2139       
 1st Qu.: 81446   1st Qu.:1.0   Class :character  
 Median :190566   Median :1.0   Mode  :character  
 Mean   :248778   Mean   :2.2                     
 3rd Qu.:280277   3rd Qu.:3.0                     
 Max.   :990077   Max.   :7.0

B - Change column names with rename()

Print the names of the trial_act data with names()

names(trial_act)

 [1] "pidnum"  "age"     "wtkg"    "hemo"    "homo"    "drugs"   "karnof" 
 [8] "oprior"  "z30"     "zprior"  "preanti" "race"    "gender"  "str2"   
[15] "strat"   "symptom" "treat"   "offtrt"  "cd40"    "cd420"   "cd496"  
[22] "r"       "cd80"    "cd820"   "cens"    "days"    "arms"

Using rename(), change the column name wtkg in the trial_act dataframe to weight_kg. Be sure to assign the result back to trial_act to change it!

# Change the name to weight_kg from wtkg

trial_act <- trial_act %>%
  rename(XX = XX)

trial_act <- trial_act %>%
  rename(weight_kg = wtkg)

Look at the names of your trial_act dataframe again, do you now see the column weight_kg?

names(trial_act)

 [1] "pidnum"    "age"       "weight_kg" "hemo"      "homo"     
 [6] "drugs"     "karnof"    "oprior"    "z30"       "zprior"   
[11] "preanti"   "race"      "gender"    "str2"      "strat"    
[16] "symptom"   "treat"     "offtrt"    "cd40"      "cd420"    
[21] "cd496"     "r"         "cd80"      "cd820"     "cens"     
[26] "days"      "arms"

Change the column name age to age_y (to specify that age is in years).

trial_act <- trial_act %>%
  rename(age_y = age)

C - Select columns with select()

Using the select() function, select only the column age_y and print the result (but don’t assign it to anything). Do you see only the age column now?

trial_act %>% 
  select(age_y)

# A tibble: 2,139 x 1
   age_y
   <int>
 1    48
 2    61
 3    45
 4    47
 5    43
 6    46
 7    31
 8    41
 9    40
10    35
# ... with 2,129 more rows

Create a new dataframe called CD4_wide that only contains the columns pidnum, arms, cd40, cd420, and cd496. The cd40, cd420, and cd496 columns show patient’s CD4 T cell counts at baseline, 20 weeks, and 96 weeks. Print the result to make sure it worked!

XX <- trial_act %>% 
  select(XX, XX, XX, XX, ...)

CD4_wide <- trial_act %>%
  select(`pidnum`, `arms`, `cd40`, `cd420`, `cd496`)

Did you know you can easily select all columns that start with specific characters using starts_with()? Try adapting the following code to get the same result you got before.

CD4_wide <- trial_act %>% 
  select(pidnum, arms, starts_with("XXX"))

CD4_wide <- trial_act %>% 
  select(pidnum, arms, starts_with("cd"))

D - Add new columns with mutate()

Using the mutate() function, add the column age_m which shows each patient’s age in months instead of years (Hint: Just multiply age_y by 12!)

trial_act <- trial_act %>%
  mutate(XX = XX * 12)

trial_act <- trial_act %>%
  mutate(age_m = age_y * 12)

Add the following new columns to trial_act. Try combining these into one call to the mutate() function!

weight_lb: Weight in lbs instead of kilograms. You can do this by multiplying weight_kg by 2.2.
cd_change_20: Change in CD4 T cell count from baseline to 20 weeks. You can do this by taking cd420 - cd40
cd_change_960: Change in CD4 T cell count from baseline to 96 weeks. You can do this by taking cd496 - cd40

trial_act <- trial_act %>% 
  mutate(weight_lb = XXX,
         cd_change_20 = XXX,
         XX = XX)

trial_act <- trial_act %>% 
  mutate(weight_lb = weight_kg * 2.2,
         cd_change_20 = cd420 - cd40,
         cd_change_960 = cd496 - cd40)

If you look at the gender column, you will see that it is numeric. Change the column so it shows gender as a string, where 0 = “female” and 1 = “male”. To do this, use a combination of mutate() and case_when:

# Create gender_char which shows gender as a stringh
trial_act <- trial_act %>%
  mutate(
  gender_char = case_when(
    gender == XX ~ "XX",
    gender == XX ~ "XX"))

trial_act <- trial_act %>%
  mutate(
  gender_char = case_when(
    gender == 0 ~ "female",
    gender == 1 ~ "male"))

The column arms is also numeric and not very meaningful. Create a new column arms_char contains the names of the arms. Here is a table of the mapping

arms	arms_char
0	zidovudine
1	zidovudine and didanosine
2	zidovudine and zalcitabine
3	didanosine

trial_act <- trial_act %>%
  mutate(
  arms_char = case_when(
    arms == 0 ~ "zidovudine",
    arms == 1 ~ "zidovudine and didanosine",
    arms == 2 ~ "zidovudine and zalcitabine",
    arms == 3 ~ "didanosine"))

If you haven’t already, put the code for your previous questions in one call to mutate(). That is, in one block of code, create agem, weight_lb, cd_change_20, cd_change_960, gender_char and over50 using the mutate() function only once. Here’s how your code should look:

trial_act <- trial_act %>%
  mutate(
    agem = XXX,
    weight_lb = XXX,
    cd_change_20 = XXX,
    cd_change_960 = XXX,
    gender_char = case_when(XXX),
    arms_char = case_when(XXX)
  )

trial_act <- trial_act %>%
  mutate(
    agem = age_y * 12,
    weight_lb = weight_kg * 2.2,
    cd_change_20 = cd420 - cd40,
    cd_change_960 = cd496 - cd40,
    gender_char = case_when(
                      gender == 0 ~ "female",
                      gender == 1 ~ "male"),
   arms_char = case_when(
    arms == 0 ~ "zidovudine",
    arms == 1 ~ "zidovudine and didanosine",
    arms == 2 ~ "zidovudine and zalcitabine",
    arms == 3 ~ "didanosine")
  )

E - Arrange rows with arrange()

Using the arrange()function, arrange the trial_act data in ascending order of age_y (from lowest to highest). After you do, look the data to make sure it worked!

trial_act <- trial_act %>% 
 arrange(age_y)

trial_act

# A tibble: 2,139 x 33
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>     <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1 940533    12      41.4     1     0     0    100      0     0      1
 2 950037    12      53.1     1     0     0    100      0     1      1
 3 950056    12      31       1     0     0    100      0     1      1
 4 910034    13      32.7     1     0     0    100      0     1      1
 5 940534    13      62.9     1     0     0    100      0     0      1
 6 960014    13      48.5     1     0     0    100      0     0      1
 7 310767    14      65       1     0     0    100      0     1      1
 8 920050    14      54.2     1     0     0    100      0     1      1
 9 940544    14      41.1     1     0     0    100      0     0      1
10 950061    14      64.3     1     0     0    100      0     1      1
# ... with 2,129 more rows, and 23 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   gender_char <chr>, arms_char <chr>, agem <dbl>, weight_lb <dbl>,
#   cd_change_20 <int>, cd_change_960 <int>

Now arrange the data in descending order of age_y (from highest to lowest). After, look the data to make sure it worked. To arrange data in descending order, just include desc() around the variable. E.g.; data %>% arrrange(desc(height))

trial_act <- trial_act %>% 
 arrange(desc(age_y))

trial_act

# A tibble: 2,139 x 33
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>     <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1  11438    70      73.9     0     1     0    100      0     1      1
 2 211360    70      63.1     0     1     0     90      0     0      1
 3 211284    69      81.6     0     1     0    100      0     0      1
 4  50580    68      90.5     0     1     1    100      0     1      1
 5  81127    68      70.8     0     1     0     90      0     1      1
 6  10924    67      71       0     1     0    100      0     0      1
 7 241150    67      82.1     0     1     0     90      0     0      1
 8  50662    66      84.4     0     1     0    100      0     0      1
 9  11987    65      77.2     0     1     0     90      0     0      1
10 140797    65      60.5     0     1     0     90      0     0      1
# ... with 2,129 more rows, and 23 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   gender_char <chr>, arms_char <chr>, agem <dbl>, weight_lb <dbl>,
#   cd_change_20 <int>, cd_change_960 <int>

You can sort the rows of dataframes with multiple columns by including many arguments to arrange(). Now sort the data by arms (arms) and then age (age_y).

trial_act <- trial_act %>% 
 arrange(arms, age_y)

trial_act

# A tibble: 2,139 x 33
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>     <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1 960014    13      48.5     1     0     0    100      0     0      1
 2 960031    14      48.3     1     0     0    100      0     0      1
 3 990071    14      60       1     0     0    100      0     0      1
 4 980042    16      63       1     0     0    100      0     1      1
 5 171040    17      51.3     0     0     0     90      0     0      1
 6 990026    17     103.      1     0     0    100      0     1      1
 7 310234    18      57.3     1     0     0    100      0     1      1
 8 940519    18      56.8     1     0     0    100      0     1      1
 9 211314    19      50.8     0     0     0     90      0     0      1
10 340767    19      74.8     0     1     0    100      0     0      1
# ... with 2,129 more rows, and 23 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   gender_char <chr>, arms_char <chr>, agem <dbl>, weight_lb <dbl>,
#   cd_change_20 <int>, cd_change_960 <int>

F - Filter specific rows with `filter()`

Using the filter() function, create a new dataframe called trial_act_m that only contains data from males (gender_char == "male"). After you finish, print your new dataframe to make sure it looks correct!

trial_act_m <- trial_act %>%
  filter(gender_char == "male")

A colleague of yours named Tracy wants a datafame only containing data from females over the age of 40. Create this dataframe with filter() and call it trial_act_Tracy

trial_act_Tracy <- trial_act %>%
  filter(age_y > XX & gender == XX)

trial_act_Tracy <- trial_act %>%
  filter(age_y > 40 & gender_char == "female")

G - Combine dataframes with `left_join()`

The trial_act_demo_fake.csv file contains additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Use the left_join() function to combine the trial_act and trial_act_demo_fake datasets, set the by argument to the name of the column that is common in both data sets. Assign the result to trial_act.

trial_act <- trial_act %>%
  left_join(XX, by = "XX")

trial_act <- trial_act %>%
  left_join(trial_act_demo_fake, by = "pidnum")

Print your new trial_act dataframe. Do you now see the demographic data?

trial_act

# A tibble: 2,139 x 35
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>     <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1 960014    13      48.5     1     0     0    100      0     0      1
 2 960031    14      48.3     1     0     0    100      0     0      1
 3 990071    14      60       1     0     0    100      0     0      1
 4 980042    16      63       1     0     0    100      0     1      1
 5 171040    17      51.3     0     0     0     90      0     0      1
 6 990026    17     103.      1     0     0    100      0     1      1
 7 310234    18      57.3     1     0     0    100      0     1      1
 8 940519    18      56.8     1     0     0    100      0     1      1
 9 211314    19      50.8     0     0     0     90      0     0      1
10 340767    19      74.8     0     1     0    100      0     0      1
# ... with 2,129 more rows, and 25 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   gender_char <chr>, arms_char <chr>, agem <dbl>, weight_lb <dbl>,
#   cd_change_20 <int>, cd_change_960 <int>, exercise <int>,
#   education <chr>

H - Calculate grouped statistics with `group_by()` and `summarise()`

In this code we’ll calculate summary statistics for each of the trial arms. Start with the trial_act dataframe. Then, group the data by arms. Then, for each arm, calculate the mean participant age (in years) as a new column called age_mean. Also, using N = n(), calculate the number of cases for each group. Assign the result to a new object called trial_arm.

trial_arm <- trial_act %>% 
  group_by(XX) %>%
  summarise(
    N = n(),
    XX = mean(XX)
  )

trial_arm <- trial_act %>% 
  group_by(arms) %>%
  summarise(
    N = n(),
    age_mean = mean(age_y)
  )

Adjust your previous code to calculate the standard deviation of age in addition to the mean.

trial_arm <- trial_act %>% 
  group_by(arms) %>%
  summarise(
    N = n(),
    age_mean = mean(age_y),
    age_sd = sd(age_y)
  )

Adjust your previous code to calculate the median number of days until the first major negative event (days) for each arm.

trial_arm <- trial_act %>% 
  group_by(arms) %>%
  summarise(
    N = n(),
    age_mean = mean(age_y),
    age_sd = sd(age_y),
    days_med = median(days)
  )

Create a new dataframe called trial_sex groups the data based on gender and calculates the same summary statistics as you did for trial_arm

trial_gender <- trial_act %>% 
  group_by(gender) %>%
  summarise(
    N = n(),
    age_mean = mean(age_y),
    age_sd = sd(age_y)
  )

trial_gender

# A tibble: 2 x 4
  gender     N age_mean age_sd
   <int> <int>    <dbl>  <dbl>
1      0   368     34.3   8.24
2      1  1771     35.4   8.79

Create a new dataframe called trial_arms_gender that calculates the same summary statistics for all the groups gender and arms. Hint: Just add a second grouping variable!

trial_arms_gender <- trial_act %>% 
  group_by(gender, arms) %>%
  summarise(
    N = n(),
    age_mean = mean(age_y),
    age_sd = sd(age_y)
  )

trial_arms_gender

# A tibble: 8 x 5
# Groups:   gender [?]
  gender  arms     N age_mean age_sd
   <int> <int> <int>    <dbl>  <dbl>
1      0     0   100     33.6   7.75
2      0     1    88     33.5   8.71
3      0     2    89     35.0   8.37
4      0     3    91     35.2   8.15
5      1     0   432     35.6   9.05
6      1     1   434     35.6   8.67
7      1     2   435     35.5   8.90
8      1     3   470     35.1   8.58

I - Reshaping with gather() and spread()

Remember the CD4_wide dataframe you created before? Currently it is in the wide format, where key data (different CD4 T cell counts) are in different columns. Now we will try to convert it to a long format. Our goal is to get the data in the ‘long’ format. Using the spread() function, create a new dataframe called CD4_long that shows the data in the ‘long’ format. To do this, use the following template. Set the grouping column to time and the new data column to value.

CD4_long <- CD4_wide %>% 
  gather(XX,  # New grouping column
         XX,  # New data column
         -pidnum, -arms)  # Names of columns to replicate

CD4_long <- CD4_wide %>% 
  gather(time,  # New grouping column
         value,  # New data column
         -pidnum, -arms)  # Names of columns to replicate

Print your CD4_long dataframe and compare it to the original CD4_wide

CD4_long

# A tibble: 6,417 x 4
   pidnum  arms time  value
    <int> <int> <chr> <int>
 1  10056     2 cd40    422
 2  10059     3 cd40    162
 3  10089     3 cd40    326
 4  10093     3 cd40    287
 5  10124     0 cd40    504
 6  10140     1 cd40    235
 7  10165     0 cd40    244
 8  10190     0 cd40    401
 9  10198     3 cd40    214
10  10229     0 cd40    221
# ... with 6,407 more rows

Now that your data are in the wide format, it should be easy to calculate grouped summary statistics! For each time point and trial arm, calculate the mean CD4 T cell count using group_by() and summarise().

CD4_long %>%
  group_by(time, arms) %>%
  summarise(count_mean = mean(value, na.rm = TRUE))

# A tibble: 12 x 3
# Groups:   time [?]
   time   arms count_mean
   <chr> <int>      <dbl>
 1 cd40      0       353.
 2 cd40      1       349.
 3 cd40      2       353.
 4 cd40      3       347.
 5 cd420     0       336.
 6 cd420     1       403.
 7 cd420     2       372.
 8 cd420     3       374.
 9 cd496     0       288.
10 cd496     1       341.
11 cd496     2       355.
12 cd496     3       329.

Now it’s time to practice moving data from the long to the wide format. Using the following template, use the spread() function to convert CD4_long back the wide format. Assign the result to a new object called CD4_wide_2.

CD4_wide_2 <- CD4_long %>% 
  spread(XX,   # old group column
         XX)   # old target column

CD4_wide_2 <- CD4_long %>% 
  spread(time,   # old group column
         value)   # old target column

Compare CD4_wide_2 to CD4_wide do they look the same?

CD4_wide_2

# A tibble: 2,139 x 5
   pidnum  arms  cd40 cd420 cd496
    <int> <int> <int> <int> <int>
 1  10056     2   422   477   660
 2  10059     3   162   218    NA
 3  10089     3   326   274   122
 4  10093     3   287   394    NA
 5  10124     0   504   353   660
 6  10140     1   235   339   264
 7  10165     0   244   225   106
 8  10190     0   401   366   453
 9  10198     3   214   107     8
10  10229     0   221   132    NA
# ... with 2,129 more rows

Advanced

X - Play around with “Scoped” functions

Many common dplyr functions like mutate() and summarise() have ‘scoped’ versions with suffixes like _if and _all. that allow you do some pretty cool stuff easily (look at the help menu with ?scoped for details). Try running the following chunk with summarise_if() and see what happens:

# See how summarise_if() works!
trial_act %>%
  group_by(gender) %>%
  summarise_if(is.numeric, mean)

# See how summarise_if() works!
trial_act %>%
  group_by(gender) %>%
  summarise_if(is.numeric, mean)

# A tibble: 2 x 32
  gender  pidnum age_y weight_kg   hemo   homo drugs karnof  oprior   z30
   <int>   <dbl> <dbl>     <dbl>  <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl>
1      0 258094.  34.3      68.1 0.0136 0.0299 0.236   95.6 0.00815 0.590
2      1 246842.  35.4      76.6 0.0988 0.792  0.110   95.4 0.0248  0.542
# ... with 22 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
#   str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
#   cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>,
#   cd820 <dbl>, cens <dbl>, days <dbl>, arms <dbl>, agem <dbl>,
#   weight_lb <dbl>, cd_change_20 <dbl>, cd_change_960 <dbl>,
#   exercise <dbl>

Now, in the trial_act dataset, group the data by arms and calculate the mean of all numeric columns using summarise_if().

trial_act %>%
  group_by(arms) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)

# A tibble: 4 x 32
   arms  pidnum age_y weight_kg   hemo  homo drugs karnof oprior   z30
  <int>   <dbl> <dbl>     <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1     0 252884.  35.2      76.1 0.0789 0.641 0.118   95.4 0.0301 0.547
2     1 232609.  35.2      74.9 0.0824 0.663 0.140   95.5 0.0172 0.552
3     2 257593.  35.4      74.7 0.0878 0.664 0.145   95.7 0.0248 0.561
4     3 251696.  35.1      74.9 0.0873 0.676 0.123   95.1 0.0160 0.542
# ... with 22 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
#   gender <dbl>, str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>,
#   offtrt <dbl>, cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>,
#   cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>, agem <dbl>,
#   weight_lb <dbl>, cd_change_20 <dbl>, cd_change_960 <dbl>,
#   exercise <dbl>

Here’s another scoped function in action mutate_if():

# use mutate_if() to round all numeric variables to 2 digits
trial_act %>%
  mutate_if(is.numeric, round, 2)

Using mutate_if(), round all of your results from the previous question to 0 decimal places (to the nearest integer)

trial_act %>%
  group_by(arms) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE) %>%
  mutate_if(is.numeric, round, 2)

# A tibble: 4 x 32
   arms  pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30
  <dbl>   <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1     0 252884.  35.2      76.1  0.08  0.64  0.12   95.4   0.03  0.55
2     1 232609.  35.2      74.9  0.08  0.66  0.14   95.5   0.02  0.55
3     2 257593.  35.4      74.7  0.09  0.66  0.15   95.7   0.02  0.56
4     3 251696.  35.1      74.9  0.09  0.68  0.12   95.1   0.02  0.54
# ... with 22 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
#   gender <dbl>, str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>,
#   offtrt <dbl>, cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>,
#   cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>, agem <dbl>,
#   weight_lb <dbl>, cd_change_20 <dbl>, cd_change_960 <dbl>,
#   exercise <dbl>

Y - Combine many functions

For each arm, calculate the following:

Mean days until a a major negative event (days)
Mean CD4 T cell count at baseline. (cd40)
Mean CD4 T cell count at 20 weeks. (cd420)
Mean CD4 T cell count at 96 weeks. (cd496)
Mean change in CD4 T cell count between baseline and 96 weeks
Number of patients in each arm

trial_act %>%
  group_by(arms) %>%
  summarise(
    days_mean = mean(days),
    cd4_bl = mean(cd40),
    cd4_20 = mean(cd420),
    cd4_96 = mean(cd496, na.rm = TRUE),
    cd4_change = mean(cd496 - cd40, na.rm = TRUE),
    N = n()
  )

# A tibble: 4 x 7
   arms days_mean cd4_bl cd4_20 cd4_96 cd4_change     N
  <int>     <dbl>  <dbl>  <dbl>  <dbl>      <dbl> <int>
1     0      801.   353.   336.   288.     -77.5    532
2     1      916.   349.   403.   341.      -6.90   522
3     2      906.   353.   372.   355.      -4.36   524
4     3      893.   347.   374.   329.     -19.4    561

Create the following dataframe that shows patient’s mean CD8 T cell count (from columns cd80 and cd820), where the data are grouped by time and trial arm. (Hint: use the following functions in order: select(), gather(), mutate(), group_by(), summarise())

trial_act %>%
  mutate(
    arms_char = case_when(
      arms == 0 ~ "Z",
      arms == 1 ~ "ZD",
      arms == 2 ~ "ZZ",
      arms == 3 ~ "D"
    )
  ) %>% 
  select(pidnum, arms_char, starts_with("cd8")) %>%
  gather(time, measure, -pidnum, -arms_char) %>%
  mutate(time = case_when(time == "cd80" ~ "baseline",
                          time == "cd820" ~ "week 20")) %>%
  group_by(time, arms_char) %>%
  summarise(N = n(),
            cd8_mean = mean(measure),
            cd8_median = median(measure))

# A tibble: 8 x 5
# Groups:   time [?]
  time     arms_char     N cd8_mean cd8_median
  <chr>    <chr>     <int>    <dbl>      <dbl>
1 baseline D           561     972.       890 
2 baseline Z           532     987.       881 
3 baseline ZD          522    1004.       917 
4 baseline ZZ          524     984.       898.
5 week 20  D           561     943.       871 
6 week 20  Z           532     928.       818 
7 week 20  ZD          522     968.       903 
8 week 20  ZZ          524     902.       862

Additional Resources

See https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html for the full dplyr vignette with lots of wrangling tips and tricks.

Wrangling

Basel R Bootcamp July 2018www.therbootcamp.com@therbootcamp