Overview

In this practical you’ll practice “data wrangling” with the dplyr and tidyr packages (part of the `tidyverse collection of packages).

By the end of this practical you will know how to:

Change column names, select specific columns
Create new columns based on existing ones
Select specific rows of data based on multiple criteria
Group data and calculate summary statistics
Combine multiple data sets through key columns
Convert data between wide and long formats

Cheatsheet

Data wrangling with dplyr and tidyr

Data wrangling with dplyr and tidyr Cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf.

Glossary

Here are the main functions you will be using in dplyr:

Function	Description	Example
`filter()`	Select rows based on some criteria	data %>% filter(age > 40 & sex == “m”)
`arrange()`	Sort rows	data %>% arrange(date, group)
`select()`	Select columns.	data %>% select(age, sex) data %>% select(age, sex, everything())
`rename()`	Rename columns	data %>% rename(new = old)
`mutate()`	Add new columns	data %>% mutate(height.m = height.cm / 100)
`case_when()`	Recode values of a column	data %>% mutate(sex_n = case_when(sex == 0 ~ “m”, sex == 1 ~ “f”))
`group_by(), summarise()`	Group data and then calculate summary statistics	data %>% group_by(…) %>% summarise(…)
`left_join()`	Combine multiple data sets using a key column	data %>% left_join(data2, by = “id”)

Here are the two functions you will be using from the tidyr package:

Function	Description	Example
`spread()`	Convert long data to wide format - from rows to columns	data %>% gather(time, data, -id)
`gather()`	Convert wide data to long format - from columns to rows	data %>% spread(time, data)

Examples

The following examples will take you through the steps of doing data wrangling with dplyr. Try to go through each line of code and see how it works!

# -----------------------------------------------
# Examples of using dplyr on the baselers data
# ------------------------------------------------

library(tidyverse)         # Load tidyverse for dplyr and tidyr
library(skimr)

# Load baselers data
baselers <- read_csv("https://raw.githubusercontent.com/therbootcamp/baselers/master/inst/extdata/baselers.txt")

# Skim the data
skim(baselers)

baselers <- baselers %>%
  
  # Change some names
  rename(age_y = age,
         swimming = rhine) %>%
  
  # Only include people over 30
  filter(age_y > 30) %>%
  
  # Calculate some new columns
  mutate(weight_lbs = weight * 2.22,
         height_m = height / 100,
         BMI = weight / height_m ^ 2,
         
         # Make binary version of sex
         sex_bin = case_when(
                      sex == "male" ~ 0,
                      sex == "female" ~ 1,
                      TRUE ~ NA_real_),

        # Show when height is greater than 150
        height_lt_150 = case_when(
          height < 150 ~ 1,
          height >= 150 ~ 0,
          TRUE ~ NA_real_
        ) %>%
  
  # Sort in ascending order of sex, then
  #  descending order of age
  arrange(sex, desc(age_y)))


# Calculate grouped summary statistics

baselers_agg <- baselers %>%
  group_by(sex, education) %>%
  summarise(
    age_mean = mean(age_y, na.rm = TRUE),
    income_median = median(income, na.rm = TRUE),
    N = n()
  )

Packages

Package	Installation
`tidyverse`	`install.packages("tidyverse")`

Datasets

File	Rows	Columns
`trial_act.csv`	2139	27
`trial_act_demo_fake`	2139	3

Tasks

Getting setup

A. Open your R project. It should already have the folders 0_Data and 1_Code. Make sure that the trial_act.csv and trial_act_demo_fake.csv files are in your 1_Data folder

# Done!

B. Open a new R script and save it as a new file called wrangling_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load the set of packages for this practical with library(). Here’s how the top of your script should look:

## NAME
## DATE
## Wrangling Practical

library(XX)     
library(XX)
#...

C. For this practical, we’ll use the trial_act data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act.

# Load trial_act.csv from the data folder in your working directory

trial_act <- read_csv(file = "XXX/XXX")

trial_act <- read_csv(file = "1_Data/trial_act.csv")

Parsed with column specification:
cols(
  .default = col_integer(),
  wtkg = col_double()
)

See spec(...) for full column specifications.

D. Using the same code structure, load the trial_act_demo_fake.csv data as a new dataframe called trial_act_demo_fake

E. The trial_act data is actually a copy of a dataset from the speff2trial package called ACTG175. Look at the help menu for the ACTG175 data by running ?ACTG175 (If you become really interested in the data, you can also read an article discussing the trial here: http://www.nejm.org/doi/full/10.1056/nejm199610103351501#t=article)

# Look at documentation for ACTG175 data (contained in the speff2trial package)
?ACTG175

F. Take a look at the first few rows of the datasets by printing them to the console.

# Print trial_act object
trial_act

G. Use the skim() function (from the skimr package) to get more details on the datasets.

skim(trial_act)

Skim summary statistics
 n obs: 2139 
 n variables: 27 

── Variable type:integer ────────────────────────────────────────────────────────────
 variable missing complete    n       mean        sd    p0      p25    p50
      age       0     2139 2139     35.25       8.71    12    29        34
     arms       0     2139 2139      1.52       1.13     0     1         2
     cd40       0     2139 2139    350.5      118.57     0   263.5     340
    cd420       0     2139 2139    371.31     144.63    49   269       353
    cd496     797     1342 2139    328.57     174.66     0   209.25    321
     cd80       0     2139 2139    986.63     480.2     40   654       893
    cd820       0     2139 2139    935.37     444.98   124   631.5     865
     cens       0     2139 2139      0.24       0.43     0     0         0
     days       0     2139 2139    879.1      292.27    14   727       997
    drugs       0     2139 2139      0.13       0.34     0     0         0
   gender       0     2139 2139      0.83       0.38     0     1         1
     hemo       0     2139 2139      0.084      0.28     0     0         0
     homo       0     2139 2139      0.66       0.47     0     0         1
   karnof       0     2139 2139     95.45       5.9     70    90       100
   offtrt       0     2139 2139      0.36       0.48     0     0         0
   oprior       0     2139 2139      0.022      0.15     0     0         0
   pidnum       0     2139 2139 248778.25  234237.29 10056 81446.5  190566
  preanti       0     2139 2139    379.18     468.66     0     0       142
        r       0     2139 2139      0.63       0.48     0     0         1
     race       0     2139 2139      0.29       0.45     0     0         0
     str2       0     2139 2139      0.59       0.49     0     0         1
    strat       0     2139 2139      1.98       0.9      1     1         2
  symptom       0     2139 2139      0.17       0.38     0     0         0
    treat       0     2139 2139      0.75       0.43     0     1         1
      z30       0     2139 2139      0.55       0.5      0     0         1
   zprior       0     2139 2139      1          0        1     1         1
      p75   p100     hist
     40       70 ▁▂▇▇▃▁▁▁
      3        3 ▇▁▇▁▁▇▁▇
    423     1199 ▁▆▇▃▁▁▁▁
    460     1119 ▂▇▇▅▂▁▁▁
    440     1190 ▃▇▇▅▁▁▁▁
   1207     5011 ▃▇▂▁▁▁▁▁
   1146.5   6035 ▇▇▁▁▁▁▁▁
      0        1 ▇▁▁▁▁▁▁▂
   1091     1231 ▁▂▂▂▂▂▇▇
      0        1 ▇▁▁▁▁▁▁▁
      1        1 ▂▁▁▁▁▁▁▇
      0        1 ▇▁▁▁▁▁▁▁
      1        1 ▅▁▁▁▁▁▁▇
    100      100 ▁▁▁▁▁▅▁▇
      1        1 ▇▁▁▁▁▁▁▅
      0        1 ▇▁▁▁▁▁▁▁
 280277   990077 ▇▇▅▁▁▁▁▂
    739.5   2851 ▇▂▂▁▁▁▁▁
      1        1 ▅▁▁▁▁▁▁▇
      1        1 ▇▁▁▁▁▁▁▃
      1        1 ▆▁▁▁▁▁▁▇
      3        3 ▇▁▁▃▁▁▁▇
      0        1 ▇▁▁▁▁▁▁▂
      1        1 ▂▁▁▁▁▁▁▇
      1        1 ▆▁▁▁▁▁▁▇
      1        1 ▁▁▁▇▁▁▁▁

── Variable type:numeric ────────────────────────────────────────────────────────────
 variable missing complete    n  mean    sd p0   p25   p50   p75   p100
     wtkg       0     2139 2139 75.13 13.26 31 66.68 74.39 82.56 159.94
     hist
 ▁▂▇▅▁▁▁▁

Notes

In this practical you’ll be doing lots of sequential operations on the same data. Whenever you can, try to keep the pipe %>% going to connect each task to the one before it. See below for an example

# Version A) Try to convert this...

baselers <- baselers %>% rename(age_y = age)
baselers <- baselers %>% rename(swimming = rhine)
baselers <- baselers %>% filter(age_y > 30)

# Version B) To this version where you keep the pipe going from the beginning!

baselers <- baselers %>% 
  
  rename(age_y = age,
         swimming = rhine) %>%
  
  filter(age_y > 30)

Change column names with rename()

Let’s change some of the column names in trial_act. Using rename(), change the column name wtkg to weight_kg (to specify that weight is in kilograms) Be sure to assign the result back to trial_act to change it!
Change the column name age to age_y (to specify that age is in years).

trial_act <- trial_act %>%
  rename(weight_kg = wtkg,
         age_y = age)

Select columns with select()

Using the select() function, create a new dataframe called CD4_wide that only contains the columns pidnum, arms, cd40, cd420, and cd496. The cd40, cd420, and cd496 columns show patient’s CD4 T cell counts at baseline, 20 weeks, and 96 weeks. Print the result to make sure it worked!
Did you know you can easily select all columns that start with specific characters using starts_with()? Try adapting the following code to get the same result you got before.

CD4_wide <- trial_act %>% 
  select(pidnum, arms, starts_with("XXX"))

CD4_wide <- trial_act %>% 
  select(pidnum, arms, starts_with("cd"))

Add new columns with mutate()

Using the mutate() function, add the following two new columns to trial_act. Try combining these into one call to the mutate() function!
- agem: Each patient’s age in months instead of years (Hint: Just multiply age_y by 12!).
- weight_lb: Weight in lbs instead of kilograms. You can do this by multiplying weight_kg by 2.2.
- cd_change_20: Change in CD4 T cell count from baseline to 20 weeks. You can do this by taking cd420 - cd40
- cd_change_960: Change in CD4 T cell count from baseline to 96 weeks. You can do this by taking cd496 - cd40

trial_act <- trial_act %>% 
  mutate(agem = age_y * 12)

mutate(case_when())

Create a new column gender_char that shows gender as a character string. To do this, use a combination of mutate() and case_when. The original gender data is stored in the gender column, where 0 = “female” and 1 = “male”.

trial_act <- trial_act %>%
  mutate(
  gender_char = case_when(
    gender == 0 ~ "female",
    gender == 1 ~ "male"
  )
  )

Create a new column over50 that is 1 when patients are older than 50, and 0 when they are younger than or equal to 50 (hint: Use logical comparisons > and <=)

trial_act <- trial_act %>%
  mutate(
  over50 = case_when(
    age_y > 50 ~ 1,
    age_y <= 50 ~ 0
  )
  )

If you haven’t already, put the code for your previous questions in one call to mutate(). That is, in one block of code, create agem, weight_lb, cd_change_20, cd_change_960, gender_char and over50 using the mutate() function only once. Here’s how your code should look:

trial_act <- trial_act %>%
  mutate(
    agem = XXX,
    weight_lb = XXX,
    cd_change_20 = XXX,
    cd_change_960 = XXX,
    gender_char = case_when(XXX),
    over50 = case_when(XXX)
  )

trial_act <- trial_act %>%
  mutate(
    agem = agem = age_y * 12,
    weight_lb = weight_kg * 2.2,
    cd_change_20 = cd420 - cd40,
    cd_change_960 = cd496 - cd40,
    gender_char = case_when(
    gender == 0 ~ "female",
    gender == 1 ~ "male"
  ),
    over50 = case_when(
    age_y > 50 ~ 1,
    age_y <= 50 ~ 0
  )
  )

Arrange rows with arrange()

Using the arrange()function, arrange the trial_act data in ascending order of age_y (from lowest to highest). After you do, look the data to make sure it worked!

trial_act <- trial_act %>% 
 arrange(age_y)

trial_act

# A tibble: 2,139 x 30
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>     <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1 940533    12      41.4     1     0     0    100      0     0      1
 2 950037    12      53.1     1     0     0    100      0     1      1
 3 950056    12      31       1     0     0    100      0     1      1
 4 910034    13      32.7     1     0     0    100      0     1      1
 5 940534    13      62.9     1     0     0    100      0     0      1
 6 960014    13      48.5     1     0     0    100      0     0      1
 7 310767    14      65       1     0     0    100      0     1      1
 8 920050    14      54.2     1     0     0    100      0     1      1
 9 940544    14      41.1     1     0     0    100      0     0      1
10 950061    14      64.3     1     0     0    100      0     1      1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   agem <dbl>, gender_char <chr>, over50 <dbl>

Now arrange the data in descending order of age_y (from highest to lowest). After, look the data to make sure it worked. To arrange data in descending order, just include desc() around the variable. E.g.; data %>% arrrange(desc(height))

trial_act <- trial_act %>% 
 arrange(desc(age_y))

trial_act

# A tibble: 2,139 x 30
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>     <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1  11438    70      73.9     0     1     0    100      0     1      1
 2 211360    70      63.1     0     1     0     90      0     0      1
 3 211284    69      81.6     0     1     0    100      0     0      1
 4  50580    68      90.5     0     1     1    100      0     1      1
 5  81127    68      70.8     0     1     0     90      0     1      1
 6  10924    67      71       0     1     0    100      0     0      1
 7 241150    67      82.1     0     1     0     90      0     0      1
 8  50662    66      84.4     0     1     0    100      0     0      1
 9  11987    65      77.2     0     1     0     90      0     0      1
10 140797    65      60.5     0     1     0     90      0     0      1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   agem <dbl>, gender_char <chr>, over50 <dbl>

You can sort the rows of dataframes with multiple columns by including many arguments to arrange(). Now sort the data by arms (arms) and then age (age_y).

trial_act <- trial_act %>% 
 arrange(arms, age_y)

trial_act

# A tibble: 2,139 x 30
   pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30 zprior
    <int> <int>     <dbl> <int> <int> <int>  <int>  <int> <int>  <int>
 1 960014    13      48.5     1     0     0    100      0     0      1
 2 960031    14      48.3     1     0     0    100      0     0      1
 3 990071    14      60       1     0     0    100      0     0      1
 4 980042    16      63       1     0     0    100      0     1      1
 5 171040    17      51.3     0     0     0     90      0     0      1
 6 990026    17     103.      1     0     0    100      0     1      1
 7 310234    18      57.3     1     0     0    100      0     1      1
 8 940519    18      56.8     1     0     0    100      0     1      1
 9 211314    19      50.8     0     0     0     90      0     0      1
10 340767    19      74.8     0     1     0    100      0     0      1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
#   race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
#   treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
#   r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
#   agem <dbl>, gender_char <chr>, over50 <dbl>

Filter specific rows with `filter()`

Using the filter() function, create a new dataframe called trial_act_m that only contains data from males (gender_char == "male")

trial_act_B <- trial_act %>%
  filter(gender == 1)

A colleague of yours named Tracy wants a datafame only containing data from females over the age of 40. Create this dataframe with filter() and call it trial_act_Tracy

trial_act_C <- trial_act %>%
  filter(age_y > 40 & gender == 0)

Combine dataframes with `left_join()`

The trial_act_demo_fake.csv file contains additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Use the left_join() function to combine the trial_act and trial_act_demo_fake datasets, set the by argument to the name of the column that is common in both data sets. This will be the key! Assign the result to trial_act. When you are done, look at the trial_act dataframe to make sure your code worked!

trial_act_demo_fake <- read_csv("1_Data/trial_act_demo_fake.csv")

Parsed with column specification:
cols(
  pidnum = col_integer(),
  exercise = col_integer(),
  education = col_character()
)

trial_act <- trial_act %>%
  left_join(trial_act_demo_fake, by = "pidnum")

Using your new trial_act dataframe, which should contain the exercise data, calculate the mean number of days of exercise that patients reported.

mean(trial_act$exercise)

[1] 2.203834

Calculate grouped statistics with `group_by()` and `summarise()`

In this code we’ll calculate summary statistics for each of the trial arms. Start with the trial_act dataframe. Then, group the data by arms. Then, for each arm, calculate the mean participant age (in years) as a new column called age_mean. Also, using N = n(), calculate the number of cases for each group. Assign the result to a new object called trial_arm.

trial_act %>% 
  group_by(arms) %>%
  summarise(
    age_mean = mean(agey)
  )

Now adjust your previous code to calculate the standard deviation of age in addition to the mean.

trial_act %>% 
  group_by(arms) %>%
  summarise(
    age_mean = mean(age_y),
    karnof_median = median(karnof)
  )

# A tibble: 4 x 3
   arms age_mean karnof_median
  <int>    <dbl>         <dbl>
1     0     35.2           100
2     1     35.2           100
3     2     35.4           100
4     3     35.1           100

Add code to calculate the median number of days until the first major negative event (days) for each arm.

trial_act %>% 
  group_by(arms) %>%
  summarise(
    age_mean = mean(age_y),
    karnof_median = median(karnof),
    days_median = median(days)
  )

# A tibble: 4 x 4
   arms age_mean karnof_median days_median
  <int>    <dbl>         <dbl>       <dbl>
1     0     35.2           100        934.
2     1     35.2           100       1014 
3     2     35.4           100       1012 
4     3     35.1           100       1000

Your code so far only creates groups based on arm, now adjust the code so it creates groups based on gender_char only (that is, forget about the trial arm). Keep calculating all of the same summary statistics as you did before. Assign the result to a new dataframe called trial_gender

trial_gender <- trial_act %>% 
  group_by(gender_char) %>%
  summarise(
    age_mean = mean(age_y),
    karnof_median = median(karnof),
    days_median = median(days)
  )

trial_gender

# A tibble: 2 x 4
  gender_char age_mean karnof_median days_median
  <chr>          <dbl>         <dbl>       <dbl>
1 female          34.3           100         982
2 male            35.4           100        1002

Now group the data by both arm and gender_char and calculate the same summary statistics! Assign the result to a new object called trial_arm_gen

trial_gender <- trial_act %>% 
  group_by(gender_char, arms) %>%
  summarise(
    age_mean = mean(age_y),
    karnof_median = median(karnof),
    days_median = median(days)
  )

trial_gender

# A tibble: 8 x 5
# Groups:   gender_char [?]
  gender_char  arms age_mean karnof_median days_median
  <chr>       <int>    <dbl>         <dbl>       <dbl>
1 female          0     33.6           100        936.
2 female          1     33.5           100        995 
3 female          2     35.0           100       1004 
4 female          3     35.2           100        974 
5 male            0     35.6           100        932.
6 male            1     35.6           100       1024 
7 male            2     35.5           100       1013 
8 male            3     35.1           100       1010

Reshaping with gather() and spread()

Remember the CD4_wide dataframe you created before? Currently it is in the wide format, where key data (different CD4 T cell counts) are in different columns. Now we will try to convert it to a long format. Our goal is to get the data in the ‘long’ format. Using the spread() function, create a new dataframe called CD4_long that shows the data in the ‘long’ format. To do this, use the following template. Set the grouping column to time and the new data column to value.

CD4_long <- CD4_wide %>% 
  gather(XX,  # New grouping column
         XX,  # New data column
         -pidnum, -arms)  # Names of columns to replicate

CD4_long <- CD4_wide %>% 
  gather(time,  # New grouping column
         value,  # New data column
         -pidnum, -arms)  # Names of columns to replicate

Now that your data are in the wide format, it should be easy to calculate grouped summary statistics! For each time point and trial arm, calculate the mean CD4 T cell count using group_by() and summarise().

CD4_long %>%
  group_by(time, arms) %>%
  summarise(
    cd4_mean = mean(value, na.rm = TRUE)
  )

# A tibble: 20 x 3
# Groups:   time [?]
   time   arms cd4_mean
   <chr> <int>    <dbl>
 1 cd40      0     353.
 2 cd40      1     349.
 3 cd40      2     353.
 4 cd40      3     347.
 5 cd420     0     336.
 6 cd420     1     403.
 7 cd420     2     372.
 8 cd420     3     374.
 9 cd496     0     288.
10 cd496     1     341.
11 cd496     2     355.
12 cd496     3     329.
13 cd80      0     987.
14 cd80      1    1004.
15 cd80      2     984.
16 cd80      3     972.
17 cd820     0     928.
18 cd820     1     968.
19 cd820     2     902.
20 cd820     3     943.

Now it’s time to practice moving data from the long to the wide format. Using the following template, use the spread() function to convert CD4_long back the wide format. Assign the result to a new object called CD4_wide_2. It should look exactly like CD4_wide!

CD4_wide_2 <- CD4_long %>% 
  spread(time,   # old group column
         value)   # old target column

CD4_wide_2

Play around with “Scoped” functions

Many common dplyr functions like mutate() and summarise() have ‘scoped’ versions with suffixes like _if and _all. that allow you do some pretty cool stuff easily (look at the help menu with ?scoped for details). Try running the following chunk with summarise_if() and see what happens:

baselers <- read_csv("1_Data/baselers.csv")

Parsed with column specification:
cols(
  .default = col_integer(),
  sex = col_character(),
  height = col_double(),
  weight = col_double(),
  income = col_double(),
  education = col_character(),
  confession = col_character(),
  food = col_double(),
  fasnacht = col_character(),
  eyecor = col_character()
)

See spec(...) for full column specifications.

# See how summerise_if() works!
baselers %>%
  group_by(sex) %>%
  summarise_if(is.numeric, mean)

# A tibble: 2 x 16
  sex       id   age height weight income children happiness fitness  food
  <chr>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>    <dbl>     <dbl>   <dbl> <dbl>
1 female 5016.  45.4   164.     NA     NA     1.81      6.91    5.12    NA
2 male   4985.  43.8   178.     NA     NA     1.82      6.90    5.13    NA
# ... with 6 more variables: alcohol <dbl>, tattoos <dbl>, rhine <dbl>,
#   datause <dbl>, consultations <dbl>, hiking <dbl>

Now, in the trial_act dataset, group the data by arms and calculate the mean of all numeric columns using summerise_if()

trial_act %>%
  group_by(arms) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)

# A tibble: 4 x 30
   arms  pidnum age_y weight_kg   hemo  homo drugs karnof oprior   z30
  <int>   <dbl> <dbl>     <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1     0 252884.  35.2      76.1 0.0789 0.641 0.118   95.4 0.0301 0.547
2     1 232609.  35.2      74.9 0.0824 0.663 0.140   95.5 0.0172 0.552
3     2 257593.  35.4      74.7 0.0878 0.664 0.145   95.7 0.0248 0.561
4     3 251696.  35.1      74.9 0.0873 0.676 0.123   95.1 0.0160 0.542
# ... with 20 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
#   gender <dbl>, str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>,
#   offtrt <dbl>, cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>,
#   cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>, agem <dbl>,
#   over50 <dbl>, exercise <dbl>

Here’s another scoped function in action mutate_if() in action:

# use mutate_if() to round all numeric variables to 2 digits
baselers %>%
  mutate_if(is.numeric, round, 2)

Using mutate_if(), round all of your results from the previous question to 0 decimal places (to the nearest integer)

trial_act %>%
  group_by(arms) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE) %>%
  mutate_if(is.numeric, round, 0)

# A tibble: 4 x 30
   arms pidnum age_y weight_kg  hemo  homo drugs karnof oprior   z30
  <dbl>  <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1     0 252884    35        76     0     1     0     95      0     1
2     1 232609    35        75     0     1     0     96      0     1
3     2 257593    35        75     0     1     0     96      0     1
4     3 251696    35        75     0     1     0     95      0     1
# ... with 20 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
#   gender <dbl>, str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>,
#   offtrt <dbl>, cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>,
#   cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>, agem <dbl>,
#   over50 <dbl>, exercise <dbl>

More practice

Now let’s check the major differences between the treatment arms. For each arm, calculate the following:

Mean days until a a major negative event (days)
Mean CD4 T cell count at baseline. (cd40)
Mean CD4 T cell count at 20 weeks. (cd420)
Mean CD4 T cell count at 96 weeks. (cd496)
Mean change in CD4 T cell count between baseline and 96 weeks
Number of patients in each arm

trial_act %>%
  group_by(arms) %>%
  summarise(
    days_mean = mean(days),
    cd4_bl = mean(cd40),
    cd4_20 = mean(cd420),
    cd4_96 = mean(cd496, na.rm = TRUE),
    cd4_change = mean(cd496 - cd40, na.rm = TRUE),
    N = n()
  )

# A tibble: 4 x 7
   arms days_mean cd4_bl cd4_20 cd4_96 cd4_change     N
  <int>     <dbl>  <dbl>  <dbl>  <dbl>      <dbl> <int>
1     0      801.   353.   336.   288.     -77.5    532
2     1      916.   349.   403.   341.      -6.90   522
3     2      906.   353.   372.   355.      -4.36   524
4     3      893.   347.   374.   329.     -19.4    561

Repeat the previous analysis, but before you do the grouping and summary statistics, create a new variable called arms_char that shows the values of arms as characters that reflect what the values actually represent (hint: use mutate() and case_when()). For example, looking at the help file ?ACTG175, I can see that the treatment arm of 0 is “zidovudine”. I might call this arm "Z". Do this in the all in the same chunk of code.

trial_act %>%
  mutate(
    arms_char = case_when(
      arms == 0 ~ "Z",
      arms == 1 ~ "ZD",
      arms == 2 ~ "ZZ",
      arms == 3 ~ "D"
    )
  ) %>%
  group_by(arms_char) %>%
  summarise(
    days_mean = mean(days),
    cd4_bl = mean(cd40),
    cd4_20 = mean(cd420),
    cd4_96 = mean(cd496, na.rm = TRUE),
    cd4_change = mean(cd496 - cd40, na.rm = TRUE)
  )

# A tibble: 4 x 6
  arms_char days_mean cd4_bl cd4_20 cd4_96 cd4_change
  <chr>         <dbl>  <dbl>  <dbl>  <dbl>      <dbl>
1 D              893.   347.   374.   329.     -19.4 
2 Z              801.   353.   336.   288.     -77.5 
3 ZD             916.   349.   403.   341.      -6.90
4 ZZ             906.   353.   372.   355.      -4.36

Create the following dataframe that shows patient’s mean CD8 T cell count (from columns cd80 and cd820), where the data are grouped by time and trial arm. (Hint: use the following functions in order: select(), gather(), mutate(), group_by(), summarise())

trial_act %>%
  mutate(
    arms_char = case_when(
      arms == 0 ~ "Z",
      arms == 1 ~ "ZD",
      arms == 2 ~ "ZZ",
      arms == 3 ~ "D"
    )
  ) %>% 
  select(pidnum, arms_char, starts_with("cd8")) %>%
  gather(time, measure, -pidnum, -arms_char) %>%
  mutate(time = case_when(time == "cd80" ~ "baseline",
                          time == "cd820" ~ "week 20")) %>%
  group_by(time, arms_char) %>%
  summarise(N = n(),
            cd8_mean = mean(measure),
            cd8_median = median(measure))

# A tibble: 8 x 5
# Groups:   time [?]
  time     arms_char     N cd8_mean cd8_median
  <chr>    <chr>     <int>    <dbl>      <dbl>
1 baseline D           561     972.       890 
2 baseline Z           532     987.       881 
3 baseline ZD          522    1004.       917 
4 baseline ZZ          524     984.       898.
5 week 20  D           561     943.       871 
6 week 20  Z           532     928.       818 
7 week 20  ZD          522     968.       903 
8 week 20  ZZ          524     902.       862

Data wrangling

The R Bootcamp @ Erfurtwww.therbootcamp.com@therbootcamp