In this practical you’ll practice “data wrangling” with the dplyr
and tidyr
packages (part of the `tidyverse collection of packages).
By the end of this practical you will know how to:
Data wrangling with dplyr and tidyr
Here are the main functions you will be using in dplyr
:
Function | Description | Example |
---|---|---|
filter() |
Select rows based on some criteria | data %>% filter(age > 40 & sex == “m”) |
arrange() |
Sort rows | data %>% arrange(date, group) |
select() |
Select columns. | data %>% select(age, sex) data %>% select(age, sex, everything()) |
rename() |
Rename columns | data %>% rename(new = old) |
mutate() |
Add new columns | data %>% mutate(height.m = height.cm / 100) |
case_when() |
Recode values of a column | data %>% mutate(sex_n = case_when(sex == 0 ~ “m”, sex == 1 ~ “f”)) |
group_by(), summarise() |
Group data and then calculate summary statistics | data %>% group_by(…) %>% summarise(…) |
left_join() |
Combine multiple data sets using a key column | data %>% left_join(data2, by = “id”) |
Here are the two functions you will be using from the tidyr
package:
Function | Description | Example |
---|---|---|
spread() |
Convert long data to wide format - from rows to columns | data %>% gather(time, data, -id) |
gather() |
Convert wide data to long format - from columns to rows | data %>% spread(time, data) |
The following examples will take you through the steps of doing data wrangling with dplyr. Try to go through each line of code and see how it works!
# -----------------------------------------------
# Examples of using dplyr on the baselers data
# ------------------------------------------------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(skimr)
# Load baselers data
baselers <- read_csv("https://raw.githubusercontent.com/therbootcamp/baselers/master/inst/extdata/baselers.txt")
# Skim the data
skim(baselers)
baselers <- baselers %>%
# Change some names
rename(age_y = age,
swimming = rhine) %>%
# Only include people over 30
filter(age_y > 30) %>%
# Calculate some new columns
mutate(weight_lbs = weight * 2.22,
height_m = height / 100,
BMI = weight / height_m ^ 2,
# Make binary version of sex
sex_bin = case_when(
sex == "male" ~ 0,
sex == "female" ~ 1,
TRUE ~ NA_real_),
# Show when height is greater than 150
height_lt_150 = case_when(
height < 150 ~ 1,
height >= 150 ~ 0,
TRUE ~ NA_real_
) %>%
# Sort in ascending order of sex, then
# descending order of age
arrange(sex, desc(age_y)))
# Calculate grouped summary statistics
baselers_agg <- baselers %>%
group_by(sex, education) %>%
summarise(
age_mean = mean(age_y, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
File | Rows | Columns |
---|---|---|
trial_act.csv |
2139 | 27 |
trial_act_demo_fake |
2139 | 3 |
A. Open your R project. It should already have the folders 0_Data
and 1_Code
. Make sure that the trial_act.csv
and trial_act_demo_fake.csv
files are in your 1_Data
folder
# Done!
B. Open a new R script and save it as a new file called wrangling_practical.R
in the 2_Code
folder. At the top of the script, using comments, write your name and the date. Then, load the set of packages for this practical with library()
. Here’s how the top of your script should look:
## NAME
## DATE
## Wrangling Practical
library(XX)
library(XX)
#...
C. For this practical, we’ll use the trial_act
data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act
.
# Load trial_act.csv from the data folder in your working directory
trial_act <- read_csv(file = "XXX/XXX")
trial_act <- read_csv(file = "1_Data/trial_act.csv")
Parsed with column specification:
cols(
.default = col_integer(),
wtkg = col_double()
)
See spec(...) for full column specifications.
D. Using the same code structure, load the trial_act_demo_fake.csv
data as a new dataframe called trial_act_demo_fake
E. The trial_act
data is actually a copy of a dataset from the speff2trial
package called ACTG175
. Look at the help menu for the ACTG175
data by running ?ACTG175
(If you become really interested in the data, you can also read an article discussing the trial here: http://www.nejm.org/doi/full/10.1056/nejm199610103351501#t=article)
# Look at documentation for ACTG175 data (contained in the speff2trial package)
?ACTG175
F. Take a look at the first few rows of the datasets by printing them to the console.
# Print trial_act object
trial_act
G. Use the skim()
function (from the skimr
package) to get more details on the datasets.
skim(trial_act)
Skim summary statistics
n obs: 2139
n variables: 27
── Variable type:integer ────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50
age 0 2139 2139 35.25 8.71 12 29 34
arms 0 2139 2139 1.52 1.13 0 1 2
cd40 0 2139 2139 350.5 118.57 0 263.5 340
cd420 0 2139 2139 371.31 144.63 49 269 353
cd496 797 1342 2139 328.57 174.66 0 209.25 321
cd80 0 2139 2139 986.63 480.2 40 654 893
cd820 0 2139 2139 935.37 444.98 124 631.5 865
cens 0 2139 2139 0.24 0.43 0 0 0
days 0 2139 2139 879.1 292.27 14 727 997
drugs 0 2139 2139 0.13 0.34 0 0 0
gender 0 2139 2139 0.83 0.38 0 1 1
hemo 0 2139 2139 0.084 0.28 0 0 0
homo 0 2139 2139 0.66 0.47 0 0 1
karnof 0 2139 2139 95.45 5.9 70 90 100
offtrt 0 2139 2139 0.36 0.48 0 0 0
oprior 0 2139 2139 0.022 0.15 0 0 0
pidnum 0 2139 2139 248778.25 234237.29 10056 81446.5 190566
preanti 0 2139 2139 379.18 468.66 0 0 142
r 0 2139 2139 0.63 0.48 0 0 1
race 0 2139 2139 0.29 0.45 0 0 0
str2 0 2139 2139 0.59 0.49 0 0 1
strat 0 2139 2139 1.98 0.9 1 1 2
symptom 0 2139 2139 0.17 0.38 0 0 0
treat 0 2139 2139 0.75 0.43 0 1 1
z30 0 2139 2139 0.55 0.5 0 0 1
zprior 0 2139 2139 1 0 1 1 1
p75 p100 hist
40 70 ▁▂▇▇▃▁▁▁
3 3 ▇▁▇▁▁▇▁▇
423 1199 ▁▆▇▃▁▁▁▁
460 1119 ▂▇▇▅▂▁▁▁
440 1190 ▃▇▇▅▁▁▁▁
1207 5011 ▃▇▂▁▁▁▁▁
1146.5 6035 ▇▇▁▁▁▁▁▁
0 1 ▇▁▁▁▁▁▁▂
1091 1231 ▁▂▂▂▂▂▇▇
0 1 ▇▁▁▁▁▁▁▁
1 1 ▂▁▁▁▁▁▁▇
0 1 ▇▁▁▁▁▁▁▁
1 1 ▅▁▁▁▁▁▁▇
100 100 ▁▁▁▁▁▅▁▇
1 1 ▇▁▁▁▁▁▁▅
0 1 ▇▁▁▁▁▁▁▁
280277 990077 ▇▇▅▁▁▁▁▂
739.5 2851 ▇▂▂▁▁▁▁▁
1 1 ▅▁▁▁▁▁▁▇
1 1 ▇▁▁▁▁▁▁▃
1 1 ▆▁▁▁▁▁▁▇
3 3 ▇▁▁▃▁▁▁▇
0 1 ▇▁▁▁▁▁▁▂
1 1 ▂▁▁▁▁▁▁▇
1 1 ▆▁▁▁▁▁▁▇
1 1 ▁▁▁▇▁▁▁▁
── Variable type:numeric ────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100
wtkg 0 2139 2139 75.13 13.26 31 66.68 74.39 82.56 159.94
hist
▁▂▇▅▁▁▁▁
In this practical you’ll be doing lots of sequential operations on the same data. Whenever you can, try to keep the pipe %>%
going to connect each task to the one before it. See below for an example
# Version A) Try to convert this...
baselers <- baselers %>% rename(age_y = age)
baselers <- baselers %>% rename(swimming = rhine)
baselers <- baselers %>% filter(age_y > 30)
# Version B) To this version where you keep the pipe going from the beginning!
baselers <- baselers %>%
rename(age_y = age,
swimming = rhine) %>%
filter(age_y > 30)
Let’s change some of the column names in trial_act
. Using rename()
, change the column name wtkg
to weight_kg
(to specify that weight is in kilograms) Be sure to assign the result back to trial_act
to change it!
Change the column name age
to age_y
(to specify that age is in years).
trial_act <- trial_act %>%
rename(weight_kg = wtkg,
age_y = age)
Using the select()
function, create a new dataframe called CD4_wide
that only contains the columns pidnum
, arms
, cd40
, cd420
, and cd496
. The cd40
, cd420
, and cd496
columns show patient’s CD4 T cell counts at baseline, 20 weeks, and 96 weeks. Print the result to make sure it worked!
Did you know you can easily select all columns that start with specific characters using starts_with()
? Try adapting the following code to get the same result you got before.
CD4_wide <- trial_act %>%
select(pidnum, arms, starts_with("XXX"))
CD4_wide <- trial_act %>%
select(pidnum, arms, starts_with("cd"))
mutate()
function, add the following two new columns to trial_act
. Try combining these into one call to the mutate() function!
agem
: Each patient’s age in months instead of years (Hint: Just multiply age_y
by 12!).weight_lb
: Weight in lbs instead of kilograms. You can do this by multiplying weight_kg
by 2.2.cd_change_20
: Change in CD4 T cell count from baseline to 20 weeks. You can do this by taking cd420 - cd40
cd_change_960
: Change in CD4 T cell count from baseline to 96 weeks. You can do this by taking cd496 - cd40
trial_act <- trial_act %>%
mutate(agem = age_y * 12)
gender_char
that shows gender as a character string. To do this, use a combination of mutate()
and case_when
. The original gender data is stored in the gender
column, where 0 = “female” and 1 = “male”.trial_act <- trial_act %>%
mutate(
gender_char = case_when(
gender == 0 ~ "female",
gender == 1 ~ "male"
)
)
over50
that is 1 when patients are older than 50, and 0 when they are younger than or equal to 50 (hint: Use logical comparisons > and <=)trial_act <- trial_act %>%
mutate(
over50 = case_when(
age_y > 50 ~ 1,
age_y <= 50 ~ 0
)
)
mutate()
. That is, in one block of code, create agem
, weight_lb
, cd_change_20
, cd_change_960
, gender_char
and over50
using the mutate()
function only once. Here’s how your code should look:trial_act <- trial_act %>%
mutate(
agem = XXX,
weight_lb = XXX,
cd_change_20 = XXX,
cd_change_960 = XXX,
gender_char = case_when(XXX),
over50 = case_when(XXX)
)
trial_act <- trial_act %>%
mutate(
agem = agem = age_y * 12,
weight_lb = weight_kg * 2.2,
cd_change_20 = cd420 - cd40,
cd_change_960 = cd496 - cd40,
gender_char = case_when(
gender == 0 ~ "female",
gender == 1 ~ "male"
),
over50 = case_when(
age_y > 50 ~ 1,
age_y <= 50 ~ 0
)
)
arrange()
function, arrange the trial_act
data in ascending order of age_y
(from lowest to highest). After you do, look the data to make sure it worked!trial_act <- trial_act %>%
arrange(age_y)
trial_act
# A tibble: 2,139 x 30
pidnum age_y weight_kg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 940533 12 41.4 1 0 0 100 0 0 1
2 950037 12 53.1 1 0 0 100 0 1 1
3 950056 12 31 1 0 0 100 0 1 1
4 910034 13 32.7 1 0 0 100 0 1 1
5 940534 13 62.9 1 0 0 100 0 0 1
6 960014 13 48.5 1 0 0 100 0 0 1
7 310767 14 65 1 0 0 100 0 1 1
8 920050 14 54.2 1 0 0 100 0 1 1
9 940544 14 41.1 1 0 0 100 0 0 1
10 950061 14 64.3 1 0 0 100 0 1 1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# agem <dbl>, gender_char <chr>, over50 <dbl>
age_y
(from highest to lowest). After, look the data to make sure it worked. To arrange data in descending order, just include desc()
around the variable. E.g.; data %>% arrrange(desc(height))
trial_act <- trial_act %>%
arrange(desc(age_y))
trial_act
# A tibble: 2,139 x 30
pidnum age_y weight_kg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 11438 70 73.9 0 1 0 100 0 1 1
2 211360 70 63.1 0 1 0 90 0 0 1
3 211284 69 81.6 0 1 0 100 0 0 1
4 50580 68 90.5 0 1 1 100 0 1 1
5 81127 68 70.8 0 1 0 90 0 1 1
6 10924 67 71 0 1 0 100 0 0 1
7 241150 67 82.1 0 1 0 90 0 0 1
8 50662 66 84.4 0 1 0 100 0 0 1
9 11987 65 77.2 0 1 0 90 0 0 1
10 140797 65 60.5 0 1 0 90 0 0 1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# agem <dbl>, gender_char <chr>, over50 <dbl>
arrange()
. Now sort the data by arms (arms
) and then age (age_y
).trial_act <- trial_act %>%
arrange(arms, age_y)
trial_act
# A tibble: 2,139 x 30
pidnum age_y weight_kg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 960014 13 48.5 1 0 0 100 0 0 1
2 960031 14 48.3 1 0 0 100 0 0 1
3 990071 14 60 1 0 0 100 0 0 1
4 980042 16 63 1 0 0 100 0 1 1
5 171040 17 51.3 0 0 0 90 0 0 1
6 990026 17 103. 1 0 0 100 0 1 1
7 310234 18 57.3 1 0 0 100 0 1 1
8 940519 18 56.8 1 0 0 100 0 1 1
9 211314 19 50.8 0 0 0 90 0 0 1
10 340767 19 74.8 0 1 0 100 0 0 1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# agem <dbl>, gender_char <chr>, over50 <dbl>
filter()
filter()
function, create a new dataframe called trial_act_m
that only contains data from males (gender_char == "male"
)trial_act_B <- trial_act %>%
filter(gender == 1)
filter()
and call it trial_act_Tracy
trial_act_C <- trial_act %>%
filter(age_y > 40 & gender == 0)
left_join()
trial_act_demo_fake.csv
file contains additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Use the left_join()
function to combine the trial_act
and trial_act_demo_fake
datasets, set the by
argument to the name of the column that is common in both data sets. This will be the key! Assign the result to trial_act
. When you are done, look at the trial_act
dataframe to make sure your code worked!trial_act_demo_fake <- read_csv("1_Data/trial_act_demo_fake.csv")
Parsed with column specification:
cols(
pidnum = col_integer(),
exercise = col_integer(),
education = col_character()
)
trial_act <- trial_act %>%
left_join(trial_act_demo_fake, by = "pidnum")
trial_act
dataframe, which should contain the exercise data, calculate the mean number of days of exercise that patients reported.mean(trial_act$exercise)
[1] 2.203834
group_by()
and summarise()
trial_act
dataframe. Then, group the data by arms
. Then, for each arm, calculate the mean participant age (in years) as a new column called age_mean
. Also, using N = n()
, calculate the number of cases for each group. Assign the result to a new object called trial_arm
.trial_act %>%
group_by(arms) %>%
summarise(
age_mean = mean(agey)
)
trial_act %>%
group_by(arms) %>%
summarise(
age_mean = mean(age_y),
karnof_median = median(karnof)
)
# A tibble: 4 x 3
arms age_mean karnof_median
<int> <dbl> <dbl>
1 0 35.2 100
2 1 35.2 100
3 2 35.4 100
4 3 35.1 100
days
) for each arm.trial_act %>%
group_by(arms) %>%
summarise(
age_mean = mean(age_y),
karnof_median = median(karnof),
days_median = median(days)
)
# A tibble: 4 x 4
arms age_mean karnof_median days_median
<int> <dbl> <dbl> <dbl>
1 0 35.2 100 934.
2 1 35.2 100 1014
3 2 35.4 100 1012
4 3 35.1 100 1000
arm
, now adjust the code so it creates groups based on gender_char
only (that is, forget about the trial arm). Keep calculating all of the same summary statistics as you did before. Assign the result to a new dataframe called trial_gender
trial_gender <- trial_act %>%
group_by(gender_char) %>%
summarise(
age_mean = mean(age_y),
karnof_median = median(karnof),
days_median = median(days)
)
trial_gender
# A tibble: 2 x 4
gender_char age_mean karnof_median days_median
<chr> <dbl> <dbl> <dbl>
1 female 34.3 100 982
2 male 35.4 100 1002
arm
and gender_char
and calculate the same summary statistics! Assign the result to a new object called trial_arm_gen
trial_gender <- trial_act %>%
group_by(gender_char, arms) %>%
summarise(
age_mean = mean(age_y),
karnof_median = median(karnof),
days_median = median(days)
)
trial_gender
# A tibble: 8 x 5
# Groups: gender_char [?]
gender_char arms age_mean karnof_median days_median
<chr> <int> <dbl> <dbl> <dbl>
1 female 0 33.6 100 936.
2 female 1 33.5 100 995
3 female 2 35.0 100 1004
4 female 3 35.2 100 974
5 male 0 35.6 100 932.
6 male 1 35.6 100 1024
7 male 2 35.5 100 1013
8 male 3 35.1 100 1010
CD4_wide
dataframe you created before? Currently it is in the wide format, where key data (different CD4 T cell counts) are in different columns. Now we will try to convert it to a long format. Our goal is to get the data in the ‘long’ format. Using the spread()
function, create a new dataframe called CD4_long
that shows the data in the ‘long’ format. To do this, use the following template. Set the grouping column to time
and the new data column to value
.CD4_long <- CD4_wide %>%
gather(XX, # New grouping column
XX, # New data column
-pidnum, -arms) # Names of columns to replicate
CD4_long <- CD4_wide %>%
gather(time, # New grouping column
value, # New data column
-pidnum, -arms) # Names of columns to replicate
group_by()
and summarise()
.CD4_long %>%
group_by(time, arms) %>%
summarise(
cd4_mean = mean(value, na.rm = TRUE)
)
# A tibble: 20 x 3
# Groups: time [?]
time arms cd4_mean
<chr> <int> <dbl>
1 cd40 0 353.
2 cd40 1 349.
3 cd40 2 353.
4 cd40 3 347.
5 cd420 0 336.
6 cd420 1 403.
7 cd420 2 372.
8 cd420 3 374.
9 cd496 0 288.
10 cd496 1 341.
11 cd496 2 355.
12 cd496 3 329.
13 cd80 0 987.
14 cd80 1 1004.
15 cd80 2 984.
16 cd80 3 972.
17 cd820 0 928.
18 cd820 1 968.
19 cd820 2 902.
20 cd820 3 943.
spread()
function to convert CD4_long
back the wide format. Assign the result to a new object called CD4_wide_2
. It should look exactly like CD4_wide
!CD4_wide_2 <- CD4_long %>%
spread(time, # old group column
value) # old target column
CD4_wide_2
Many common dplyr functions like mutate()
and summarise()
have ‘scoped’ versions with suffixes like _if
and _all
. that allow you do some pretty cool stuff easily (look at the help menu with ?scoped
for details). Try running the following chunk with summarise_if()
and see what happens:
baselers <- read_csv("1_Data/baselers.csv")
Parsed with column specification:
cols(
.default = col_integer(),
sex = col_character(),
height = col_double(),
weight = col_double(),
income = col_double(),
education = col_character(),
confession = col_character(),
food = col_double(),
fasnacht = col_character(),
eyecor = col_character()
)
See spec(...) for full column specifications.
# See how summerise_if() works!
baselers %>%
group_by(sex) %>%
summarise_if(is.numeric, mean)
# A tibble: 2 x 16
sex id age height weight income children happiness fitness food
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 female 5016. 45.4 164. NA NA 1.81 6.91 5.12 NA
2 male 4985. 43.8 178. NA NA 1.82 6.90 5.13 NA
# ... with 6 more variables: alcohol <dbl>, tattoos <dbl>, rhine <dbl>,
# datause <dbl>, consultations <dbl>, hiking <dbl>
trial_act
dataset, group the data by arms
and calculate the mean of all numeric columns using summerise_if()
trial_act %>%
group_by(arms) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
# A tibble: 4 x 30
arms pidnum age_y weight_kg hemo homo drugs karnof oprior z30
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 252884. 35.2 76.1 0.0789 0.641 0.118 95.4 0.0301 0.547
2 1 232609. 35.2 74.9 0.0824 0.663 0.140 95.5 0.0172 0.552
3 2 257593. 35.4 74.7 0.0878 0.664 0.145 95.7 0.0248 0.561
4 3 251696. 35.1 74.9 0.0873 0.676 0.123 95.1 0.0160 0.542
# ... with 20 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
# gender <dbl>, str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>,
# offtrt <dbl>, cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>,
# cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>, agem <dbl>,
# over50 <dbl>, exercise <dbl>
Here’s another scoped function in action mutate_if()
in action:
# use mutate_if() to round all numeric variables to 2 digits
baselers %>%
mutate_if(is.numeric, round, 2)
mutate_if()
, round all of your results from the previous question to 0 decimal places (to the nearest integer)trial_act %>%
group_by(arms) %>%
summarise_if(is.numeric, mean, na.rm = TRUE) %>%
mutate_if(is.numeric, round, 0)
# A tibble: 4 x 30
arms pidnum age_y weight_kg hemo homo drugs karnof oprior z30
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 252884 35 76 0 1 0 95 0 1
2 1 232609 35 75 0 1 0 96 0 1
3 2 257593 35 75 0 1 0 96 0 1
4 3 251696 35 75 0 1 0 95 0 1
# ... with 20 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
# gender <dbl>, str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>,
# offtrt <dbl>, cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>,
# cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>, agem <dbl>,
# over50 <dbl>, exercise <dbl>
days
)cd40
)cd420
)cd496
)trial_act %>%
group_by(arms) %>%
summarise(
days_mean = mean(days),
cd4_bl = mean(cd40),
cd4_20 = mean(cd420),
cd4_96 = mean(cd496, na.rm = TRUE),
cd4_change = mean(cd496 - cd40, na.rm = TRUE),
N = n()
)
# A tibble: 4 x 7
arms days_mean cd4_bl cd4_20 cd4_96 cd4_change N
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0 801. 353. 336. 288. -77.5 532
2 1 916. 349. 403. 341. -6.90 522
3 2 906. 353. 372. 355. -4.36 524
4 3 893. 347. 374. 329. -19.4 561
arms_char
that shows the values of arms
as characters that reflect what the values actually represent (hint: use mutate()
and case_when()
). For example, looking at the help file ?ACTG175
, I can see that the treatment arm of 0 is “zidovudine”. I might call this arm "Z"
. Do this in the all in the same chunk of code.trial_act %>%
mutate(
arms_char = case_when(
arms == 0 ~ "Z",
arms == 1 ~ "ZD",
arms == 2 ~ "ZZ",
arms == 3 ~ "D"
)
) %>%
group_by(arms_char) %>%
summarise(
days_mean = mean(days),
cd4_bl = mean(cd40),
cd4_20 = mean(cd420),
cd4_96 = mean(cd496, na.rm = TRUE),
cd4_change = mean(cd496 - cd40, na.rm = TRUE)
)
# A tibble: 4 x 6
arms_char days_mean cd4_bl cd4_20 cd4_96 cd4_change
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 D 893. 347. 374. 329. -19.4
2 Z 801. 353. 336. 288. -77.5
3 ZD 916. 349. 403. 341. -6.90
4 ZZ 906. 353. 372. 355. -4.36
cd80
and cd820
), where the data are grouped by time and trial arm. (Hint: use the following functions in order: select(), gather(), mutate(), group_by(), summarise())trial_act %>%
mutate(
arms_char = case_when(
arms == 0 ~ "Z",
arms == 1 ~ "ZD",
arms == 2 ~ "ZZ",
arms == 3 ~ "D"
)
) %>%
select(pidnum, arms_char, starts_with("cd8")) %>%
gather(time, measure, -pidnum, -arms_char) %>%
mutate(time = case_when(time == "cd80" ~ "baseline",
time == "cd820" ~ "week 20")) %>%
group_by(time, arms_char) %>%
summarise(N = n(),
cd8_mean = mean(measure),
cd8_median = median(measure))
# A tibble: 8 x 5
# Groups: time [?]
time arms_char N cd8_mean cd8_median
<chr> <chr> <int> <dbl> <dbl>
1 baseline D 561 972. 890
2 baseline Z 532 987. 881
3 baseline ZD 522 1004. 917
4 baseline ZZ 524 984. 898.
5 week 20 D 561 943. 871
6 week 20 Z 532 928. 818
7 week 20 ZD 522 968. 903
8 week 20 ZZ 524 902. 862