In this practical you’ll practice “data wrangling” with the dplyr
and tidyr
packages (part of the `tidyverse collection of packages).
By the end of this practical you will know how to:
You’ll need the following datasets for this practical:
library(tidyverse)
trial_act <- read_csv("../_data/baselrbootcamp_data/trial_act.csv")
trial_act_demo <- read_csv("../_data/baselrbootcamp_data/trial_act_demo_fake.csv")
File | Rows | Columns |
---|---|---|
trial_act.csv | 2139 | 27 |
trial_act_demo_fake | 2139 | 3 |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
Function | Package | Description |
---|---|---|
rename() |
dplyr |
Rename columns |
select() |
dplyr |
Select columns based on name or index |
filter() |
dplyr |
Select rows based on some logical criteria |
arrange() |
dplyr |
Sort rows |
mutate() |
dplyr |
Add new columns |
case_when() |
dplyr |
Recode values of a column |
group_by(), summarise() |
dplyr |
Group data and then calculate summary statistics |
left_join() |
dplyr |
Combine multiple data sets using a key column |
spread() |
tidyr |
Convert long data to wide format - from rows to columns |
gather() |
tidyr |
Convert wide data to long format - from columns to rows |
# Wrangling with dplyr and tidyr ---------------------------
library(tidyverse) # Load tidyverse for dplyr and tidyr
# Load baselers data
baselers <- read_csv("https://raw.githubusercontent.com/therbootcamp/baselers/master/inst/extdata/baselers.txt")
# Perform many dplyr operations in a row
baselers %>%
# Change some names
rename(age_y = age,
swimming = rhine) %>%
# Only include people over 30
filter(age_y > 30) %>%
# Calculate some new columns
mutate(weight_lbs = weight * 2.22,
height_m = height / 100,
BMI = weight / height_m ^ 2,
# Make binary version of sex
sex_bin = case_when(
sex == "male" ~ 0,
sex == "female" ~ 1),
# Show when height is greater than 150
height_lt_150 = case_when(
height < 150 ~ 1,
height >= 150 ~ 0)) %>%
# Sort in ascending order of sex, then
# descending order of age
arrange(sex, desc(age_y)))
# Calculate grouped summary statistics
baselers_agg <- baselers %>%
group_by(sex, education) %>%
summarise(
age_mean = mean(age_y, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
baselrbootcamp
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data files listed in the Datasets
section above are in your 1_Data
folder# Done!
wrangling_practical.R
in the 2_Code
folder.Using library()
load the set of packages for this practical listed in the packages section above.
For this practical, we’ll use the trial_act
data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act
.
# Load trial_act.csv from the data folder in your working directory
trial_act <- read_csv(file = "XX")
trial_act <- read_csv(file = "1_Data/trial_act.csv")
trial_act_demo_fake.csv
data as a new dataframe called trial_act_demo_fake
trial_act_demo_fake <- read_csv(file = "1_Data/trial_act_demo_fake")
# Print trial_act object
trial_act
# A tibble: 2,139 x 27
pidnum age wtkg hemo homo drugs karnof oprior z30 zprior preanti
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int>
1 10056 48 89.8 0 0 0 100 0 0 1 0
2 10059 61 49.4 0 0 0 90 0 1 1 895
3 10089 45 88.5 0 1 1 90 0 1 1 707
4 10093 47 85.3 0 1 0 100 0 1 1 1399
5 10124 43 66.7 0 1 0 100 0 1 1 1352
6 10140 46 88.9 0 1 1 100 0 1 1 1181
7 10165 31 73.0 0 1 0 100 0 1 1 930
8 10190 41 66.2 0 1 1 100 0 1 1 1329
9 10198 40 82.6 0 1 0 90 0 1 1 1074
10 10229 35 78.0 0 1 0 100 0 1 1 964
# ... with 2,129 more rows, and 16 more variables: race <int>,
# gender <int>, str2 <int>, strat <int>, symptom <int>, treat <int>,
# offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>, r <int>,
# cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>
summary()
function to get more details on the datasets.summary(trial_act)
pidnum age wtkg hemo
Min. : 10056 Min. :12.0 Min. : 31.0 Min. :0.000
1st Qu.: 81446 1st Qu.:29.0 1st Qu.: 66.7 1st Qu.:0.000
Median :190566 Median :34.0 Median : 74.4 Median :0.000
Mean :248778 Mean :35.2 Mean : 75.1 Mean :0.084
3rd Qu.:280277 3rd Qu.:40.0 3rd Qu.: 82.6 3rd Qu.:0.000
Max. :990077 Max. :70.0 Max. :159.9 Max. :1.000
homo drugs karnof oprior
Min. :0.000 Min. :0.000 Min. : 70.0 Min. :0.000
1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 90.0 1st Qu.:0.000
Median :1.000 Median :0.000 Median :100.0 Median :0.000
Mean :0.661 Mean :0.131 Mean : 95.4 Mean :0.022
3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:100.0 3rd Qu.:0.000
Max. :1.000 Max. :1.000 Max. :100.0 Max. :1.000
z30 zprior preanti race gender
Min. :0.00 Min. :1 Min. : 0 Min. :0.000 Min. :0.000
1st Qu.:0.00 1st Qu.:1 1st Qu.: 0 1st Qu.:0.000 1st Qu.:1.000
Median :1.00 Median :1 Median : 142 Median :0.000 Median :1.000
Mean :0.55 Mean :1 Mean : 379 Mean :0.288 Mean :0.828
3rd Qu.:1.00 3rd Qu.:1 3rd Qu.: 740 3rd Qu.:1.000 3rd Qu.:1.000
Max. :1.00 Max. :1 Max. :2851 Max. :1.000 Max. :1.000
str2 strat symptom treat
Min. :0.000 Min. :1.00 Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.000 1st Qu.:1.000
Median :1.000 Median :2.00 Median :0.000 Median :1.000
Mean :0.586 Mean :1.98 Mean :0.173 Mean :0.751
3rd Qu.:1.000 3rd Qu.:3.00 3rd Qu.:0.000 3rd Qu.:1.000
Max. :1.000 Max. :3.00 Max. :1.000 Max. :1.000
offtrt cd40 cd420 cd496
Min. :0.000 Min. : 0 Min. : 49 Min. : 0
1st Qu.:0.000 1st Qu.: 264 1st Qu.: 269 1st Qu.: 209
Median :0.000 Median : 340 Median : 353 Median : 321
Mean :0.363 Mean : 351 Mean : 371 Mean : 329
3rd Qu.:1.000 3rd Qu.: 423 3rd Qu.: 460 3rd Qu.: 440
Max. :1.000 Max. :1199 Max. :1119 Max. :1190
NA's :797
r cd80 cd820 cens
Min. :0.000 Min. : 40 Min. : 124 Min. :0.000
1st Qu.:0.000 1st Qu.: 654 1st Qu.: 632 1st Qu.:0.000
Median :1.000 Median : 893 Median : 865 Median :0.000
Mean :0.627 Mean : 987 Mean : 935 Mean :0.244
3rd Qu.:1.000 3rd Qu.:1207 3rd Qu.:1146 3rd Qu.:0.000
Max. :1.000 Max. :5011 Max. :6035 Max. :1.000
days arms
Min. : 14 Min. :0.00
1st Qu.: 727 1st Qu.:1.00
Median : 997 Median :2.00
Mean : 879 Mean :1.52
3rd Qu.:1091 3rd Qu.:3.00
Max. :1231 Max. :3.00
summary(trial_act_demo_fake)
pidnum exercise education
Min. : 10056 Min. :0.0 Length:2139
1st Qu.: 81446 1st Qu.:1.0 Class :character
Median :190566 Median :1.0 Mode :character
Mean :248778 Mean :2.2
3rd Qu.:280277 3rd Qu.:3.0
Max. :990077 Max. :7.0
trial_act
data with names()
names(trial_act)
[1] "pidnum" "age" "wtkg" "hemo" "homo" "drugs" "karnof"
[8] "oprior" "z30" "zprior" "preanti" "race" "gender" "str2"
[15] "strat" "symptom" "treat" "offtrt" "cd40" "cd420" "cd496"
[22] "r" "cd80" "cd820" "cens" "days" "arms"
rename()
, change the column name wtkg
in the trial_act
dataframe to weight_kg
. Be sure to assign the result back to trial_act
to change it!# Change the name to weight_kg from wtkg
trial_act <- trial_act %>%
rename(XX = XX)
trial_act <- trial_act %>%
rename(weight_kg = wtkg)
trial_act
dataframe again, do you now see the column weight_kg
?names(trial_act)
[1] "pidnum" "age" "weight_kg" "hemo" "homo"
[6] "drugs" "karnof" "oprior" "z30" "zprior"
[11] "preanti" "race" "gender" "str2" "strat"
[16] "symptom" "treat" "offtrt" "cd40" "cd420"
[21] "cd496" "r" "cd80" "cd820" "cens"
[26] "days" "arms"
age
to age_y
(to specify that age is in years).trial_act <- trial_act %>%
rename(age_y = age)
select()
function, select only the column age_y
and print the result (but don’t assign it to anything). Do you see only the age column now?trial_act %>%
select(age_y)
# A tibble: 2,139 x 1
age_y
<int>
1 48
2 61
3 45
4 47
5 43
6 46
7 31
8 41
9 40
10 35
# ... with 2,129 more rows
CD4_wide
that only contains the columns pidnum
, arms
, cd40
, cd420
, and cd496
. The cd40
, cd420
, and cd496
columns show patient’s CD4 T cell counts at baseline, 20 weeks, and 96 weeks. Print the result to make sure it worked!XX <- trial_act %>%
select(XX, XX, XX, XX, ...)
CD4_wide <- trial_act %>%
select(`pidnum`, `arms`, `cd40`, `cd420`, `cd496`)
starts_with()
? Try adapting the following code to get the same result you got before.CD4_wide <- trial_act %>%
select(pidnum, arms, starts_with("XXX"))
CD4_wide <- trial_act %>%
select(pidnum, arms, starts_with("cd"))
mutate()
function, add the column age_m
which shows each patient’s age in months instead of years (Hint: Just multiply age_y
by 12!)trial_act <- trial_act %>%
mutate(XX = XX * 12)
trial_act <- trial_act %>%
mutate(age_m = age_y * 12)
trial_act
. Try combining these into one call to the mutate()
function!weight_lb
: Weight in lbs instead of kilograms. You can do this by multiplying weight_kg
by 2.2.cd_change_20
: Change in CD4 T cell count from baseline to 20 weeks. You can do this by taking cd420 - cd40
cd_change_960
: Change in CD4 T cell count from baseline to 96 weeks. You can do this by taking cd496 - cd40
trial_act <- trial_act %>%
mutate(weight_lb = XXX,
cd_change_20 = XXX,
XX = XX)
trial_act <- trial_act %>%
mutate(weight_lb = weight_kg * 2.2,
cd_change_20 = cd420 - cd40,
cd_change_960 = cd496 - cd40)
gender
column, you will see that it is numeric. Change the column so it shows gender as a string, where 0 = “female” and 1 = “male”. To do this, use a combination of mutate()
and case_when
:# Create gender_char which shows gender as a stringh
trial_act <- trial_act %>%
mutate(
gender_char = case_when(
gender == XX ~ "XX",
gender == XX ~ "XX"))
trial_act <- trial_act %>%
mutate(
gender_char = case_when(
gender == 0 ~ "female",
gender == 1 ~ "male"))
arms
is also numeric and not very meaningful. Create a new column arms_char
contains the names of the arms. Here is a table of the mappingarms | arms_char |
---|---|
0 | zidovudine |
1 | zidovudine and didanosine |
2 | zidovudine and zalcitabine |
3 | didanosine |
trial_act <- trial_act %>%
mutate(
arms_char = case_when(
arms == 0 ~ "zidovudine",
arms == 1 ~ "zidovudine and didanosine",
arms == 2 ~ "zidovudine and zalcitabine",
arms == 3 ~ "didanosine"))
mutate()
. That is, in one block of code, create agem
, weight_lb
, cd_change_20
, cd_change_960
, gender_char
and over50
using the mutate()
function only once. Here’s how your code should look:trial_act <- trial_act %>%
mutate(
agem = XXX,
weight_lb = XXX,
cd_change_20 = XXX,
cd_change_960 = XXX,
gender_char = case_when(XXX),
arms_char = case_when(XXX)
)
trial_act <- trial_act %>%
mutate(
agem = age_y * 12,
weight_lb = weight_kg * 2.2,
cd_change_20 = cd420 - cd40,
cd_change_960 = cd496 - cd40,
gender_char = case_when(
gender == 0 ~ "female",
gender == 1 ~ "male"),
arms_char = case_when(
arms == 0 ~ "zidovudine",
arms == 1 ~ "zidovudine and didanosine",
arms == 2 ~ "zidovudine and zalcitabine",
arms == 3 ~ "didanosine")
)
arrange()
function, arrange the trial_act
data in ascending order of age_y
(from lowest to highest). After you do, look the data to make sure it worked!trial_act <- trial_act %>%
arrange(age_y)
trial_act
# A tibble: 2,139 x 33
pidnum age_y weight_kg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 940533 12 41.4 1 0 0 100 0 0 1
2 950037 12 53.1 1 0 0 100 0 1 1
3 950056 12 31 1 0 0 100 0 1 1
4 910034 13 32.7 1 0 0 100 0 1 1
5 940534 13 62.9 1 0 0 100 0 0 1
6 960014 13 48.5 1 0 0 100 0 0 1
7 310767 14 65 1 0 0 100 0 1 1
8 920050 14 54.2 1 0 0 100 0 1 1
9 940544 14 41.1 1 0 0 100 0 0 1
10 950061 14 64.3 1 0 0 100 0 1 1
# ... with 2,129 more rows, and 23 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# gender_char <chr>, arms_char <chr>, agem <dbl>, weight_lb <dbl>,
# cd_change_20 <int>, cd_change_960 <int>
age_y
(from highest to lowest). After, look the data to make sure it worked. To arrange data in descending order, just include desc()
around the variable. E.g.; data %>% arrrange(desc(height))
trial_act <- trial_act %>%
arrange(desc(age_y))
trial_act
# A tibble: 2,139 x 33
pidnum age_y weight_kg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 11438 70 73.9 0 1 0 100 0 1 1
2 211360 70 63.1 0 1 0 90 0 0 1
3 211284 69 81.6 0 1 0 100 0 0 1
4 50580 68 90.5 0 1 1 100 0 1 1
5 81127 68 70.8 0 1 0 90 0 1 1
6 10924 67 71 0 1 0 100 0 0 1
7 241150 67 82.1 0 1 0 90 0 0 1
8 50662 66 84.4 0 1 0 100 0 0 1
9 11987 65 77.2 0 1 0 90 0 0 1
10 140797 65 60.5 0 1 0 90 0 0 1
# ... with 2,129 more rows, and 23 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# gender_char <chr>, arms_char <chr>, agem <dbl>, weight_lb <dbl>,
# cd_change_20 <int>, cd_change_960 <int>
arrange()
. Now sort the data by arms (arms
) and then age (age_y
).trial_act <- trial_act %>%
arrange(arms, age_y)
trial_act
# A tibble: 2,139 x 33
pidnum age_y weight_kg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 960014 13 48.5 1 0 0 100 0 0 1
2 960031 14 48.3 1 0 0 100 0 0 1
3 990071 14 60 1 0 0 100 0 0 1
4 980042 16 63 1 0 0 100 0 1 1
5 171040 17 51.3 0 0 0 90 0 0 1
6 990026 17 103. 1 0 0 100 0 1 1
7 310234 18 57.3 1 0 0 100 0 1 1
8 940519 18 56.8 1 0 0 100 0 1 1
9 211314 19 50.8 0 0 0 90 0 0 1
10 340767 19 74.8 0 1 0 100 0 0 1
# ... with 2,129 more rows, and 23 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# gender_char <chr>, arms_char <chr>, agem <dbl>, weight_lb <dbl>,
# cd_change_20 <int>, cd_change_960 <int>
filter()
filter()
function, create a new dataframe called trial_act_m
that only contains data from males (gender_char == "male"
). After you finish, print your new dataframe to make sure it looks correct!trial_act_m <- trial_act %>%
filter(gender_char == "male")
filter()
and call it trial_act_Tracy
trial_act_Tracy <- trial_act %>%
filter(age_y > XX & gender == XX)
trial_act_Tracy <- trial_act %>%
filter(age_y > 40 & gender_char == "female")
left_join()
trial_act_demo_fake.csv
file contains additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Use the left_join()
function to combine the trial_act
and trial_act_demo_fake
datasets, set the by
argument to the name of the column that is common in both data sets. Assign the result to trial_act
.trial_act <- trial_act %>%
left_join(XX, by = "XX")
trial_act <- trial_act %>%
left_join(trial_act_demo_fake, by = "pidnum")
trial_act
dataframe. Do you now see the demographic data?trial_act
# A tibble: 2,139 x 35
pidnum age_y weight_kg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 960014 13 48.5 1 0 0 100 0 0 1
2 960031 14 48.3 1 0 0 100 0 0 1
3 990071 14 60 1 0 0 100 0 0 1
4 980042 16 63 1 0 0 100 0 1 1
5 171040 17 51.3 0 0 0 90 0 0 1
6 990026 17 103. 1 0 0 100 0 1 1
7 310234 18 57.3 1 0 0 100 0 1 1
8 940519 18 56.8 1 0 0 100 0 1 1
9 211314 19 50.8 0 0 0 90 0 0 1
10 340767 19 74.8 0 1 0 100 0 0 1
# ... with 2,129 more rows, and 25 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# gender_char <chr>, arms_char <chr>, agem <dbl>, weight_lb <dbl>,
# cd_change_20 <int>, cd_change_960 <int>, exercise <int>,
# education <chr>
group_by()
and summarise()
trial_act
dataframe. Then, group the data by arms
. Then, for each arm, calculate the mean participant age (in years) as a new column called age_mean
. Also, using N = n()
, calculate the number of cases for each group. Assign the result to a new object called trial_arm
.trial_arm <- trial_act %>%
group_by(XX) %>%
summarise(
N = n(),
XX = mean(XX)
)
trial_arm <- trial_act %>%
group_by(arms) %>%
summarise(
N = n(),
age_mean = mean(age_y)
)
trial_arm <- trial_act %>%
group_by(arms) %>%
summarise(
N = n(),
age_mean = mean(age_y),
age_sd = sd(age_y)
)
days
) for each arm.trial_arm <- trial_act %>%
group_by(arms) %>%
summarise(
N = n(),
age_mean = mean(age_y),
age_sd = sd(age_y),
days_med = median(days)
)
trial_sex
groups the data based on gender
and calculates the same summary statistics as you did for trial_arm
trial_gender <- trial_act %>%
group_by(gender) %>%
summarise(
N = n(),
age_mean = mean(age_y),
age_sd = sd(age_y)
)
trial_gender
# A tibble: 2 x 4
gender N age_mean age_sd
<int> <int> <dbl> <dbl>
1 0 368 34.3 8.24
2 1 1771 35.4 8.79
trial_arms_gender
that calculates the same summary statistics for all the groups gender
and arms
. Hint: Just add a second grouping variable!trial_arms_gender <- trial_act %>%
group_by(gender, arms) %>%
summarise(
N = n(),
age_mean = mean(age_y),
age_sd = sd(age_y)
)
trial_arms_gender
# A tibble: 8 x 5
# Groups: gender [?]
gender arms N age_mean age_sd
<int> <int> <int> <dbl> <dbl>
1 0 0 100 33.6 7.75
2 0 1 88 33.5 8.71
3 0 2 89 35.0 8.37
4 0 3 91 35.2 8.15
5 1 0 432 35.6 9.05
6 1 1 434 35.6 8.67
7 1 2 435 35.5 8.90
8 1 3 470 35.1 8.58
CD4_wide
dataframe you created before? Currently it is in the wide format, where key data (different CD4 T cell counts) are in different columns. Now we will try to convert it to a long format. Our goal is to get the data in the ‘long’ format. Using the spread()
function, create a new dataframe called CD4_long
that shows the data in the ‘long’ format. To do this, use the following template. Set the grouping column to time
and the new data column to value
.CD4_long <- CD4_wide %>%
gather(XX, # New grouping column
XX, # New data column
-pidnum, -arms) # Names of columns to replicate
CD4_long <- CD4_wide %>%
gather(time, # New grouping column
value, # New data column
-pidnum, -arms) # Names of columns to replicate
CD4_long
dataframe and compare it to the original CD4_wide
CD4_long
# A tibble: 6,417 x 4
pidnum arms time value
<int> <int> <chr> <int>
1 10056 2 cd40 422
2 10059 3 cd40 162
3 10089 3 cd40 326
4 10093 3 cd40 287
5 10124 0 cd40 504
6 10140 1 cd40 235
7 10165 0 cd40 244
8 10190 0 cd40 401
9 10198 3 cd40 214
10 10229 0 cd40 221
# ... with 6,407 more rows
group_by()
and summarise()
.CD4_long %>%
group_by(time, arms) %>%
summarise(count_mean = mean(value, na.rm = TRUE))
# A tibble: 12 x 3
# Groups: time [?]
time arms count_mean
<chr> <int> <dbl>
1 cd40 0 353.
2 cd40 1 349.
3 cd40 2 353.
4 cd40 3 347.
5 cd420 0 336.
6 cd420 1 403.
7 cd420 2 372.
8 cd420 3 374.
9 cd496 0 288.
10 cd496 1 341.
11 cd496 2 355.
12 cd496 3 329.
spread()
function to convert CD4_long
back the wide format. Assign the result to a new object called CD4_wide_2
.CD4_wide_2 <- CD4_long %>%
spread(XX, # old group column
XX) # old target column
CD4_wide_2 <- CD4_long %>%
spread(time, # old group column
value) # old target column
CD4_wide_2
to CD4_wide
do they look the same?CD4_wide_2
# A tibble: 2,139 x 5
pidnum arms cd40 cd420 cd496
<int> <int> <int> <int> <int>
1 10056 2 422 477 660
2 10059 3 162 218 NA
3 10089 3 326 274 122
4 10093 3 287 394 NA
5 10124 0 504 353 660
6 10140 1 235 339 264
7 10165 0 244 225 106
8 10190 0 401 366 453
9 10198 3 214 107 8
10 10229 0 221 132 NA
# ... with 2,129 more rows
mutate()
and summarise()
have ‘scoped’ versions with suffixes like _if
and _all
. that allow you do some pretty cool stuff easily (look at the help menu with ?scoped
for details). Try running the following chunk with summarise_if()
and see what happens:# See how summarise_if() works!
trial_act %>%
group_by(gender) %>%
summarise_if(is.numeric, mean)
# See how summarise_if() works!
trial_act %>%
group_by(gender) %>%
summarise_if(is.numeric, mean)
# A tibble: 2 x 32
gender pidnum age_y weight_kg hemo homo drugs karnof oprior z30
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 258094. 34.3 68.1 0.0136 0.0299 0.236 95.6 0.00815 0.590
2 1 246842. 35.4 76.6 0.0988 0.792 0.110 95.4 0.0248 0.542
# ... with 22 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
# str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>, offtrt <dbl>,
# cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>, cd80 <dbl>,
# cd820 <dbl>, cens <dbl>, days <dbl>, arms <dbl>, agem <dbl>,
# weight_lb <dbl>, cd_change_20 <dbl>, cd_change_960 <dbl>,
# exercise <dbl>
trial_act
dataset, group the data by arms
and calculate the mean of all numeric columns using summarise_if()
.trial_act %>%
group_by(arms) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
# A tibble: 4 x 32
arms pidnum age_y weight_kg hemo homo drugs karnof oprior z30
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 252884. 35.2 76.1 0.0789 0.641 0.118 95.4 0.0301 0.547
2 1 232609. 35.2 74.9 0.0824 0.663 0.140 95.5 0.0172 0.552
3 2 257593. 35.4 74.7 0.0878 0.664 0.145 95.7 0.0248 0.561
4 3 251696. 35.1 74.9 0.0873 0.676 0.123 95.1 0.0160 0.542
# ... with 22 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
# gender <dbl>, str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>,
# offtrt <dbl>, cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>,
# cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>, agem <dbl>,
# weight_lb <dbl>, cd_change_20 <dbl>, cd_change_960 <dbl>,
# exercise <dbl>
mutate_if()
:# use mutate_if() to round all numeric variables to 2 digits
trial_act %>%
mutate_if(is.numeric, round, 2)
mutate_if()
, round all of your results from the previous question to 0 decimal places (to the nearest integer)trial_act %>%
group_by(arms) %>%
summarise_if(is.numeric, mean, na.rm = TRUE) %>%
mutate_if(is.numeric, round, 2)
# A tibble: 4 x 32
arms pidnum age_y weight_kg hemo homo drugs karnof oprior z30
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 252884. 35.2 76.1 0.08 0.64 0.12 95.4 0.03 0.55
2 1 232609. 35.2 74.9 0.08 0.66 0.14 95.5 0.02 0.55
3 2 257593. 35.4 74.7 0.09 0.66 0.15 95.7 0.02 0.56
4 3 251696. 35.1 74.9 0.09 0.68 0.12 95.1 0.02 0.54
# ... with 22 more variables: zprior <dbl>, preanti <dbl>, race <dbl>,
# gender <dbl>, str2 <dbl>, strat <dbl>, symptom <dbl>, treat <dbl>,
# offtrt <dbl>, cd40 <dbl>, cd420 <dbl>, cd496 <dbl>, r <dbl>,
# cd80 <dbl>, cd820 <dbl>, cens <dbl>, days <dbl>, agem <dbl>,
# weight_lb <dbl>, cd_change_20 <dbl>, cd_change_960 <dbl>,
# exercise <dbl>
days
)cd40
)cd420
)cd496
)trial_act %>%
group_by(arms) %>%
summarise(
days_mean = mean(days),
cd4_bl = mean(cd40),
cd4_20 = mean(cd420),
cd4_96 = mean(cd496, na.rm = TRUE),
cd4_change = mean(cd496 - cd40, na.rm = TRUE),
N = n()
)
# A tibble: 4 x 7
arms days_mean cd4_bl cd4_20 cd4_96 cd4_change N
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0 801. 353. 336. 288. -77.5 532
2 1 916. 349. 403. 341. -6.90 522
3 2 906. 353. 372. 355. -4.36 524
4 3 893. 347. 374. 329. -19.4 561
cd80
and cd820
), where the data are grouped by time and trial arm. (Hint: use the following functions in order: select(), gather(), mutate(), group_by(), summarise())trial_act %>%
mutate(
arms_char = case_when(
arms == 0 ~ "Z",
arms == 1 ~ "ZD",
arms == 2 ~ "ZZ",
arms == 3 ~ "D"
)
) %>%
select(pidnum, arms_char, starts_with("cd8")) %>%
gather(time, measure, -pidnum, -arms_char) %>%
mutate(time = case_when(time == "cd80" ~ "baseline",
time == "cd820" ~ "week 20")) %>%
group_by(time, arms_char) %>%
summarise(N = n(),
cd8_mean = mean(measure),
cd8_median = median(measure))
# A tibble: 8 x 5
# Groups: time [?]
time arms_char N cd8_mean cd8_median
<chr> <chr> <int> <dbl> <dbl>
1 baseline D 561 972. 890
2 baseline Z 532 987. 881
3 baseline ZD 522 1004. 917
4 baseline ZZ 524 984. 898.
5 week 20 D 561 943. 871
6 week 20 Z 532 928. 818
7 week 20 ZD 522 968. 903
8 week 20 ZZ 524 902. 862