In this practical you’ll practice “data wrangling” with the dplyr
package (part of the `tidyverse collection of packages). Data wrangling refers to modifying and summarizing data. For example, sorting, adding columns, recoding values, and calculating summary statistics.
By the end of this practical you will know how to:
Here are the main verbs you will be using in dplyr
:
verb | action | example |
---|---|---|
filter() |
Select rows based on some criteria | data %>% filter(age > 40 & sex == “m”) |
arrange() |
Sort rows | data %>% arrange(date, group) |
select() |
Select columns (and ignore all others) | data %>% select(age, sex) |
rename() |
Rename columns | data %>% rename(new = old) |
mutate() |
Add new columns | data %>% mutate(height.m = height.cm / 100) |
case_when() |
Recode values of a column | data %>% mutate(sex_n = case_when(sex == 0 ~ “m”, sex == 1 ~ “f”)) |
group_by(), summarise() |
Group data and then calculate summary statistics | data %>% group_by(…) %>% summarise(…) |
left_join() |
Combine multiple datasets using a key column | data %>% left_join(data2, by = “id”) |
# -----------------------------------------------
# Examples of using dplyr on the ChickWeight data
# ------------------------------------------------
library(tidyverse) # Load tidyverse
chick <- as_tibble(ChickWeight) # Save a copy of the ChickWeight data as chick
# Change some column names with rename()
chick <- chick %>%
rename(arm = Diet, # New = arm, old = Diet
weight_g = weight) # new = weight_g, old = weight
# Select rows with filter()
chick %>%
filter(weight_g > 40 & arm == 1) # Only rows where weight_g > 40 and arm == 1
# Add columns with mutate()
chick <- chick %>%
mutate(
weight_g_kg = weight_g / 1000, # Add new column of weight_g in kilograms
time_week = Time / 7 # Add time_week as time in weeks
)
# Sort rows with arrange()
chick <- chick %>%
arrange(arm, weight_g) # sort rows by arm and then weight_g
# Recode variables with case_when()
chick <- chick %>%
mutate(
arm_name = case_when( # Rename arms with strings
arm == 1 ~ "vegetables",
arm == 2 ~ "fruit",
arm == 3 ~ "candy",
arm == 4 ~ "meat"
)
)
# Grouped statistics with group_by() and summarise()
chick %>% # Start with chick data...
group_by(arm) %>% # Group by arm...
summarise( # Summarise the following
weight_g_mean = mean(weight_g), # Mean weight
weight_g_max = max(weight_g), # Max weight
time_mean = mean(Time), # Mean time
N = n()) %>% # Number of cases...
arrange(weight_g_mean) # Sort by mean weight
# Many sequential functions
chick %>%
rename(weeks = Time) %>% # Change column name
filter(Chick > 10) %>% # Only chicks with values larger than 10
group_by(week, arm) %>% # Group by Time and arm
summarise( # Calculate summary statistics
weight_g_median = median(weight_g),
weight_g_sd = sd(weight_g),
N = n() # Counts of cases
)
You’ll use one dataset in this practical: ACTG175.csv
. It is available in the data_BaselRBootcamp_Day2.zip
file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip
folder and unzip it to get the ACTG175.csv
file.
A. In RStudio, Create a new project called BaselRBootcamp
. In the project root directory, create two folders data
and R
.
# Done!
B. Outside of RStudio (on your computer), move the ACTG175.csv
file you just downloaded in to the data
folder you just created.
# Done!
C. Open a new R script and save it as a new file called wrangling_practical.R
in the R
folder in your working directory. At the top of the script, using comments, write your name and the date. The, load the set of tidyverse
and speff2trial
packages with library()
. Here’s how the top of your script should look:
## NAME
## DATE
## Wrangling Practical
library(tidyverse) # For tidyverse
library(speff2trial) # For the ACTG175 data documentation
B. For this practical, we’ll use the ACTG175
data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act
.
# Load ACTG175.csv from the data folder
# The ACTG175.csv file MUST be in a folder called data in your working directory!!
trial_act <- read_csv(file = "data/ACTG175.csv")
C. The trial_act
data is actually a copy of a dataset from the speff2trial
package called ACTG175
. We can make Look at the help menu for the ACTG175
data by running ?ACTG175
(If you become really interested in the data, you can also read an article discussing the trial here: http://www.nejm.org/doi/full/10.1056/nejm199610103351501#t=article)
# Look at documentation for ACTG175 data (contained in the speff2trial package)
?ACTG175
D. First thing’s first, take a look at the first few rows of the data by printing the trial_act
object. It should look like this:
# Print trial_act object
trial_act
# A tibble: 2,139 x 27
pidnum age wtkg hemo homo drugs karnof oprior z30 zprior preanti
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int>
1 10056 48 89.8 0 0 0 100 0 0 1 0
2 10059 61 49.4 0 0 0 90 0 1 1 895
3 10089 45 88.5 0 1 1 90 0 1 1 707
4 10093 47 85.3 0 1 0 100 0 1 1 1399
5 10124 43 66.7 0 1 0 100 0 1 1 1352
6 10140 46 88.9 0 1 1 100 0 1 1 1181
7 10165 31 73.0 0 1 0 100 0 1 1 930
8 10190 41 66.2 0 1 1 100 0 1 1 1329
9 10198 40 82.6 0 1 0 90 0 1 1 1074
10 10229 35 78.0 0 1 0 100 0 1 1 964
# ... with 2,129 more rows, and 16 more variables: race <int>,
# gender <int>, str2 <int>, strat <int>, symptom <int>, treat <int>,
# offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>, r <int>,
# cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>
rename()
trial_act
. Using rename()
, change wtkg
to weightkg
and age
to age_years
(to specify that age is in years)trial_act <- trial_act %>%
rename(weightkg = wtkg,
age_years = age)
age_years
conflicts with another table in their database. To fix this, change age_years
to agey
.trial_act <- trial_act %>%
rename(agey = age_years)
mutate()
trial_act
called agem
that shows each patient’s age in months instead of years (Hint: Just multiply agey
by 12!).trial_act <- trial_act %>%
mutate(
agem = agey * 12
)
karnof
shows each patient’s Karnofsky score on a scale from 0 to 100, add a new column to trial_act
called karnof_b
that shows the score on a scale from 0 to 1 (hint: just divide the original column by 100)trial_act <- trial_act %>%
mutate(
karnof_b = karnof / 100
)
mutate()
add both agem
and karnof_b
columns to trial_act
trial_act <- trial_act %>%
mutate(
agem = agey * 12,
karnof_b = karnof / 100
)
sparrow
which is equal to the Karnofsky score divided by a person’s age plus the person’s weight in kg. Add each participant’s sparrow score as a new column to trial_act
called sparrow
(Hint: Just take `karnof / agey + weightkg)trial_act <- trial_act %>%
mutate(
sparrow = karnof / agey + weightkg
)
arrange()
trial_act
data in ascending order of agey (from lowest to highest). After, look the data to make sure it worked!trial_act <- trial_act %>%
arrange(agey)
trial_act
# A tibble: 2,139 x 30
pidnum agey weightkg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 940533 12 41.4 1 0 0 100 0 0 1
2 950037 12 53.1 1 0 0 100 0 1 1
3 950056 12 31.0 1 0 0 100 0 1 1
4 910034 13 32.7 1 0 0 100 0 1 1
5 940534 13 62.9 1 0 0 100 0 0 1
6 960014 13 48.5 1 0 0 100 0 0 1
7 310767 14 65.0 1 0 0 100 0 1 1
8 920050 14 54.2 1 0 0 100 0 1 1
9 940544 14 41.1 1 0 0 100 0 0 1
10 950061 14 64.3 1 0 0 100 0 1 1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# agem <dbl>, karnof_b <dbl>, sparrow <dbl>
trial_act <- trial_act %>%
arrange(desc(agey))
trial_act
# A tibble: 2,139 x 30
pidnum agey weightkg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 11438 70 73.9 0 1 0 100 0 1 1
2 211360 70 63.1 0 1 0 90 0 0 1
3 211284 69 81.6 0 1 0 100 0 0 1
4 50580 68 90.5 0 1 1 100 0 1 1
5 81127 68 70.8 0 1 0 90 0 1 1
6 10924 67 71.0 0 1 0 100 0 0 1
7 241150 67 82.1 0 1 0 90 0 0 1
8 50662 66 84.4 0 1 0 100 0 0 1
9 11987 65 77.2 0 1 0 90 0 0 1
10 140797 65 60.5 0 1 0 90 0 0 1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# agem <dbl>, karnof_b <dbl>, sparrow <dbl>
desc()
around the variable. E.g.; data %>% arrrange(desc(height))
arrange()
. Now sort the data by karnof
and then agey (agey
), and then arms (arms
)trial_act <- trial_act %>%
arrange(karnof, agey, arms)
trial_act
# A tibble: 2,139 x 30
pidnum agey weightkg hemo homo drugs karnof oprior z30 zprior
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 610015 25 58.7 0 1 0 70 0 0 1
2 990018 27 80.3 1 0 0 70 0 1 1
3 150147 32 90.2 0 1 0 70 0 1 1
4 230109 34 64.4 0 1 0 70 0 1 1
5 640022 36 83.9 0 1 1 70 0 0 1
6 280278 40 79.4 0 0 1 70 0 1 1
7 220418 43 64.4 0 1 0 70 0 1 1
8 50632 45 72.1 0 1 1 70 0 0 1
9 11013 54 71.2 0 1 1 70 0 1 1
10 230456 23 54.0 0 1 0 80 0 1 1
# ... with 2,129 more rows, and 20 more variables: preanti <int>,
# race <int>, gender <int>, str2 <int>, strat <int>, symptom <int>,
# treat <int>, offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>,
# r <int>, cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>,
# agem <dbl>, karnof_b <dbl>, sparrow <dbl>
case_when()
gender_char
that shows gender as a string.?ACTG175
to see how gender is coded.trial_act <- trial_act %>%
mutate(
gender_char = case_when(
gender == 0 ~ "female",
gender == 1 ~ "male"
)
)
over50
that is 1 when patients are older than 50, and 0 when they are younger than or equal to 50trial_act <- trial_act %>%
mutate(
over50 = case_when(
agey > 50 ~ 1,
agey <= 50 ~ 0
)
)
mutate()
. That is, in one block of code, create gender_char
and over50
trial_act <- trial_act %>%
mutate(
gender_char = case_when(
gender == 0 ~ "female",
gender == 1 ~ "male"),
over50 = case_when(
agey > 50 ~ 1,
agey <= 50 ~ 0)
)
filter()
trial_act_B
that only contains data from males (gender == 1
)trial_act_B <- trial_act %>%
filter(gender == 1)
trial_act_C
that only contains data from people who have not used intravenous drugs (drugs == 0
) and are over 40 years old and are femaletrial_act_C <- trial_act %>%
filter(drugs == 0 & agey > 40 & gender == 0)
left_join()
left_join()
to combine multiple dataframes. At the following link https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/patient_demo.csv I have some additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Here’s how the first few rows of the data look:# A tibble: 2,139 x 3
pidnum exercise education
<int> <int> <chr>
1 250091 1 HS
2 70444 3 BA
3 220425 4 HS
4 50300 3 PHD
5 630013 3 <HS
6 60972 1 MS
7 81105 3 BA
8 211333 1 HS
9 170648 0 HS
10 220488 1 HS
# ... with 2,129 more rows
Using left_join()
to add the demographic data to the trial_act
data. To do this, first load the demographic data into R as a new dataframe called trial_act_demo
. Then, use the following template to combine the data (Hint: Replace XXX with the correct values.)
# Get trial_act_demo data
trial_act_demo <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/patient_demo.csv")
Parsed with column specification:
cols(
pidnum = col_integer(),
exercise = col_integer(),
education = col_character()
)
# Combine trial_act and trial_act_demo data
trial_act <- trial_act %>%
left_join(trial_act_demo) # The new dataframe to combine with trial_act
Joining, by = "pidnum"
trial_act
tibble, which should contain the exercise and education data, calculate the mean number of days of exercise that patients reported.mean(trial_act$exercise)
[1] 2.288453
group_by()
and summarise()
arms
. Then, for each arm, calculate the mean participant age.trial_act %>%
group_by(arms) %>%
summarise(
agey.mean = mean(agey)
)
arms
.For each arm, calculate the mean participant age and the median Karnofsky score.trial_act %>%
group_by(arms) %>%
summarise(
agey_mean = mean(agey),
karnof_median = median(karnof)
)
# A tibble: 4 x 3
arms agey_mean karnof_median
<int> <dbl> <dbl>
1 0 35.2 100.
2 1 35.2 100.
3 2 35.4 100.
4 3 35.1 100.
gender
. Then, separately for male and female patients, calculate the percent who have a history of intravenous drug use.trial_act %>%
group_by(gender) %>%
summarise(
drug_percent = mean(drugs)
)
# A tibble: 2 x 2
gender drug_percent
<int> <dbl>
1 0 0.236
2 1 0.110
gender
. Then calculate the percent of male and female patients who have a history of intravenous drug use (drugs
), and the mean number of days until a major negative event days
trial_act %>%
group_by(gender) %>%
summarise(
drug_percent = mean(drugs),
homo_percent = mean(homo)
)
# A tibble: 2 x 3
gender drug_percent homo_percent
<int> <dbl> <dbl>
1 0 0.236 0.0299
2 1 0.110 0.792
cd40
) (Hint: group by gender
and race
)trial_act %>%
group_by(gender, race) %>%
summarise(
agey_mean = mean(agey),
cd40_mean = mean(cd40)
)
# A tibble: 4 x 4
# Groups: gender [?]
gender race agey_mean cd40_mean
<int> <int> <dbl> <dbl>
1 0 0 35.1 356.
2 0 1 33.7 360.
3 1 0 35.9 350.
4 1 1 34.0 345.
days
)N = n()
)trial_act %>%
group_by(arms) %>%
summarise(
days_mean = mean(days),
cd4_bl = mean(cd40),
cd4_20 = mean(cd420),
cd4_96 = mean(cd496, na.rm = TRUE),
cd4_change = mean(cd496 - cd40, na.rm = TRUE),
N = n()
)
# A tibble: 4 x 7
arms days_mean cd4_bl cd4_20 cd4_96 cd4_change N
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0 801. 353. 336. 288. -77.5 532
2 1 916. 349. 403. 341. -6.90 522
3 2 906. 353. 372. 355. -4.36 524
4 3 893. 347. 374. 329. -19.4 561
arms_char
that shows the values of arms
as characters that reflect what the values actually represent (hint: use mutate()
and case_when()
). For example, looking at the help file ?ACTG175
, I can see that the treatment arm of 0 is “zidovudine”. I might call this arm "Z"
. Do this in the all in the same chunk of code.trial_act %>%
mutate(
arms_char = case_when(
arms == 0 ~ "Z",
arms == 1 ~ "ZD",
arms == 2 ~ "ZZ",
arms == 3 ~ "D"
)
) %>%
group_by(arms_char) %>%
summarise(
days_mean = mean(days),
cd4_bl = mean(cd40),
cd4_20 = mean(cd420),
cd4_96 = mean(cd496, na.rm = TRUE),
cd4_change = mean(cd496 - cd40, na.rm = TRUE)
)
# A tibble: 4 x 6
arms_char days_mean cd4_bl cd4_20 cd4_96 cd4_change
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 D 893. 347. 374. 329. -19.4
2 Z 801. 353. 336. 288. -77.5
3 ZD 916. 349. 403. 341. -6.90
4 ZZ 906. 353. 372. 355. -4.36
z30
)trial_act %>%
filter(karnof == 100 & z30 == 0) %>%
mutate(
arms_char = case_when(
arms == 0 ~ "Z",
arms == 1 ~ "ZD",
arms == 2 ~ "ZZ",
arms == 3 ~ "D"
)
) %>%
group_by(arms_char) %>%
summarise(
days_mean = mean(days),
cd4_bl = mean(cd40),
cd4_20 = mean(cd420),
cd4_96 = mean(cd496, na.rm = TRUE),
cd4_change = mean(cd496 - cd40, na.rm = TRUE)
)
# A tibble: 4 x 6
arms_char days_mean cd4_bl cd4_20 cd4_96 cd4_change
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 D 883. 364. 409. 359. -2.22
2 Z 822. 373. 375. 318. -63.1
3 ZD 890. 359. 446. 381. 16.8
4 ZZ 910. 383. 416. 421. 41.2
weightkg
to weight
drugs_char
that uses strings instead of numbers to indicate drug usedrugs_char
and arms
trial_act %>%
rename(
weight = weightkg
) %>%
mutate(drugs_char = case_when(
drugs == 0 ~ "No drug use",
drugs == 1 ~ "Prior drug use"
)) %>%
filter(gender_char == "male" & pidnum > 10100) %>%
group_by(drugs_char, arms) %>%
summarise(
age_mean = mean(agey),
male = mean(gender_char == "male"),
days_mean = mean(days)
)
# A tibble: 8 x 5
# Groups: drugs_char [?]
drugs_char arms age_mean male days_mean
<chr> <int> <dbl> <dbl> <dbl>
1 No drug use 0 35.6 1. 803.
2 No drug use 1 35.1 1. 914.
3 No drug use 2 35.4 1. 910.
4 No drug use 3 34.7 1. 912.
5 Prior drug use 0 35.7 1. 741.
6 Prior drug use 1 39.2 1. 955.
7 Prior drug use 2 36.8 1. 905.
8 Prior drug use 3 37.7 1. 802.
For more details on data wrangling with R, check out the chapter in YaRrr! The Pirate’s Guide to R YaRrr! Chapter Link
Hadley Wickham, the author of dplyr, also has great examples in the dplyr vignette here: dplyr vignette link