In this practical you’ll practice “data wrangling” with the dplyr
package (part of the `tidyverse collection of packages). Data wrangling refers to modifying and summarizing data. For example, sorting, adding columns, recoding values, and calculating summary statistics.
By the end of this practical you will know how to:
Here are the main verbs you will be using in dplyr
:
verb | action | example |
---|---|---|
filter() |
Select rows based on some criteria | data %>% filter(age > 40 & sex == “m”) |
arrange() |
Sort rows | data %>% arrange(date, group) |
select() |
Select columns (and ignore all others) | data %>% select(age, sex) |
rename() |
Rename columns | data %>% rename(new = old) |
mutate() |
Add new columns | data %>% mutate(height.m = height.cm / 100) |
case_when() |
Recode values of a column | data %>% mutate(sex_n = case_when(sex == 0 ~ “m”, sex == 1 ~ “f”)) |
group_by(), summarise() |
Group data and then calculate summary statistics | data %>% group_by(…) %>% summarise(…) |
left_join() |
Combine multiple datasets using a key column | data %>% left_join(data2, by = “id”) |
# -----------------------------------------------
# Examples of using dplyr on the ChickWeight data
# ------------------------------------------------
library(tidyverse) # Load tidyverse
chick <- as_tibble(ChickWeight) # Save a copy of the ChickWeight data as chick
# Change some column names with rename()
chick <- chick %>%
rename(arm = Diet, # New = arm, old = Diet
weight_g = weight) # new = weight_g, old = weight
# Select rows with filter()
chick %>%
filter(weight_g > 40 & arm == 1) # Only rows where weight_g > 40 and arm == 1
# Add columns with mutate()
chick <- chick %>%
mutate(
weight_g_kg = weight_g / 1000, # Add new column of weight_g in kilograms
time_week = Time / 7 # Add time_week as time in weeks
)
# Sort rows with arrange()
chick <- chick %>%
arrange(arm, weight_g) # sort rows by arm and then weight_g
# Recode variables with case_when()
chick <- chick %>%
mutate(
arm_name = case_when( # Rename arms with strings
arm == 1 ~ "vegetables",
arm == 2 ~ "fruit",
arm == 3 ~ "candy",
arm == 4 ~ "meat"
)
)
# Grouped statistics with group_by() and summarise()
chick %>% # Start with chick data...
group_by(arm) %>% # Group by arm...
summarise( # Summarise the following
weight_g_mean = mean(weight_g), # Mean weight
weight_g_max = max(weight_g), # Max weight
time_mean = mean(Time), # Mean time
N = n()) %>% # Number of cases...
arrange(weight_g_mean) # Sort by mean weight
# Many sequential functions
chick %>%
rename(weeks = Time) %>% # Change column name
filter(Chick > 10) %>% # Only chicks with values larger than 10
group_by(week, arm) %>% # Group by Time and arm
summarise( # Calculate summary statistics
weight_g_median = median(weight_g),
weight_g_sd = sd(weight_g),
N = n() # Counts of cases
)
You’ll use one dataset in this practical: ACTG175.csv
. It is available in the data_BaselRBootcamp_Day2.zip
file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip
folder and unzip it to get the ACTG175.csv
file.
A. In RStudio, Create a new project called BaselRBootcamp
. In the project root directory, create two folders data
and R
.
B. Outside of RStudio (on your computer), move the ACTG175.csv
file you just downloaded in to the data
folder you just created.
C. Open a new R script and save it as a new file called wrangling_practical.R
in the R
folder in your working directory. At the top of the script, using comments, write your name and the date. The, load the set of tidyverse
and speff2trial
packages with library()
. Here’s how the top of your script should look:
## NAME
## DATE
## Wrangling Practical
library(tidyverse) # For tidyverse
library(speff2trial) # For the ACTG175 data documentation
B. For this practical, we’ll use the ACTG175
data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act
.
# Load ACTG175.csv from the data folder
# The ACTG175.csv file MUST be in a folder called data in your working directory!!
trial_act <- read_csv(file = "data/ACTG175.csv")
C. The trial_act
data is actually a copy of a dataset from the speff2trial
package called ACTG175
. We can make Look at the help menu for the ACTG175
data by running ?ACTG175
(If you become really interested in the data, you can also read an article discussing the trial here: http://www.nejm.org/doi/full/10.1056/nejm199610103351501#t=article)
# Look at documentation for ACTG175 data (contained in the speff2trial package)
?ACTG175
D. First thing’s first, take a look at the first few rows of the data by printing the trial_act
object. It should look like this:
# Print trial_act object
trial_act
# A tibble: 2,139 x 27
pidnum age wtkg hemo homo drugs karnof oprior z30 zprior preanti
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int>
1 10056 48 89.8 0 0 0 100 0 0 1 0
2 10059 61 49.4 0 0 0 90 0 1 1 895
3 10089 45 88.5 0 1 1 90 0 1 1 707
4 10093 47 85.3 0 1 0 100 0 1 1 1399
5 10124 43 66.7 0 1 0 100 0 1 1 1352
6 10140 46 88.9 0 1 1 100 0 1 1 1181
7 10165 31 73.0 0 1 0 100 0 1 1 930
8 10190 41 66.2 0 1 1 100 0 1 1 1329
9 10198 40 82.6 0 1 0 90 0 1 1 1074
10 10229 35 78.0 0 1 0 100 0 1 1 964
# ... with 2,129 more rows, and 16 more variables: race <int>,
# gender <int>, str2 <int>, strat <int>, symptom <int>, treat <int>,
# offtrt <int>, cd40 <int>, cd420 <int>, cd496 <int>, r <int>,
# cd80 <int>, cd820 <int>, cens <int>, days <int>, arms <int>
rename()
Lets change some of the column names in trial_act
. Using rename()
, change wtkg
to weightkg
and age
to age_years
(to specify that age is in years)
Oops, someone in your company just told you that the name age_years
conflicts with another table in their database. To fix this, change age_years
to agey
.
mutate()
Add a new column to trial_act
called agem
that shows each patient’s age in months instead of years (Hint: Just multiply agey
by 12!).
The column karnof
shows each patient’s Karnofsky score on a scale from 0 to 100, add a new column to trial_act
called karnof_b
that shows the score on a scale from 0 to 1 (hint: just divide the original column by 100)
Now, do the previous two questions in one chunk of code. That is, using one call to mutate()
add both agem
and karnof_b
columns to trial_act
A physician wants to see a new score called the sparrow
which is equal to the Karnofsky score divided by a person’s age plus the person’s weight in kg. Add each participant’s sparrow score as a new column to trial_act
called sparrow
(Hint: Just take `karnof / agey + weightkg)
arrange()
Arrange the trial_act
data in ascending order of agey (from lowest to highest). After, look the data to make sure it worked!
Now arrange the data in descending order of agey (from highest to lowest). After, look the data to make sure it worked.
desc()
around the variable. E.g.; data %>% arrrange(desc(height))
arrange()
. Now sort the data by karnof
and then agey (agey
), and then arms (arms
)case_when()
gender_char
that shows gender as a string.?ACTG175
to see how gender is coded.Create a new column over50
that is 1 when patients are older than 50, and 0 when they are younger than or equal to 50
Now, repeat the previous two questions, but do them both in the same call to mutate()
. That is, in one block of code, create gender_char
and over50
filter()
Create a new tibble called trial_act_B
that only contains data from males (gender == 1
)
Create a new tibble called trial_act_C
that only contains data from people who have not used intravenous drugs (drugs == 0
) and are over 40 years old and are female
left_join()
left_join()
to combine multiple dataframes. At the following link https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/patient_demo.csv I have some additional (fictional) demographic data about the patients, namely the number of days of exercise they get per week, and their highest level of education. Here’s how the first few rows of the data look:# A tibble: 2,139 x 3
pidnum exercise education
<int> <int> <chr>
1 250091 1 HS
2 70444 3 BA
3 220425 4 HS
4 50300 3 PHD
5 630013 3 <HS
6 60972 1 MS
7 81105 3 BA
8 211333 1 HS
9 170648 0 HS
10 220488 1 HS
# ... with 2,129 more rows
Using left_join()
to add the demographic data to the trial_act
data. To do this, first load the demographic data into R as a new dataframe called trial_act_demo
. Then, use the following template to combine the data (Hint: Replace XXX with the correct values.)
# Get trial_act_demo data
trial_act_demo <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/patient_demo.csv")
Parsed with column specification:
cols(
pidnum = col_integer(),
exercise = col_integer(),
education = col_character()
)
# Combine trial_act and trial_act_demo data
trial_act <- trial_act %>%
left_join(trial_act_demo) # The new dataframe to combine with trial_act
Joining, by = "pidnum"
trial_act
tibble, which should contain the exercise and education data, calculate the mean number of days of exercise that patients reported.group_by()
and summarise()
Group the data by arms
. Then, for each arm, calculate the mean participant age.
Group the data by arms
.For each arm, calculate the mean participant age and the median Karnofsky score.
Group the data by gender
. Then, separately for male and female patients, calculate the percent who have a history of intravenous drug use.
Group the data by gender
. Then calculate the percent of male and female patients who have a history of intravenous drug use (drugs
), and the mean number of days until a major negative event days
Separately for all combinations of gender and race, calculate the mean age and mean CD4 T cell count at baseline (cd40
) (Hint: group by gender
and race
)
days
)N = n()
)Repeat the previous analysis, but before you do the grouping and summary statistics, create a new variable called arms_char
that shows the values of arms
as characters that reflect what the values actually represent (hint: use mutate()
and case_when()
). For example, looking at the help file ?ACTG175
, I can see that the treatment arm of 0 is “zidovudine”. I might call this arm "Z"
. Do this in the all in the same chunk of code.
Repeat the previous analysis, but only include patients with a Karnofsky score equal to 100, and who did not use zidovudine in the 30 days prior to the treatment initiation (z30
)
weightkg
to weight
drugs_char
that uses strings instead of numbers to indicate drug usedrugs_char
and arms
For more details on data wrangling with R, check out the chapter in YaRrr! The Pirate’s Guide to R YaRrr! Chapter Link
Hadley Wickham, the author of dplyr, also has great examples in the dplyr vignette here: dplyr vignette link