Overview

In this case study, we will look at the results of a clinical trial exploring the effectiveness of a new medication called dimarta on reducing histamine in patients with a disease that leads to chronically high histamine levels. In the study, 300 patients were assigned to one of three different treatment arms. One arm was given a placebo. The other arm was given adiclax – the standard of care for the disease. Finally, the third arm was given dimarta. There were two main measures of interest in the trial: patient’s changes in histamine from the beginning to the end of the trial, and their change in quality of life (measured by self report).

In addition to exploring the effects of the three medications, the researchers are interested in the extent to which three different biomarkers, dw, ms, and np, are correlated with therapeutic outcomes. In other words, to patients that express one or more of these biomarkers have better, or worse, outcomes that those that do not express these biomarkers?

Data Description

The data for this case study are in three separate files: dimarta_biomarker.csv, dimarta_demographics.csv, and dimarta_trial.csv. The files are located in the data_BaselRBootcamp_Day2.zip folder available through the schedule. Here are descriptions of the columns in these files:

dimarta_trial.csv

Table1. “dimarta_trial.csv” variable description:
Variable Description
PatientID Unique patient id
arm Treatment arm, either 1 = placebo, 2 = adiclax (the standard of treatment), or 3 = dimarta (the target drug)
histamine_start histamine value at the start of the trial
histamine_end histamine value at the end of the trial
qol_start Patient’s rated quality of life at the start of the trial
qol_end Patient’s rated quality of life at the end of the trial

dimarta_demographics.csv

Table2. “dimarta_demographics.csv” variable description:
Variable Description
PatientID Unique patient id
age Patient age
gender Patient gender, 0 = male, 1 = female
site Site where the clinical trial was conducted
diseasestatus Status of the patient’s disease at start of trial

dimarta_biomarker.csv

Table3. “dimarta_biomarker.csv” variable description:
Variable Description
PatientID Unique patient id
Biomarker One of three biomarkers: dw, ms, and np
BiomarkerStatus Result of the test for the biomarker.

Data I/O

  1. Create a new R project called dimarta. In that project, create two folders: R and data.

  2. Outside of R (e.g.; on your computer) save local copies of the three data files, dimarta_trial.csv, dimarta_demographics.csv, anddimarta_biomarker.csv in the data folder of your project.

  3. Open a new R script and save it as a new file in your R folder called dimarta_casestudy.R. At the top of the script, using comments, write your name and the date. Then, load the tidyverse package. Here’s how the top of your script should look:

## My Name
## The Date
## Dimarta - Case Study

library(tidyverse)
  1. Using read_csv(), load the dimarta_trial.csv, dimarta_demographics.csv, and dimarta_biomarker.csv datasets as three new objects called trial_df, demographics_df, and biomarker_df.
trial_df <- read_csv(file = "data/dimarta_trial.csv")
demographics_df <- read_csv(file = "data/dimarta_demographics.csv")
biomarker_df <- read_csv(file = "data/dimarta_biomarker.csv")
  1. Get a first impression of the objects you just created by exploring them with a mixture of the View(), head(), names(), and str() functions. Were they all loaded correctly?
trial_df
# A tibble: 300 x 6
   PatientID   arm histamine_start histamine_end qol_start qol_end
   <chr>     <dbl>           <dbl>         <dbl>     <dbl>   <dbl>
 1 txdjezeo     1.            58.6          67.0        3.      3.
 2 htxfjlxk     3.            36.1          26.0        3.      4.
 3 vkdqhyez     1.            57.7          57.3        2.      2.
 4 dbuvrwfq     3.            56.6          55.4        2.      3.
 5 ydaitaah     2.            64.7          62.9        5.      7.
 6 omhxokdr     1.            37.4          41.7        2.      2.
 7 dsybafny     1.            88.4          95.9        4.      6.
 8 fdfmcoto     2.            20.2          13.9        2.      2.
 9 rwsbykxe     2.            48.3          38.0        3.      3.
10 xocueqqe     3.            48.6          53.2        1.      1.
# ... with 290 more rows
demographics_df
# A tibble: 300 x 5
   PatientID   age gender site   diseasestatus
   <chr>     <dbl>  <dbl> <chr>  <chr>        
 1 pkyivajv    36.     0. Tokyo  Mid          
 2 dbuvrwfq    39.     0. Paris  Late         
 3 jhuztppp    30.     0. Tokyo  Mid          
 4 qejexgza    34.     1. Tokyo  Late         
 5 cszrjxju    41.     1. Tokyo  Late         
 6 uhvgttqh    31.     0. London Late         
 7 cflnybdw    45.     1. Tokyo  Mid          
 8 igobmmvj    48.     0. Tokyo  Late         
 9 lcrtmerg    35.     0. London Late         
10 fjjrnsnt    43.     1. London Mid          
# ... with 290 more rows
biomarker_df
# A tibble: 900 x 3
   PatientID Biomarker BiomarkerStatus
   <chr>     <chr>     <lgl>          
 1 ygazqssv  dw        FALSE          
 2 qosueuyw  ms        FALSE          
 3 bhhykjvw  ms        FALSE          
 4 ifajorty  np        TRUE           
 5 gxnsybdt  ms        FALSE          
 6 igobmmvj  ms        FALSE          
 7 knnzlzun  ms        FALSE          
 8 glzcbmby  ms        FALSE          
 9 gxnsybdt  dw        TRUE           
10 fzrhdpdu  np        FALSE          
# ... with 890 more rows

Data Wrangling

  1. Using rename(), change the name of the column arm in the trial_df data to StudyArm.
trial_df <- trial_df %>%
  rename(StudyArm = arm)
  1. Using the table() function, look at the values of the StudyArm column in trial_df. You’ll notice the values are 1, 2, and 3. Using mutate() and case_when() change these values to the appropriate names of the study arms (look at the variable descriptions to see which is which!)
trial_df <- trial_df %>%
  mutate(StudyArm = case_when(
    StudyArm == 1 ~ "placebo",
    StudyArm == 2 ~ "adiclax",
    StudyArm == 3 ~ "dimarta"
  ))
Warning: package 'bindrcpp' was built under R version 3.4.4
  1. In the demographics_df data, you’ll see that gender is coded as 0 and 1. Using mutate() create a new column in demographics_df called gender_c that shows gender as a string, where 0 = “male”, and 1 = “female”.
demographics_df <- demographics_df %>%
  mutate(gender_c = case_when(
    gender == 0 ~ "male",
    gender == 1 ~ "female"
  ))
  1. Now let’s create a new object called dimarta_df that combines data from trial_df and demographics_df. To do this, use left_join() to combine the trial_df data with the demographics_df data. This will merge the two datasets so you can have the study results and demographic data in the same dataframe. Make sure to assign the result to a new object called dimarta_df
# Create a new dataframe called dimarta_df that contains both trial_df and demographics_df
dimarta_df <- trial_df %>%
  left_join(demographics_df)
Joining, by = "PatientID"
  1. You’ll notice that the biomarker_df dataframe is in the ‘long’ format, where each row is a patient’s biomarker result. Using the code below, making use of the spread() function, create a new dataframe called biomarker_wide_df where each row is a patient, and the results from different biomarkers are in different columns. When you finish, look at biomarker_wide_df to see how it looks!
# Convert biomarker_df to a wide format using spread()
biomarker_wide_df <- biomarker_df %>%
  spread(Biomarker, BiomarkerStatus)

biomarker_wide_df
# A tibble: 300 x 4
   PatientID dw    ms    np   
   <chr>     <lgl> <lgl> <lgl>
 1 aiadwpbo  FALSE FALSE FALSE
 2 ajshuufj  FALSE FALSE FALSE
 3 ammweoxu  FALSE FALSE FALSE
 4 amzacliz  FALSE FALSE FALSE
 5 apeddxgo  TRUE  FALSE FALSE
 6 aqadnoup  TRUE  FALSE FALSE
 7 arinkxww  FALSE FALSE FALSE
 8 arpvtvxl  FALSE FALSE FALSE
 9 asquxnty  FALSE FALSE TRUE 
10 atzluemr  FALSE FALSE FALSE
# ... with 290 more rows
  1. Now, using the left_join function, add the biomarker_wide_df data to the dimarta_df data! Now you should hve all of the data in a single dataframe called dimarta_df
dimarta_df <- dimarta_df %>% 
  left_join(biomarker_wide_df)
Joining, by = "PatientID"
  1. View dimarta_df to make sure the data look correct! The data should have one row for each patient, and 13 separate columns, including dw, ms, and np
dimarta_df
# A tibble: 300 x 14
   PatientID StudyArm histamine_start histamine_end qol_start qol_end
   <chr>     <chr>              <dbl>         <dbl>     <dbl>   <dbl>
 1 txdjezeo  placebo             58.6          67.0        3.      3.
 2 htxfjlxk  dimarta             36.1          26.0        3.      4.
 3 vkdqhyez  placebo             57.7          57.3        2.      2.
 4 dbuvrwfq  dimarta             56.6          55.4        2.      3.
 5 ydaitaah  adiclax             64.7          62.9        5.      7.
 6 omhxokdr  placebo             37.4          41.7        2.      2.
 7 dsybafny  placebo             88.4          95.9        4.      6.
 8 fdfmcoto  adiclax             20.2          13.9        2.      2.
 9 rwsbykxe  adiclax             48.3          38.0        3.      3.
10 xocueqqe  dimarta             48.6          53.2        1.      1.
# ... with 290 more rows, and 8 more variables: age <dbl>, gender <dbl>,
#   site <chr>, diseasestatus <chr>, gender_c <chr>, dw <lgl>, ms <lgl>,
#   np <lgl>
  1. Using write_csv(), save dimarta_df in a new .csv file in your data folder called dimarta.csv
write_csv(x = dimarta_df,
          path = "data/dimarta.csv")
  1. Using the mean() function, calculate the mean age of all patients.
mean(dimarta_df$age)
[1] 39.93333
  1. Using the following template, find out how many male and female patients were in the trial.
dimarta_df %>%
  group_by(XXX) %>%
  summarise(
    Counts = n()    
  )
dimarta_df %>%
  group_by(gender_c) %>%
  summarise(
    Counts = n()    
  )
# A tibble: 2 x 2
  gender_c Counts
  <chr>     <int>
1 female      140
2 male        160
  1. Now, using similar code, find out how many patients were assigned to each study arm.
dimarta_df %>%
  group_by(StudyArm) %>%
  summarise(
    Counts = n()    
  )
# A tibble: 3 x 2
  StudyArm Counts
  <chr>     <int>
1 adiclax     100
2 dimarta     100
3 placebo     100
  1. Find out how many men and women were assigned to each study arm (Hint: You can use very similar code to what you used above, just add a second grouping variable!)
dimarta_df %>%
  group_by(StudyArm, gender_c) %>%
  summarise(
    Counts = n()    
  )
# A tibble: 6 x 3
# Groups:   StudyArm [?]
  StudyArm gender_c Counts
  <chr>    <chr>     <int>
1 adiclax  female       41
2 adiclax  male         59
3 dimarta  female       49
4 dimarta  male         51
5 placebo  female       50
6 placebo  male         50
  1. Using mutate(), add a new column to the dimarta_df data called histamine_change that shows the change in patient’s histamine levels from the start to the end of the trial (Hint: just subtract histamine_start from histamine_end!)
dimarta_df <- dimarta_df %>%
  mutate(
    histamine_change = histamine_end - histamine_start
  )

# Look at result
dimarta_df %>% 
  select(histamine_change)
# A tibble: 300 x 1
   histamine_change
              <dbl>
 1            8.40 
 2          -10.1  
 3           -0.390
 4           -1.24 
 5           -1.75 
 6            4.34 
 7            7.47 
 8           -6.35 
 9          -10.3  
10            4.58 
# ... with 290 more rows
  1. Using mutate() again, add a new column to dimarta_df called qol_change that shows the change in patient’s quality of life.
dimarta_df <- dimarta_df %>%
  mutate(
    qol_change = qol_end - qol_start
  )

# Look at result
dimarta_df %>% 
  select(qol_change)
# A tibble: 300 x 1
   qol_change
        <dbl>
 1         0.
 2         1.
 3         0.
 4         1.
 5         2.
 6         0.
 7         2.
 8         0.
 9         0.
10         0.
# ... with 290 more rows
  1. Calculate the percentage of patients who tested positive for each of the three biomarkers using the following template (Hint: If you calculate the mean() of a logical vector, you will get the percentage of TRUE values!)
# Calculate percent of patients with positive biomarkers

dimarta_df %>%
  summarise(
    dw_mean = mean(XXX),
    ms_percent = mean(XXX),
    np_percent = mean(XXX)
  )
# Calculate percent of patients with positive biomarkers

dimarta_df %>%
  summarise(
    dw_mean = mean(dw),
    ms_percent = mean(ms),
    np_percent = mean(np)
  )
# A tibble: 1 x 3
  dw_mean ms_percent np_percent
    <dbl>      <dbl>      <dbl>
1   0.257      0.190      0.233
  1. Were there different distributions of age in the different trial sites? To answer this, separately calculate the mean and standard deviations of patient ages in each site. (Hint: group the data by site, then calculate two separate summary statistics: age_mean = mean(age), and age_sd = sd(age).
# Calculate the mean change in histamine for each study site
dimarta_df %>%
  group_by(site) %>%
  summarise(
    age_mean = mean(age),
    age_sd = sd(age)
  )
# A tibble: 3 x 3
  site   age_mean age_sd
  <chr>     <dbl>  <dbl>
1 London     39.9   5.86
2 Paris      39.8   4.84
3 Tokyo      40.1   4.50
  1. Calculate the mean change in histamine results separately for each study site
# Calculate the mean change in histamine for each study site
dimarta_df %>%
  group_by(site) %>%
  summarise(
    histamine_change_mean = mean(histamine_change, na.rm = TRUE)
  )
# A tibble: 3 x 2
  site   histamine_change_mean
  <chr>                  <dbl>
1 London               -0.352 
2 Paris                 0.0180
3 Tokyo                -1.10  
  1. Calculate the mean change in histamine results (histamine_change) for each study arm. Which study arm had a largest decrease in histamine?
# Calculate the mean change in histamine for each study site
dimarta_df %>%
  group_by(StudyArm) %>%
  summarise(
    histamine_change_mean = mean(histamine_change, na.rm = TRUE)
  )
# A tibble: 3 x 2
  StudyArm histamine_change_mean
  <chr>                    <dbl>
1 adiclax                 -4.79 
2 dimarta                  0.513
3 placebo                  2.90 
  1. Calculate the mean change in quality of life (qol_change) for each study arm. Do the results match what you found with the histamine results?
# Calculate the mean change in histamine for each study site
dimarta_df %>%
  group_by(StudyArm) %>%
  summarise(
    qol_change_mean = mean(qol_change, na.rm = TRUE)
  )
# A tibble: 3 x 2
  StudyArm qol_change_mean
  <chr>              <dbl>
1 adiclax           0.0600
2 dimarta           0.0100
3 placebo          -0.150 

Plotting

  1. Run the following code block to visualise the relationship between study arm and histamine change. What do you find?
ggplot(data = dimarta_df,
       mapping = aes(x = StudyArm, y = histamine_change)) +
  geom_boxplot()

Statistics

  1. Using t.test() conduct a t-test comparing the change in histamine results between the placebo and dimarta. Did dimarta differ from the placebo?
# T.test comparing change in histamine between placebo and dimarta

t.test(formula = XX ~ XX,
       data = XXX %>%
              filter(XXX %in% c(XXX, XXX)))  # Only include placebo and dimarta
# T.test comparing change in histamine between placebo and dimarta

t.test(formula = histamine_change ~ StudyArm,
       data = dimarta_df %>%
              filter(StudyArm %in% c("placebo", "dimarta")))  # Only include placebo and dimarta

    Welch Two Sample t-test

data:  histamine_change by StudyArm
t = -3.3406, df = 197.63, p-value = 0.0009995
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.788957 -0.976043
sample estimates:
mean in group dimarta mean in group placebo 
               0.5130                2.8955 
  1. Using t.test() conduct a t-test comparing the change in histamine results between the adiclax (the standard of care) and dimarta. Did dimarta improve over the standard of care?
# T.test comparing change in histamine between placebo and dimarta

t.test(formula = histamine_change ~ StudyArm,
       data = dimarta_df %>%
              filter(StudyArm %in% c("adiclax", "dimarta")))  # Only include placebo and dimarta

    Welch Two Sample t-test

data:  histamine_change by StudyArm
t = -7.4914, df = 197.31, p-value = 2.233e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.698596 -3.906804
sample estimates:
mean in group adiclax mean in group dimarta 
              -4.7897                0.5130 
  1. Using glm() conduct a regression analysis predicting histamine_change as a function of 4 variables: treatment arm, histamine test results at the start of the trial, age, and quality of life at the start of the trial. Save the result as an object called histamine_change_glm. Once you do, apply the summary() function to the histamine_change_glm object to explore the results. Which variables reliably predict changes in test results?
histamine_change_glm <- glm(histamine_change ~ StudyArm + histamine_start + qol_start + age,
                            data = dimarta_df)

summary(histamine_change_glm)

Call:
glm(formula = histamine_change ~ StudyArm + histamine_start + 
    qol_start + age, data = dimarta_df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-12.7313   -3.5004   -0.1632    3.6559   14.2557  

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -7.560739   2.682141  -2.819  0.00515 ** 
StudyArmdimarta  5.324078   0.704761   7.554 5.36e-13 ***
StudyArmplacebo  7.648549   0.706447  10.827  < 2e-16 ***
histamine_start -0.001245   0.017221  -0.072  0.94242    
qol_start       -0.231201   0.285169  -0.811  0.41817    
age              0.088735   0.056956   1.558  0.12032    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 24.80221)

    Null deviance: 10464.4  on 299  degrees of freedom
Residual deviance:  7291.8  on 294  degrees of freedom
AIC: 1822.6

Number of Fisher Scoring iterations: 2
  1. Repeat your previous regression analysis, but now predict change in quality of life. Do you get different results compared to your previous analysis?
qol_change_glm <- glm(qol_change ~ StudyArm + histamine_start + qol_start + age,
                            data = dimarta_df)

summary(qol_change_glm)

Call:
glm(formula = qol_change ~ StudyArm + histamine_start + qol_start + 
    age, data = dimarta_df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0477  -0.7925   0.0022   0.8319   2.9564  

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.508547   0.513842   0.990    0.323
StudyArmdimarta -0.050578   0.135018  -0.375    0.708
StudyArmplacebo -0.209041   0.135340  -1.545    0.124
histamine_start -0.002300   0.003299  -0.697    0.486
qol_start        0.036483   0.054633   0.668    0.505
age             -0.011144   0.010912  -1.021    0.308

(Dispersion parameter for gaussian family taken to be 0.9103048)

    Null deviance: 271.79  on 299  degrees of freedom
Residual deviance: 267.63  on 294  degrees of freedom
AIC: 831.11

Number of Fisher Scoring iterations: 2
  1. Now it’s time to see if the patient’s biomarkers were related to treatment success. For the dw biomarker, calculate the mean change in test results (histamine_change) separately for patients with different outcomes on the dw biomarker (hint: just group the data by dw and use summarise() to calculate the mean histamine_change).
dimarta_df %>%
  group_by(dw) %>%
  summarise(
    histamine_change_mean = mean(histamine_change, na.rm = TRUE)
  )
# A tibble: 2 x 2
  dw    histamine_change_mean
  <lgl>                 <dbl>
1 FALSE                -0.464
2 TRUE                 -0.450
  1. Do the same analysis for the other two biomarkers ms and np. Do either of these biomarkers seem to predict changes in test results?
dimarta_df %>%
  group_by(ms) %>%
  summarise(
    histamine_change_mean = mean(histamine_change, na.rm = TRUE)
  )
# A tibble: 2 x 2
  ms    histamine_change_mean
  <lgl>                 <dbl>
1 FALSE               -0.549 
2 TRUE                -0.0812
dimarta_df %>%
  group_by(np) %>%
  summarise(
    histamine_change_mean = mean(histamine_change, na.rm = TRUE)
  )
# A tibble: 2 x 2
  np    histamine_change_mean
  <lgl>                 <dbl>
1 FALSE                -0.635
2 TRUE                  0.113
  1. Using glm(), create a new regression object called histamine_change_bio_glm predicting histamine_change as a function of the 3 biomarkers. Explore the results with summary(). Do you find that any of these biomarkers predict changes in histamine?
histamine_change_bio_glm <- glm(histamine_change ~ dw + np + ms,
                                data = dimarta_df)

summary(histamine_change_bio_glm)

Call:
glm(formula = histamine_change ~ dw + np + ms, data = dimarta_df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-16.0259   -4.4108    0.1053    4.1112   13.4225  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.72247    0.47032  -1.536    0.126
dwTRUE       0.02633    0.78484   0.034    0.973
npTRUE       0.73427    0.81059   0.906    0.366
msTRUE       0.44201    0.87436   0.506    0.614

(Dispersion parameter for gaussian family taken to be 35.22088)

    Null deviance: 10464  on 299  degrees of freedom
Residual deviance: 10425  on 296  degrees of freedom
AIC: 1925.8

Number of Fisher Scoring iterations: 2
  1. Did some drugs work better for patients with some biomarkers than others? For example, did patients who expressed the dw biomarker have better results when given dimarta compared to patients who do not express the dw biomarker? To answer this, start by calculating the descriptive statistics by calculating mean change in histamine histamine_change for all groups of dw and StudyArm (Hint: Just use group_by(dw, StudyArm) and summarise(histamine_change_mean = mean(histamine_change))).
dimarta_df %>%
  group_by(dw, StudyArm) %>%
  summarise(
    histamine_change_mean = mean(histamine_change, na.rm = TRUE)
  )
# A tibble: 6 x 3
# Groups:   dw [?]
  dw    StudyArm histamine_change_mean
  <lgl> <chr>                    <dbl>
1 FALSE adiclax                 -4.95 
2 FALSE dimarta                  0.518
3 FALSE placebo                  2.88 
4 TRUE  adiclax                 -4.37 
5 TRUE  dimarta                  0.495
6 TRUE  placebo                  2.94 
  1. Once you’ve looked at the descriptive statistics, conduct a regression analysis predicting histamine_change based on the interaction between dw and StudyArm. Remember to calculate an interaction term in regression, use the * symbol in the formula. What do the results show?
dw_arm_glm <- glm(histamine_change ~ dw * StudyArm,
                  data = dimarta_df)

summary(dw_arm_glm)

Call:
glm(formula = histamine_change ~ dw * StudyArm, data = dimarta_df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-12.2783   -3.3999   -0.2432    3.6346   14.5618  

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -4.9518     0.5898  -8.396 1.98e-15 ***
dwTRUE                   0.5789     1.1145   0.519    0.604    
StudyArmdimarta          5.4701     0.8204   6.668 1.28e-10 ***
StudyArmplacebo          7.8321     0.8284   9.455  < 2e-16 ***
dwTRUE:StudyArmdimarta  -0.6020     1.6298  -0.369    0.712    
dwTRUE:StudyArmplacebo  -0.5204     1.5949  -0.326    0.744    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 25.04207)

    Null deviance: 10464.4  on 299  degrees of freedom
Residual deviance:  7362.4  on 294  degrees of freedom
AIC: 1825.5

Number of Fisher Scoring iterations: 2