In this case study, we will look at the results of a clinical trial exploring the effectiveness of a new medication called dimarta on reducing histamine in patients with a disease that leads to chronically high histamine levels. In the study, 300 patients were assigned to one of three different treatment arms. One arm was given a placebo. The other arm was given adiclax – the standard of care for the disease. Finally, the third arm was given dimarta. There were two main measures of interest in the trial: patient’s changes in histamine from the beginning to the end of the trial, and their change in quality of life (measured by self report).

In addition to exploring the effects of the three medications, the researchers are interested in the extent to which three different biomarkers, dw, ms, and np, are correlated with therapeutic outcomes. In other words, to patients that express one or more of these biomarkers have better, or worse, outcomes that those that do not express these biomarkers?

The data for this case study are in three separate files: `dimarta_biomarker.csv`

, `dimarta_demographics.csv`

, and `dimarta_trial.csv`

. The files are located in the `data_BaselRBootcamp_Day2.zip`

folder available through the schedule. Here are descriptions of the columns in these files:

**dimarta_trial.csv**

Variable | Description |
---|---|

PatientID | Unique patient id |

arm | Treatment arm, either 1 = placebo, 2 = adiclax (the standard of treatment), or 3 = dimarta (the target drug) |

histamine_start | histamine value at the start of the trial |

histamine_end | histamine value at the end of the trial |

qol_start | Patient’s rated quality of life at the start of the trial |

qol_end | Patient’s rated quality of life at the end of the trial |

**dimarta_demographics.csv**

Variable | Description |
---|---|

PatientID | Unique patient id |

age | Patient age |

gender | Patient gender, 0 = male, 1 = female |

site | Site where the clinical trial was conducted |

diseasestatus | Status of the patient’s disease at start of trial |

**dimarta_biomarker.csv**

Variable | Description |
---|---|

PatientID | Unique patient id |

Biomarker | One of three biomarkers: dw, ms, and np |

BiomarkerStatus | Result of the test for the biomarker. |

Create a new R project called

`dimarta`

. In that project, create two folders:`R`

and`data`

.Outside of R (e.g.; on your computer) save local copies of the three data files,

`dimarta_trial.csv`

,`dimarta_demographics.csv`

, and`dimarta_biomarker.csv`

in the`data`

folder of your project.Open a new R script and save it as a new file in your

`R`

folder called`dimarta_casestudy.R`

. At the top of the script, using comments, write your name and the date. Then, load the`tidyverse`

package. Here’s how the top of your script should look:

```
## My Name
## The Date
## Dimarta - Case Study
library(tidyverse)
```

- Using
`read_csv()`

, load the`dimarta_trial.csv`

,`dimarta_demographics.csv`

, and`dimarta_biomarker.csv`

datasets as*three*new objects called`trial_df`

,`demographics_df`

, and`biomarker_df`

.

```
trial_df <- read_csv(file = "data/dimarta_trial.csv")
demographics_df <- read_csv(file = "data/dimarta_demographics.csv")
biomarker_df <- read_csv(file = "data/dimarta_biomarker.csv")
```

- Get a first impression of the objects you just created by exploring them with a mixture of the
`View()`

,`head()`

,`names()`

, and`str()`

functions. Were they all loaded correctly?

`trial_df`

```
# A tibble: 300 x 6
PatientID arm histamine_start histamine_end qol_start qol_end
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 txdjezeo 1. 58.6 67.0 3. 3.
2 htxfjlxk 3. 36.1 26.0 3. 4.
3 vkdqhyez 1. 57.7 57.3 2. 2.
4 dbuvrwfq 3. 56.6 55.4 2. 3.
5 ydaitaah 2. 64.7 62.9 5. 7.
6 omhxokdr 1. 37.4 41.7 2. 2.
7 dsybafny 1. 88.4 95.9 4. 6.
8 fdfmcoto 2. 20.2 13.9 2. 2.
9 rwsbykxe 2. 48.3 38.0 3. 3.
10 xocueqqe 3. 48.6 53.2 1. 1.
# ... with 290 more rows
```

`demographics_df`

```
# A tibble: 300 x 5
PatientID age gender site diseasestatus
<chr> <dbl> <dbl> <chr> <chr>
1 pkyivajv 36. 0. Tokyo Mid
2 dbuvrwfq 39. 0. Paris Late
3 jhuztppp 30. 0. Tokyo Mid
4 qejexgza 34. 1. Tokyo Late
5 cszrjxju 41. 1. Tokyo Late
6 uhvgttqh 31. 0. London Late
7 cflnybdw 45. 1. Tokyo Mid
8 igobmmvj 48. 0. Tokyo Late
9 lcrtmerg 35. 0. London Late
10 fjjrnsnt 43. 1. London Mid
# ... with 290 more rows
```

`biomarker_df`

```
# A tibble: 900 x 3
PatientID Biomarker BiomarkerStatus
<chr> <chr> <lgl>
1 ygazqssv dw FALSE
2 qosueuyw ms FALSE
3 bhhykjvw ms FALSE
4 ifajorty np TRUE
5 gxnsybdt ms FALSE
6 igobmmvj ms FALSE
7 knnzlzun ms FALSE
8 glzcbmby ms FALSE
9 gxnsybdt dw TRUE
10 fzrhdpdu np FALSE
# ... with 890 more rows
```

- Using
`rename()`

, change the name of the column`arm`

in the`trial_df`

data to`StudyArm`

.

```
trial_df <- trial_df %>%
rename(StudyArm = arm)
```

- Using the
`table()`

function, look at the values of the`StudyArm`

column in`trial_df`

. You’ll notice the values are 1, 2, and 3. Using`mutate()`

and`case_when()`

change these values to the appropriate names of the study arms (look at the variable descriptions to see which is which!)

```
trial_df <- trial_df %>%
mutate(StudyArm = case_when(
StudyArm == 1 ~ "placebo",
StudyArm == 2 ~ "adiclax",
StudyArm == 3 ~ "dimarta"
))
```

`Warning: package 'bindrcpp' was built under R version 3.4.4`

- In the
`demographics_df`

data, you’ll see that gender is coded as 0 and 1. Using`mutate()`

create a*new*column in`demographics_df`

called`gender_c`

that shows gender as a string, where 0 = “male”, and 1 = “female”.

```
demographics_df <- demographics_df %>%
mutate(gender_c = case_when(
gender == 0 ~ "male",
gender == 1 ~ "female"
))
```

- Now let’s create a new object called
`dimarta_df`

that combines data from`trial_df`

and`demographics_df`

. To do this, use`left_join()`

to combine the`trial_df`

data with the`demographics_df`

data. This will merge the two datasets so you can have the study results and demographic data in the same dataframe. Make sure to assign the result to a new object called`dimarta_df`

```
# Create a new dataframe called dimarta_df that contains both trial_df and demographics_df
dimarta_df <- trial_df %>%
left_join(demographics_df)
```

`Joining, by = "PatientID"`

- You’ll notice that the
`biomarker_df`

dataframe is in the ‘long’ format, where each row is a patient’s biomarker result. Using the code below, making use of the`spread()`

function, create a new dataframe called`biomarker_wide_df`

where each row is a patient, and the results from different biomarkers are in different columns. When you finish, look at`biomarker_wide_df`

to see how it looks!

```
# Convert biomarker_df to a wide format using spread()
biomarker_wide_df <- biomarker_df %>%
spread(Biomarker, BiomarkerStatus)
biomarker_wide_df
```

```
# A tibble: 300 x 4
PatientID dw ms np
<chr> <lgl> <lgl> <lgl>
1 aiadwpbo FALSE FALSE FALSE
2 ajshuufj FALSE FALSE FALSE
3 ammweoxu FALSE FALSE FALSE
4 amzacliz FALSE FALSE FALSE
5 apeddxgo TRUE FALSE FALSE
6 aqadnoup TRUE FALSE FALSE
7 arinkxww FALSE FALSE FALSE
8 arpvtvxl FALSE FALSE FALSE
9 asquxnty FALSE FALSE TRUE
10 atzluemr FALSE FALSE FALSE
# ... with 290 more rows
```

- Now, using the
`left_join`

function, add the`biomarker_wide_df`

data to the`dimarta_df`

data! Now you should hve all of the data in a single dataframe called`dimarta_df`

```
dimarta_df <- dimarta_df %>%
left_join(biomarker_wide_df)
```

`Joining, by = "PatientID"`

- View
`dimarta_df`

to make sure the data look correct! The data should have one row for each patient, and 13 separate columns, including`dw`

,`ms`

, and`np`

`dimarta_df`

```
# A tibble: 300 x 14
PatientID StudyArm histamine_start histamine_end qol_start qol_end
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 txdjezeo placebo 58.6 67.0 3. 3.
2 htxfjlxk dimarta 36.1 26.0 3. 4.
3 vkdqhyez placebo 57.7 57.3 2. 2.
4 dbuvrwfq dimarta 56.6 55.4 2. 3.
5 ydaitaah adiclax 64.7 62.9 5. 7.
6 omhxokdr placebo 37.4 41.7 2. 2.
7 dsybafny placebo 88.4 95.9 4. 6.
8 fdfmcoto adiclax 20.2 13.9 2. 2.
9 rwsbykxe adiclax 48.3 38.0 3. 3.
10 xocueqqe dimarta 48.6 53.2 1. 1.
# ... with 290 more rows, and 8 more variables: age <dbl>, gender <dbl>,
# site <chr>, diseasestatus <chr>, gender_c <chr>, dw <lgl>, ms <lgl>,
# np <lgl>
```

- Using
`write_csv()`

, save`dimarta_df`

in a new`.csv`

file in your`data`

folder called`dimarta.csv`

```
write_csv(x = dimarta_df,
path = "data/dimarta.csv")
```

- Using the
`mean()`

function, calculate the mean age of all patients.

`mean(dimarta_df$age)`

`[1] 39.93333`

- Using the following template, find out how many male and female patients were in the trial.

```
dimarta_df %>%
group_by(XXX) %>%
summarise(
Counts = n()
)
```

```
dimarta_df %>%
group_by(gender_c) %>%
summarise(
Counts = n()
)
```

```
# A tibble: 2 x 2
gender_c Counts
<chr> <int>
1 female 140
2 male 160
```

- Now, using similar code, find out how many patients were assigned to each study arm.

```
dimarta_df %>%
group_by(StudyArm) %>%
summarise(
Counts = n()
)
```

```
# A tibble: 3 x 2
StudyArm Counts
<chr> <int>
1 adiclax 100
2 dimarta 100
3 placebo 100
```

- Find out how many men and women were assigned to each study arm (Hint: You can use very similar code to what you used above, just add a second grouping variable!)

```
dimarta_df %>%
group_by(StudyArm, gender_c) %>%
summarise(
Counts = n()
)
```

```
# A tibble: 6 x 3
# Groups: StudyArm [?]
StudyArm gender_c Counts
<chr> <chr> <int>
1 adiclax female 41
2 adiclax male 59
3 dimarta female 49
4 dimarta male 51
5 placebo female 50
6 placebo male 50
```

- Using
`mutate()`

, add a new column to the`dimarta_df`

data called`histamine_change`

that shows the*change*in patient’s histamine levels from the start to the end of the trial (Hint: just subtract`histamine_start`

from`histamine_end`

!)

```
dimarta_df <- dimarta_df %>%
mutate(
histamine_change = histamine_end - histamine_start
)
# Look at result
dimarta_df %>%
select(histamine_change)
```

```
# A tibble: 300 x 1
histamine_change
<dbl>
1 8.40
2 -10.1
3 -0.390
4 -1.24
5 -1.75
6 4.34
7 7.47
8 -6.35
9 -10.3
10 4.58
# ... with 290 more rows
```

- Using
`mutate()`

again, add a new column to`dimarta_df`

called`qol_change`

that shows the change in patient’s quality of life.

```
dimarta_df <- dimarta_df %>%
mutate(
qol_change = qol_end - qol_start
)
# Look at result
dimarta_df %>%
select(qol_change)
```

```
# A tibble: 300 x 1
qol_change
<dbl>
1 0.
2 1.
3 0.
4 1.
5 2.
6 0.
7 2.
8 0.
9 0.
10 0.
# ... with 290 more rows
```

- Calculate the percentage of patients who tested positive for each of the three biomarkers using the following template (Hint: If you calculate the
`mean()`

of a logical vector, you will get the percentage of TRUE values!)

```
# Calculate percent of patients with positive biomarkers
dimarta_df %>%
summarise(
dw_mean = mean(XXX),
ms_percent = mean(XXX),
np_percent = mean(XXX)
)
```

```
# Calculate percent of patients with positive biomarkers
dimarta_df %>%
summarise(
dw_mean = mean(dw),
ms_percent = mean(ms),
np_percent = mean(np)
)
```

```
# A tibble: 1 x 3
dw_mean ms_percent np_percent
<dbl> <dbl> <dbl>
1 0.257 0.190 0.233
```

- Were there different distributions of age in the different trial sites? To answer this, separately calculate the mean and standard deviations of patient ages in each site. (Hint: group the data by
`site`

, then calculate two separate summary statistics:`age_mean = mean(age)`

, and`age_sd = sd(age)`

.

```
# Calculate the mean change in histamine for each study site
dimarta_df %>%
group_by(site) %>%
summarise(
age_mean = mean(age),
age_sd = sd(age)
)
```

```
# A tibble: 3 x 3
site age_mean age_sd
<chr> <dbl> <dbl>
1 London 39.9 5.86
2 Paris 39.8 4.84
3 Tokyo 40.1 4.50
```

- Calculate the mean change in histamine results separately for each study site

```
# Calculate the mean change in histamine for each study site
dimarta_df %>%
group_by(site) %>%
summarise(
histamine_change_mean = mean(histamine_change, na.rm = TRUE)
)
```

```
# A tibble: 3 x 2
site histamine_change_mean
<chr> <dbl>
1 London -0.352
2 Paris 0.0180
3 Tokyo -1.10
```

- Calculate the mean change in histamine results (
`histamine_change`

) for each study arm. Which study arm had a largest decrease in histamine?

```
# Calculate the mean change in histamine for each study site
dimarta_df %>%
group_by(StudyArm) %>%
summarise(
histamine_change_mean = mean(histamine_change, na.rm = TRUE)
)
```

```
# A tibble: 3 x 2
StudyArm histamine_change_mean
<chr> <dbl>
1 adiclax -4.79
2 dimarta 0.513
3 placebo 2.90
```

- Calculate the mean change in quality of life (
`qol_change`

) for each study arm. Do the results match what you found with the histamine results?

```
# Calculate the mean change in histamine for each study site
dimarta_df %>%
group_by(StudyArm) %>%
summarise(
qol_change_mean = mean(qol_change, na.rm = TRUE)
)
```

```
# A tibble: 3 x 2
StudyArm qol_change_mean
<chr> <dbl>
1 adiclax 0.0600
2 dimarta 0.0100
3 placebo -0.150
```

- Run the following code block to visualise the relationship between study arm and histamine change. What do you find?

```
ggplot(data = dimarta_df,
mapping = aes(x = StudyArm, y = histamine_change)) +
geom_boxplot()
```

- Using
`t.test()`

conduct a t-test comparing the change in histamine results between the placebo and dimarta. Did dimarta differ from the placebo?

```
# T.test comparing change in histamine between placebo and dimarta
t.test(formula = XX ~ XX,
data = XXX %>%
filter(XXX %in% c(XXX, XXX))) # Only include placebo and dimarta
```

```
# T.test comparing change in histamine between placebo and dimarta
t.test(formula = histamine_change ~ StudyArm,
data = dimarta_df %>%
filter(StudyArm %in% c("placebo", "dimarta"))) # Only include placebo and dimarta
```

```
Welch Two Sample t-test
data: histamine_change by StudyArm
t = -3.3406, df = 197.63, p-value = 0.0009995
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.788957 -0.976043
sample estimates:
mean in group dimarta mean in group placebo
0.5130 2.8955
```

- Using
`t.test()`

conduct a t-test comparing the change in histamine results between the adiclax (the standard of care) and dimarta. Did dimarta improve over the standard of care?

```
# T.test comparing change in histamine between placebo and dimarta
t.test(formula = histamine_change ~ StudyArm,
data = dimarta_df %>%
filter(StudyArm %in% c("adiclax", "dimarta"))) # Only include placebo and dimarta
```

```
Welch Two Sample t-test
data: histamine_change by StudyArm
t = -7.4914, df = 197.31, p-value = 2.233e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.698596 -3.906804
sample estimates:
mean in group adiclax mean in group dimarta
-4.7897 0.5130
```

- Using
`glm()`

conduct a regression analysis predicting`histamine_change`

as a function of 4 variables: treatment arm, histamine test results at the start of the trial, age, and quality of life at the start of the trial. Save the result as an object called`histamine_change_glm`

. Once you do, apply the`summary()`

function to the`histamine_change_glm`

object to explore the results. Which variables reliably predict changes in test results?

```
histamine_change_glm <- glm(histamine_change ~ StudyArm + histamine_start + qol_start + age,
data = dimarta_df)
summary(histamine_change_glm)
```

```
Call:
glm(formula = histamine_change ~ StudyArm + histamine_start +
qol_start + age, data = dimarta_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-12.7313 -3.5004 -0.1632 3.6559 14.2557
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.560739 2.682141 -2.819 0.00515 **
StudyArmdimarta 5.324078 0.704761 7.554 5.36e-13 ***
StudyArmplacebo 7.648549 0.706447 10.827 < 2e-16 ***
histamine_start -0.001245 0.017221 -0.072 0.94242
qol_start -0.231201 0.285169 -0.811 0.41817
age 0.088735 0.056956 1.558 0.12032
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 24.80221)
Null deviance: 10464.4 on 299 degrees of freedom
Residual deviance: 7291.8 on 294 degrees of freedom
AIC: 1822.6
Number of Fisher Scoring iterations: 2
```

- Repeat your previous regression analysis, but now predict change in quality of life. Do you get different results compared to your previous analysis?

```
qol_change_glm <- glm(qol_change ~ StudyArm + histamine_start + qol_start + age,
data = dimarta_df)
summary(qol_change_glm)
```

```
Call:
glm(formula = qol_change ~ StudyArm + histamine_start + qol_start +
age, data = dimarta_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0477 -0.7925 0.0022 0.8319 2.9564
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.508547 0.513842 0.990 0.323
StudyArmdimarta -0.050578 0.135018 -0.375 0.708
StudyArmplacebo -0.209041 0.135340 -1.545 0.124
histamine_start -0.002300 0.003299 -0.697 0.486
qol_start 0.036483 0.054633 0.668 0.505
age -0.011144 0.010912 -1.021 0.308
(Dispersion parameter for gaussian family taken to be 0.9103048)
Null deviance: 271.79 on 299 degrees of freedom
Residual deviance: 267.63 on 294 degrees of freedom
AIC: 831.11
Number of Fisher Scoring iterations: 2
```

- Now it’s time to see if the patient’s biomarkers were related to treatment success. For the
`dw`

biomarker, calculate the mean change in test results (`histamine_change`

) separately for patients with different outcomes on the`dw`

biomarker (hint: just group the data by`dw`

and use`summarise()`

to calculate the mean`histamine_change`

).

```
dimarta_df %>%
group_by(dw) %>%
summarise(
histamine_change_mean = mean(histamine_change, na.rm = TRUE)
)
```

```
# A tibble: 2 x 2
dw histamine_change_mean
<lgl> <dbl>
1 FALSE -0.464
2 TRUE -0.450
```

- Do the same analysis for the other two biomarkers
`ms`

and`np`

. Do either of these biomarkers seem to predict changes in test results?

```
dimarta_df %>%
group_by(ms) %>%
summarise(
histamine_change_mean = mean(histamine_change, na.rm = TRUE)
)
```

```
# A tibble: 2 x 2
ms histamine_change_mean
<lgl> <dbl>
1 FALSE -0.549
2 TRUE -0.0812
```

```
dimarta_df %>%
group_by(np) %>%
summarise(
histamine_change_mean = mean(histamine_change, na.rm = TRUE)
)
```

```
# A tibble: 2 x 2
np histamine_change_mean
<lgl> <dbl>
1 FALSE -0.635
2 TRUE 0.113
```

- Using
`glm()`

, create a new regression object called`histamine_change_bio_glm`

predicting`histamine_change`

as a function of the 3 biomarkers. Explore the results with`summary()`

. Do you find that any of these biomarkers predict changes in histamine?

```
histamine_change_bio_glm <- glm(histamine_change ~ dw + np + ms,
data = dimarta_df)
summary(histamine_change_bio_glm)
```

```
Call:
glm(formula = histamine_change ~ dw + np + ms, data = dimarta_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-16.0259 -4.4108 0.1053 4.1112 13.4225
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.72247 0.47032 -1.536 0.126
dwTRUE 0.02633 0.78484 0.034 0.973
npTRUE 0.73427 0.81059 0.906 0.366
msTRUE 0.44201 0.87436 0.506 0.614
(Dispersion parameter for gaussian family taken to be 35.22088)
Null deviance: 10464 on 299 degrees of freedom
Residual deviance: 10425 on 296 degrees of freedom
AIC: 1925.8
Number of Fisher Scoring iterations: 2
```

- Did some drugs work better for patients with some biomarkers than others? For example, did patients who expressed the
`dw`

biomarker have better results when given dimarta compared to patients who do not express the`dw`

biomarker? To answer this, start by calculating the descriptive statistics by calculating mean change in histamine`histamine_change`

for all groups of`dw`

and`StudyArm`

(Hint: Just use`group_by(dw, StudyArm)`

and`summarise(histamine_change_mean = mean(histamine_change))`

).

```
dimarta_df %>%
group_by(dw, StudyArm) %>%
summarise(
histamine_change_mean = mean(histamine_change, na.rm = TRUE)
)
```

```
# A tibble: 6 x 3
# Groups: dw [?]
dw StudyArm histamine_change_mean
<lgl> <chr> <dbl>
1 FALSE adiclax -4.95
2 FALSE dimarta 0.518
3 FALSE placebo 2.88
4 TRUE adiclax -4.37
5 TRUE dimarta 0.495
6 TRUE placebo 2.94
```

- Once you’ve looked at the descriptive statistics, conduct a regression analysis predicting
`histamine_change`

based on the*interaction*between`dw`

and`StudyArm`

. Remember to calculate an interaction term in regression, use the`*`

symbol in the formula. What do the results show?

```
dw_arm_glm <- glm(histamine_change ~ dw * StudyArm,
data = dimarta_df)
summary(dw_arm_glm)
```

```
Call:
glm(formula = histamine_change ~ dw * StudyArm, data = dimarta_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-12.2783 -3.3999 -0.2432 3.6346 14.5618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9518 0.5898 -8.396 1.98e-15 ***
dwTRUE 0.5789 1.1145 0.519 0.604
StudyArmdimarta 5.4701 0.8204 6.668 1.28e-10 ***
StudyArmplacebo 7.8321 0.8284 9.455 < 2e-16 ***
dwTRUE:StudyArmdimarta -0.6020 1.6298 -0.369 0.712
dwTRUE:StudyArmplacebo -0.5204 1.5949 -0.326 0.744
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 25.04207)
Null deviance: 10464.4 on 299 degrees of freedom
Residual deviance: 7362.4 on 294 degrees of freedom
AIC: 1825.5
Number of Fisher Scoring iterations: 2
```