In this practical you’ll conduct machine learning analyses on a dataset on heart disease. You will see how well many different machine learning models can predict new data. By the end of this practical you will know how to:

- Create separate training and test data
- Fit a model to data
- Explore a model
- Make predictions from a model
- Compare models in how well they can predict new data.

Here are the main functions and packages you’ll be using. For more information about the specific models, click on the link in *Additional Details*.

Algorithm | Function | Package | Additional Details |
---|---|---|---|

Regression | `glm()` |
Base R | https://bookdown.org/ndphillips/YaRrr/regression.html#the-linear-model |

Fast-and-Frugal decision trees | `FFTrees()` |
FFTrees | https://cran.r-project.org/web/packages/FFTrees/vignettes/guide.html |

Random Forests | `randomForest()` |
`randomForest` |
http://www.blopig.com/blog/2017/04/a-very-basic-introduction-to-random-forests-using-r/ |

You’ll use two datasets in this practical: `heartdisease.csv`

and `ACTG175.csv`

. They available in the `data_BaselRBootcamp_Day2.zip`

file available through the main course page. If you haven’t already, download the `data_BaselRBootcamp_Day2.zip`

folder and unzip it to get the two files.

- The following examples will take you through all steps of the machine learning process, from creating training and test data, to fitting models, to making predictions. Follow along and try to see how piece of code works!

```
# -----------------------------------------------
# A step-by-step tutorial for conducting machine learning
# In this tutorial, we'll see how well 3 different models can
# predict medical data
# ------------------------------------------------
# -----------------------
# Part A:
# Load libraries
# -----------------------
library(tidyverse) # for dplyr and ggplot2
library(randomForest) # for randomForest()
library(FFTrees) # for FFTrees and the heartdisease data
# -----------------------
# Part B: Create datasets
# heart_train, heart_test
# -----------------------
heartdisease <- read_csv(file = "data/heartdisease.csv") # Save a copy of the heartdisease data as heart
set.seed(100) # To fix the training / test randomization
heartdisease <- heartdisease %>%
mutate_if(is.character, factor) %>% # Convert character to factor
sample_frac(1) # Randomly sort rows
# Savew first 100 rows as heart_train and remaining as heart_test
heart_train <- heartdisease %>%
slice(1:100)
heart_test <- heartdisease %>%
slice(101:nrow(heartdisease))
# ------------------------------
# Part I: Build Models
# ------------------------------
# Build FFTrees_model
FFTrees_model <- FFTrees(formula = sex ~ .,
data = heart_train)
# Build glm_model
glm_model <- glm(formula = factor(sex) ~ .,
data = heart_train,
family = "binomial") # For predicting a binary variable
# Build randomForest model
randomForest_model <- randomForest(formula = factor(sex) ~ .,
data = heart_train)
# ------------------------------
# Part II: Explore Models
# ------------------------------
print(FFTrees_model)
summary(FFTrees_model)
print(glm_model)
summary(glm_model)
print(randomForest_model)
summary(randomForest_model)
# ------------------------------
# Part III: Training Accuracy
# ------------------------------
# FFTrees training decisions
FFTrees_fit <- predict(object = FFTrees_model,
newdata = heart_train)
# Regression training decisions
# Positive values are predicted to be 1, negative values are 0
glm_fit <- predict(object = glm_model,
newdata = heart_train) > 0
# randomForest training decisions
randomForest_fit <- predict(object = randomForest_model,
newdata = heart_train)
# Now calculate fitting accuracies and put in dataframe
# Truth value for training data is heart_train$sex
train_truth <- heart_train$sex
# Put training results together
training_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
result = c(mean(FFTrees_fit == train_truth),
mean(glm_fit == train_truth),
mean(randomForest_fit == train_truth)))
# Plot training results
ggplot(data = training_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Training Accuracy",
y = "Correct Classifications")
# ------------------------------
# Part IV: Prediction Accuacy!
# ------------------------------
# Calculate predictions for each model for heart_test
# FFTrees testing decisions
FFTrees_pred <- predict(object = FFTrees_model,
newdata = heart_test)
# Regression testing decisions
# Positive values are predicted to be 1, negative values are 0
glm_pred <- predict(object = glm_model,
newdata = heart_test) >= 0
# randomForest testing decisions
randomForest_pred <- predict(object = randomForest_model,
newdata = heart_test)
# Now calculate testing accuracies and put in dataframe
# Truth value for test data is heart_test$sex
test_truth <- heart_test$sex
testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
result = c(mean(FFTrees_pred == test_truth),
mean(glm_pred == test_truth),
mean(randomForest_pred == test_truth)))
# Plot testing results
ggplot(data = testing_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Testing Accuracy",
y = "Correct Classifications")
```

- Note, most of this practical will be copying and pasting code from the Examples and only making small changes.
- You should start by copying and pasting all of the code in the examples into a new .R file.
- Try running pieces of the code line by line and understand what it’s doing!

A. Open your `BaselRBootcamp`

project. This project should have the folders `R`

and `data`

in the working directory.

B. Open a new R script and save it in the `R`

folder you just created as a new file called `machinelearning_practical.R`

. At the top of the script, using comments, write your name and the date. The, load the `tidyverse`

, `randomForest`

and `FFTrees`

packages. Here’s how the top of your script should look:

```
## NAME
## DATE
## Machine Learning Practical
library(tidyverse) # for dplyr and ggplot2
library(randomForest) # for randomForest()
library(FFTrees) # for FFTrees and the heartdisease data
```

C. Make sure the `heartdisease.csv`

file is in the `data`

in your working directory. Then, using `read_csv()`

, read the `heartdisease.csv`

data and assign it to a new object in R called `heartdisease`

.

D. Explore the `heartdisease`

dataset using `summary()`

, and `View()`

E. There is a help menu for the `heartdisease`

dataset in the `FFTrees`

package. Look at the help menu for `heartdisease`

by running `?heartdisease`

F. Create separate training dataframes `heart_train`

for model training and `heart_test`

for model testing. Print each of these dataframes to make sure they look ok.

`diagnosis`

- In the example code, we predicted patient’s sex. Now, in our analyses, we will try to predict each patient’s diagnosis using the column
`diagnosis`

. To do this, create three new model objects`FFTrees_model`

,`glm_model`

, and`randomForest_model`

, each predicting`diagnosis`

using the training data`heart_train`

- Calculate fits for each model with
`predict()`

, then create`training_results`

containing the fitting accuracy of each model in a dataframe. The code will be almost identical to what is in the Example. All you need to do is change the value of`truth_train`

to the correct column in`heart_train`

. Afterwards, plot the results using`ggplot`

. Which model had the best training accuracy?

Explore each of the three models by applying

`print()`

,`summary()`

,`plot()`

, and`names()`

to the objects. Which of these methods work for each object? Can you interpret any of the outputs?You can look at a variable’s

*importance*in each of these models in different ways. In the decision tree, you can look at which variables show up in the tree. In regression, you can look at the significance of the predictors. In random forests, you can look at look at a variable’s importance in terms of what is called*mean decrease in gini*. Using the following template, explore how each model rates the importance of variables.

```
# ----------------------
# Explore the importance of different variables in models
# ----------------------
# Print the elements of the FFTrees model
FFTrees_model
# Look at significance of regression
summary(glm_model)
# Look at importance of variables in randomforests
randomForest_model$importance
```

- Calculate predictions for each model based on
`heart_test`

, and then calculate the prediction accuracies. Don’t forget to change the value of`truth_test`

to reflect the true value for the current analysis! Then plot the results. Which model predicted each patient’s diagnosis the best?

By default, random forests creates 500 trees. You can change this using the

`ntree`

argument in`randomForest()`

. Try creating 2 new randomForest objects, one based on either a small number of trees (e.g. 10), and one based on 10,000 trees. How much better (or worse) do these models do compared to the default model with 500 trees?You can use the

`my.tree`

argument in`FFTrees()`

to create your own custom tree ‘in words’. To see how this works, look at the “Specifying FFTs directly” vignette in the`FFTrees`

package by running`vignette("FFTrees_mytree", package = "FFTrees")`

. Look through the vignette to see how the`my.tree`

argument works. Then, try creating a new`FFTrees`

object called`custom_FFTrees_model`

trained on the`heart_train`

data using the following rule: “If sex = 0, predict False. If chol > 250, predict True. Otherwise, predict True.”. Plot the object to see how the tree did!A colleague of yours thinks that support vector machines should perform better than the models you used. Is she right? Test her prediction by including support vector machines using the

`svm()`

function from the`e1071`

package in all of your analyses. You’ll need to add code for support vector machines at each stage of the machine learning process, model building, data fitting, and data prediction. Was she right?In all of our machine learning, we have allowed all models to use all data in the

`heartdisease`

dataset. However, some of the data is more expensive to collect than other data. What do you think would happen if we only let the models use a few cheap predictors like`age`

,`sex`

and`chol`

? Test your prediction by replicating the machine learning process, but*only*allow the models to make predictions based on the three variables`age`

,`sex`

and`chol`

. Does one model substantially outperform the others? (Hint: You can easily tell a model what specific variables to include using the`formula`

argument. For example,`formula = y ~ a + b`

will tell a model to model a variable y, but*only*using variables a and b.)How do you think these algorithms would perform on a randomly generated dataset? Let’s test this by creating a random training and test dataset, and then see how well the algorithms do. Run the code below to add a random column of data called

`random`

to`heart_train`

and`heart_test`

. Then, run your machine learning analysis, but now train and test the models to predict the new random data column. How well do the models do in training and testing?

```
# Add a new column random to heart_train and heart_test
heart_train <- heart_train %>%
mutate(
random = sample(c(0, 1), size = nrow(.), replace = TRUE)
)
heart_test <- heart_test %>%
mutate(
random = sample(c(0, 1), size = nrow(.), replace = TRUE)
)
```

For more advanced machine learning functionality in R, check out the

`caret`

package caret documentation link and the`mlr`

package mlr documentation link. Both of these packages contain functions that automate most aspects of the machine learning process.To read more about the fundamentals of machine learning and statistical prediction, check out An Introduction to Statistical Learning by James et al.