Source: https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer

Slides

Here are the introduction slides for this practical on machine learning!

Overview

In this practical you’ll conduct machine learning analyses on a dataset on heart disease. You will see how well many different machine learning models can predict new data. By the end of this practical you will know how to:

Create separate training and test data
Fit a model to data
Explore a model
Make predictions from a model
Compare models in how well they can predict new data.

Glossary and packages

Here are the main functions and packages you’ll be using. For more information about the specific models, click on the link in Additional Details.

Algorithm	Function	Package	Additional Details
Regression	`glm()`	Base R	https://bookdown.org/ndphillips/YaRrr/regression.html#the-linear-model
Fast-and-Frugal decision trees	`FFTrees()`	FFTrees	https://cran.r-project.org/web/packages/FFTrees/vignettes/guide.html
Random Forests	`randomForest()`	`randomForest`	http://www.blopig.com/blog/2017/04/a-very-basic-introduction-to-random-forests-using-r/

Datasets

You’ll use two datasets in this practical: heartdisease.csv and ACTG175.csv. They available in the data_BaselRBootcamp_Day2.zip file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip folder and unzip it to get the two files.

Examples

The following examples will take you through all steps of the machine learning process, from creating training and test data, to fitting models, to making predictions. Follow along and try to see how piece of code works!

# -----------------------------------------------
# A step-by-step tutorial for conducting machine learning
# In this tutorial, we'll see how well 3 different models can
#  predict medical data
# ------------------------------------------------

# -----------------------
# Part A:
# Load libraries
# -----------------------

library(tidyverse)      # for dplyr and ggplot2
library(randomForest)   # for randomForest()
library(FFTrees)        # for FFTrees and the heartdisease data

# -----------------------
# Part B: Create datasets
#  heart_train, heart_test
# -----------------------

heartdisease <- read_csv(file = "data/heartdisease.csv")   # Save a copy of the heartdisease data as heart

set.seed(100)           # To fix the training / test randomization

heartdisease <- heartdisease %>%
   mutate_if(is.character, factor) %>%   # Convert character to factor
   sample_frac(1)                        # Randomly sort rows

# Savew first 100 rows as heart_train and remaining as heart_test
heart_train <- heartdisease %>% 
                slice(1:100)

heart_test <- heartdisease %>% 
                slice(101:nrow(heartdisease))

# ------------------------------
# Part I: Build Models
# ------------------------------

# Build FFTrees_model
FFTrees_model <- FFTrees(formula = sex ~ ., 
                         data = heart_train)

# Build glm_model
glm_model <- glm(formula = factor(sex) ~ ., 
                 data = heart_train, 
                 family = "binomial")  # For predicting a binary variable

# Build randomForest model
randomForest_model <- randomForest(formula = factor(sex) ~ ., 
                                   data = heart_train)

# ------------------------------
# Part II: Explore Models
# ------------------------------

print(FFTrees_model)
summary(FFTrees_model)

print(glm_model)
summary(glm_model)

print(randomForest_model)
summary(randomForest_model)

# ------------------------------
# Part III: Training Accuracy
# ------------------------------

# FFTrees training decisions
FFTrees_fit <- predict(object = FFTrees_model, 
                       newdata = heart_train)

# Regression training decisions
#  Positive values are predicted to be 1, negative values are 0
glm_fit <- predict(object = glm_model, 
                   newdata = heart_train) > 0

# randomForest training decisions
randomForest_fit <- predict(object = randomForest_model, 
                            newdata = heart_train)

# Now calculate fitting accuracies and put in dataframe

# Truth value for training data is heart_train$sex
train_truth <- heart_train$sex

# Put training results together
training_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
                          result = c(mean(FFTrees_fit == train_truth),
                                     mean(glm_fit == train_truth),
                                     mean(randomForest_fit == train_truth)))

# Plot training results

ggplot(data = training_results, 
       aes(x = model, y = result, fill = model)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Training Accuracy",
       y = "Correct Classifications")

# ------------------------------
# Part IV: Prediction Accuacy!
# ------------------------------

# Calculate predictions for each model for heart_test

# FFTrees testing decisions
FFTrees_pred <- predict(object = FFTrees_model, 
                        newdata = heart_test)

# Regression testing decisions
#  Positive values are predicted to be 1, negative values are 0
glm_pred <- predict(object = glm_model, 
                    newdata = heart_test) >= 0

# randomForest testing decisions
randomForest_pred <- predict(object = randomForest_model, 
                             newdata = heart_test)

# Now calculate testing accuracies and put in dataframe

# Truth value for test data is heart_test$sex
test_truth <- heart_test$sex

testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
                          result = c(mean(FFTrees_pred == test_truth),
                                     mean(glm_pred == test_truth),
                                     mean(randomForest_pred == test_truth)))

# Plot testing results

ggplot(data = testing_results, 
       aes(x = model, y = result, fill = model)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Testing Accuracy",
       y = "Correct Classifications")

Tasks

Note, most of this practical will be copying and pasting code from the Examples and only making small changes.
You should start by copying and pasting all of the code in the examples into a new .R file.
Try running pieces of the code line by line and understand what it’s doing!

Part A: Getting setup

A. Open your BaselRBootcamp project. This project should have the folders R and data in the working directory.

B. Open a new R script and save it in the R folder you just created as a new file called machinelearning_practical.R. At the top of the script, using comments, write your name and the date. The, load the tidyverse, randomForest and FFTrees packages. Here’s how the top of your script should look:

## NAME
## DATE
## Machine Learning Practical

library(tidyverse)      # for dplyr and ggplot2

── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.2     ✔ dplyr   0.7.4
✔ tidyr   0.8.0     ✔ stringr 1.3.0
✔ readr   1.1.1     ✔ forcats 0.3.0

── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(randomForest)   # for randomForest()

Warning: package 'randomForest' was built under R version 3.4.4

randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

library(FFTrees)        # for FFTrees and the heartdisease data

/ \

 F   O

/ \

   F   T

FFTrees v1.3.5. Email: Nathaniel.D.Phillips.is@gmail.com

FFTrees.guide() opens the package guide. Citation info at citation('FFTrees')

C. Make sure the heartdisease.csv file is in the data in your working directory. Then, using read_csv(), read the heartdisease.csv data and assign it to a new object in R called heartdisease.

heartdisease <- read_csv(file = "data/heartdisease.csv")

D. Explore the heartdisease dataset using summary(), and View()

summary(heartdisease)

      age             sex              cp               trestbps    
 Min.   :29.00   Min.   :0.0000   Length:303         Min.   : 94.0  
 1st Qu.:48.00   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
 Median :56.00   Median :1.0000   Mode  :character   Median :130.0  
 Mean   :54.44   Mean   :0.6799                      Mean   :131.7  
 3rd Qu.:61.00   3rd Qu.:1.0000                      3rd Qu.:140.0  
 Max.   :77.00   Max.   :1.0000                      Max.   :200.0  
      chol            fbs           restecg             thalach     
 Min.   :126.0   Min.   :0.0000   Length:303         Min.   : 71.0  
 1st Qu.:211.0   1st Qu.:0.0000   Class :character   1st Qu.:133.5  
 Median :241.0   Median :0.0000   Mode  :character   Median :153.0  
 Mean   :246.7   Mean   :0.1485                      Mean   :149.6  
 3rd Qu.:275.0   3rd Qu.:0.0000                      3rd Qu.:166.0  
 Max.   :564.0   Max.   :1.0000                      Max.   :202.0  
     exang           oldpeak        slope                 ca        
 Min.   :0.0000   Min.   :0.00   Length:303         Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.00   Class :character   1st Qu.:0.0000  
 Median :0.0000   Median :0.80   Mode  :character   Median :0.0000  
 Mean   :0.3267   Mean   :1.04                      Mean   :0.6766  
 3rd Qu.:1.0000   3rd Qu.:1.60                      3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :6.20                      Max.   :3.0000  
     thal             diagnosis     
 Length:303         Min.   :0.0000  
 Class :character   1st Qu.:0.0000  
 Mode  :character   Median :0.0000  
                    Mean   :0.4587  
                    3rd Qu.:1.0000  
                    Max.   :1.0000

# View(heartdisease)

E. There is a help menu for the heartdisease dataset in the FFTrees package. Look at the help menu for heartdisease by running ?heartdisease

?heartdisease

Part B: Create training and test data

F. Create separate training dataframes heart_train for model training and heart_test for model testing. Print each of these dataframes to make sure they look ok.

set.seed(100)           # To fix the training / test randomization

heartdisease <- heartdisease %>%
   mutate_if(is.character, factor) %>%   # Convert character to factor
   sample_frac(1)                        # Randomly sort rows

Warning: package 'bindrcpp' was built under R version 3.4.4

# Savew first 100 rows as heart_train and remaining as heart_test
heart_train <- heartdisease %>% 
                slice(1:100)

heart_test <- heartdisease %>% 
                slice(101:nrow(heartdisease))

heart_train

# A tibble: 100 x 14
     age   sex cp    trestbps  chol   fbs restecg    thalach exang oldpeak
   <dbl> <dbl> <fct>    <dbl> <dbl> <dbl> <fct>        <dbl> <dbl>   <dbl>
 1   44.    0. np        108.  141.    0. normal        175.    0.   0.600
 2   51.    0. np        140.  308.    0. hypertrop…    142.    0.   1.50 
 3   52.    1. np        138.  223.    0. normal        169.    0.   0.   
 4   48.    1. aa        110.  229.    0. normal        168.    0.   1.00 
 5   59.    1. aa        140.  221.    0. normal        164.    1.   0.   
 6   58.    1. np        105.  240.    0. hypertrop…    154.    1.   0.600
 7   41.    0. aa        126.  306.    0. normal        163.    0.   0.   
 8   39.    1. a         118.  219.    0. normal        140.    0.   1.20 
 9   77.    1. a         125.  304.    0. hypertrop…    162.    1.   0.   
10   41.    0. aa        105.  198.    0. normal        168.    0.   0.   
# ... with 90 more rows, and 4 more variables: slope <fct>, ca <dbl>,
#   thal <fct>, diagnosis <dbl>

heart_test

# A tibble: 203 x 14
     age   sex cp    trestbps  chol   fbs restecg    thalach exang oldpeak
   <dbl> <dbl> <fct>    <dbl> <dbl> <dbl> <fct>        <dbl> <dbl>   <dbl>
 1   60.    1. np        140.  185.    0. hypertrop…    155.    0.   3.00 
 2   48.    1. aa        130.  245.    0. hypertrop…    180.    0.   0.200
 3   64.    0. a         130.  303.    0. normal        122.    0.   2.00 
 4   62.    1. a         120.  267.    0. normal         99.    1.   1.80 
 5   43.    0. a         132.  341.    1. hypertrop…    136.    1.   3.00 
 6   55.    0. aa        135.  250.    0. hypertrop…    161.    0.   1.40 
 7   51.    1. a         140.  298.    0. normal        122.    1.   4.20 
 8   62.    1. aa        120.  281.    0. hypertrop…    103.    0.   1.40 
 9   67.    1. a         120.  229.    0. hypertrop…    129.    1.   2.60 
10   71.    0. np        110.  265.    1. hypertrop…    130.    0.   0.   
# ... with 193 more rows, and 4 more variables: slope <fct>, ca <dbl>,
#   thal <fct>, diagnosis <dbl>

Part I: Train models on `diagnosis`

In the example code, we predicted patient’s sex. Now, in our analyses, we will try to predict each patient’s diagnosis using the column diagnosis. To do this, create three new model objects FFTrees_model, glm_model, and randomForest_model, each predicting diagnosis using the training data heart_train

# ------------------------------
# Part I: Build Models
# ------------------------------

# Build FFTrees_model
FFTrees_model <- FFTrees(formula = diagnosis ~ ., 
                         data = heart_train)

Growing FFTs with ifan

Fitting non-FFTrees algorithms for comparison (you can turn this off with do.comp = FALSE) ...

# Build glm_model
glm_model <- glm(formula = factor(diagnosis) ~ ., 
                 data = heart_train, 
                 family = "binomial")  # For predicting a binary variable

# Build randomForest model
randomForest_model <- randomForest(formula = factor(diagnosis) ~ ., 
                                   data = heart_train)

Part II: Calculate fits for training data

Calculate fits for each model with predict(), then create training_results containing the fitting accuracy of each model in a dataframe. The code will be almost identical to what is in the Example. All you need to do is change the value of truth_train to the correct column in heart_train. Afterwards, plot the results using ggplot. Which model had the best training accuracy?

# FFTrees training decisions
FFTrees_fit <- predict(object = FFTrees_model, 
                       newdata = heart_train)

# Regression training decisions
#  Positive values are predicted to be 1, negative values are 0
glm_fit <- predict(object = glm_model, 
                   newdata = heart_train) > 0

# randomForest training decisions
randomForest_fit <- predict(object = randomForest_model, 
                            newdata = heart_train)

# Now calculate fitting accuracies and put in dataframe

# Truth value for training data is heart_train$sex
train_truth <- heart_train$diagnosis

# Put training results together
training_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
                          result = c(mean(FFTrees_fit == train_truth),
                                     mean(glm_fit == train_truth),
                                     mean(randomForest_fit == train_truth)))

# Plot testing results

ggplot(data = training_results, 
       aes(x = model, y = result, fill = model)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Training Accuracy",
       y = "Correct Classifications")

Part III: Explore models

Explore each of the three models by applying print(), summary(), plot(), and names() to the objects. Which of these methods work for each object? Can you interpret any of the outputs?

print(FFTrees_model)

FFT #1 predicts diagnosis using 4 cues: {cp,thal,ca,oldpeak}

[1] If cp != {a}, predict False.
[2] If thal = {rd,fd}, predict True.
[3] If ca > 0, predict True.
[4] If oldpeak <= 0.9, predict False, otherwise, predict True.

                   train
cases       :n    100.00
speed       :mcu    1.69
frugality   :pci    0.88
accuracy    :acc    0.83
weighted    :wacc   0.82
sensitivity :sens   0.73
specificity :spec   0.91

pars: algorithm = 'ifan', goal = 'wacc', goal.chase = 'bacc', sens.w = 0.5, max.levels = 4

print(glm_model)


Call:  glm(formula = factor(diagnosis) ~ ., family = "binomial", data = heart_train)

Coefficients:
       (Intercept)                 age                 sex  
         -42.74285             0.10042             3.89776  
              cpaa                cpnp                cpta  
          -4.81183            -9.60073           -11.51812  
          trestbps                chol                 fbs  
           0.05502             0.01829             0.11396  
restecghypertrophy       restecgnormal             thalach  
          -7.16047            -7.38367             0.17836  
             exang             oldpeak           slopeflat  
          -3.96418             1.31366             4.26435  
           slopeup                  ca          thalnormal  
          -2.07748             5.23272             1.36110  
            thalrd  
           5.06249  

Degrees of Freedom: 99 Total (i.e. Null);  81 Residual
Null Deviance:      137.6 
Residual Deviance: 30.26    AIC: 68.26

print(randomForest_model)


Call:
 randomForest(formula = factor(diagnosis) ~ ., data = heart_train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 18%
Confusion matrix:
   0  1 class.error
0 48  7   0.1272727
1 11 34   0.2444444

summary(FFTrees_model)

$train
  tree   n hi mi fa cr      sens      spec       ppv       npv        far
1    1 100 33 12  5 50 0.7333333 0.9090909 0.8684211 0.8064516 0.09090909
2    2 100 31 14  3 52 0.6888889 0.9454545 0.9117647 0.7878788 0.05454545
3    3 100 40  5 14 41 0.8888889 0.7454545 0.7407407 0.8913043 0.25454545
4    4 100 38  7 12 43 0.8444444 0.7818182 0.7600000 0.8600000 0.21818182
5    5 100 42  3 22 33 0.9333333 0.6000000 0.6562500 0.9166667 0.40000000
6    6 100 21 24  1 54 0.4666667 0.9818182 0.9545455 0.6923077 0.01818182
7    7 100 45  0 34 21 1.0000000 0.3818182 0.5696203 1.0000000 0.61818182
8    8 100 17 28  0 55 0.3777778 1.0000000 1.0000000 0.6626506 0.00000000
   acc      bacc      wacc       bpv   dprime cost       pci  mcu
1 0.83 0.8212121 0.8212121 0.8374363 1.958103 0.17 0.8792857 1.69
2 0.83 0.8171717 0.8171717 0.8498217 2.094996 0.17 0.8864286 1.59
3 0.81 0.8171717 0.8171717 0.8160225 1.880894 0.19 0.8700000 1.82
4 0.81 0.8131313 0.8131313 0.8100000 1.791242 0.19 0.8771429 1.72
5 0.75 0.7666667 0.7666667 0.7864583 1.754433 0.25 0.8507143 2.09
6 0.75 0.7242424 0.7242424 0.8234266 2.009186 0.25 0.8764286 1.73
7 0.66 0.6909091 0.6909091 0.7848101 1.999716 0.34 0.8407143 2.23
8 0.72 0.6888889 0.6888889 0.8313253 2.064228 0.28 0.8607143 1.95

$test
NULL

summary(glm_model)


Call:
glm(formula = factor(diagnosis) ~ ., family = "binomial", data = heart_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.90474  -0.07549  -0.00268   0.04191   1.85561  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)   
(Intercept)        -42.74285   25.08901  -1.704  0.08845 . 
age                  0.10042    0.09606   1.045  0.29583   
sex                  3.89776    2.69543   1.446  0.14816   
cpaa                -4.81183    2.52717  -1.904  0.05691 . 
cpnp                -9.60073    4.31506  -2.225  0.02609 * 
cpta               -11.51812    4.85146  -2.374  0.01759 * 
trestbps             0.05502    0.05502   1.000  0.31737   
chol                 0.01829    0.01733   1.055  0.29137   
fbs                  0.11396    1.97646   0.058  0.95402   
restecghypertrophy  -7.16047   12.51323  -0.572  0.56717   
restecgnormal       -7.38367   12.38725  -0.596  0.55113   
thalach              0.17836    0.07913   2.254  0.02420 * 
exang               -3.96418    2.91634  -1.359  0.17405   
oldpeak              1.31366    1.31935   0.996  0.31940   
slopeflat            4.26435    3.42511   1.245  0.21312   
slopeup             -2.07748    2.64305  -0.786  0.43186   
ca                   5.23272    1.81873   2.877  0.00401 **
thalnormal           1.36110    6.36321   0.214  0.83062   
thalrd               5.06249    6.46121   0.784  0.43332   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 137.628  on 99  degrees of freedom
Residual deviance:  30.255  on 81  degrees of freedom
AIC: 68.255

Number of Fisher Scoring iterations: 9

summary(randomForest_model)

                Length Class  Mode     
call               3   -none- call     
type               1   -none- character
predicted        100   factor numeric  
err.rate        1500   -none- numeric  
confusion          6   -none- numeric  
votes            200   matrix numeric  
oob.times        100   -none- numeric  
classes            2   -none- character
importance        13   -none- numeric  
importanceSD       0   -none- NULL     
localImportance    0   -none- NULL     
proximity          0   -none- NULL     
ntree              1   -none- numeric  
mtry               1   -none- numeric  
forest            14   -none- list     
y                100   factor numeric  
test               0   -none- NULL     
inbag              0   -none- NULL     
terms              3   terms  call

You can look at a variable’s importance in each of these models in different ways. In the decision tree, you can look at which variables show up in the tree. In regression, you can look at the significance of the predictors. In random forests, you can look at look at a variable’s importance in terms of what is called mean decrease in gini. Using the following template, explore how each model rates the importance of variables.

# ----------------------
# Explore the importance of different variables in models
# ----------------------

# Print the elements of the FFTrees model
FFTrees_model

FFT #1 predicts diagnosis using 4 cues: {cp,thal,ca,oldpeak}

[1] If cp != {a}, predict False.
[2] If thal = {rd,fd}, predict True.
[3] If ca > 0, predict True.
[4] If oldpeak <= 0.9, predict False, otherwise, predict True.

                   train
cases       :n    100.00
speed       :mcu    1.69
frugality   :pci    0.88
accuracy    :acc    0.83
weighted    :wacc   0.82
sensitivity :sens   0.73
specificity :spec   0.91

pars: algorithm = 'ifan', goal = 'wacc', goal.chase = 'bacc', sens.w = 0.5, max.levels = 4

# Look at significance of regression 
summary(glm_model)


Call:
glm(formula = factor(diagnosis) ~ ., family = "binomial", data = heart_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.90474  -0.07549  -0.00268   0.04191   1.85561  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)   
(Intercept)        -42.74285   25.08901  -1.704  0.08845 . 
age                  0.10042    0.09606   1.045  0.29583   
sex                  3.89776    2.69543   1.446  0.14816   
cpaa                -4.81183    2.52717  -1.904  0.05691 . 
cpnp                -9.60073    4.31506  -2.225  0.02609 * 
cpta               -11.51812    4.85146  -2.374  0.01759 * 
trestbps             0.05502    0.05502   1.000  0.31737   
chol                 0.01829    0.01733   1.055  0.29137   
fbs                  0.11396    1.97646   0.058  0.95402   
restecghypertrophy  -7.16047   12.51323  -0.572  0.56717   
restecgnormal       -7.38367   12.38725  -0.596  0.55113   
thalach              0.17836    0.07913   2.254  0.02420 * 
exang               -3.96418    2.91634  -1.359  0.17405   
oldpeak              1.31366    1.31935   0.996  0.31940   
slopeflat            4.26435    3.42511   1.245  0.21312   
slopeup             -2.07748    2.64305  -0.786  0.43186   
ca                   5.23272    1.81873   2.877  0.00401 **
thalnormal           1.36110    6.36321   0.214  0.83062   
thalrd               5.06249    6.46121   0.784  0.43332   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 137.628  on 99  degrees of freedom
Residual deviance:  30.255  on 81  degrees of freedom
AIC: 68.255

Number of Fisher Scoring iterations: 9

# Look at importance of variables in randomforests
randomForest_model$importance

         MeanDecreaseGini
age             4.8117369
sex             1.0653393
cp              7.8707268
trestbps        3.6289871
chol            3.8782158
fbs             0.4243759
restecg         0.7612203
thalach         4.0227932
exang           1.1082324
oldpeak         5.2384905
slope           2.2656301
ca              6.6686235
thal            6.8666014

Part IV: Calculate predictions for test data

Calculate predictions for each model based on heart_test, and then calculate the prediction accuracies. Don’t forget to change the value of truth_test to reflect the true value for the current analysis! Then plot the results. Which model predicted each patient’s diagnosis the best?

# ------------------------------
# Part IV: Prediction Accuacy!
# ------------------------------

# Calculate predictions for each model for heart_test

# FFTrees testing decisions
FFTrees_pred <- predict(object = FFTrees_model, 
                        newdata = heart_test)

# Regression testing decisions
#  Positive values are predicted to be 1, negative values are 0
glm_pred <- predict(object = glm_model, 
                    newdata = heart_test) >= 0

# randomForest testing decisions
randomForest_pred <- predict(object = randomForest_model, 
                             newdata = heart_test)

# Now calculate testing accuracies and put in dataframe

# Truth value for test data is heart_test$sex
test_truth <- heart_test$diagnosis

testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
                          result = c(mean(FFTrees_pred == test_truth),
                                     mean(glm_pred == test_truth),
                                     mean(randomForest_pred == test_truth)))

# Plot testing results

ggplot(data = testing_results, 
       aes(x = model, y = result, fill = model)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Testing Accuracy",
       y = "Correct Classifications")

Extra Challenges

By default, random forests creates 500 trees. You can change this using the ntree argument in randomForest(). Try creating 2 new randomForest objects, one based on either a small number of trees (e.g. 10), and one based on 10,000 trees. How much better (or worse) do these models do compared to the default model with 500 trees?

# Build randomForest model
randomForest_10_model <- randomForest(formula = factor(diagnosis) ~ ., 
                                      data = heart_train,
                                      ntree = 10)

randomForest_10000_model <- randomForest(formula = factor(diagnosis) ~ ., 
                                      data = heart_train,
                                      ntree = 10000)

# randomForest testing decisions
randomForest_10_pred <- predict(object = randomForest_10_model, 
                             newdata = heart_test)

randomForest_10000_pred <- predict(object = randomForest_10000_model, 
                             newdata = heart_test)


# Truth value for test data is heart_test$sex
test_truth <- heart_test$sex

testing_results <- tibble(model = c("RF_10", "RF_10000"),
                          result = c(mean(randomForest_10_pred == test_truth),
                                     mean(randomForest_10000_pred == test_truth)))

# Plot testing results

ggplot(data = testing_results, 
       aes(x = model, y = result, fill = model)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Testing Accuracy",
       y = "Correct Classifications")

You can use the my.tree argument in FFTrees() to create your own custom tree ‘in words’. To see how this works, look at the “Specifying FFTs directly” vignette in the FFTrees package by running vignette("FFTrees_mytree", package = "FFTrees"). Look through the vignette to see how the my.tree argument works. Then, try creating a new FFTrees object called custom_FFTrees_model trained on the heart_train data using the following rule: “If sex = 0, predict False. If chol > 250, predict True. Otherwise, predict True.”. Plot the object to see how the tree did!

custom_FFTrees_model <- FFTrees(formula = diagnosis ~.,
                                data = heart_train,
                                my.tree = "If sex = 0, predict False.
                                           If chol > 250, predict TRUE.
                                           Otherwise, predict TRUE.")

Fitting non-FFTrees algorithms for comparison (you can turn this off with do.comp = FALSE) ...

plot(custom_FFTrees_model, main = "My custom FFT")

A colleague of yours thinks that support vector machines should perform better than the models you used. Is she right? Test her prediction by including support vector machines using the svm() function from the e1071 package in all of your analyses. You’ll need to add code for support vector machines at each stage of the machine learning process, model building, data fitting, and data prediction. Was she right?

library(e1071)


# Build randomForest model
svm_model <- svm(formula = factor(diagnosis) ~ ., 
                 data = heart_train)

# randomForest testing decisions
svm_pred <- predict(object = svm_model, 
                    newdata = heart_test)

# Truth value for test data is heart_test$sex
test_truth <- heart_test$diagnosis

testing_results <- tibble(model = c("FFTrees", "glm", "randomForest", "svm"),
                          result = c(mean(FFTrees_pred == test_truth),
                                     mean(glm_pred == test_truth),
                                     mean(randomForest_pred == test_truth),
                                     mean(svm_pred == test_truth)))

# Plot testing results

ggplot(data = testing_results, 
       aes(x = model, y = result, fill = model)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Testing Accuracy",
       y = "Correct Classifications")

In all of our machine learning, we have allowed all models to use all data in the heartdisease dataset. However, some of the data is more expensive to collect than other data. What do you think would happen if we only let the models use a few cheap predictors like age, sex and chol? Test your prediction by replicating the machine learning process, but only allow the models to make predictions based on the three variables age, sex and chol. Does one model substantially outperform the others? (Hint: You can easily tell a model what specific variables to include using the formula argument. For example, formula = y ~ a + b will tell a model to model a variable y, but only using variables a and b.)

# Build FFTrees_model
FFTrees_asc_model <- FFTrees(formula = diagnosis ~ age + sex + chol, 
                         data = heart_train)

Growing FFTs with ifan

Fitting non-FFTrees algorithms for comparison (you can turn this off with do.comp = FALSE) ...

# Build glm_model
glm_asc_model <- glm(formula = factor(diagnosis) ~  age + sex + chol, 
                 data = heart_train, 
                 family = "binomial")  # For predicting a binary variable

# Build randomForest model
randomForest_asc_model <- randomForest(formula = factor(diagnosis) ~  age + sex + chol, 
                                   data = heart_train)


# FFTrees testing decisions
FFTrees_asc_pred <- predict(object = FFTrees_asc_model, 
                        newdata = heart_test)

# Regression testing decisions
#  Positive values are predicted to be 1, negative values are 0
glm_asc_pred <- predict(object = glm_asc_model, 
                    newdata = heart_test) >= 0

# randomForest testing decisions
randomForest_asc_pred <- predict(object = randomForest_asc_model, 
                             newdata = heart_test)

# Now calculate testing accuracies and put in dataframe

# Truth value for test data is heart_test$sex
test_truth <- heart_test$diagnosis

testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
                          result = c(mean(FFTrees_asc_pred == test_truth),
                                     mean(glm_asc_pred == test_truth),
                                     mean(randomForest_asc_pred == test_truth)))

# Plot testing results

ggplot(data = testing_results, 
       aes(x = model, y = result, fill = model)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Testing Accuracy",
       y = "Correct Classifications")

How do you think these algorithms would perform on a randomly generated dataset? Let’s test this by creating a random training and test dataset, and then see how well the algorithms do. Run the code below to add a random column of data called random to heart_train and heart_test. Then, run your machine learning analysis, but now train and test the models to predict the new random data column. How well do the models do in training and testing?

# Add a new column random to heart_train and heart_test

heart_train <- heart_train %>%
  mutate(
    random = sample(c(0, 1), size = nrow(.), replace = TRUE)
  )

heart_test <- heart_test %>%
  mutate(
    random = sample(c(0, 1), size = nrow(.), replace = TRUE)
  )

# Build FFTrees_model
FFTrees_random_model <- FFTrees(formula = random ~ age + sex + chol, 
                         data = heart_train)

Growing FFTs with ifan

Fitting non-FFTrees algorithms for comparison (you can turn this off with do.comp = FALSE) ...

# Build glm_model
glm_random_model <- glm(formula = factor(random) ~  age + sex + chol, 
                 data = heart_train, 
                 family = "binomial")  # For predicting a binary variable

# Build randomForest model
randomForest_random_model <- randomForest(formula = factor(random) ~  age + sex + chol, 
                                   data = heart_train)


# FFTrees testing decisions
FFTrees_random_pred <- predict(object = FFTrees_random_model, 
                        newdata = heart_test)

# Regression testing decisions
#  Positive values are predicted to be 1, negative values are 0
glm_random_pred <- predict(object = glm_random_model, 
                    newdata = heart_test) >= 0

# randomForest testing decisions
randomForest_random_pred <- predict(object = randomForest_random_model, 
                             newdata = heart_test)

# Now calculate testing accuracies and put in dataframe

# Truth value for test data is heart_test$sex
test_truth <- heart_test$random

testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
                          result = c(mean(FFTrees_random_pred == test_truth),
                                     mean(glm_random_pred == test_truth),
                                     mean(randomForest_random_pred == test_truth)))

# Plot testing results

ggplot(data = testing_results, 
       aes(x = model, y = result, fill = model)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Testing Accuracy",
       y = "Correct Classifications")

Additional reading

For more advanced machine learning functionality in R, check out the caret package caret documentation link and the mlr package mlr documentation link. Both of these packages contain functions that automate most aspects of the machine learning process.
To read more about the fundamentals of machine learning and statistical prediction, check out An Introduction to Statistical Learning by James et al.

Machine Learning

BaselRBootcamp April 2018