The almighty Caret!

The almighty Caret!

## Warning: package 'dplyr' was built under R version 3.5.1

Cheatsheet

Packages

Package Installation
tidyverse install.packages("tidyverse")
caret install.packages("caret")
GGally install.packages("GGally")

Datasets

library(tidyverse)

house_train <- read_csv("../_data/house_train.csv")
house_test <- read_csv("../_data/house_test.csv")

heartdisease_train <- read_csv("../_data/heartdisease_train.csv")
heartdisease_train <- read_csv("../_data/heartdisease_test.csv")

attrition_train <- read_csv("../_data/attrition_train.csv")
attrition_train <- read_csv("../_data/attrition_test.csv")
File Rows Columns Criterion Source
house_train.csv 1000 21 price Kaggle: House Sales Prediction
house_test.csv 15000 21 price Kaggle: House Sales Prediction
heartdisease_train.csv 150 14 diagnosis UCI ML Heartdisease
heartdisease_test.csv 150 14 diagnosis UCI ML Heartdisease
attrition_train 500 35 Attrition Kaggle Attrition
attrition_test 900 35 Attrition Kaggle Attrition

Examples

The following set of example code will take you through the basic steps of machine learning using the amazing caret package.

# Load packages
library(tidyverse)
library(GGally)
library(skimr)
library(caret)

# ------------------------------------
# Step 0: Create training and test data
#  Only necessary if you don't already have training
#   and test data
# ------------------------------------

# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
  sample_frac(1)              

train_v <- createDataPartition(y = diamonds$price, 
                               times = 1,
                               p = .1)

# Create separate training and test data

diamonds_train <- diamonds %>%
  slice(train_v$Resample1)

diamonds_test <- diamonds %>%
  slice(-train_v$Resample1)

# ---------------
# Explore
# ---------------

# Explore columns with summarizeColumns
skim(diamonds_train)

# Visualise relationships with ggpairs
ggpairs(diamonds_train)

# ---------------
# Step 1: Define control parameters
# ---------------

# Set up control values
ctr <- trainControl(method = "repeatedcv",
                    number = 10,
                    repeats = 2)

# ---------------
# Step 2: Train model
# ---------------

# Predict price with linear regression
diamonds_lm_train <- train(form = price ~ ., 
                           data = diamonds_train,
                           method = "lm",
                           trControl = ctr)
# ---------------
# Step 3: Explore
# ---------------

class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)

# Look at variable importance with varImp
varImp(diamonds_lm_train)

# Look at final model object
diamonds_lm_train$finalModel

# ---------------
# Step 4: Predict
# ---------------

diamonds_lm_predictions <- predict(diamonds_lm_train, 
                                   newdata = diamonds_test)
# ---------------
# Step 5: Evaluate
# ---------------

# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions, 
             obs = diamonds_test$price)

# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
                          criterion = diamonds_test$price)

ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point(alpha = .1) +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

A. Open your R project. It should already have the folders 1_Data and 2_Code. Make sure that the all of the datasets you need for this practical are in your 1_Data folder

# Done!

B. Open a new R script and save it as a new file called ml_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load the all of the packages you’ll need for this practical. Here’s how the top of your script should look:

## NAME
## DATE
## Machine Learning Practical

library(XX)     
library(XX)   
library(XX)   
library(tidyverse)
library(caret)
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift

Modelling procedure

In this practical you will conduct machine learning analyses on several data sets. For each of the tasks, go through the following steps.

A. Load the training data XXX_train.csv as a new dataframe called XXX_train and the test data XXX_test.csv as a new dataframe called XXX_test

B. Explore the XXX_train dataframe with a combination of skim(), names(), summary() and other similar functions

C. Define control parameters with trainControl(). Use 10-fold cross validation with 2 repeitions.

D. If you are conducting a classification analysis, be sure to convert the criterion to a factor to tell the function you are doing classification instead of regression. Do this for both the training and test datasets. Here’s how to do it for a dataframe called df

# Convert a column called criterion to a factor
df <- df %>%
  mutate(criterion = factor(criterion))

E. Train one or more models on the training data. Start with one model, then gradually try more. For each model, assign the result to a new training object called XX_train (e.g.; rf_train for random forests, glm_train for standard regression.)

Regression Tasks

For classification tasks, your criterion should be numeric

Classification Tasks

For classification tasks, your criterion should be a factor

F. Explore your training objects by printing them, looking at (and printing) their named elements with names(), try accessing a few of the named elements with XX_train$ and see what the outputs look like.

G. Look at the final model with XX_train$finalModel. Try applying generic functions like summary(), plot(), and broom:tidy() to the object. Do these help you to understand the final model?

H. Predict the criterion values of the test data XXX_test using predict() and store the results in a vector called XXX_pred (e.g.; rf_pred for predictions from random forests)

I. Evaluate the model’s prediction accuracy. For regression tasks, use postResample(), and for classification tasks, use confusionMatrix(). and by creating an appropriate plot.

Tasks

Regression

  1. Create your best possible model for the house_train dataset predicting housing price. Which model does the best in predicting house_test$price?
# Here is what I got with glmnet!

house_train <- read_csv("1_Data/house_train.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  id = col_character(),
  date = col_datetime(format = ""),
  price = col_double(),
  bathrooms = col_double(),
  floors = col_double(),
  lat = col_double(),
  long = col_double()
)
See spec(...) for full column specifications.
house_test <- read_csv("1_Data/house_test.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  id = col_character(),
  date = col_datetime(format = ""),
  price = col_double(),
  bathrooms = col_double(),
  floors = col_double(),
  lat = col_double(),
  long = col_double()
)
See spec(...) for full column specifications.
house_train <- house_train %>%
  select(-id, -date)

house_test <- house_test %>%
  select(-id, -date)


# Set up control values
ctr <- trainControl(method = "repeatedcv",
                    number = 10,
                    repeats = 2)

# ---------------
# Step 2: Train model
# ---------------

# Predict price with linear regression
house_glmnet_train <- train(form = price ~ ., 
                        data = house_train,
                        method = "glmnet",
                        trControl = ctr)
# ---------------
# Step 3: Explore
# ---------------

class(house_glmnet_train)
[1] "train"         "train.formula"
house_glmnet_train
glmnet 

1000 samples
  18 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 2 times) 
Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 
Resampling results across tuning parameters:

  alpha  lambda      RMSE      Rsquared   MAE     
  0.10     594.8134  226481.9  0.7023734  141116.7
  0.10    5948.1336  225982.7  0.7035682  139971.8
  0.10   59481.3363  227122.7  0.7010853  133738.6
  0.55     594.8134  226531.7  0.7022322  141281.7
  0.55    5948.1336  225589.7  0.7039135  138292.7
  0.55   59481.3363  240695.3  0.6692907  140590.3
  1.00     594.8134  226448.5  0.7024254  141089.6
  1.00    5948.1336  225762.9  0.7032940  137233.6
  1.00   59481.3363  253948.8  0.6417456  149695.8

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0.55 and lambda
 = 5948.134.
names(house_glmnet_train)
 [1] "method"       "modelInfo"    "modelType"    "results"     
 [5] "pred"         "bestTune"     "call"         "dots"        
 [9] "metric"       "control"      "finalModel"   "preProcess"  
[13] "trainingData" "resample"     "resampledCM"  "perfNames"   
[17] "maximize"     "yLimits"      "times"        "levels"      
[21] "terms"        "coefnames"    "xlevels"     
summary(house_glmnet_train)
            Length Class      Mode     
a0            74   -none-     numeric  
beta        1332   dgCMatrix  S4       
df            74   -none-     numeric  
dim            2   -none-     numeric  
lambda        74   -none-     numeric  
dev.ratio     74   -none-     numeric  
nulldev        1   -none-     numeric  
npasses        1   -none-     numeric  
jerr           1   -none-     numeric  
offset         1   -none-     logical  
call           5   -none-     call     
nobs           1   -none-     numeric  
lambdaOpt      1   -none-     numeric  
xNames        18   -none-     character
problemType    1   -none-     character
tuneValue      2   data.frame list     
obsLevels      1   -none-     logical  
param          0   -none-     list     
# Look at variable importance with varImp
varImp(house_glmnet_train)
glmnet variable importance

                Overall
waterfront    1.000e+02
lat           7.311e+01
long          3.295e+01
grade         1.315e+01
bedrooms      8.217e+00
view          5.038e+00
bathrooms     3.964e+00
floors        2.907e+00
condition     1.682e+00
yr_built      2.552e-01
zipcode       1.009e-01
sqft_living   2.361e-02
sqft_above    5.167e-03
sqft_lot      2.632e-05
sqft_living15 0.000e+00
sqft_lot15    0.000e+00
sqft_basement 0.000e+00
yr_renovated  0.000e+00
# Look at final model object

# For some reason it prints to a huge table..
# house_glmnet_train$finalModel

# ---------------
# Step 4: Predict
# ---------------

house_glmnet_predictions <- predict(house_glmnet_train, 
                                   newdata = house_test)
# ---------------
# Step 5: Evaluate
# ---------------

# Look at final prediction performance!
postResample(pred = house_glmnet_predictions, 
             obs = house_test$price)
        RMSE     Rsquared          MAE 
2.096501e+05 6.814879e-01 1.349668e+05 
# Plot relationship between predictions and truth
performance_data <- tibble(predictions = house_glmnet_predictions,
                          criterion = house_test$price)

ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point(alpha = .1) +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

Classification

Make sure your criterion values are factors!!

  1. Create your best possible model for the heartdisease_train dataset predicting diagnosis. Which model does the best in predicting heartdisease_test$diagnosis?
# Here is what I got with random forests! (rf)

heartdisease_train <- read_csv("1_Data/heartdisease_train.csv")
Parsed with column specification:
cols(
  age = col_integer(),
  sex = col_integer(),
  cp = col_character(),
  trestbps = col_integer(),
  chol = col_integer(),
  fbs = col_integer(),
  restecg = col_character(),
  thalach = col_integer(),
  exang = col_integer(),
  oldpeak = col_double(),
  slope = col_character(),
  ca = col_integer(),
  thal = col_character(),
  diagnosis = col_integer()
)
heartdisease_test <- read_csv("1_Data/heartdisease_test.csv")
Parsed with column specification:
cols(
  age = col_integer(),
  sex = col_integer(),
  cp = col_character(),
  trestbps = col_integer(),
  chol = col_integer(),
  fbs = col_integer(),
  restecg = col_character(),
  thalach = col_integer(),
  exang = col_integer(),
  oldpeak = col_double(),
  slope = col_character(),
  ca = col_integer(),
  thal = col_character(),
  diagnosis = col_integer()
)
heartdisease_train <- heartdisease_train %>%
  mutate(diagnosis = factor(diagnosis))

heartdisease_test <- heartdisease_test %>%
  mutate(diagnosis = factor(diagnosis))

# Set up control values
ctr <- trainControl(method = "repeatedcv",
                    number = 10,
                    repeats = 2)

# ---------------
# Step 2: Train model
# ---------------

# Predict price with linear regression
heartdisease_rf_train <- train(form = diagnosis ~ ., 
                        data = heartdisease_train,
                        method = "rf",
                        trControl = ctr)
# ---------------
# Step 3: Explore
# ---------------

class(heartdisease_rf_train)
[1] "train"         "train.formula"
heartdisease_rf_train
Random Forest 

150 samples
 13 predictor
  2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 2 times) 
Summary of sample sizes: 135, 135, 135, 134, 135, 134, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8479464  0.6906826
  10    0.8305655  0.6576454
  18    0.8205357  0.6365990

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
names(heartdisease_rf_train)
 [1] "method"       "modelInfo"    "modelType"    "results"     
 [5] "pred"         "bestTune"     "call"         "dots"        
 [9] "metric"       "control"      "finalModel"   "preProcess"  
[13] "trainingData" "resample"     "resampledCM"  "perfNames"   
[17] "maximize"     "yLimits"      "times"        "levels"      
[21] "terms"        "coefnames"    "contrasts"    "xlevels"     
summary(heartdisease_rf_train)
                Length Class      Mode     
call               4   -none-     call     
type               1   -none-     character
predicted        150   factor     numeric  
err.rate        1500   -none-     numeric  
confusion          6   -none-     numeric  
votes            300   matrix     numeric  
oob.times        150   -none-     numeric  
classes            2   -none-     character
importance        18   -none-     numeric  
importanceSD       0   -none-     NULL     
localImportance    0   -none-     NULL     
proximity          0   -none-     NULL     
ntree              1   -none-     numeric  
mtry               1   -none-     numeric  
forest            14   -none-     list     
y                150   factor     numeric  
test               0   -none-     NULL     
inbag              0   -none-     NULL     
xNames            18   -none-     character
problemType        1   -none-     character
tuneValue          1   data.frame list     
obsLevels          2   -none-     character
param              0   -none-     list     
# Look at variable importance with varImp
varImp(heartdisease_rf_train)
rf variable importance

                   Overall
thalach            100.000
ca                  96.844
thalnormal          95.936
thalrd              93.867
oldpeak             81.529
chol                72.519
trestbps            65.422
age                 65.198
slopeup             63.535
slopeflat           63.253
exang               58.606
cpaa                20.780
cpta                16.627
sex                 14.200
cpnp                 6.659
restecgnormal        5.779
restecghypertrophy   4.210
fbs                  0.000
# Look at final model object
heartdisease_rf_train$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 14.67%
Confusion matrix:
   0  1 class.error
0 74  8  0.09756098
1 14 54  0.20588235
# ---------------
# Step 4: Predict
# ---------------

heartdisease_rf_predictions <- predict(heartdisease_rf_train, 
                                   newdata = heartdisease_test)
# ---------------
# Step 5: Evaluate
# ---------------

# Look at final prediction performance!
postResample(pred = heartdisease_rf_predictions, 
             obs = heartdisease_test$diagnosis)
 Accuracy     Kappa 
0.7733333 0.5397112 
  1. Create your best possible model for the attrition_train dataset predicting attrition. Which model does the best in predicting attrition_test$attrition?
# on your own!

Exploring the parameter fitting process

  1. Repeat one of your previous analyses (with just one model), but instead of doing cross validation with 2 repetitions, just fit the model to the entire training dataset with trainControl(method = "none"). Then predict the testing data again. How does the accuracy of your models compare to your original analysis when you did not do any cross validation?
# The accuracy of all should decrease (but perhaps not by so much)

Try different models

  1. Go to http://topepo.github.io/caret/available-models.html and find the craziest looking model you can find. Then, try it on one of your datasets and see how well it works!

Reducing data

  1. Repeat one of your original analyses, but instead of allowing the models to use all of the training data, force them to only use only three predictors. For example, in the heartdisease_train data, you could have the model(s) only use age, sex and cp as predictors by using the formula form = diagnosis ~ age + sex + cp as a predictor. How do the models compare to each other when they each only get access to a few predictors? Are they all the same or is one much better than others?
# on your own!
  1. Select one dataset, and using createDataPartition, create your own new training and test datasets based on the XX_train datasets (that is, pretend the XX_train datasets are all possible data available, and split them into new XX_train2 and XX_test2 datsets). Now you have a new world of training and test data! Repeat your analyses and see if you get similar models and prediction performance as before.
# on your own!

Tips

If you want to plot the results of multiple models, you can try using the following code template:

# Some fake prediction data
#  include your real model prediction data here

XX_pred <- rnorm(100, mean = 100, sd = 10)
YY_pred <- rnorm(100, mean = 100, sd = 10)
ZZ_pred <- rnorm(100, mean = 100, sd = 10)

# Some fake true test values
#   Get these from your XX_test objects

truth <- rnorm(100, mean = 100, sd = 10)

# Put results together in a tibble

N <- length(truth)

model_results <- tibble(model = rep(c("XX", "YY", "ZZ"), each = N),
                        pred = c(XX_pred, YY_pred, ZZ_pred),
                        truth = rep(truth, times = 3))

# Add error and absolute error

model_results <- model_results %>%
  mutate(error = pred - truth,
         abserr = abs(error))

# Plot Distribution of errors for each model

ggplot(model_results, 
       aes(x = model, y = error, col = model)) +
  geom_jitter(width = .1, alpha = .5) +
  stat_summary(fun.y = mean, 
               fun.ymin = min, 
               fun.ymax = max, 
               colour = "black") + 
  labs(title = "Model Prediction Errors",
       subtitle = "Dots represent means",
       caption = "Caret is awesome!") +
  theme(legend.position = "none")

# Plot relationship between truth and predictions

ggplot(model_results, 
       aes(x = truth, y = pred, col = model)) +
  geom_point(alpha = .5) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "XX model predictions",
       subtitle = "Diagonal represents perfect performance",
       caption = "Caret is awesome!",
       x = "True Values",
       y = "Model Predictions")

Further Reading