The almighty Caret!

The almighty Caret!

Cheatsheet

Packages

Package Installation
tidyverse install.packages("tidyverse")
caret install.packages("caret")
GGally install.packages("GGally")

Datasets

File Rows Columns Criterion Source
house_train.csv 1000 21 price Kaggle: House Sales Prediction
house_test.csv 15000 21 price Kaggle: House Sales Prediction
heartdisease_train.csv 150 14 diagnosis UCI ML Heartdisease
heartdisease_test.csv 150 14 diagnosis UCI ML Heartdisease
attrition_train 500 35 Attrition Kaggle Attrition
attrition_test 900 35 Attrition Kaggle Attrition

Examples

The following set of example code will take you through the basic steps of machine learning using the amazing caret package.

# Load packages
library(tidyverse)
library(GGally)
library(skimr)
library(caret)

# ------------------------------------
# Step 0: Create training and test data
#  Only necessary if you don't already have training
#   and test data
# ------------------------------------

# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
  sample_frac(1)              

train_v <- createDataPartition(y = diamonds$price, 
                               times = 1,
                               p = .1)

# Create separate training and test data

diamonds_train <- diamonds %>%
  slice(train_v$Resample1)

diamonds_test <- diamonds %>%
  slice(-train_v$Resample1)

# ---------------
# Explore
# ---------------

# Explore columns with summarizeColumns
skim(diamonds_train)

# Visualise relationships with ggpairs
ggpairs(diamonds_train)

# ---------------
# Step 1: Define control parameters
# ---------------

# Set up control values
ctr <- trainControl(method = "repeatedcv",
                    number = 10,
                    repeats = 2)

# ---------------
# Step 2: Train model
# ---------------

# Predict price with linear regression
diamonds_lm_train <- train(form = price ~ ., 
                           data = diamonds_train,
                           method = "lm",
                           trControl = ctr)
# ---------------
# Step 3: Explore
# ---------------

class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)

# Look at variable importance with varImp
varImp(diamonds_lm_train)

# Look at final model object
diamonds_lm_train$finalModel

# ---------------
# Step 4: Predict
# ---------------

diamonds_lm_predictions <- predict(diamonds_lm_train, 
                                   newdata = diamonds_test)
# ---------------
# Step 5: Evaluate
# ---------------

# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions, 
             obs = diamonds_test$price)

# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
                          criterion = diamonds_test$price)

ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point(alpha = .1) +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

A. Open your R project. It should already have the folders 1_Data and 2_Code. Make sure that the all of the datasets you need for this practical are in your 1_Data folder

B. Open a new R script and save it as a new file called ml_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load the all of the packages you’ll need for this practical. Here’s how the top of your script should look:

## NAME
## DATE
## Machine Learning Practical

library(XX)     
library(XX)   
library(XX)   

Modelling procedure

In this practical you will conduct machine learning analyses on several data sets. For each of the tasks, go through the following steps.

A. Load the training data XXX_train.csv as a new dataframe called XXX_train and the test data XXX_test.csv as a new dataframe called XXX_test

B. Explore the XXX_train dataframe with a combination of skim(), names(), summary() and other similar functions

C. Define control parameters with trainControl(). Use 10-fold cross validation with 2 repeitions.

D. If you are conducting a classification analysis, be sure to convert the criterion to a factor to tell the function you are doing classification instead of regression. Do this for both the training and test datasets. Here’s how to do it for a dataframe called df

# Convert a column called criterion to a factor
df <- df %>%
  mutate(criterion = factor(criterion))

E. Train one or more models on the training data. Start with one model, then gradually try more. For each model, assign the result to a new training object called XX_train (e.g.; rf_train for random forests, glm_train for standard regression.)

Regression Tasks

For classification tasks, your criterion should be numeric

Classification Tasks

For classification tasks, your criterion should be a factor

F. Explore your training objects by printing them, looking at (and printing) their named elements with names(), try accessing a few of the named elements with XX_train$ and see what the outputs look like.

G. Look at the final model with XX_train$finalModel. Try applying generic functions like summary(), plot(), and broom:tidy() to the object. Do these help you to understand the final model?

H. Predict the criterion values of the test data XXX_test using predict() and store the results in a vector called XXX_pred (e.g.; rf_pred for predictions from random forests)

I. Evaluate the model’s prediction accuracy. For regression tasks, use postResample(), and for classification tasks, use confusionMatrix(). and by creating an appropriate plot.

Tasks

Regression

  1. Create your best possible model for the house_train dataset predicting housing price. Which model does the best in predicting house_test$price?

Classification

Make sure your criterion values are factors!!

  1. Create your best possible model for the heartdisease_train dataset predicting diagnosis. Which model does the best in predicting heartdisease_test$diagnosis?

  2. Create your best possible model for the attrition_train dataset predicting attrition. Which model does the best in predicting attrition_test$attrition?

Exploring the parameter fitting process

  1. Repeat one of your previous analyses (with just one model), but instead of doing cross validation with 2 repetitions, just fit the model to the entire training dataset with trainControl(method = "none"). Then predict the testing data again. How does the accuracy of your models compare to your original analysis when you did not do any cross validation?

Try different models

  1. Go to http://topepo.github.io/caret/available-models.html and find the craziest looking model you can find. Then, try it on one of your datasets and see how well it works!

Reducing data

  1. Repeat one of your original analyses, but instead of allowing the models to use all of the training data, force them to only use only three predictors. For example, in the heartdisease_train data, you could have the model(s) only use age, sex and cp as predictors by using the formula form = diagnosis ~ age + sex + cp as a predictor. How do the models compare to each other when they each only get access to a few predictors? Are they all the same or is one much better than others?

  2. Select one dataset, and using createDataPartition, create your own new training and test datasets based on the XX_train datasets (that is, pretend the XX_train datasets are all possible data available, and split them into new XX_train2 and XX_test2 datsets). Now you have a new world of training and test data! Repeat your analyses and see if you get similar models and prediction performance as before.

Tips

If you want to plot the results of multiple models, you can try using the following code template:

# Some fake prediction data
#  include your real model prediction data here

XX_pred <- rnorm(100, mean = 100, sd = 10)
YY_pred <- rnorm(100, mean = 100, sd = 10)
ZZ_pred <- rnorm(100, mean = 100, sd = 10)

# Some fake true test values
#   Get these from your XX_test objects

truth <- rnorm(100, mean = 100, sd = 10)

# Put results together in a tibble

N <- length(truth)

model_results <- tibble(model = rep(c("XX", "YY", "ZZ"), each = N),
                        pred = c(XX_pred, YY_pred, ZZ_pred),
                        truth = rep(truth, times = 3))

# Add error and absolute error

model_results <- model_results %>%
  mutate(error = pred - truth,
         abserr = abs(error))
Warning: package 'bindrcpp' was built under R version 3.4.4
# Plot Distribution of errors for each model

ggplot(model_results, 
       aes(x = model, y = error, col = model)) +
  geom_jitter(width = .1, alpha = .5) +
  stat_summary(fun.y = mean, 
               fun.ymin = min, 
               fun.ymax = max, 
               colour = "black") + 
  labs(title = "Model Prediction Errors",
       subtitle = "Dots represent means",
       caption = "Caret is awesome!") +
  theme(legend.position = "none")

# Plot relationship between truth and predictions

ggplot(model_results, 
       aes(x = truth, y = pred, col = model)) +
  geom_point(alpha = .5) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "XX model predictions",
       subtitle = "Diagonal represents perfect performance",
       caption = "Caret is awesome!",
       x = "True Values",
       y = "Model Predictions")

Further Reading