Trulli
www.rstudio.com

Overview

In this practical you’ll practice advanced machine learning using the caret package

By the end of this practical you will know how to:

  1. Create partitions of data with createDataPartition()
  2. Set up control parameters for the training process with trainControl()
  3. Train (pretty much) any machine learning model you want with train()
  4. Get variable importance rankings with varImp()
  5. Predict new data with predict()
  6. Evaluate model performance with postResample()

Packages

Package Installation
tidyverse install.packages("tidyverse")
caret install.packages("caret")

Examples

The following set of example code will take you through the basic steps of machine learning using the amazing caret package.

# Load packages
library(tidyverse)
library(caret)

# ------------------------------------
# Step 0: Create training and test data
#  Only necessary if you don't already have training
#   and test data
# ------------------------------------

# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
  sample_frac(1)              

train_v <- createDataPartition(y = diamonds$price, 
                               times = 1,
                               p = .5)

# Create separate training and test data

diamonds_train <- diamonds %>%
  slice(train_v$Resample1)

diamonds_test <- diamonds %>%
  slice(-train_v$Resample1)

# ---------------
# Step 1: Define control parameters
# ---------------

# Set up control values
ctr <- trainControl(method = "none")

# ---------------
# Step 2: Train model
# ---------------

# Predict price with linear regression
#    Only using 4 numeric features

diamonds_lm_train <- train(form = price ~ ., 
                           data = diamonds_train %>%
                                      select(carat, depth, table, price),
                           method = "lm",
                           trControl = ctr)
# ---------------
# Step 3: Explore
# ---------------

class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)

# Look at variable importance with varImp
varImp(diamonds_lm_train)

# Look at final model object
diamonds_lm_train$finalModel

# Look at fitting performance!
postResample(pred = predict(diamonds_lm_train$finalModel, diamonds_train), 
             obs = diamonds_train$price)

# ---------------
# Step 4: Predict
# ---------------

diamonds_lm_predictions <- predict(diamonds_lm_train, 
                                   newdata = diamonds_test)
# ---------------
# Step 5: Evaluate
# ---------------

# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions, 
             obs = diamonds_test$price)

# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
                           criterion = diamonds_test$price)

ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point(alpha = .1) +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

Models

Here are three models you can try for both regression and classification tasks

Regression Tasks

For regression tasks, your criterion should be numeric

  • Decision Trees with rpart
  • Random forests with rf
  • Regularised regression with glmnet

Classification Tasks

For classification tasks, your criterion should be a factor

  • Decision Trees with rpart
  • Random forests with rf
  • K nearest neighbors with knn

Modelling procedure

In this practical you will conduct machine learning analyses on several data sets. For each of the tasks, go through the following steps.

Step 0: Setup

  • Load the training data XXX_train.csv as a new dataframe called XXX_train and the test data XXX_test.csv as a new dataframe called XXX_test.

  • Explore the XXX_train dataframe with a combination of names(), summary() and other similar functions.

  • If there are any predictors in the data that you don’t want in the model, consider excluding them when you train the model in Step 2 (train())

Step 1: Define control parameters

  • Define control parameters with trainControl(). To start, use method = "none"

Step 2: Train model

  • Train one or more models on the training data using train(). For each model, assign the result to a new training object called XX_train (e.g.; rf_train for random forests, glm_train for standard regression.)

  • Make sure to exclude any predictors in the data you don’t want to use!

Step 3: Explore model

  • Explore your training objects by printing them, looking at (and printing) their named elements with names(), try accessing a few of the named elements with XX_train$ and see what the outputs look like.

  • Look at the final model with XX_train$finalModel. Try applying generic functions like summary(), plot(), and broom:tidy() to the object. Do these help you to understand the final model?

Step 4: Predict new data

  • Predict the criterion values of the test data XXX_test using predict() and store the results in a vector called XXX_pred (e.g.; rf_pred for predictions from random forests)

Step 5: Evaluate performance

  • Evaluate the model’s prediction accuracy. For regression tasks, use postResample(), and for classification tasks, use confusionMatrix().

Tips

If you want to plot the results of multiple models, you can try using the following code template:

Regression Models

# Some fake prediction data
#  include your real model prediction data here

XX_pred <- rnorm(100, mean = 100, sd = 10)
YY_pred <- rnorm(100, mean = 100, sd = 10)
ZZ_pred <- rnorm(100, mean = 100, sd = 10)

# Some fake true test values
#   Get these from your XX_test objects

truth <- rnorm(100, mean = 100, sd = 10)

# Put results together in a tibble

model_results <- tibble(XX = XX_pred,
                        YY = YY_pred,
                        ZZ = ZZ_pred,
                        truth = truth) %>%
  gather(model, pred, -truth) %>%
  mutate(error = pred - truth,
         abserr = abs(error))

# Plot Distribution of errors for each model

ggplot(model_results, 
       aes(x = model, y = error, col = model)) +
  geom_jitter(width = .1, alpha = .5) +
  stat_summary(fun.y = mean, 
               fun.ymin = min, 
               fun.ymax = max, 
               colour = "black") + 
  labs(title = "Model Prediction Errors",
       subtitle = "Dots represent means",
       caption = "Caret is awesome!") +
  theme(legend.position = "none")

# Plot relationship between truth and predictions

ggplot(model_results, 
       aes(x = truth, y = pred, col = model)) +
  geom_point(alpha = .5) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "XX model predictions",
       subtitle = "Diagonal represents perfect performance",
       caption = "Caret is awesome!",
       x = "True Values",
       y = "Model Predictions")

Further Reading

Tasks

A - Setup

  1. Open your R project. It should already have the folders 1_Data and 2_Code. Make sure that the all of the datasets you need for this practical are in your 1_Data folder

  2. Open a new R script and save it as a new file called ml2_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load the all of the packages you’ll need for this practical. Here’s how the top of your script should look:

## NAME
## DATE
## Machine Learning 2 Practical

library(XX)     
library(XX)   
library(XX)   
  1. Load each of the datasets listed in the Datasets section above as new objects using read_csv(). Call them XX_train and XX_test.

B - Regression task with house

In this task, you’ll create the best possible model to predict the housing prices of houses in King County Washington. The training data is stored in house_train and the test data is stored in house_test. The variable you are trying to predict is price!

  1. Follow Steps 1 through 3 in the Modelling Procedure section above above to create and explore your best decision tree model predicting house_train$price using method = "rpart". Call the model rpart_price

  2. Do the same with random forests using method = "rf". Call the model rf_price

  3. Do the same with regularised regression using method = "glmnet". Call the model glmnet_price.

  4. Make sure you have explored each of your models!

  5. Follow steps 4 and 5 to see which model did the best in predicting house_test$price

  6. Which of the three models was the best at predicting price in the test data? Were they very different? Is one easier to interpret than the other?

  7. Go to http://topepo.github.io/caret/available-models.html and find the craziest looking model you can find that works for regression tasks. Then, try it on the price dataset and see how well it works! Does it do much better than the models you already tried?

  8. Repeat your analyses, but now train your models using 10 fold cross validation by specifying trainControl(method = "cv", number = 10, repeats = 1). After you do so, be sure complete the remaining steps to fit the model using train() and evaluate its prediction performance.

  9. How do the accuracy of the models change when you use 10 fold cross validation? Do they become more accurate when you use 10 fold cross validation compared to when you didn’t?

C - Classification with heartdisease

Make sure your criterion values are factors when fitting your model(s)!!

When you are conducting a classification analysis, you need to make sure the criterion is stored as a factor. This tells the function you are doing classification instead of regression. So before doing anything, do this for both the training and test datasets. Here’s how to do it for the heartdisease datasets:

# Convert the diagnosis column to a factor

heartdisease_train <- heartdisease_train %>%
  mutate(diagnosis = factor(diagnosis))

heartdisease_test <- heartdisease_test %>%
  mutate(diagnosis = factor(diagnosis))

In this task, you’ll create the best possible model to predict whether or not a person is having a heart attack. These the training data is stored in heartdisease_train and the test data is stored in housedisease_test. The variable you are trying to predict is diagnosis!

  1. Follow Steps 1 through 3 above in the Modelling Procedure section above, to create your best decision tree model predicting diagnosis. Call the model rpart_diagnosis

  2. Do the same with random forests using method = "rf". Call the model rf_diagnosis

  3. Do the same with nearest neighbors using method = "knn". Call the model knn_diagnosis

  4. Make sure you have explored each of your models!

  5. Follow steps 4 and 5 to see which model did the best in predicting house_test$price

  6. Which of the three models was the best at predicting diagnosis? Were they very different? Is one easier to interpret than the other?

  7. Go to http://topepo.github.io/caret/available-models.html and find the craziest looking model you can find that works for classification tasks. Then, try it on the heartdisease dataset and see how well it works! Does it do much better than the models you already tried?

  8. Repeat your analyses, but now train your models using 10 fold cross validation by specifying trainControl(method = "cv", number = 10, repeats = 1). After you do so, be sure complete the remaining steps to fit the model using train() and evaluate its prediction performance.

  9. How do the accuracy of the models change when you use 10 fold cross validation? Do they become more accurate when you use 10 fold cross validation compared to when you didn’t?

D - Classification with attrition

  1. Try creating your best model to predict whether or not an employee will leave his/her job with the attrition_train dataset. The variable you want to predict is Attrition! Use all of the techniques you have learned so far!