Overview

In this practical you’ll practice advanced machine learning using the caret package

By the end of this practical you will know how to:

Create partitions of data with createDataPartition()
Set up control parameters for the training process with trainControl()
Train (pretty much) any machine learning model you want with train()
Get variable importance rankings with varImp()
Predict new data with predict()
Evaluate model performance with postResample()

Packages

Package	Installation
`tidyverse`	`install.packages("tidyverse")`
`caret`	`install.packages("caret")`

Cheatsheet

https://github.com/rstudio/cheatsheets/raw/master/caret.pdf

Datasets

File	Rows	Columns	Criterion	Source
house_train.csv	1000	21	`price`	Kaggle: House Sales Prediction
house_test.csv	15000	21	`price`	Kaggle: House Sales Prediction
heartdisease_train.csv	150	14	`diagnosis`	UCI ML Heartdisease
heartdisease_test.csv	150	14	`diagnosis`	UCI ML Heartdisease
attrition_train.csv	500	35	`Attrition`	Kaggle Attrition
attrition_test.csv	900	35	`Attrition`	Kaggle Attrition

Examples

The following set of example code will take you through the basic steps of machine learning using the amazing caret package.

# Load packages
library(tidyverse)
library(caret)

# ------------------------------------
# Step 0: Create training and test data
#  Only necessary if you don't already have training
#   and test data
# ------------------------------------

# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
  sample_frac(1)              

train_v <- createDataPartition(y = diamonds$price, 
                               times = 1,
                               p = .5)

# Create separate training and test data

diamonds_train <- diamonds %>%
  slice(train_v$Resample1)

diamonds_test <- diamonds %>%
  slice(-train_v$Resample1)

# ---------------
# Step 1: Define control parameters
# ---------------

# Set up control values
ctr <- trainControl(method = "none")

# ---------------
# Step 2: Train model
# ---------------

# Predict price with linear regression
#    Only using 4 numeric features

diamonds_lm_train <- train(form = price ~ ., 
                           data = diamonds_train %>%
                                      select(carat, depth, table, price),
                           method = "lm",
                           trControl = ctr)
# ---------------
# Step 3: Explore
# ---------------

class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)

# Look at variable importance with varImp
varImp(diamonds_lm_train)

# Look at final model object
diamonds_lm_train$finalModel

# Look at fitting performance!
postResample(pred = predict(diamonds_lm_train$finalModel, diamonds_train), 
             obs = diamonds_train$price)

# ---------------
# Step 4: Predict
# ---------------

diamonds_lm_predictions <- predict(diamonds_lm_train, 
                                   newdata = diamonds_test)
# ---------------
# Step 5: Evaluate
# ---------------

# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions, 
             obs = diamonds_test$price)

# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
                           criterion = diamonds_test$price)

ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point(alpha = .1) +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

Models

Here are three models you can try for both regression and classification tasks

Regression Tasks

For regression tasks, your criterion should be numeric

Decision Trees with rpart
Random forests with rf
Regularised regression with glmnet

Classification Tasks

For classification tasks, your criterion should be a factor

Decision Trees with rpart
Random forests with rf
K nearest neighbors with knn

Modelling procedure

In this practical you will conduct machine learning analyses on several data sets. For each of the tasks, go through the following steps.

Step 0: Setup

Load the training data XXX_train.csv as a new dataframe called XXX_train and the test data XXX_test.csv as a new dataframe called XXX_test.
Explore the XXX_train dataframe with a combination of names(), summary() and other similar functions.
If there are any predictors in the data that you don’t want in the model, consider excluding them when you train the model in Step 2 (train())

Step 1: Define control parameters

Define control parameters with trainControl(). To start, use method = "none"

Step 2: Train model

Train one or more models on the training data using train(). For each model, assign the result to a new training object called XX_train (e.g.; rf_train for random forests, glm_train for standard regression.)
Make sure to exclude any predictors in the data you don’t want to use!

Step 3: Explore model

Explore your training objects by printing them, looking at (and printing) their named elements with names(), try accessing a few of the named elements with XX_train$ and see what the outputs look like.
Look at the final model with XX_train$finalModel. Try applying generic functions like summary(), plot(), and broom:tidy() to the object. Do these help you to understand the final model?

Step 4: Predict new data

Predict the criterion values of the test data XXX_test using predict() and store the results in a vector called XXX_pred (e.g.; rf_pred for predictions from random forests)

Step 5: Evaluate performance

Evaluate the model’s prediction accuracy. For regression tasks, use postResample(), and for classification tasks, use confusionMatrix().

Tips

If you want to plot the results of multiple models, you can try using the following code template:

Regression Models

# Some fake prediction data
#  include your real model prediction data here

XX_pred <- rnorm(100, mean = 100, sd = 10)
YY_pred <- rnorm(100, mean = 100, sd = 10)
ZZ_pred <- rnorm(100, mean = 100, sd = 10)

# Some fake true test values
#   Get these from your XX_test objects

truth <- rnorm(100, mean = 100, sd = 10)

# Put results together in a tibble

model_results <- tibble(XX = XX_pred,
                        YY = YY_pred,
                        ZZ = ZZ_pred,
                        truth = truth) %>%
  gather(model, pred, -truth) %>%
  mutate(error = pred - truth,
         abserr = abs(error))

# Plot Distribution of errors for each model

ggplot(model_results, 
       aes(x = model, y = error, col = model)) +
  geom_jitter(width = .1, alpha = .5) +
  stat_summary(fun.y = mean, 
               fun.ymin = min, 
               fun.ymax = max, 
               colour = "black") + 
  labs(title = "Model Prediction Errors",
       subtitle = "Dots represent means",
       caption = "Caret is awesome!") +
  theme(legend.position = "none")

# Plot relationship between truth and predictions

ggplot(model_results, 
       aes(x = truth, y = pred, col = model)) +
  geom_point(alpha = .5) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "XX model predictions",
       subtitle = "Diagonal represents perfect performance",
       caption = "Caret is awesome!",
       x = "True Values",
       y = "Model Predictions")

Tasks

A - Setup

Open your R project. It should already have the folders 1_Data and 2_Code. Make sure that the all of the datasets you need for this practical are in your 1_Data folder
Open a new R script and save it as a new file called ml2_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load the all of the packages you’ll need for this practical. Here’s how the top of your script should look:

## NAME
## DATE
## Machine Learning 2 Practical

library(XX)     
library(XX)   
library(XX)

Load each of the datasets listed in the Datasets section above as new objects using read_csv(). Call them XX_train and XX_test.

B - Regression task with `house`

In this task, you’ll create the best possible model to predict the housing prices of houses in King County Washington. The training data is stored in house_train and the test data is stored in house_test. The variable you are trying to predict is price!

Follow Steps 1 through 3 in the Modelling Procedure section above above to create and explore your best decision tree model predicting house_train$price using method = "rpart". Call the model rpart_price
Do the same with random forests using method = "rf". Call the model rf_price
Do the same with regularised regression using method = "glmnet". Call the model glmnet_price.
Make sure you have explored each of your models!
Follow steps 4 and 5 to see which model did the best in predicting house_test$price
Which of the three models was the best at predicting price in the test data? Were they very different? Is one easier to interpret than the other?
Go to http://topepo.github.io/caret/available-models.html and find the craziest looking model you can find that works for regression tasks. Then, try it on the price dataset and see how well it works! Does it do much better than the models you already tried?
Repeat your analyses, but now train your models using 10 fold cross validation by specifying trainControl(method = "cv", number = 10, repeats = 1). After you do so, be sure complete the remaining steps to fit the model using train() and evaluate its prediction performance.
How do the accuracy of the models change when you use 10 fold cross validation? Do they become more accurate when you use 10 fold cross validation compared to when you didn’t?

C - Classification with `heartdisease`

Make sure your criterion values are factors when fitting your model(s)!!

When you are conducting a classification analysis, you need to make sure the criterion is stored as a factor. This tells the function you are doing classification instead of regression. So before doing anything, do this for both the training and test datasets. Here’s how to do it for the heartdisease datasets:

# Convert the diagnosis column to a factor

heartdisease_train <- heartdisease_train %>%
  mutate(diagnosis = factor(diagnosis))

heartdisease_test <- heartdisease_test %>%
  mutate(diagnosis = factor(diagnosis))

In this task, you’ll create the best possible model to predict whether or not a person is having a heart attack. These the training data is stored in heartdisease_train and the test data is stored in housedisease_test. The variable you are trying to predict is diagnosis!

Follow Steps 1 through 3 above in the Modelling Procedure section above, to create your best decision tree model predicting diagnosis. Call the model rpart_diagnosis
Do the same with random forests using method = "rf". Call the model rf_diagnosis
Do the same with nearest neighbors using method = "knn". Call the model knn_diagnosis
Make sure you have explored each of your models!
Follow steps 4 and 5 to see which model did the best in predicting house_test$price
Which of the three models was the best at predicting diagnosis? Were they very different? Is one easier to interpret than the other?
Go to http://topepo.github.io/caret/available-models.html and find the craziest looking model you can find that works for classification tasks. Then, try it on the heartdisease dataset and see how well it works! Does it do much better than the models you already tried?
Repeat your analyses, but now train your models using 10 fold cross validation by specifying trainControl(method = "cv", number = 10, repeats = 1). After you do so, be sure complete the remaining steps to fit the model using train() and evaluate its prediction performance.
How do the accuracy of the models change when you use 10 fold cross validation? Do they become more accurate when you use 10 fold cross validation compared to when you didn’t?

D - Classification with `attrition`

Try creating your best model to predict whether or not an employee will leave his/her job with the attrition_train dataset. The variable you want to predict is Attrition! Use all of the techniques you have learned so far!

Machine Learning II

BaselRBootcamp July 2018
www.therbootcamp.com
@therbootcamp

Overview

Packages

Cheatsheet

Datasets

Examples

Models

Modelling procedure

Tips

Regression Models

Further Reading

Tasks

A - Setup

B - Regression task with `house`

C - Classification with `heartdisease`

D - Classification with `attrition`

Machine Learning II

BaselRBootcamp July 2018www.therbootcamp.com@therbootcamp

Overview

Packages

Cheatsheet

Datasets

Examples

Models

Modelling procedure

Tips

Regression Models

Further Reading

Tasks

A - Setup

B - Regression task with house

C - Classification with heartdisease

D - Classification with attrition

BaselRBootcamp July 2018
www.therbootcamp.com
@therbootcamp

B - Regression task with `house`

C - Classification with `heartdisease`

D - Classification with `attrition`