The almighty Caret!

Slides

Here are the introduction slides for this practical on machine learning!

Overview

In this practical you’ll use the caret (Classification And REgression Training) package to automate many aspects of the machine learning process. By the end of this practical you will know:

Select different ‘learners’ available in the caret package
Fit parameters of a learner to a dataset
Explore learners to understand how they work
Use cross validation techniques to estimate the accuracy of a learner and find best fitting parameters 5 Make predictions and evaluate the performance of learners!

Datasets

You’ll use four datasets in this practical: boston_train_csv and boston_test.csv, heartdisease_train.csv and heartdisease_test.csv. They available in the data_BaselRBootcamp_Day3.zip file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day3.zip folder and unzip it to get the files. We also recommend saving them in a folder called data in your working directory!

Packages

For this practical, you will need the following packages. If you don’t have any installed, you’ll need to install them with install.packages()

# Packages you'll need for this practical
library(caret)
library(mlr)        
library(GGally)
library(mlbench)
library(tidyverse)

Examples

The following set of example code will take you through the basic steps of machine learning using the amazing caret package.

# Load packages
library(caret)
library(mlr)        
library(GGally)
library(mlbench)
library(tidyverse)

# ---------------
# Setup
# ---------------

# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
  sample_frac(1)                  # Randomly sort

# Create separate training and test data
diamonds_train <- diamonds %>%
  slice(1:500)

diamonds_test <- diamonds %>%
  slice(501:2000)

# ---------------
# Explore
# ---------------

# Explore columns with summarizeColumns
summarizeColumns(diamonds_train)

# Visualise relationships with ggpairs
ggpairs(diamonds_train)

# ---------------
# Train
# ---------------

# Set up control values
rep10_control <- trainControl(method = "repeatedcv",
                              number = 10, # 10 folds
                              repeats = 5) # Repeat 5 times

# Predict price with linear regression
diamonds_lm_train <- train(form = price ~ ., 
                           data = diamonds_train,
                           method = "lm",
                           trControl = rep10_control)

# Explore the train object
class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)

# Look at variable importance with varImp
varImp(diamonds_lm_train)

# Save final predictions
diamonds_lm_predictions <- predict(diamonds_lm_train, 
                                   newdata = diamonds_test)

# ---------------
# Evaluate
# ---------------

# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
                          criterion = diamonds_test$price)

ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point() +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions, 
             obs = diamonds_test$price)

boston – predicting median home value (medv)

The boston dataset contains housing data for 506 census tracts of Boston from the 1970 census.

Here is a description of each of the variables in the dataset

Column	Description
crim	per capita crime rate by town
zn	proportion of residential land zoned for lots over 25,000 sq.ft
indus	proportion of non-retail business acres per town
chas	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox	nitric oxides concentration (parts per 10 million)
rm	average number of rooms per dwelling
age	proportion of owner-occupied units built prior to 1940
dis	weighted distances to five Boston employment centres
rad	index of accessibility to radial highways
tax	full-value property-tax rate per USD 10,000
ptratio	pupil-teacher ratio by town
b	1000(B - 0.63)^2 where B is the proportion of blacks by town
lstat	percentage of lower status of the population
medv	median value of owner-occupied homes in USD 1000’s

(E)xplore

Load the training data boston_train.csv into R as a new object with the name boston_train using read_csv().
Explore the data using the standard functions head(), View(), names() and summary()
Use the summarizeColumns() function from the mlr package to create numerical summaries of each column. Do you notice any strange columns?
Can you think of a reason why you should remove any of the columns? If so, remove them using select()! There’s no need to include unnecessary data in a machine learning algorithm :)
Visualise the data using the ggpairs() function from the GGally package. Do you notice any interesting patterns?

(T)rain

We will use 10-fold cross validation to train the models. To set up the fitting process, we’ll need to use the trainControl function. Look at the help menu for this function and look through the examples to see how it works.
Using the trainControl() function, create an object called rep10_control that will conduct 10-fold cross validation.
We’re almost ready to train models! To do so, we’ll use the almighty train() function. Look at the help menu for the train() function to see all of its lovely arguments and examples.
Now it’s time to actually train a model. We’ll start with simple linear regression. Create an object called boston_lm_train where you predict a town’s median home value medv based on all other variables. In order to conduct linear regression, include the arguments method = 'lm' and trControl = rep10_control
Explore your boston_lm_train argument using the generic functions names(), summary(), and plot(). What is stored in this object?
Using the varImp() function, look at the variable importance of each predictor. Which predictors seem to be the most important? After you’ve run the function, try plotting the object with plot() to visualise the results!
Load the test dataset as a new object called boston_test into R using read_csv()
Now it’s time to make predictions form your model! Using predict(), predict the criterion values for the test dataset boston_test. In the object argument, use your boston_lm_train model, in the newdata argument, use your boston_test test data. Save the results as the vector boston_lm_predictions

(E)valuate model performance

How well did your model do in predicting the true data? To start, plot the relationship between your model predictions and the true criterion using the following template

# Combine predictions and criterion in one tibble
performance_data <- tibble(predictions = boston_lm_predictions,
                          criterion = boston_test$medv)

# Plot results
ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point() +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

Now it’s time to quantify how well your model did. To do this, we’ll evaluate its performance with postResample(). Start by looking at the help menu for the postResample() function to see its arguments.
Now evaluate the model’s predictions using postResample(). You should only specify the pred and obs arguments as pred = boston_lm_predictions and obs = boston_test$medv.
How well did your model do? What was the overall level of accuracy? If you don’t understand the results, look at the details in the help menu for help!

Ridge regression

Now let’s try a new model! Instead of using simple linear regression with method = 'lm', we’ll use so–called ‘ridge regression’. Ridge regression is a variation of ‘standard’ linear regression that tries to avoid overfitting and thus make better predictions. How can you use ridge regression in caret? Try to answer yourself by looking at the list of all available models on the caret help site: http://topepo.github.io/caret/available-models.html. Search for ‘ridge’ and find the method value.
Repeat steps 9 through 16 to create a new set of objects using ridge regression instead of standard linear regression. When creating your objects, you may want to use the name _ridge_ instead of _regression_ so you know which object is which!. As you are creating your new objects, make sure to stop and explore them (i.e.; with summary(), plot(), print() to see how they look different from the previous objects.
Which model predicted the data better? Standard regression or ridge regression? Was using ridge regression able to improve your predictions?

Random forests

Now you’re getting the hang of it! Instead of regression, let’s try a totally different model….random forests! Repeat all of your steps to see how well random forests do in predicting the boston_test data.

heartdisease - Predicting heartdisease status

Now it’s time to try a classification problem. We’ll do this with the heartdisease data. The heartdisease dataset contains data from 303 patients suspected of having heart disease The objective is to predict the diagnosis column indicating whether a patient has heartdisease or not