The almighty Caret!

Slides

Here are the introduction slides for this practical on machine learning II!

Overview

In this practical you’ll use the caret (Classification And REgression Training) package to automate many aspects of the machine learning process. By the end of this practical you will know:

Select different ‘learners’ available in the caret package
Fit parameters of a learner to a dataset
Explore learners to understand how they work
Use cross validation techniques to estimate the accuracy of a learner and find best fitting parameters 5 Make predictions and evaluate the performance of learners!

Datasets

You’ll use four datasets in this practical: boston_train_csv and boston_test.csv, heartdisease_train.csv and heartdisease_test.csv. They available in the data_BaselRBootcamp_Day3.zip file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day3.zip folder and unzip it to get the files. We also recommend saving them in a folder called data in your working directory!

Packages

For this practical, you will need the following packages. If you don’t have any installed, you’ll need to install them with install.packages()

# Packages you'll need for this practical
library(caret)
library(mlr)        
library(GGally)
library(mlbench)
library(tidyverse)

Examples

The following set of example code will take you through the basic steps of machine learning using the amazing caret package.

# Load packages
library(caret)
library(mlr)        
library(GGally)
library(mlbench)
library(tidyverse)

# ---------------
# Setup
# ---------------

# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
  sample_frac(1)                  # Randomly sort

# Create separate training and test data
diamonds_train <- diamonds %>%
  slice(1:500)

diamonds_test <- diamonds %>%
  slice(501:2000)

# ---------------
# Explore
# ---------------

# Explore columns with summarizeColumns
summarizeColumns(diamonds_train)

# Visualise relationships with ggpairs
ggpairs(diamonds_train)

# ---------------
# Train
# ---------------

# Set up control values
rep10_control <- trainControl(method = "repeatedcv",
                              number = 10, # 10 folds
                              repeats = 5) # Repeat 5 times

# Predict price with linear regression
diamonds_lm_train <- train(form = price ~ ., 
                           data = diamonds_train,
                           method = "lm",
                           trControl = rep10_control)

# Explore the train object
class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)

# Look at variable importance with varImp
varImp(diamonds_lm_train)

# Save final predictions
diamonds_lm_predictions <- predict(diamonds_lm_train, 
                                   newdata = diamonds_test)

# ---------------
# Evaluate
# ---------------

# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
                          criterion = diamonds_test$price)

ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point() +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions, 
             obs = diamonds_test$price)

boston – predicting median home value (medv)

The boston dataset contains housing data for 506 census tracts of Boston from the 1970 census.

Here is a description of each of the variables in the dataset

Column	Description
crim	per capita crime rate by town
zn	proportion of residential land zoned for lots over 25,000 sq.ft
indus	proportion of non-retail business acres per town
chas	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox	nitric oxides concentration (parts per 10 million)
rm	average number of rooms per dwelling
age	proportion of owner-occupied units built prior to 1940
dis	weighted distances to five Boston employment centres
rad	index of accessibility to radial highways
tax	full-value property-tax rate per USD 10,000
ptratio	pupil-teacher ratio by town
b	1000(B - 0.63)^2 where B is the proportion of blacks by town
lstat	percentage of lower status of the population
medv	median value of owner-occupied homes in USD 1000’s

(E)xplore

Load the training data boston_train.csv into R as a new object with the name boston_train using read_csv().

boston_train <- read_csv("data/boston_train.csv")

Explore the data using the standard functions head(), View(), names() and summary()

head(boston_train)

# A tibble: 6 x 14
    crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>   <dbl>
1 3.54      0. 19.6      1 0.871  6.15  82.6  1.75     5   403    14.7
2 0.340     0. 21.9      0 0.624  6.46  98.9  2.12     4   437    21.2
3 0.0798   40.  6.41     0 0.447  6.48  32.1  4.14     4   254    17.6
4 0.773     0.  8.14     0 0.538  6.50  94.4  4.45     4   307    21.0
5 0.330     0.  6.20     0 0.507  6.09  61.5  3.65     8   307    17.4
6 0.103    30.  4.93     0 0.428  6.36  52.9  7.04     6   300    16.6
# ... with 3 more variables: b <dbl>, lstat <dbl>, medv <dbl>

Use the summarizeColumns() function from the mlr package to create numerical summaries of each column. Do you notice any strange columns?

mlr::summarizeColumns(boston_train)

      name    type na       mean        disp     median         mad
1     crim numeric  0   4.017157  10.6362176   0.239745   0.3043481
2       zn numeric  0  12.835000  24.4253196   0.000000   0.0000000
3    indus numeric  0  11.644100   7.5021709   9.900000  11.4011940
4     chas integer  0   0.090000   0.2876235   0.000000   0.0000000
5      nox numeric  0   0.568709   0.1260641   0.544000   0.1319514
6       rm numeric  0   6.346150   0.7002630   6.278000   0.4885167
7      age numeric  0  70.893000  27.4259748  82.200000  24.2405100
8      dis numeric  0   3.493716   2.0120118   2.800100   1.4820070
9      rad integer  0   9.670000   8.8740400   5.000000   2.9652000
10     tax integer  0 415.710000 173.7767809 364.000000 139.3644000
11 ptratio numeric  0  18.289000   2.2834448  18.600000   2.3721600
12       b numeric  0 362.635300  84.7712980 390.720000   9.1624680
13   lstat numeric  0  12.349100   7.1234023  10.585000   6.1083120
14    medv numeric  0  23.326000   9.6243758  21.800000   5.9304000
         min      max nlevs
1    0.01096  73.5341     0
2    0.00000  95.0000     0
3    0.46000  27.7400     0
4    0.00000   1.0000     0
5    0.38900   0.8710     0
6    4.13800   8.3980     0
7    7.80000 100.0000     0
8    1.17420   9.2203     0
9    1.00000  24.0000     0
10 188.00000 711.0000     0
11  12.60000  21.2000     0
12  16.45000 396.9000     0
13   2.47000  34.7700     0
14   5.00000  50.0000     0

# Nope they all look ok!

Can you think of a reason why you should remove any of the columns? If so, remove them using select()! There’s no need to include unnecessary data in a machine learning algorithm :)
Visualise the data using the ggpairs() function from the GGally package. Do you notice any interesting patterns?

ggpairs(boston_train)

(T)rain

We will use 10-fold cross validation to train the models. To set up the fitting process, we’ll need to use the trainControl function. Look at the help menu for this function and look through the examples to see how it works.
Using the trainControl() function, create an object called rep10_control that will conduct 10-fold cross validation.

rep10_control <- caret::trainControl(method = "repeatedcv",
                                    number = 10,
                                    search = "grid")          # Find optimal parameters

We’re almost ready to train models! To do so, we’ll use the almighty train() function. Look at the help menu for the train() function to see all of its lovely arguments and examples.

?train

Now it’s time to actually train a model. We’ll start with simple linear regression. Create an object called boston_lm_train where you predict a town’s median home value medv based on all other variables. In order to conduct linear regression, include the arguments method = 'lm' and trControl = rep10_control

boston_lm_train <- caret::train(form = medv ~ ., 
                                        data = boston_train,
                                        method = "lm",
                                        trControl = rep10_control)

Explore your boston_lm_train argument using the generic functions names(), summary(), and plot(). What is stored in this object?

boston_lm_train

Linear Regression 

100 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 90, 90, 91, 89, 90, 91, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  5.896511  0.667127  4.522078

Tuning parameter 'intercept' was held constant at a value of TRUE

summary(boston_lm_train)


Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.548  -2.999  -0.452   2.584  18.646 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  53.236754  11.969055   4.448 2.58e-05 ***
crim         -0.150641   0.062897  -2.395 0.018790 *  
zn            0.097746   0.036441   2.682 0.008767 ** 
indus        -0.064170   0.139644  -0.460 0.647019    
chas          0.871467   1.984693   0.439 0.661694    
nox         -18.843021   9.448875  -1.994 0.049297 *  
rm            1.983625   1.006865   1.970 0.052045 .  
age           0.039229   0.043431   0.903 0.368918    
dis          -2.214633   0.579946  -3.819 0.000253 ***
rad           0.308793   0.139984   2.206 0.030056 *  
tax          -0.004699   0.008022  -0.586 0.559580    
ptratio      -1.254161   0.332930  -3.767 0.000302 ***
b             0.012235   0.007149   1.712 0.090587 .  
lstat        -0.756288   0.118553  -6.379 8.61e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.017 on 86 degrees of freedom
Multiple R-squared:  0.764, Adjusted R-squared:  0.7283 
F-statistic: 21.41 on 13 and 86 DF,  p-value: < 2.2e-16

Using the varImp() function, look at the variable importance of each predictor. Which predictors seem to be the most important? After you’ve run the function, try plotting the object with plot() to visualise the results!

caret::varImp(boston_lm_train)

lm variable importance

         Overall
lstat   100.0000
dis      56.8933
ptratio  56.0239
zn       37.7630
crim     32.9271
rad      29.7434
nox      26.1794
rm       25.7735
b        21.4207
age       7.8137
tax       2.4688
indus     0.3439
chas      0.0000

Load the test dataset as a new object called boston_test into R using read_csv()

boston_test <- read_csv("data/boston_test.csv")

Now it’s time to make predictions form your model! Using predict(), predict the criterion values for the test dataset boston_test. In the object argument, use your boston_lm_train model, in the newdata argument, use your boston_test test data. Save the results as the vector boston_lm_predictions

boston_lm_predictions <- predict(boston_lm_train, 
                                 newdata = boston_test)

(E)valuate model performance

How well did your model do in predicting the true data? To start, plot the relationship between your model predictions and the true criterion using the following template

# Combine predictions and criterion in one tibble
performance_data <- tibble(predictions = boston_lm_predictions,
                          criterion = boston_test$medv)

# Plot results
ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point() +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

Now it’s time to quantify how well your model did. To do this, we’ll evaluate its performance with postResample(). Start by looking at the help menu for the postResample() function to see its arguments.

?postResample

Now evaluate the model’s predictions using postResample(). You should only specify the pred and obs arguments as pred = boston_lm_predictions and obs = boston_test$medv.

postResample(pred = boston_lm_predictions, 
             obs = boston_test$medv)

     RMSE  Rsquared       MAE 
5.1690046 0.6859913 3.8336613

How well did your model do? What was the overall level of accuracy? If you don’t understand the results, look at the details in the help menu for help!

Ridge regression

Now let’s try a new model! Instead of using simple linear regression with method = 'lm', we’ll use so–called ‘ridge regression’. Ridge regression is a variation of ‘standard’ linear regression that tries to avoid overfitting and thus make better predictions. How can you use ridge regression in caret? Try to answer yourself by looking at the list of all available models on the caret help site: http://topepo.github.io/caret/available-models.html. Search for ‘ridge’ and find the method value.

# It's method = "ridge" :)

Repeat steps 9 through 16 to create a new set of objects using ridge regression instead of standard linear regression. When creating your objects, you may want to use the name _ridge_ instead of _regression_ so you know which object is which!. As you are creating your new objects, make sure to stop and explore them (i.e.; with summary(), plot(), print() to see how they look different from the previous objects.

boston_ridge_train <- caret::train(form = medv ~ ., 
                                   data = boston_train,
                                   method = "ridge",
                                   trControl = rep10_control)

# VarImp() doesn't work for ridge regression
# caret::varImp(boston_ridge_train)

boston_ridge_predictions <- predict(boston_ridge_train, 
                                 newdata = boston_test)

# Combine predictions and criterion in one tibble
performance_data <- tibble(predictions = boston_ridge_predictions,
                          criterion = boston_test$medv)

# Plot results
ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point() +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Ridge Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

 postResample(pred = boston_ridge_predictions, 
              obs = boston_test$medv)

     RMSE  Rsquared       MAE 
4.9313886 0.7125253 3.5123396

Which model predicted the data better? Standard regression or ridge regression? Was using ridge regression able to improve your predictions?

# Looks like ridge regression did better with an RMSE of 4.93 comapred to 5.19 for standard regression

Random forests

Now you’re getting the hang of it! Instead of regression, let’s try a totally different model….random forests! Repeat all of your steps to see how well random forests do in predicting the boston_test data.

boston_rf_train <- caret::train(form = medv ~ ., 
                                   data = boston_train,
                                   method = "rf",
                                   trControl = rep10_control)

# VarImp() doesn't work for rf regression
# caret::varImp(boston_rf_train)

boston_rf_predictions <- predict(boston_rf_train, 
                                 newdata = boston_test)

# Combine predictions and criterion in one tibble
performance_data <- tibble(predictions = boston_rf_predictions,
                          criterion = boston_test$medv)

# Plot results
ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point() +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Random Forests prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

 postResample(pred = boston_rf_predictions, 
              obs = boston_test$medv)

     RMSE  Rsquared       MAE 
4.2500196 0.7846074 2.9331980

heartdisease - Predicting heartdisease status

Now it’s time to try a classification problem. We’ll do this with the heartdisease data. The heartdisease dataset contains data from 303 patients suspected of having heart disease The objective is to predict the diagnosis column indicating whether a patient has heartdisease or not