The almighty Caret!

The almighty Caret!

Slides

Overview

In this practical you’ll use the caret (Classification And REgression Training) package to automate many aspects of the machine learning process. By the end of this practical you will know:

  1. Select different ‘learners’ available in the caret package
  2. Fit parameters of a learner to a dataset
  3. Explore learners to understand how they work
  4. Use cross validation techniques to estimate the accuracy of a learner and find best fitting parameters 5 Make predictions and evaluate the performance of learners!

Datasets

You’ll use four datasets in this practical: boston_train_csv and boston_test.csv, heartdisease_train.csv and heartdisease_test.csv. They available in the data_BaselRBootcamp_Day3.zip file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day3.zip folder and unzip it to get the files. We also recommend saving them in a folder called data in your working directory!

Packages

For this practical, you will need the following packages. If you don’t have any installed, you’ll need to install them with install.packages()

# Packages you'll need for this practical
library(caret)
library(mlr)        
library(GGally)
library(mlbench)
library(tidyverse)

Examples

The following set of example code will take you through the basic steps of machine learning using the amazing caret package.

# Load packages
library(caret)
library(mlr)        
library(GGally)
library(mlbench)
library(tidyverse)

# ---------------
# Setup
# ---------------

# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
  sample_frac(1)                  # Randomly sort

# Create separate training and test data
diamonds_train <- diamonds %>%
  slice(1:500)

diamonds_test <- diamonds %>%
  slice(501:2000)

# ---------------
# Explore
# ---------------

# Explore columns with summarizeColumns
summarizeColumns(diamonds_train)

# Visualise relationships with ggpairs
ggpairs(diamonds_train)

# ---------------
# Train
# ---------------

# Set up control values
rep10_control <- trainControl(method = "repeatedcv",
                              number = 10, # 10 folds
                              repeats = 5) # Repeat 5 times

# Predict price with linear regression
diamonds_lm_train <- train(form = price ~ ., 
                           data = diamonds_train,
                           method = "lm",
                           trControl = rep10_control)

# Explore the train object
class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)

# Look at variable importance with varImp
varImp(diamonds_lm_train)

# Save final predictions
diamonds_lm_predictions <- predict(diamonds_lm_train, 
                                   newdata = diamonds_test)

# ---------------
# Evaluate
# ---------------

# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
                          criterion = diamonds_test$price)

ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point() +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")

# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions, 
             obs = diamonds_test$price)

boston – predicting median home value (medv)

The boston dataset contains housing data for 506 census tracts of Boston from the 1970 census.

Here is a description of each of the variables in the dataset

Column Description
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per USD 10,000
ptratio pupil-teacher ratio by town
b 1000(B - 0.63)^2 where B is the proportion of blacks by town
lstat percentage of lower status of the population
medv median value of owner-occupied homes in USD 1000’s

(E)xplore

  1. Load the training data boston_train.csv into R as a new object with the name boston_train using read_csv().

  2. Explore the data using the standard functions head(), View(), names() and summary()

  3. Use the summarizeColumns() function from the mlr package to create numerical summaries of each column. Do you notice any strange columns?

  4. Can you think of a reason why you should remove any of the columns? If so, remove them using select()! There’s no need to include unnecessary data in a machine learning algorithm :)

  5. Visualise the data using the ggpairs() function from the GGally package. Do you notice any interesting patterns?

(T)rain

  1. We will use 10-fold cross validation to train the models. To set up the fitting process, we’ll need to use the trainControl function. Look at the help menu for this function and look through the examples to see how it works.

  2. Using the trainControl() function, create an object called rep10_control that will conduct 10-fold cross validation.

  3. We’re almost ready to train models! To do so, we’ll use the almighty train() function. Look at the help menu for the train() function to see all of its lovely arguments and examples.

  4. Now it’s time to actually train a model. We’ll start with simple linear regression. Create an object called boston_lm_train where you predict a town’s median home value medv based on all other variables. In order to conduct linear regression, include the arguments method = 'lm' and trControl = rep10_control

  5. Explore your boston_lm_train argument using the generic functions names(), summary(), and plot(). What is stored in this object?

  6. Using the varImp() function, look at the variable importance of each predictor. Which predictors seem to be the most important? After you’ve run the function, try plotting the object with plot() to visualise the results!

  7. Load the test dataset as a new object called boston_test into R using read_csv()

  8. Now it’s time to make predictions form your model! Using predict(), predict the criterion values for the test dataset boston_test. In the object argument, use your boston_lm_train model, in the newdata argument, use your boston_test test data. Save the results as the vector boston_lm_predictions

(E)valuate model performance

  1. How well did your model do in predicting the true data? To start, plot the relationship between your model predictions and the true criterion using the following template
# Combine predictions and criterion in one tibble
performance_data <- tibble(predictions = boston_lm_predictions,
                          criterion = boston_test$medv)

# Plot results
ggplot(data = performance_data,
       aes(x = predictions, y = criterion)) +
  geom_point() +   # Add points
  geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
  labs(title = "Regression prediction accuracy",
       subtitle = "Blue line is perfect prediction!")
  1. Now it’s time to quantify how well your model did. To do this, we’ll evaluate its performance with postResample(). Start by looking at the help menu for the postResample() function to see its arguments.

  2. Now evaluate the model’s predictions using postResample(). You should only specify the pred and obs arguments as pred = boston_lm_predictions and obs = boston_test$medv.

  3. How well did your model do? What was the overall level of accuracy? If you don’t understand the results, look at the details in the help menu for help!

Ridge regression

  1. Now let’s try a new model! Instead of using simple linear regression with method = 'lm', we’ll use so–called ‘ridge regression’. Ridge regression is a variation of ‘standard’ linear regression that tries to avoid overfitting and thus make better predictions. How can you use ridge regression in caret? Try to answer yourself by looking at the list of all available models on the caret help site: http://topepo.github.io/caret/available-models.html. Search for ‘ridge’ and find the method value.

  2. Repeat steps 9 through 16 to create a new set of objects using ridge regression instead of standard linear regression. When creating your objects, you may want to use the name _ridge_ instead of _regression_ so you know which object is which!. As you are creating your new objects, make sure to stop and explore them (i.e.; with summary(), plot(), print() to see how they look different from the previous objects.

  3. Which model predicted the data better? Standard regression or ridge regression? Was using ridge regression able to improve your predictions?

Random forests

  1. Now you’re getting the hang of it! Instead of regression, let’s try a totally different model….random forests! Repeat all of your steps to see how well random forests do in predicting the boston_test data.

heartdisease - Predicting heartdisease status

Now it’s time to try a classification problem. We’ll do this with the heartdisease data. The heartdisease dataset contains data from 303 patients suspected of having heart disease The objective is to predict the diagnosis column indicating whether a patient has heartdisease or not

Here is a description of each of the variables in the dataset

Column Description
age Age
sex Sex, 1 = male, 0 = female
cp Chest pain type: ta = typical angina, aa = atypical angina, np = non-anginal pain, a = asymptomatic
trestbps Resting blood pressure (in mm Hg on admission to the hospital)
chol Serum cholestoral in mg/dl
fbs Fasting blood sugar > 120 mg/dl: 1 = true, 0 = false
restecg Resting electrocardiographic results. “normal” = normal, “abnormal” = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), “hypertrophy” = showing probable or definite left ventricular hypertrophy by Estes’ criteria.
thalach Maximum heart rate achieved
exang Exercise induced angina: 1 = yes, 0 = no
oldpeak ST depression induced by exercise relative to rest
slope The slope of the peak exercise ST segment.
ca Number of major vessels (0-3) colored by flourosopy
thal “normal” = normal, “fd” = fixed defect, “rd” = reversible defect
diagnosis 1 = Heart disease, 0 = No Heart disease
  1. Load the training data heartdisease_train.csv into R as a new object with the name heartdisease_train using read_csv().

  2. Before you do anything else, convert the diagnosis column to a factor. This will be necessary for some machine learning models to work!

# Convert diagnosis to a factor
#   This is important for some machine learning models to do
#   classification analyses

heartdisease_train <- heartdisease_train %>%
  mutate(diagnosis = factor(diagnosis))      # Convert diagnosis to a factor
  1. Follow tasks 2 - 5 to explore the data. Remember our goal is to predict diagnosis!

  2. Now it’s time to train a model on the data. For this data, we can’t use linear regression because the criterion is a nominal class. Instead, we’ll start with a type of decision tree called CART. Look at http://topepo.github.io/caret/available-models.html to find the correct method for using CART.

  3. Follow tasks 6 - 17 to fit your model, and evaluate its performance.
    • When you read in the heartdisease_test data, make sure to convert the diagnosis column to a factor like you did for the heartdisease_train data.
    • In task 14, where you try to visualise the performance of the model, you’ll need to tweak the plotting code a bit. To help, try using the following code template:
# -----------------------------------------------------------
# Visualise the relationship between two nominal variables
# -----------------------------------------------------------

# Combine predictions and criterion in one tibble
performance_data <- tibble(predictions = heartdisease_rpart_predictions,
                          criterion = heartdisease_test$diagnosis)

# Plot results
ggplot(data = performance_data, 
       aes(x = predictions, ..count..)) + 
  geom_bar(aes(fill = criterion), position = "dodge") +
  labs(title = "CART prediction accuracy")
  1. Now that you’ve fit CART, try using random forests! How well do random forests compare to plain old decision trees with CART?

  2. If you’ve gotten this far, then you’ve really gotten the hang of the package! Try exploring the list of all models available in caret (again check out http://topepo.github.io/caret/available-models.html to see them all). Look for the craziest looking models you can find, and see how well they do on the two datasets in this practical. Can you find one that does much better than regression, random forests, or decision trees?

Further Reading