In this practical you’ll use the caret
(Classification And REgression Training) package to automate many aspects of the machine learning process. By the end of this practical you will know:
caret
packageYou’ll use four datasets in this practical: boston_train_csv
and boston_test.csv
, heartdisease_train.csv
and heartdisease_test.csv
. They available in the data_BaselRBootcamp_Day3.zip
file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day3.zip
folder and unzip it to get the files. We also recommend saving them in a folder called data
in your working directory!
For this practical, you will need the following packages. If you don’t have any installed, you’ll need to install them with install.packages()
# Packages you'll need for this practical
library(caret)
library(mlr)
library(GGally)
library(mlbench)
library(tidyverse)
The following set of example code will take you through the basic steps of machine learning using the amazing caret
package.
# Load packages
library(caret)
library(mlr)
library(GGally)
library(mlbench)
library(tidyverse)
# ---------------
# Setup
# ---------------
# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
sample_frac(1) # Randomly sort
# Create separate training and test data
diamonds_train <- diamonds %>%
slice(1:500)
diamonds_test <- diamonds %>%
slice(501:2000)
# ---------------
# Explore
# ---------------
# Explore columns with summarizeColumns
summarizeColumns(diamonds_train)
# Visualise relationships with ggpairs
ggpairs(diamonds_train)
# ---------------
# Train
# ---------------
# Set up control values
rep10_control <- trainControl(method = "repeatedcv",
number = 10, # 10 folds
repeats = 5) # Repeat 5 times
# Predict price with linear regression
diamonds_lm_train <- train(form = price ~ .,
data = diamonds_train,
method = "lm",
trControl = rep10_control)
# Explore the train object
class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)
# Look at variable importance with varImp
varImp(diamonds_lm_train)
# Save final predictions
diamonds_lm_predictions <- predict(diamonds_lm_train,
newdata = diamonds_test)
# ---------------
# Evaluate
# ---------------
# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
criterion = diamonds_test$price)
ggplot(data = performance_data,
aes(x = predictions, y = criterion)) +
geom_point() + # Add points
geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
labs(title = "Regression prediction accuracy",
subtitle = "Blue line is perfect prediction!")
# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions,
obs = diamonds_test$price)
The boston
dataset contains housing data for 506 census tracts of Boston from the 1970 census.
Here is a description of each of the variables in the dataset
Column | Description |
---|---|
crim | per capita crime rate by town |
zn | proportion of residential land zoned for lots over 25,000 sq.ft |
indus | proportion of non-retail business acres per town |
chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
nox | nitric oxides concentration (parts per 10 million) |
rm | average number of rooms per dwelling |
age | proportion of owner-occupied units built prior to 1940 |
dis | weighted distances to five Boston employment centres |
rad | index of accessibility to radial highways |
tax | full-value property-tax rate per USD 10,000 |
ptratio | pupil-teacher ratio by town |
b | 1000(B - 0.63)^2 where B is the proportion of blacks by town |
lstat | percentage of lower status of the population |
medv | median value of owner-occupied homes in USD 1000’s |
Load the training data boston_train.csv
into R as a new object with the name boston_train
using read_csv()
.
Explore the data using the standard functions head()
, View()
, names()
and summary()
Use the summarizeColumns()
function from the mlr
package to create numerical summaries of each column. Do you notice any strange columns?
Can you think of a reason why you should remove any of the columns? If so, remove them using select()
! There’s no need to include unnecessary data in a machine learning algorithm :)
Visualise the data using the ggpairs()
function from the GGally
package. Do you notice any interesting patterns?
We will use 10-fold cross validation to train the models. To set up the fitting process, we’ll need to use the trainControl
function. Look at the help menu for this function and look through the examples to see how it works.
Using the trainControl()
function, create an object called rep10_control
that will conduct 10-fold cross validation.
We’re almost ready to train models! To do so, we’ll use the almighty train()
function. Look at the help menu for the train()
function to see all of its lovely arguments and examples.
Now it’s time to actually train a model. We’ll start with simple linear regression. Create an object called boston_lm_train
where you predict a town’s median home value medv
based on all other variables. In order to conduct linear regression, include the arguments method = 'lm'
and trControl = rep10_control
Explore your boston_lm_train
argument using the generic functions names()
, summary()
, and plot()
. What is stored in this object?
Using the varImp()
function, look at the variable importance of each predictor. Which predictors seem to be the most important? After you’ve run the function, try plotting the object with plot()
to visualise the results!
Load the test dataset as a new object called boston_test
into R using read_csv()
Now it’s time to make predictions form your model! Using predict()
, predict the criterion values for the test dataset boston_test
. In the object
argument, use your boston_lm_train
model, in the newdata
argument, use your boston_test
test data. Save the results as the vector boston_lm_predictions
# Combine predictions and criterion in one tibble
performance_data <- tibble(predictions = boston_lm_predictions,
criterion = boston_test$medv)
# Plot results
ggplot(data = performance_data,
aes(x = predictions, y = criterion)) +
geom_point() + # Add points
geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
labs(title = "Regression prediction accuracy",
subtitle = "Blue line is perfect prediction!")
Now it’s time to quantify how well your model did. To do this, we’ll evaluate its performance with postResample()
. Start by looking at the help menu for the postResample()
function to see its arguments.
Now evaluate the model’s predictions using postResample()
. You should only specify the pred
and obs
arguments as pred = boston_lm_predictions
and obs = boston_test$medv
.
How well did your model do? What was the overall level of accuracy? If you don’t understand the results, look at the details in the help menu for help!
Now let’s try a new model! Instead of using simple linear regression with method = 'lm'
, we’ll use so–called ‘ridge regression’. Ridge regression is a variation of ‘standard’ linear regression that tries to avoid overfitting and thus make better predictions. How can you use ridge regression in caret? Try to answer yourself by looking at the list of all available models on the caret help site: http://topepo.github.io/caret/available-models.html. Search for ‘ridge’ and find the method
value.
Repeat steps 9 through 16 to create a new set of objects using ridge regression instead of standard linear regression. When creating your objects, you may want to use the name _ridge_
instead of _regression_
so you know which object is which!. As you are creating your new objects, make sure to stop and explore them (i.e.; with summary()
, plot()
, print()
to see how they look different from the previous objects.
Which model predicted the data better? Standard regression or ridge regression? Was using ridge regression able to improve your predictions?
boston_test
data.Now it’s time to try a classification problem. We’ll do this with the heartdisease
data. The heartdisease
dataset contains data from 303 patients suspected of having heart disease The objective is to predict the diagnosis
column indicating whether a patient has heartdisease or not
Here is a description of each of the variables in the dataset
Column | Description |
---|---|
age | Age |
sex | Sex, 1 = male, 0 = female |
cp | Chest pain type: ta = typical angina, aa = atypical angina, np = non-anginal pain, a = asymptomatic |
trestbps | Resting blood pressure (in mm Hg on admission to the hospital) |
chol | Serum cholestoral in mg/dl |
fbs | Fasting blood sugar > 120 mg/dl: 1 = true, 0 = false |
restecg | Resting electrocardiographic results. “normal” = normal, “abnormal” = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), “hypertrophy” = showing probable or definite left ventricular hypertrophy by Estes’ criteria. |
thalach | Maximum heart rate achieved |
exang | Exercise induced angina: 1 = yes, 0 = no |
oldpeak | ST depression induced by exercise relative to rest |
slope | The slope of the peak exercise ST segment. |
ca | Number of major vessels (0-3) colored by flourosopy |
thal | “normal” = normal, “fd” = fixed defect, “rd” = reversible defect |
diagnosis | 1 = Heart disease, 0 = No Heart disease |
Load the training data heartdisease_train.csv
into R as a new object with the name heartdisease_train
using read_csv()
.
Before you do anything else, convert the diagnosis
column to a factor. This will be necessary for some machine learning models to work!
# Convert diagnosis to a factor
# This is important for some machine learning models to do
# classification analyses
heartdisease_train <- heartdisease_train %>%
mutate(diagnosis = factor(diagnosis)) # Convert diagnosis to a factor
Follow tasks 2 - 5 to explore the data. Remember our goal is to predict diagnosis
!
Now it’s time to train a model on the data. For this data, we can’t use linear regression because the criterion is a nominal class. Instead, we’ll start with a type of decision tree called CART. Look at http://topepo.github.io/caret/available-models.html to find the correct method for using CART.
heartdisease_test
data, make sure to convert the diagnosis
column to a factor like you did for the heartdisease_train
data.# -----------------------------------------------------------
# Visualise the relationship between two nominal variables
# -----------------------------------------------------------
# Combine predictions and criterion in one tibble
performance_data <- tibble(predictions = heartdisease_rpart_predictions,
criterion = heartdisease_test$diagnosis)
# Plot results
ggplot(data = performance_data,
aes(x = predictions, ..count..)) +
geom_bar(aes(fill = criterion), position = "dodge") +
labs(title = "CART prediction accuracy")
Now that you’ve fit CART, try using random forests! How well do random forests compare to plain old decision trees with CART?
If you’ve gotten this far, then you’ve really gotten the hang of the package! Try exploring the list of all models available in caret
(again check out http://topepo.github.io/caret/available-models.html to see them all). Look for the craziest looking models you can find, and see how well they do on the two datasets in this practical. Can you find one that does much better than regression, random forests, or decision trees?
Max Kuhn, the author of caret
has a fantastic overview of the package at http://topepo.github.io/caret/index.html. If you like the caret
package as much as we do, be sure to go through this page in detail.
Max Kuhn is also the co-author of a fantastic book on machine learning called Applied predictive modelling - http://appliedpredictivemodeling.com/.