In this practical you’ll practice advanced machine learning using the caret
package
By the end of this practical you will know how to:
createDataPartition()
trainControl()
train()
varImp()
predict()
postResample()
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
caret |
install.packages("caret") |
File | Rows | Columns | Criterion | Source |
---|---|---|---|---|
house_train.csv | 1000 | 21 | price |
Kaggle: House Sales Prediction |
house_test.csv | 15000 | 21 | price |
Kaggle: House Sales Prediction |
heartdisease_train.csv | 150 | 14 | diagnosis |
UCI ML Heartdisease |
heartdisease_test.csv | 150 | 14 | diagnosis |
UCI ML Heartdisease |
attrition_train.csv | 500 | 35 | Attrition |
Kaggle Attrition |
attrition_test.csv | 900 | 35 | Attrition |
Kaggle Attrition |
The following set of example code will take you through the basic steps of machine learning using the amazing caret
package.
# Load packages
library(tidyverse)
library(caret)
# ------------------------------------
# Step 0: Create training and test data
# Only necessary if you don't already have training
# and test data
# ------------------------------------
# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
sample_frac(1)
train_v <- createDataPartition(y = diamonds$price,
times = 1,
p = .5)
# Create separate training and test data
diamonds_train <- diamonds %>%
slice(train_v$Resample1)
diamonds_test <- diamonds %>%
slice(-train_v$Resample1)
# ---------------
# Step 1: Define control parameters
# ---------------
# Set up control values
ctr <- trainControl(method = "none")
# ---------------
# Step 2: Train model
# ---------------
# Predict price with linear regression
# Only using 4 numeric features
diamonds_lm_train <- train(form = price ~ .,
data = diamonds_train %>%
select(carat, depth, table, price),
method = "lm",
trControl = ctr)
# ---------------
# Step 3: Explore
# ---------------
class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)
# Look at variable importance with varImp
varImp(diamonds_lm_train)
# Look at final model object
diamonds_lm_train$finalModel
# Look at fitting performance!
postResample(pred = predict(diamonds_lm_train$finalModel, diamonds_train),
obs = diamonds_train$price)
# ---------------
# Step 4: Predict
# ---------------
diamonds_lm_predictions <- predict(diamonds_lm_train,
newdata = diamonds_test)
# ---------------
# Step 5: Evaluate
# ---------------
# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions,
obs = diamonds_test$price)
# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
criterion = diamonds_test$price)
ggplot(data = performance_data,
aes(x = predictions, y = criterion)) +
geom_point(alpha = .1) + # Add points
geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
labs(title = "Regression prediction accuracy",
subtitle = "Blue line is perfect prediction!")
Here are three models you can try for both regression and classification tasks
Regression Tasks
For regression tasks, your criterion should be numeric
rpart
rf
glmnet
Classification Tasks
For classification tasks, your criterion should be a factor
rpart
rf
knn
In this practical you will conduct machine learning analyses on several data sets. For each of the tasks, go through the following steps.
Step 0: Setup
Load the training data XXX_train.csv
as a new dataframe called XXX_train
and the test data XXX_test.csv
as a new dataframe called XXX_test
.
Explore the XXX_train
dataframe with a combination of names()
, summary()
and other similar functions.
If there are any predictors in the data that you don’t want in the model, consider excluding them when you train the model in Step 2 (train()
)
Step 1: Define control parameters
trainControl()
. To start, use method = "none"
Step 2: Train model
Train one or more models on the training data using train()
. For each model, assign the result to a new training object called XX_train
(e.g.; rf_train
for random forests, glm_train
for standard regression.)
Make sure to exclude any predictors in the data you don’t want to use!
Step 3: Explore model
Explore your training objects by printing them, looking at (and printing) their named elements with names()
, try accessing a few of the named elements with XX_train$
and see what the outputs look like.
Look at the final model with XX_train$finalModel
. Try applying generic functions like summary()
, plot()
, and broom:tidy()
to the object. Do these help you to understand the final model?
Step 4: Predict new data
XXX_test
using predict()
and store the results in a vector called XXX_pred
(e.g.; rf_pred
for predictions from random forests)Step 5: Evaluate performance
postResample()
, and for classification tasks, use confusionMatrix()
.If you want to plot the results of multiple models, you can try using the following code template:
# Some fake prediction data
# include your real model prediction data here
XX_pred <- rnorm(100, mean = 100, sd = 10)
YY_pred <- rnorm(100, mean = 100, sd = 10)
ZZ_pred <- rnorm(100, mean = 100, sd = 10)
# Some fake true test values
# Get these from your XX_test objects
truth <- rnorm(100, mean = 100, sd = 10)
# Put results together in a tibble
model_results <- tibble(XX = XX_pred,
YY = YY_pred,
ZZ = ZZ_pred,
truth = truth) %>%
gather(model, pred, -truth) %>%
mutate(error = pred - truth,
abserr = abs(error))
# Plot Distribution of errors for each model
ggplot(model_results,
aes(x = model, y = error, col = model)) +
geom_jitter(width = .1, alpha = .5) +
stat_summary(fun.y = mean,
fun.ymin = min,
fun.ymax = max,
colour = "black") +
labs(title = "Model Prediction Errors",
subtitle = "Dots represent means",
caption = "Caret is awesome!") +
theme(legend.position = "none")
# Plot relationship between truth and predictions
ggplot(model_results,
aes(x = truth, y = pred, col = model)) +
geom_point(alpha = .5) +
geom_abline(slope = 1, intercept = 0) +
labs(title = "XX model predictions",
subtitle = "Diagonal represents perfect performance",
caption = "Caret is awesome!",
x = "True Values",
y = "Model Predictions")
Max Kuhn, the author of caret
has a fantastic overview of the package at http://topepo.github.io/caret/index.html. If you like the caret
package as much as we do, be sure to go through this page in detail.
Max Kuhn is also the co-author of a fantastic book on machine learning called Applied predictive modelling - http://appliedpredictivemodeling.com/.
Open your R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the all of the datasets you need for this practical are in your 1_Data
folder
Open a new R script and save it as a new file called ml2_practical.R
in the 2_Code
folder. At the top of the script, using comments, write your name and the date. Then, load the all of the packages you’ll need for this practical. Here’s how the top of your script should look:
## NAME
## DATE
## Machine Learning 2 Practical
library(XX)
library(XX)
library(XX)
Datasets
section above as new objects using read_csv()
. Call them XX_train
and XX_test
.house
In this task, you’ll create the best possible model to predict the housing prices of houses in King County Washington. The training data is stored in house_train
and the test data is stored in house_test
. The variable you are trying to predict is price
!
Follow Steps 1 through 3 in the Modelling Procedure section above above to create and explore your best decision tree model predicting house_train$price
using method = "rpart"
. Call the model rpart_price
Do the same with random forests using method = "rf"
. Call the model rf_price
Do the same with regularised regression using method = "glmnet"
. Call the model glmnet_price
.
Make sure you have explored each of your models!
Follow steps 4 and 5 to see which model did the best in predicting house_test$price
Which of the three models was the best at predicting price in the test data? Were they very different? Is one easier to interpret than the other?
Go to http://topepo.github.io/caret/available-models.html and find the craziest looking model you can find that works for regression tasks. Then, try it on the price dataset and see how well it works! Does it do much better than the models you already tried?
Repeat your analyses, but now train your models using 10 fold cross validation by specifying trainControl(method = "cv", number = 10, repeats = 1)
. After you do so, be sure complete the remaining steps to fit the model using train()
and evaluate its prediction performance.
How do the accuracy of the models change when you use 10 fold cross validation? Do they become more accurate when you use 10 fold cross validation compared to when you didn’t?
heartdisease
Make sure your criterion values are factors when fitting your model(s)!!
When you are conducting a classification analysis, you need to make sure the criterion is stored as a factor. This tells the function you are doing classification instead of regression. So before doing anything, do this for both the training and test datasets. Here’s how to do it for the heartdisease datasets:
# Convert the diagnosis column to a factor
heartdisease_train <- heartdisease_train %>%
mutate(diagnosis = factor(diagnosis))
heartdisease_test <- heartdisease_test %>%
mutate(diagnosis = factor(diagnosis))
In this task, you’ll create the best possible model to predict whether or not a person is having a heart attack. These the training data is stored in heartdisease_train
and the test data is stored in housedisease_test
. The variable you are trying to predict is diagnosis
!
Follow Steps 1 through 3 above in the Modelling Procedure section above, to create your best decision tree model predicting diagnosis. Call the model rpart_diagnosis
Do the same with random forests using method = "rf"
. Call the model rf_diagnosis
Do the same with nearest neighbors using method = "knn"
. Call the model knn_diagnosis
Make sure you have explored each of your models!
Follow steps 4 and 5 to see which model did the best in predicting house_test$price
Which of the three models was the best at predicting diagnosis? Were they very different? Is one easier to interpret than the other?
Go to http://topepo.github.io/caret/available-models.html and find the craziest looking model you can find that works for classification tasks. Then, try it on the heartdisease dataset and see how well it works! Does it do much better than the models you already tried?
Repeat your analyses, but now train your models using 10 fold cross validation by specifying trainControl(method = "cv", number = 10, repeats = 1)
. After you do so, be sure complete the remaining steps to fit the model using train()
and evaluate its prediction performance.
How do the accuracy of the models change when you use 10 fold cross validation? Do they become more accurate when you use 10 fold cross validation compared to when you didn’t?
attrition
attrition_train
dataset. The variable you want to predict is Attrition
! Use all of the techniques you have learned so far!