Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
caret |
install.packages("caret") |
GGally |
install.packages("GGally") |
File | Rows | Columns | Criterion | Source |
---|---|---|---|---|
house_train.csv |
1000 | 21 | price |
Kaggle: House Sales Prediction |
house_test.csv |
15000 | 21 | price |
Kaggle: House Sales Prediction |
heartdisease_train.csv |
150 | 14 | diagnosis |
UCI ML Heartdisease |
heartdisease_test.csv |
150 | 14 | diagnosis |
UCI ML Heartdisease |
attrition_train |
500 | 35 | Attrition |
Kaggle Attrition |
attrition_test |
900 | 35 | Attrition |
Kaggle Attrition |
The following set of example code will take you through the basic steps of machine learning using the amazing caret
package.
# Load packages
library(tidyverse)
library(GGally)
library(skimr)
library(caret)
# ------------------------------------
# Step 0: Create training and test data
# Only necessary if you don't already have training
# and test data
# ------------------------------------
# Split diamonds data into separate training and test datasets
diamonds <- diamonds %>%
sample_frac(1)
train_v <- createDataPartition(y = diamonds$price,
times = 1,
p = .1)
# Create separate training and test data
diamonds_train <- diamonds %>%
slice(train_v$Resample1)
diamonds_test <- diamonds %>%
slice(-train_v$Resample1)
# ---------------
# Explore
# ---------------
# Explore columns with summarizeColumns
skim(diamonds_train)
# Visualise relationships with ggpairs
ggpairs(diamonds_train)
# ---------------
# Step 1: Define control parameters
# ---------------
# Set up control values
ctr <- trainControl(method = "repeatedcv",
number = 10,
repeats = 2)
# ---------------
# Step 2: Train model
# ---------------
# Predict price with linear regression
diamonds_lm_train <- train(form = price ~ .,
data = diamonds_train,
method = "lm",
trControl = ctr)
# ---------------
# Step 3: Explore
# ---------------
class(diamonds_lm_train)
diamonds_lm_train
names(diamonds_lm_train)
summary(diamonds_lm_train)
# Look at variable importance with varImp
varImp(diamonds_lm_train)
# Look at final model object
diamonds_lm_train$finalModel
# ---------------
# Step 4: Predict
# ---------------
diamonds_lm_predictions <- predict(diamonds_lm_train,
newdata = diamonds_test)
# ---------------
# Step 5: Evaluate
# ---------------
# Look at final prediction performance!
postResample(pred = diamonds_lm_predictions,
obs = diamonds_test$price)
# Plot relationship between predictions and truth
performance_data <- tibble(predictions = diamonds_lm_predictions,
criterion = diamonds_test$price)
ggplot(data = performance_data,
aes(x = predictions, y = criterion)) +
geom_point(alpha = .1) + # Add points
geom_abline(slope = 1, intercept = 0, col = "blue", size = 2) +
labs(title = "Regression prediction accuracy",
subtitle = "Blue line is perfect prediction!")
A. Open your R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the all of the datasets you need for this practical are in your 1_Data
folder
B. Open a new R script and save it as a new file called ml_practical.R
in the 2_Code
folder. At the top of the script, using comments, write your name and the date. Then, load the all of the packages you’ll need for this practical. Here’s how the top of your script should look:
## NAME
## DATE
## Machine Learning Practical
library(XX)
library(XX)
library(XX)
In this practical you will conduct machine learning analyses on several data sets. For each of the tasks, go through the following steps.
A. Load the training data XXX_train.csv
as a new dataframe called XXX_train
and the test data XXX_test.csv
as a new dataframe called XXX_test
B. Explore the XXX_train
dataframe with a combination of skim()
, names()
, summary()
and other similar functions
C. Define control parameters with trainControl()
. Use 10-fold cross validation with 2 repeitions.
D. If you are conducting a classification analysis, be sure to convert the criterion to a factor to tell the function you are doing classification instead of regression. Do this for both the training and test datasets. Here’s how to do it for a dataframe called df
# Convert a column called criterion to a factor
df <- df %>%
mutate(criterion = factor(criterion))
E. Train one or more models on the training data. Start with one model, then gradually try more. For each model, assign the result to a new training object called XX_train
(e.g.; rf_train
for random forests, glm_train
for standard regression.)
Regression Tasks
For classification tasks, your criterion should be numeric
glm
rpart
glmnet
rf
Classification Tasks
For classification tasks, your criterion should be a factor
rpart
rf
knn
F. Explore your training objects by printing them, looking at (and printing) their named elements with names()
, try accessing a few of the named elements with XX_train$
and see what the outputs look like.
G. Look at the final model with XX_train$finalModel
. Try applying generic functions like summary()
, plot()
, and broom:tidy()
to the object. Do these help you to understand the final model?
H. Predict the criterion values of the test data XXX_test
using predict()
and store the results in a vector called XXX_pred
(e.g.; rf_pred
for predictions from random forests)
I. Evaluate the model’s prediction accuracy. For regression tasks, use postResample()
, and for classification tasks, use confusionMatrix()
. and by creating an appropriate plot.
house_train
dataset predicting housing price
. Which model does the best in predicting house_test$price
?Make sure your criterion values are factors!!
Create your best possible model for the heartdisease_train
dataset predicting diagnosis
. Which model does the best in predicting heartdisease_test$diagnosis?
Create your best possible model for the attrition_train
dataset predicting attrition
. Which model does the best in predicting attrition_test$attrition?
trainControl(method = "none")
. Then predict the testing data again. How does the accuracy of your models compare to your original analysis when you did not do any cross validation?Repeat one of your original analyses, but instead of allowing the models to use all of the training data, force them to only use only three predictors. For example, in the heartdisease_train
data, you could have the model(s) only use age
, sex
and cp
as predictors by using the formula form = diagnosis ~ age + sex + cp
as a predictor. How do the models compare to each other when they each only get access to a few predictors? Are they all the same or is one much better than others?
Select one dataset, and using createDataPartition
, create your own new training and test datasets based on the XX_train
datasets (that is, pretend the XX_train
datasets are all possible data available, and split them into new XX_train2
and XX_test2
datsets). Now you have a new world of training and test data! Repeat your analyses and see if you get similar models and prediction performance as before.
If you want to plot the results of multiple models, you can try using the following code template:
# Some fake prediction data
# include your real model prediction data here
XX_pred <- rnorm(100, mean = 100, sd = 10)
YY_pred <- rnorm(100, mean = 100, sd = 10)
ZZ_pred <- rnorm(100, mean = 100, sd = 10)
# Some fake true test values
# Get these from your XX_test objects
truth <- rnorm(100, mean = 100, sd = 10)
# Put results together in a tibble
N <- length(truth)
model_results <- tibble(model = rep(c("XX", "YY", "ZZ"), each = N),
pred = c(XX_pred, YY_pred, ZZ_pred),
truth = rep(truth, times = 3))
# Add error and absolute error
model_results <- model_results %>%
mutate(error = pred - truth,
abserr = abs(error))
Warning: package 'bindrcpp' was built under R version 3.4.4
# Plot Distribution of errors for each model
ggplot(model_results,
aes(x = model, y = error, col = model)) +
geom_jitter(width = .1, alpha = .5) +
stat_summary(fun.y = mean,
fun.ymin = min,
fun.ymax = max,
colour = "black") +
labs(title = "Model Prediction Errors",
subtitle = "Dots represent means",
caption = "Caret is awesome!") +
theme(legend.position = "none")
# Plot relationship between truth and predictions
ggplot(model_results,
aes(x = truth, y = pred, col = model)) +
geom_point(alpha = .5) +
geom_abline(slope = 1, intercept = 0) +
labs(title = "XX model predictions",
subtitle = "Diagonal represents perfect performance",
caption = "Caret is awesome!",
x = "True Values",
y = "Model Predictions")
Max Kuhn, the author of caret
has a fantastic overview of the package at http://topepo.github.io/caret/index.html. If you like the caret
package as much as we do, be sure to go through this page in detail.
Max Kuhn is also the co-author of a fantastic book on machine learning called Applied predictive modelling - http://appliedpredictivemodeling.com/.