from Medium.com

Overview

By the end of this practical you will know how to:

  1. Fit regression, decision trees and random forests to training data.
  2. Evaluate model fitting and prediction performance in a test set.
  3. Compare the fitting and prediction performance of two models.
  4. Explore the effects of features on model predictions.

Tasks

A - Setup

  1. Open your RBootcamp_AMLD2020 R project. It should already have the folders 1_Data and 2_Code. Make sure that the data file(s) listed in the Datasets section are in your 1_Data folder.

  2. Open a new R script and save it as a new file called Prediction_College_practical.R in the 2_Code folder.

  3. Using library() load the set of packages for this practical listed in the packages section above.

# Load packages necessary for this script
library(tidyverse)
library(caret)
library(party)
library(partykit)
  1. Run the code below to load each of the datasets listed in the Datasets as new objects.
# College data
college_train <- read_csv(file = "1_Data/college_train.csv")
college_test <- read_csv(file = "1_Data/college_test.csv")
  1. Take a look at the first few rows of each dataframe by printing them to the console.
# Print dataframes to the console
college_train
college_test
  1. Open each dataset in a new window using View(). Do they look ok?
# Open each dataset in a window.
View(XXX)
View(XXX)
  1. Again, We need to do a little bit of data cleaning. Specifically, we need to convert all character columns to factors: Do this by running the following code:
# Convert all character columns to factor
college_train <- college_train %>%
          mutate_if(is.character, factor)

college_test <- college_test %>%
          mutate_if(is.character, factor)

B - Fitting

Your goal in this set of tasks is again to fit models predicting Grad.Rate, the percentage of attendees who graduate from each college.

  1. Using trainControl(), set your training control method to "none". Save your object as ctrl_none.
# Set training method to "none" for simple fitting
#  Note: This is for demonstration purposes, you would almost
#   never do this for a 'real' prediction task!
ctrl_none <- trainControl(method = "XXX")

Regression

  1. Using train() fit a regression model called grad_glm predicting Grad.Rate as a function of all features. Specifically,…
  • for the form argument, use Grad.Rate ~ ..
  • for the data argument, use college_train in the data argument.
  • for the method argument, use method = "glm" for regression.
  • for the trControl argument, use your ctrl_none object you created before.
grad_glm <- train(form = XX ~ .,
                  data = XX,
                  method = "XXX",
                  trControl = ctrl_none)
  1. Explore your grad_glm object by looking at grad_glm$finalModel and using summary(), what do you find?
grad_glm$XXX
summary(XXX)
  1. Using predict() save the fitted values of grad_glm object as glm_fitted.
# Save fitted values of regression model
glm_fitted <- predict(XXX)
  1. Print your glm_fitted object, look at summary statistics with summary(glm_fitted), and create a histogram with hist() do they make sense?
# Explore regression model fits
XXX
summary(XXX)
hist(XXX)

Decision Trees

  1. Using train(), fit a decision tree model called grad_rpart. Specifically,…
  • for the form argument, use Grad.Rate ~ ..
  • for the data argument, use college_train.
  • for the method argument, use method = "rpart" to create decision trees.
  • for the trControl argument, use your ctrl_none object you created before.
  • for the tuneGrid argument, use cp = 0.01 to specify the value of the complexity parameter. This is a relatively low value which means your trees will be, relatively, complex, i.e., deep.
grad_rpart <- train(form = XX ~ .,
                  data = XXX,
                  method = "XX",
                  trControl = XX,
                  tuneGrid = expand.grid(cp = XX))   # Set complexity parameter
  1. Explore your grad_rpart object by looking at grad_rpart$finalModel and plotting it with plot(as.party(grad_rpart$finalModel)), what do you find?

  2. Using predict(), save the fitted values of grad_rpart object as rpart_fitted.

# Save fitted values of decision tree model
rpart_fitted <- predict(XXX)
  1. Print your rpart_fitted object, look at summary statistics with summary(rpart_fitted), and create a histogram with hist(). Do they make sense?
# Explore decision tree fits
XXX
summary(XXX)
hist(XXX)

Random Forests

  1. Using train(), fit a random forest model called grad_rf. Speicifically,…
  • for the form argument, use Grad.Rate ~ ..
  • for the data argument, use college_train.
  • for the method argument, use method = "rf" to fit random forests.
  • for the trControl argument, use your ctrl_none object you created before.
  • for the mtry parameter, use mtry = 2. This is a relatively low value, so the forest will be very diverse.
grad_rf <- train(form = XX ~ .,   # Predict grad
                 data = XX,
                 method = "XX",
                 trControl = XX,
                 tuneGrid = expand.grid(mtry = XX))  # Set number of features randomly selected
  1. Using predict(), save the fitted values of grad_rf object as rf_fitted.
# Save fitted values of random forest model
rf_fitted <- predict(XXX)
  1. Print your rf_fitted object, look at summary statistics with summary(rf_fitted), and create a histogram with hist(). Do they make sense?
# Explore random forest fits
XXX
summary(XXX)
hist(XXX)

Assess accuracy

  1. Save the true training criterion values (college_train$Grad.Rate) as a vector called criterion_train.
# Save training criterion values
criterion_train <- XXX$XXX
  1. Using postResample(), determine the fitting performance of each of your models separately. Make sure to set your criterion_train values to the obs argument, and your true model fits XX_fitted to the pred argument.
# Calculate fitting accuracies of each model
# pred = XX_fitted
# obs = criterion_train

# Regression
postResample(pred = XXX, obs = XXX)

# Decision Trees
postResample(pred = XXX, obs = XXX)

# Random Forests
postResample(pred = XXX, obs = XXX)
  1. Which one had the best fit? What was the fitting MAE of each model? Optional: If you’d like to, try visualizing the fitting results using the plotting code shown in the Examples tab above.

C - Prediction

  1. Save the criterion values from the test data set college_test$Grad.Rate as a new vector called criterion_test.
# Save criterion values
criterion_test <- XXX$XXX
  1. Using predict(), save the predicted values of each model for the test data college_test as glm_pred, rpart_pred and rf_pred.
# Save model predictions for test data
# newdata = college_test

# Regression
glm_pred <- predict(XXX, newdata = XXX)

# Decision Trees
rpart_pred <- predict(XXX, newdata = XXX)

# Random Forests
rf_pred <- predict(XXX, newdata = XXX)
  1. Using postResample(), determine the prediction performance of each of your models against the test criterion criterion_test.
# Calculate prediction accuracies of each model
# obs = criterion_test
# pred = XX_pred

# Regression
postResample(pred = XXX, obs = XXX)

# Decision Trees
postResample(pred = XXX, obs = XXX)

# Random Forests
postResample(pred = XXX, obs = XXX)
  1. How does each model’s prediction or test performance (on the XXX_test data) compare to its fitting or training performance (on the XXX_train data)? Is it worse? Better? The same? What does the change tell you about the models?

  2. Which of the three models has the best prediction performance?

  3. If you had to use one of these three models in the real-world, which one would it be? Why?

  4. If someone came to you and asked “If I use your model in the future to predict the graduation rate of a new college, how accurate do you think it would be?”, what would you say?

D - A different data Set House Prices in King County, Washington

In this section, we will work with a different data set. Specifically, we will predict the prices of houses in King County Washington (home of Seattle, which you can thank for this) using the house_train and house_test datasets.

Setup

  1. Run the code below to load each of the datasets listed in the Datasets as new objects.
# house data
house_train <- read_csv(file = "1_Data/house_train.csv")
house_test <- read_csv(file = "1_Data/house_test.csv")
  1. Take a look at the first few rows of each dataframe by printing them to the console.

  2. Open each dataset in a new window using View(). Do they look ok?

# Open each dataset in a window.
View(XXX)
View(XXX)
  1. Again, we need to do a little bit of data cleaning. Convert all character columns to factor.
# Convert all character columns to factor
house_train <- house_train %>%
          mutate_if(is.character, factor)

house_test <- house_test %>%
          mutate_if(is.character, factor)

Fitting

Your goal in the following models is to predict price, the selling price of homes in King County WA.

  1. Using trainControl(), set your training control method to "none". Save your object as ctrl_none.
# Set training method to "none" for simple fitting
#  Note: This is for demonstration purposes, you would almost
#   never do this for a 'real' prediction task!
ctrl_none <- trainControl(method = "XXX")
  1. Using train(), fit a regression model called price_glm predicting price using all features in house_train. Specifically,…
  • for the form argument, use price ~ ..
  • for the data argument, use house_train.
  • for the method argument, use method = "glm" for regression.
  • for the trControl argument, use your ctrl_none object you created before.
price_glm <- train(form = XX ~ .,
                  data = XX,
                  method = "XXX",
                  trControl = ctrl_none)
  1. Using predict(), save the fitted values of price_glm object as glm_fitted.
# Save fitted values of regression model
glm_fitted <- predict(XXX)
  1. Using train(), fit a decision tree model called price_rpart predicting price using all features in house_train. Specifically,…
  • for the form argument, use price ~ ..
  • for the data argument, use house_train.
  • for the method argument, use method = "rpart" to create decision trees.
  • for the trControl argument, use your ctrl_none object you created before.
  • for the tuneGrid argument, use cp = 0.01 to specify the value of the complexity parameter. This is a pretty low value which means your trees will be, relatively, complex.
price_rpart <- train(form = XX ~ .,
                  data = XXX,
                  method = "XX",
                  trControl = XX,
                  tuneGrid = expand.grid(cp = XX))   # Set complexity parameter
  1. Using predict() save the fitted values of price_rpart object as rpart_fitted.
# Save fitted values of decision tree model
rpart_fitted <- predict(XXX)
  1. Using train(), fit a random forest model called price_rf predicting price using all features in house_train. Specifically,…
  • for the form argument, use price ~ ..
  • for the data argument, use house_train in the data argument.
  • for the method argument, use method = "rf" to fit random forests.
  • for the trControl argument, use your ctrl_none object you created before.
  • for the mtry parameter, use mtry = 2. This is a relatively low value, so the forest will be very diverse.
price_rf <- train(form = XX ~ .,
                 data = XX,
                 method = "XX",
                 trControl = XX,
                 tuneGrid = expand.grid(mtry = XX))  # Set number of features randomly selected
  1. Using predict() save the fitted values of price_rf object as rf_fitted.
# Save fitted values of random forest model
rf_fitted <- predict(XXX)

Assess fitting accuracy

  1. Save the true training criterion values (house_train$price) as a vector called criterion_train.
# Save training criterion values
criterion_train <- XXX$XXX
  1. Using postResample(), determine the fitting performance of each of your models separately. Make sure to set your criterion_train values to the obs argument, and your true model fits XX_fitted to the pred argument.
# Calculate fitting accuracies of each model
# pred = XX_fitted
# obs = criterion_train

# Regression
postResample(pred = XXX, obs = XXX)

# Decision Trees
postResample(pred = XXX, obs = XXX)

# Random Forests
postResample(pred = XXX, obs = XXX)
  1. Which one had the best fits? What was the fitting MAE of each model? Optional: If you’d like to, try visualizing the fitting results using the plotting code shown in the Examples tab above. Ask for help if you need it!

Assess prediction accuracy

  1. Save the criterion values from the test data set house_test$price as a new vector called criterion_test.
# Save criterion values
criterion_test <- XXX$XXX
  1. Using predict(), save the predicted values of each model for the test data house_test as glm_pred, rpart_pred and rf_pred.
# Save model predictions for test data
# object: price_XXX
# newdata: house_test

# Regression
glm_pred <- predict(XXX, newdata = XXX)

# Decision Trees
rpart_pred <- predict(XXX, newdata = XXX)

# Random Forests
rf_pred <- predict(XXX, newdata = XXX)
  1. Using postResample(), determine the prediction performance of each of your models against the test criterion criterion_test.
# Calculate prediction accuracies of each model
# obs = criterion_test
# pred = XX_pred

# Regression
postResample(pred = XXX, obs = XXX)

# Decision Trees
postResample(pred = XXX, obs = XXX)

# Random Forests
postResample(pred = XXX, obs = XXX)
  1. How does each model’s prediction or test performance (on the XXX_test data) compare to its fitting or training performance (on the XXX_train data)? Is it worse? Better? The same? What does the change tell you about the models?

  2. Which of the three models has the best prediction performance?

  3. If you had to use one of these three models in the real-world, which one would it be? Why?

  4. If someone came to you and asked “If I use your model in the future to predict the price of a new house, how accurate do you think it would be?”, what would you say?

E - Exploring model tuning parameters

  1. In all of your decision tree models so far, you have been setting the complexity parameter to 0.01. Try setting it to a larger value of 0.2 and see how your decision trees change (by plotting them using plot(as.party(XXX_rpart$finalModel))). Do they get more or less complicated? How does increasing this value affect fitting and prediction performance? If you are interested in learning more about this parameter, look at the help menu with ?rpart.control.

  2. In each of your random forest models, you have been setting the mtry argument to 2. Try setting it to a larger value such as 5 and re-run your models. How does increasing this value affect fitting and prediction performance? If you are interested in learning more about this parameter, look at the help menu with ?randomForest.

  3. By default, the train() function uses 500 trees for method = "rf". How do the number of trees affect performance? To answer this, try setting the number of trees to 1,000 (see example below) and re-evaluating your model’s training and test performance. What do you find? What if you set the number of trees to just 10?

# Create random forest model with 1000 trees
mod <- train(form = price ~ .,
             data = house_train,
             method = "rf",
             trControl = ctrl_none,
             ntree = 1000,   # use 1000 trees! (Instead of the default value of 500)
             tuneGrid = expand.grid(mtry = 2))

Examples

# Fitting and evaluating regression, decision trees, and random forests

# Step 0: Load packages-----------
library(tidyverse)    # Load tidyverse for dplyr and tidyr
library(caret)        # For ML mastery 
library(partykit)     # For decision trees
library(party)        # For decision trees

# Step 1: Load and Clean, and Explore Training data ----------------------

# training data
data_train <- read_csv("1_Data/mpg_train.csv")

# test data
data_test <- read_csv("1_Data/mpg_test.csv")

# Convert all characters to factor
#  Some ML models require factors
data_train <- data_train %>%
  mutate_if(is.character, factor)

data_test <- data_test %>%
  mutate_if(is.character, factor)

# Explore training data
data_train        # Print the dataset
View(data_train)  # Open in a new spreadsheet-like window 
dim(data_train)   # Print dimensions
names(data_train) # Print the names

# Define criterion_train
#   We'll use this later to evaluate model accuracy
criterion_train <- data_train$hwy

# Step 2: Define training control parameters -------------

# In this case, I will set method = "none" to fit to 
#  the entire dataset without any fancy methods
ctrl_none <- trainControl(method = "none") 

# Step 3: Train model: -----------------------------
#   Criterion: hwy
#   Features: year, cyl, displ

# Regression --------------------------
hwy_glm <- train(form = hwy ~ year + cyl + displ,
                 data = data_train,
                 method = "glm",
                 trControl = ctrl_none)

# Look at summary information
hwy_glm$finalModel
summary(hwy_glm)

# Save fitted values
glm_fitted <- predict(hwy_glm)

#  Calculate fitting accuracies
postResample(pred = glm_fitted, 
             obs = criterion_train)

# Decision Trees ----------------
hwy_rpart <- train(form = hwy ~ year + cyl + displ,
                data = data_train,
                method = "rpart",
                trControl = ctrl_none,
                tuneGrid = expand.grid(cp = .01))   # Set complexity parameter

# Look at summary information
hwy_rpart$finalModel
plot(as.party(hwy_rpart$finalModel))   # Visualise your trees

# Save fitted values
rpart_predfit <- predict(hwy_rpart)

# Calculate fitting accuracies
postResample(pred = rpart_predfit, obs = criterion_train)

# Random Forests -------------------------
hwy_rf <- train(form = hwy ~ year + cyl + displ,
                data = data_train,
                method = "rf",
                trControl = ctrl_none,
                tuneGrid = expand.grid(mtry = 2))   # Set number of features randomly selected

# Look at summary information
hwy_rf$finalModel

# Save fitted values
rf_fitted <- predict(hwy_rf)

# Calculate fitting accuracies
postResample(pred = rf_fitted, obs = criterion_train)

# Visualise Accuracy -------------------------

# Tidy competition results
accuracy <- tibble(criterion_train = criterion_train,
                   Regression = glm_fitted,
                   DecisionTrees = rpart_predfit,
                   RandomForest = rf_fitted) %>%
               gather(model, prediction, -criterion_train) %>%
               # Add error measures
               mutate(se = prediction - criterion_train,
                      ae = abs(prediction - criterion_train))

# Calculate summaries
accuracy_agg <- accuracy %>%
                  group_by(model) %>%
                  summarise(mae = mean(ae))   # Calculate MAE (mean absolute error)

# Plot A) Scatterplot of truth versus predictions
ggplot(data = accuracy,
       aes(x = criterion_train, y = prediction, col = model)) +
  geom_point(alpha = .5) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Predicting mpg$hwy",
       subtitle = "Black line indicates perfect performance")

# Plot B) Violin plot of absolute errors
ggplot(data = accuracy, 
       aes(x = model, y = ae, fill = model)) + 
  geom_violin() + 
  geom_jitter(width = .05, alpha = .2) +
  labs(title = "Fitting Absolute Errors",
       subtitle = "Numbers indicate means",
       x = "Model",
       y = "Absolute Error") +
  guides(fill = FALSE) +
  annotate(geom = "label", 
           x = accuracy_agg$model, 
           y = accuracy_agg$mae, 
           label = round(accuracy_agg$mae, 2))

# Step 5: Access prediction ------------------------------

# Define criterion_train
criterion_test <- data_test$hwy

# Save predicted values
glm_pred <- predict(hwy_glm, newdata = data_test)
rpart_pred <- predict(hwy_rpart, newdata = data_test)
rf_pred <- predict(hwy_rf, newdata = data_test)

#  Calculate fitting accuracies
postResample(pred = glm_pred, obs = criterion_test)
postResample(pred = rpart_pred, obs = criterion_test)
postResample(pred = rf_pred, obs = criterion_test)

# Visualise Accuracy -------------------------

# Tidy competition results
accuracy <- tibble(criterion_test = criterion_test,
                   Regression = glm_pred,
                   DecisionTrees = rpart_pred,
                   RandomForest = rf_pred) %>%
               gather(model, prediction, -criterion_test) %>%
               # Add error measures
               mutate(se = prediction - criterion_test,
                      ae = abs(prediction - criterion_test))

# Calculate summaries
accuracy_agg <- accuracy %>%
                  group_by(model) %>%
                  summarise(mae = mean(ae))   # Calculate MAE (mean absolute error)

# Plot A) Scatterplot of truth versus predictions
ggplot(data = accuracy,
       aes(x = criterion_test, y = prediction, col = model)) +
  geom_point(alpha = .5) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Predicting mpg$hwy",
       subtitle = "Black line indicates perfect performance")

# Plot B) Violin plot of absolute errors
ggplot(data = accuracy, 
       aes(x = model, y = ae, fill = model)) + 
  geom_violin() + 
  geom_jitter(width = .05, alpha = .2) +
  labs(title = "Prediction Absolute Errors",
       subtitle = "Numbers indicate means",
       x = "Model",
       y = "Absolute Error") +
  guides(fill = FALSE) +
  annotate(geom = "label", 
           x = accuracy_agg$model, 
           y = accuracy_agg$mae, 
           label = round(accuracy_agg$mae, 2))

Datasets

File Rows Columns
college_train.csv 500 18
college_test.csv 277 18
house_train.csv 5000 21
house_test.csv 1000 21
  • The college_train and college_test data are taken from the College dataset in the ISLR package. They contain statistics for a large number of US Colleges from the 1995 issue of US News and World Report.

  • The house_train and house_test data come from https://www.kaggle.com/harlfoxem/housesalesprediction

Variable description of college_train and college_test

Name Description
Private A factor with levels No and Yes indicating private or public university.
Apps Number of applications received.
Accept Number of applications accepted.
Enroll Number of new students enrolled.
Top10perc Pct. new students from top 10% of H.S. class.
Top25perc Pct. new students from top 25% of H.S. class.
F.Undergrad Number of fulltime undergraduates.
P.Undergrad Number of parttime undergraduates.
Outstate Out-of-state tuition.
Room.Board Room and board costs.
Books Estimated book costs.
Personal Estimated personal spending.
PhD Pct. of faculty with Ph.D.’s.
Terminal Pct. of faculty with terminal degree.
S.F.Ratio Student/faculty ratio.
perc.alumni Pct. alumni who donate.
Expend Instructional expenditure per student.
Grad.Rate Graduation rate.

Variable description of house_train and house_test

Name Description
price Price of the house in $.
bedrooms Number of bedrooms.
bathrooms Number of bathrooms.
sqft_living Square footage of the home.
sqft_lot Square footage of the lot.
floors Total floors (levels) in house.
waterfront House which has a view to a waterfront.
view Has been viewed.
condition How good the condition is (Overall).
grade Overall grade given to the housing unit, based on King County grading system.
sqft_above Square footage of house apart from basement.
sqft_basement Square footage of the basement.
yr_built Built Year.
yr_renovated Year when house was renovated.
zipcode Zip code.
lat Latitude coordinate.
long Longitude coordinate.
sqft_living15 Living room area in 2015 (implies some renovations). This might or might not have affected the lotsize area.
sqft_lot15 lot-size area in 2015 (implies some renovations).

Functions

Packages

Package Installation
tidyverse install.packages("tidyverse")
caret install.packages("caret")
partykit install.packages("partykit")
party install.packages("party")

Functions

Function Package Description
trainControl() caret Define modelling control parameters
train() caret Train a model
predict(object, newdata) stats Predict the criterion values of newdata based on object
postResample() caret Calculate aggregate model performance in regression tasks
confusionMatrix() caret Calculate aggregate model performance in classification tasks

Resources

Cheatsheet

Trulli
from github.com/rstudio