Applied Machine Learning with R The R Bootcamp @ AMLD |
from Medium.com
By the end of this practical you will know how to:
Open your RBootcamp_AMLD2020
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data file(s) listed in the Datasets
section are in your 1_Data
folder.
Open a new R script and save it as a new file called Prediction_College_practical.R
in the 2_Code
folder.
Using library()
load the set of packages for this practical listed in the packages section above.
# Load packages necessary for this script
library(tidyverse)
library(caret)
library(party)
library(partykit)
Datasets
as new objects.# College data
college_train <- read_csv(file = "1_Data/college_train.csv")
college_test <- read_csv(file = "1_Data/college_test.csv")
# Print dataframes to the console
college_train
college_test
View()
. Do they look ok?# Open each dataset in a window.
View(XXX)
View(XXX)
# Convert all character columns to factor
college_train <- college_train %>%
mutate_if(is.character, factor)
college_test <- college_test %>%
mutate_if(is.character, factor)
Your goal in this set of tasks is again to fit models predicting Grad.Rate
, the percentage of attendees who graduate from each college.
trainControl()
, set your training control method to "none"
. Save your object as ctrl_none
.# Set training method to "none" for simple fitting
# Note: This is for demonstration purposes, you would almost
# never do this for a 'real' prediction task!
ctrl_none <- trainControl(method = "XXX")
train()
fit a regression model called grad_glm
predicting Grad.Rate
as a function of all features. Specifically,…form
argument, use Grad.Rate ~ .
.data
argument, use college_train
in the data argument.method
argument, use method = "glm"
for regression.trControl
argument, use your ctrl_none
object you created before.grad_glm <- train(form = XX ~ .,
data = XX,
method = "XXX",
trControl = ctrl_none)
grad_glm
object by looking at grad_glm$finalModel
and using summary()
, what do you find?grad_glm$XXX
summary(XXX)
predict()
save the fitted values of grad_glm
object as glm_fitted
.# Save fitted values of regression model
glm_fitted <- predict(XXX)
glm_fitted
object, look at summary statistics with summary(glm_fitted)
, and create a histogram with hist()
do they make sense?# Explore regression model fits
XXX
summary(XXX)
hist(XXX)
train()
, fit a decision tree model called grad_rpart
. Specifically,…form
argument, use Grad.Rate ~ .
.data
argument, use college_train
.method
argument, use method = "rpart"
to create decision trees.trControl
argument, use your ctrl_none
object you created before.tuneGrid
argument, use cp = 0.01
to specify the value of the complexity parameter. This is a relatively low value which means your trees will be, relatively, complex, i.e., deep.grad_rpart <- train(form = XX ~ .,
data = XXX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(cp = XX)) # Set complexity parameter
Explore your grad_rpart
object by looking at grad_rpart$finalModel
and plotting it with plot(as.party(grad_rpart$finalModel))
, what do you find?
Using predict()
, save the fitted values of grad_rpart
object as rpart_fitted
.
# Save fitted values of decision tree model
rpart_fitted <- predict(XXX)
rpart_fitted
object, look at summary statistics with summary(rpart_fitted)
, and create a histogram with hist()
. Do they make sense?# Explore decision tree fits
XXX
summary(XXX)
hist(XXX)
train()
, fit a random forest model called grad_rf
. Speicifically,…form
argument, use Grad.Rate ~ .
.data
argument, use college_train
.method
argument, use method = "rf"
to fit random forests.trControl
argument, use your ctrl_none
object you created before.mtry
parameter, use mtry
= 2. This is a relatively low value, so the forest will be very diverse.grad_rf <- train(form = XX ~ ., # Predict grad
data = XX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(mtry = XX)) # Set number of features randomly selected
predict()
, save the fitted values of grad_rf
object as rf_fitted
.# Save fitted values of random forest model
rf_fitted <- predict(XXX)
rf_fitted
object, look at summary statistics with summary(rf_fitted)
, and create a histogram with hist()
. Do they make sense?# Explore random forest fits
XXX
summary(XXX)
hist(XXX)
college_train$Grad.Rate
) as a vector called criterion_train
.# Save training criterion values
criterion_train <- XXX$XXX
postResample()
, determine the fitting performance of each of your models separately. Make sure to set your criterion_train
values to the obs
argument, and your true model fits XX_fitted
to the pred
argument.# Calculate fitting accuracies of each model
# pred = XX_fitted
# obs = criterion_train
# Regression
postResample(pred = XXX, obs = XXX)
# Decision Trees
postResample(pred = XXX, obs = XXX)
# Random Forests
postResample(pred = XXX, obs = XXX)
college_test$Grad.Rate
as a new vector called criterion_test
.# Save criterion values
criterion_test <- XXX$XXX
predict()
, save the predicted values of each model for the test data college_test
as glm_pred
, rpart_pred
and rf_pred
.# Save model predictions for test data
# newdata = college_test
# Regression
glm_pred <- predict(XXX, newdata = XXX)
# Decision Trees
rpart_pred <- predict(XXX, newdata = XXX)
# Random Forests
rf_pred <- predict(XXX, newdata = XXX)
postResample()
, determine the prediction performance of each of your models against the test criterion criterion_test
.# Calculate prediction accuracies of each model
# obs = criterion_test
# pred = XX_pred
# Regression
postResample(pred = XXX, obs = XXX)
# Decision Trees
postResample(pred = XXX, obs = XXX)
# Random Forests
postResample(pred = XXX, obs = XXX)
How does each model’s prediction or test performance (on the XXX_test
data) compare to its fitting or training performance (on the XXX_train
data)? Is it worse? Better? The same? What does the change tell you about the models?
Which of the three models has the best prediction performance?
If you had to use one of these three models in the real-world, which one would it be? Why?
If someone came to you and asked “If I use your model in the future to predict the graduation rate of a new college, how accurate do you think it would be?”, what would you say?
In this section, we will work with a different data set. Specifically, we will predict the prices of houses in King County Washington (home of Seattle, which you can thank for this) using the house_train
and house_test
datasets.
Datasets
as new objects.# house data
house_train <- read_csv(file = "1_Data/house_train.csv")
house_test <- read_csv(file = "1_Data/house_test.csv")
Take a look at the first few rows of each dataframe by printing them to the console.
Open each dataset in a new window using View()
. Do they look ok?
# Open each dataset in a window.
View(XXX)
View(XXX)
# Convert all character columns to factor
house_train <- house_train %>%
mutate_if(is.character, factor)
house_test <- house_test %>%
mutate_if(is.character, factor)
Your goal in the following models is to predict price
, the selling price of homes in King County WA.
trainControl()
, set your training control method to "none"
. Save your object as ctrl_none
.# Set training method to "none" for simple fitting
# Note: This is for demonstration purposes, you would almost
# never do this for a 'real' prediction task!
ctrl_none <- trainControl(method = "XXX")
train()
, fit a regression model called price_glm
predicting price
using all features in house_train
. Specifically,…form
argument, use price ~ .
.data
argument, use house_train
.method
argument, use method = "glm"
for regression.trControl
argument, use your ctrl_none
object you created before.price_glm <- train(form = XX ~ .,
data = XX,
method = "XXX",
trControl = ctrl_none)
predict()
, save the fitted values of price_glm
object as glm_fitted
.# Save fitted values of regression model
glm_fitted <- predict(XXX)
train()
, fit a decision tree model called price_rpart
predicting price
using all features in house_train
. Specifically,…form
argument, use price ~ .
.data
argument, use house_train
.method
argument, use method = "rpart"
to create decision trees.trControl
argument, use your ctrl_none
object you created before.tuneGrid
argument, use cp = 0.01
to specify the value of the complexity parameter. This is a pretty low value which means your trees will be, relatively, complex.price_rpart <- train(form = XX ~ .,
data = XXX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(cp = XX)) # Set complexity parameter
predict()
save the fitted values of price_rpart
object as rpart_fitted
.# Save fitted values of decision tree model
rpart_fitted <- predict(XXX)
train()
, fit a random forest model called price_rf
predicting price
using all features in house_train
. Specifically,…form
argument, use price ~ .
.data
argument, use house_train
in the data argument.method
argument, use method = "rf"
to fit random forests.trControl
argument, use your ctrl_none
object you created before.mtry
parameter, use mtry
= 2. This is a relatively low value, so the forest will be very diverse.price_rf <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(mtry = XX)) # Set number of features randomly selected
predict()
save the fitted values of price_rf
object as rf_fitted
.# Save fitted values of random forest model
rf_fitted <- predict(XXX)
house_train$price
) as a vector called criterion_train
.# Save training criterion values
criterion_train <- XXX$XXX
postResample()
, determine the fitting performance of each of your models separately. Make sure to set your criterion_train
values to the obs
argument, and your true model fits XX_fitted
to the pred
argument.# Calculate fitting accuracies of each model
# pred = XX_fitted
# obs = criterion_train
# Regression
postResample(pred = XXX, obs = XXX)
# Decision Trees
postResample(pred = XXX, obs = XXX)
# Random Forests
postResample(pred = XXX, obs = XXX)
house_test$price
as a new vector called criterion_test
.# Save criterion values
criterion_test <- XXX$XXX
predict()
, save the predicted values of each model for the test data house_test
as glm_pred
, rpart_pred
and rf_pred
.# Save model predictions for test data
# object: price_XXX
# newdata: house_test
# Regression
glm_pred <- predict(XXX, newdata = XXX)
# Decision Trees
rpart_pred <- predict(XXX, newdata = XXX)
# Random Forests
rf_pred <- predict(XXX, newdata = XXX)
postResample()
, determine the prediction performance of each of your models against the test criterion criterion_test
.# Calculate prediction accuracies of each model
# obs = criterion_test
# pred = XX_pred
# Regression
postResample(pred = XXX, obs = XXX)
# Decision Trees
postResample(pred = XXX, obs = XXX)
# Random Forests
postResample(pred = XXX, obs = XXX)
How does each model’s prediction or test performance (on the XXX_test
data) compare to its fitting or training performance (on the XXX_train
data)? Is it worse? Better? The same? What does the change tell you about the models?
Which of the three models has the best prediction performance?
If you had to use one of these three models in the real-world, which one would it be? Why?
If someone came to you and asked “If I use your model in the future to predict the price of a new house, how accurate do you think it would be?”, what would you say?
In all of your decision tree models so far, you have been setting the complexity parameter to 0.01. Try setting it to a larger value of 0.2 and see how your decision trees change (by plotting them using plot(as.party(XXX_rpart$finalModel))
). Do they get more or less complicated? How does increasing this value affect fitting and prediction performance? If you are interested in learning more about this parameter, look at the help menu with ?rpart.control
.
In each of your random forest models, you have been setting the mtry
argument to 2. Try setting it to a larger value such as 5 and re-run your models. How does increasing this value affect fitting and prediction performance? If you are interested in learning more about this parameter, look at the help menu with ?randomForest
.
By default, the train()
function uses 500 trees for method = "rf"
. How do the number of trees affect performance? To answer this, try setting the number of trees to 1,000 (see example below) and re-evaluating your model’s training and test performance. What do you find? What if you set the number of trees to just 10?
# Create random forest model with 1000 trees
mod <- train(form = price ~ .,
data = house_train,
method = "rf",
trControl = ctrl_none,
ntree = 1000, # use 1000 trees! (Instead of the default value of 500)
tuneGrid = expand.grid(mtry = 2))
# Fitting and evaluating regression, decision trees, and random forests
# Step 0: Load packages-----------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(caret) # For ML mastery
library(partykit) # For decision trees
library(party) # For decision trees
# Step 1: Load and Clean, and Explore Training data ----------------------
# training data
data_train <- read_csv("1_Data/mpg_train.csv")
# test data
data_test <- read_csv("1_Data/mpg_test.csv")
# Convert all characters to factor
# Some ML models require factors
data_train <- data_train %>%
mutate_if(is.character, factor)
data_test <- data_test %>%
mutate_if(is.character, factor)
# Explore training data
data_train # Print the dataset
View(data_train) # Open in a new spreadsheet-like window
dim(data_train) # Print dimensions
names(data_train) # Print the names
# Define criterion_train
# We'll use this later to evaluate model accuracy
criterion_train <- data_train$hwy
# Step 2: Define training control parameters -------------
# In this case, I will set method = "none" to fit to
# the entire dataset without any fancy methods
ctrl_none <- trainControl(method = "none")
# Step 3: Train model: -----------------------------
# Criterion: hwy
# Features: year, cyl, displ
# Regression --------------------------
hwy_glm <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "glm",
trControl = ctrl_none)
# Look at summary information
hwy_glm$finalModel
summary(hwy_glm)
# Save fitted values
glm_fitted <- predict(hwy_glm)
# Calculate fitting accuracies
postResample(pred = glm_fitted,
obs = criterion_train)
# Decision Trees ----------------
hwy_rpart <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "rpart",
trControl = ctrl_none,
tuneGrid = expand.grid(cp = .01)) # Set complexity parameter
# Look at summary information
hwy_rpart$finalModel
plot(as.party(hwy_rpart$finalModel)) # Visualise your trees
# Save fitted values
rpart_predfit <- predict(hwy_rpart)
# Calculate fitting accuracies
postResample(pred = rpart_predfit, obs = criterion_train)
# Random Forests -------------------------
hwy_rf <- train(form = hwy ~ year + cyl + displ,
data = data_train,
method = "rf",
trControl = ctrl_none,
tuneGrid = expand.grid(mtry = 2)) # Set number of features randomly selected
# Look at summary information
hwy_rf$finalModel
# Save fitted values
rf_fitted <- predict(hwy_rf)
# Calculate fitting accuracies
postResample(pred = rf_fitted, obs = criterion_train)
# Visualise Accuracy -------------------------
# Tidy competition results
accuracy <- tibble(criterion_train = criterion_train,
Regression = glm_fitted,
DecisionTrees = rpart_predfit,
RandomForest = rf_fitted) %>%
gather(model, prediction, -criterion_train) %>%
# Add error measures
mutate(se = prediction - criterion_train,
ae = abs(prediction - criterion_train))
# Calculate summaries
accuracy_agg <- accuracy %>%
group_by(model) %>%
summarise(mae = mean(ae)) # Calculate MAE (mean absolute error)
# Plot A) Scatterplot of truth versus predictions
ggplot(data = accuracy,
aes(x = criterion_train, y = prediction, col = model)) +
geom_point(alpha = .5) +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Predicting mpg$hwy",
subtitle = "Black line indicates perfect performance")
# Plot B) Violin plot of absolute errors
ggplot(data = accuracy,
aes(x = model, y = ae, fill = model)) +
geom_violin() +
geom_jitter(width = .05, alpha = .2) +
labs(title = "Fitting Absolute Errors",
subtitle = "Numbers indicate means",
x = "Model",
y = "Absolute Error") +
guides(fill = FALSE) +
annotate(geom = "label",
x = accuracy_agg$model,
y = accuracy_agg$mae,
label = round(accuracy_agg$mae, 2))
# Step 5: Access prediction ------------------------------
# Define criterion_train
criterion_test <- data_test$hwy
# Save predicted values
glm_pred <- predict(hwy_glm, newdata = data_test)
rpart_pred <- predict(hwy_rpart, newdata = data_test)
rf_pred <- predict(hwy_rf, newdata = data_test)
# Calculate fitting accuracies
postResample(pred = glm_pred, obs = criterion_test)
postResample(pred = rpart_pred, obs = criterion_test)
postResample(pred = rf_pred, obs = criterion_test)
# Visualise Accuracy -------------------------
# Tidy competition results
accuracy <- tibble(criterion_test = criterion_test,
Regression = glm_pred,
DecisionTrees = rpart_pred,
RandomForest = rf_pred) %>%
gather(model, prediction, -criterion_test) %>%
# Add error measures
mutate(se = prediction - criterion_test,
ae = abs(prediction - criterion_test))
# Calculate summaries
accuracy_agg <- accuracy %>%
group_by(model) %>%
summarise(mae = mean(ae)) # Calculate MAE (mean absolute error)
# Plot A) Scatterplot of truth versus predictions
ggplot(data = accuracy,
aes(x = criterion_test, y = prediction, col = model)) +
geom_point(alpha = .5) +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Predicting mpg$hwy",
subtitle = "Black line indicates perfect performance")
# Plot B) Violin plot of absolute errors
ggplot(data = accuracy,
aes(x = model, y = ae, fill = model)) +
geom_violin() +
geom_jitter(width = .05, alpha = .2) +
labs(title = "Prediction Absolute Errors",
subtitle = "Numbers indicate means",
x = "Model",
y = "Absolute Error") +
guides(fill = FALSE) +
annotate(geom = "label",
x = accuracy_agg$model,
y = accuracy_agg$mae,
label = round(accuracy_agg$mae, 2))
File | Rows | Columns |
---|---|---|
college_train.csv | 500 | 18 |
college_test.csv | 277 | 18 |
house_train.csv | 5000 | 21 |
house_test.csv | 1000 | 21 |
The college_train
and college_test
data are taken from the College
dataset in the ISLR
package. They contain statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
The house_train
and house_test
data come from https://www.kaggle.com/harlfoxem/housesalesprediction
college_train
and college_test
Name | Description |
---|---|
Private |
A factor with levels No and Yes indicating private or public university. |
Apps |
Number of applications received. |
Accept |
Number of applications accepted. |
Enroll |
Number of new students enrolled. |
Top10perc |
Pct. new students from top 10% of H.S. class. |
Top25perc |
Pct. new students from top 25% of H.S. class. |
F.Undergrad |
Number of fulltime undergraduates. |
P.Undergrad |
Number of parttime undergraduates. |
Outstate |
Out-of-state tuition. |
Room.Board |
Room and board costs. |
Books |
Estimated book costs. |
Personal |
Estimated personal spending. |
PhD |
Pct. of faculty with Ph.D.’s. |
Terminal |
Pct. of faculty with terminal degree. |
S.F.Ratio |
Student/faculty ratio. |
perc.alumni |
Pct. alumni who donate. |
Expend |
Instructional expenditure per student. |
Grad.Rate |
Graduation rate. |
house_train
and house_test
Name | Description |
---|---|
price |
Price of the house in $. |
bedrooms |
Number of bedrooms. |
bathrooms |
Number of bathrooms. |
sqft_living |
Square footage of the home. |
sqft_lot |
Square footage of the lot. |
floors |
Total floors (levels) in house. |
waterfront |
House which has a view to a waterfront. |
view |
Has been viewed. |
condition |
How good the condition is (Overall). |
grade |
Overall grade given to the housing unit, based on King County grading system. |
sqft_above |
Square footage of house apart from basement. |
sqft_basement |
Square footage of the basement. |
yr_built |
Built Year. |
yr_renovated |
Year when house was renovated. |
zipcode |
Zip code. |
lat |
Latitude coordinate. |
long |
Longitude coordinate. |
sqft_living15 |
Living room area in 2015 (implies some renovations). This might or might not have affected the lotsize area. |
sqft_lot15 |
lot-size area in 2015 (implies some renovations). |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
caret |
install.packages("caret") |
partykit |
install.packages("partykit") |
party |
install.packages("party") |
Function | Package | Description |
---|---|---|
trainControl() |
caret |
Define modelling control parameters |
train() |
caret |
Train a model |
predict(object, newdata) |
stats |
Predict the criterion values of newdata based on object |
postResample() |
caret |
Calculate aggregate model performance in regression tasks |
confusionMatrix() |
caret |
Calculate aggregate model performance in classification tasks |