from xkcd.com

## Overview

By the end of this practical you will know how to:

1. Use cross-validation to select optimal model tuning parameters for decision trees and random forests
2. Compare ‘standard’ regression with lasso and ridge penalised regression
3. Use cross-validation to estimate future test accuracy

### Baseball player salaries

In this practical, we will predict the salaries of baseball players from the hitters_train and hitters_test datasets.

Our goal will be to predict Salary.

### A - Setup

1. Open your BaselRBootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data file(s) listed in the Datasets section are in your 1_Data folder

2. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called Optimization_practical.R in the 2_Code folder.
3. Using library() load the set of packages for this practical listed in the packages section above.

# Load packages necessary for this script
library(tidyverse)
library(caret)
library(party)
library(partykit)
1. Run the code below to load each of the datasets listed in the Datasets as new objects
# hitters data
hitters_test <- read_csv(file = "1_Data/hitters_test.csv")
1. Take a look at the first few rows of each dataframe by printing them to the console.
# Print dataframes to the console
hitters_train
hitters_test
1. Print the numbers of rows and columns of each dataset using the dim() function
# Print numbers of rows and columns
dim(XXX)
dim(XXX)
dim(hitters_train)
[1] 50 20
dim(hitters_test)
[1] 213  20
1. Look at the names of the dataframes with the names() function.
# Print the names of each dataframe
names(XXX)
names(XXX)
names(hitters_train)
 [1] "Salary"    "AtBat"     "Hits"      "HmRun"     "Runs"
[6] "RBI"       "Walks"     "Years"     "CAtBat"    "CHits"
[11] "CHmRun"    "CRuns"     "CRBI"      "CWalks"    "League"
[16] "Division"  "PutOuts"   "Assists"   "Errors"    "NewLeague"
names(hitters_test)
 [1] "Salary"    "AtBat"     "Hits"      "HmRun"     "Runs"
[6] "RBI"       "Walks"     "Years"     "CAtBat"    "CHits"
[11] "CHmRun"    "CRuns"     "CRBI"      "CWalks"    "League"
[16] "Division"  "PutOuts"   "Assists"   "Errors"    "NewLeague"
1. Open each dataset in a new window using View(). Do they look ok?
# Open each dataset in a window.
View(XXX)
View(XXX)
1. We need to do a little bit of data cleaning before starting. Specifically, we need to convert all character columns to factors: Do this by running the following code:
# Convert all character columns to factor
hitters_train <- hitters_train %>%
mutate_if(is.character, factor)

hitters_test <- hitters_test %>%
mutate_if(is.character, factor)

### B - Setup trainControl

1. Set up your training by specifying ctrl_cv as 10-fold cross-validation
• Set method = “cv” to specify cross validation
• Set number = 10 to specify 10 folds
# Use 10-fold cross validation
ctrl_cv <- trainControl(method = "XX",
number = XX) 
# Use 10-fold cross validation
ctrl_cv <- trainControl(method = "cv",
number = 10) 

### C - Regression (standard)

1. Fit a (standard) regression model predicting Salary as a function of all features
• Set the formula to Salary ~ .
• Set the data to hitters_train
• Set the method to "glm" for regular regression
• Set the train control argument to ctrl_cv
# Normal Regression --------------------------
salary_glm <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX)
# Normal Regression --------------------------
salary_glm <- train(form = Salary ~ .,
data = hitters_train,
method = "glm",
trControl = ctrl_cv)
1. Print your salary_glm what do you see?
salary_glm
Generalized Linear Model

50 samples
19 predictors

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 45, 43, 46, 46, 44, 46, ...
Resampling results:

RMSE  Rsquared  MAE
523   0.273     441
1. Try plotting your salary_glm object. What happens? What does this error mean?
# I get the following error:

# Error in plot.train(salary_glm) :
#   There are no tuning parameters for this model.

# The problem is that method = "glm" has no tuning parameters so there is nothing to plot!
1. Print your final model object with salary_glm$finalModel # Print final regression model salary_glm$finalModel

Call:  NULL

Coefficients:
(Intercept)        AtBat         Hits        HmRun         Runs
664.0169      -1.8086      12.7263      14.0206     -10.7594
RBI        Walks        Years       CAtBat        CHits
-5.1859      -8.6702     -62.3666       0.4883      -4.9553
CHmRun        CRuns         CRBI       CWalks      LeagueN
-8.3973       5.7164       4.3061       0.3754     144.0316
DivisionW      PutOuts      Assists       Errors   NewLeagueN
-75.0233      -0.0843       0.5934      -9.7698     223.6980

Degrees of Freedom: 49 Total (i.e. Null);  30 Residual
Null Deviance:      12600000
Residual Deviance: 5110000  AIC: 761
1. Print your final regression model coefficients with coef().
# Print glm coefficients
coef(salary_glm$finalModel) (Intercept) AtBat Hits HmRun Runs RBI 664.0169 -1.8086 12.7263 14.0206 -10.7594 -5.1859 Walks Years CAtBat CHits CHmRun CRuns -8.6702 -62.3666 0.4883 -4.9553 -8.3973 5.7164 CRBI CWalks LeagueN DivisionW PutOuts Assists 4.3061 0.3754 144.0316 -75.0233 -0.0843 0.5934 Errors NewLeagueN -9.7698 223.6980  ### D - Ridge Regression It’s time to fit an optimized regression model with a Ridge penalty! 1. Before we can fit a ridge regression model, we need to specify which values of the lambda penalty parameter we want to try. Using the code below, create a vector called lambda_vec which contains 100 values spanning a wide range, from very close to 0 to 1,000 # Vector of lambda values to try lambda_vec <- 10 ^ (seq(-4, 4, length = 100)) 1. Fit a ridge regression model predicting Salary as a function of all features • Set the formula to Salary ~ . • Set the data to hitters_train • Set the method to "glmnet" for regularized regression • Set the train control argument to ctrl_cv • Set the preProcess argument to c("center", "scale") to make sure the variables are standarised when estimating the beta weights (this is good practice for ridge regression) • Set the tuneGrid argument such that alpha is 0 (for ridge regression), and with all lambda values you specified in lambda_vec (we’ve done this for you below) # Ridge Regression -------------------------- salary_ridge <- train(form = XX ~ ., data = XX, method = "XX", trControl = XX, preProcess = c("XX", "XX"), # Standardise tuneGrid = expand.grid(alpha = 0, # Ridge penalty lambda = lambda_vec)) # Ridge Regression -------------------------- salary_ridge <- train(form = Salary ~ ., data = hitters_train, method = "glmnet", trControl = ctrl_cv, preProcess = c("center", "scale"), # Standardise tuneGrid = expand.grid(alpha = 0, # Ridge penalty lambda = lambda_vec)) 1. Print your salary_ridge object what do you see? salary_ridge glmnet 50 samples 19 predictors Pre-processing: centered (19), scaled (19) Resampling: Cross-Validated (10 fold) Summary of sample sizes: 45, 45, 44, 46, 45, 45, ... Resampling results across tuning parameters: lambda RMSE Rsquared MAE 1.00e-04 454 0.313 340 1.20e-04 454 0.313 340 1.45e-04 454 0.313 340 1.75e-04 454 0.313 340 2.10e-04 454 0.313 340 2.54e-04 454 0.313 340 3.05e-04 454 0.313 340 3.68e-04 454 0.313 340 4.43e-04 454 0.313 340 5.34e-04 454 0.313 340 6.43e-04 454 0.313 340 7.74e-04 454 0.313 340 9.33e-04 454 0.313 340 1.12e-03 454 0.313 340 1.35e-03 454 0.313 340 1.63e-03 454 0.313 340 1.96e-03 454 0.313 340 2.36e-03 454 0.313 340 2.85e-03 454 0.313 340 3.43e-03 454 0.313 340 4.13e-03 454 0.313 340 4.98e-03 454 0.313 340 5.99e-03 454 0.313 340 7.22e-03 454 0.313 340 8.70e-03 454 0.313 340 1.05e-02 454 0.313 340 1.26e-02 454 0.313 340 1.52e-02 454 0.313 340 1.83e-02 454 0.313 340 2.21e-02 454 0.313 340 2.66e-02 454 0.313 340 3.20e-02 454 0.313 340 3.85e-02 454 0.313 340 4.64e-02 454 0.313 340 5.59e-02 454 0.313 340 6.73e-02 454 0.313 340 8.11e-02 454 0.313 340 9.77e-02 454 0.313 340 1.18e-01 454 0.313 340 1.42e-01 454 0.313 340 1.71e-01 454 0.313 340 2.06e-01 454 0.313 340 2.48e-01 454 0.313 340 2.98e-01 454 0.313 340 3.59e-01 454 0.313 340 4.33e-01 454 0.313 340 5.21e-01 454 0.313 340 6.28e-01 454 0.313 340 7.56e-01 454 0.313 340 9.11e-01 454 0.313 340 1.10e+00 454 0.313 340 1.32e+00 454 0.313 340 1.59e+00 454 0.313 340 1.92e+00 454 0.313 340 2.31e+00 454 0.313 340 2.78e+00 454 0.313 340 3.35e+00 454 0.313 340 4.04e+00 454 0.313 340 4.86e+00 454 0.313 340 5.86e+00 454 0.313 340 7.05e+00 454 0.313 340 8.50e+00 454 0.313 340 1.02e+01 454 0.313 340 1.23e+01 454 0.313 340 1.48e+01 454 0.313 340 1.79e+01 454 0.313 340 2.15e+01 454 0.313 340 2.60e+01 454 0.314 340 3.13e+01 451 0.317 338 3.76e+01 447 0.321 334 4.53e+01 442 0.326 331 5.46e+01 437 0.331 328 6.58e+01 432 0.337 324 7.92e+01 427 0.343 322 9.55e+01 423 0.350 319 1.15e+02 418 0.356 316 1.38e+02 414 0.363 314 1.67e+02 411 0.370 312 2.01e+02 407 0.376 310 2.42e+02 404 0.383 308 2.92e+02 401 0.389 306 3.51e+02 399 0.395 304 4.23e+02 397 0.402 302 5.09e+02 395 0.408 301 6.14e+02 394 0.414 299 7.39e+02 393 0.421 299 8.90e+02 393 0.428 298 1.07e+03 394 0.436 298 1.29e+03 395 0.444 299 1.56e+03 396 0.452 300 1.87e+03 399 0.461 302 2.26e+03 401 0.470 304 2.72e+03 405 0.479 307 3.27e+03 408 0.487 309 3.94e+03 412 0.495 313 4.75e+03 417 0.501 317 5.72e+03 421 0.507 321 6.89e+03 426 0.512 325 8.30e+03 431 0.516 329 1.00e+04 435 0.520 333 Tuning parameter 'alpha' was held constant at a value of 0 RMSE was used to select the optimal model using the smallest value. The final values used for the model were alpha = 0 and lambda = 890. 1. Plot your salary_ridge object. What do you see? Which value of the regularization parameter seems to be the best? # Plot salary_ridge object plot(XX) plot(salary_ridge) 1. Print the best value of lambda by running the following code. Does this match what you saw in the plot above? # Print best regularisation parameter salary_ridge$bestTune$lambda [1] 890 1. What were your final regression model coefficients for the best lambda value? Find them by running the following code: # Get coefficients from best lambda value coef(salary_ridge$finalModel,
salary_ridge$bestTune$lambda)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 608.79
AtBat        -6.30
Hits          6.78
HmRun        -7.65
Runs          8.16
RBI           6.96
Walks        11.70
Years        13.70
CAtBat       29.45
CHits        31.17
CHmRun       24.79
CRuns        37.78
CRBI         33.51
CWalks       39.65
LeagueN      39.21
DivisionW   -37.83
PutOuts      13.73
Assists      37.15
Errors      -10.95
NewLeagueN   26.18
1. How do these coefficients compare to what you found in regular regression? Are they similar? Different?
# Actually the look quite different! The reason why is that we have changed the scale using preProcess

# If you want the coefficients on the original scale, you'd need to convert them, or run your training again without any processing. However, this can lead to problems finding the optimal Lambda value...

### E - Lasso Regression

It’s time to fit an optimized regression model with a Lasso penalty!

1. Before we can fit a lasso regression model, we need to specify which values of the lambda penalty parameter we want to try. Using the code below, create a vector called lambda_vec which contains 100 values between 0 and 1000
# Determine possible values of lambda
lambda_vec <- 10 ^ seq(from = -4, to = 4, length = 100)
1. Fit a lasso regression model predicting Salary as a function of all features
• Set the formula to Salary ~ .
• Set the data to hitters_train
• Set the method to "glmnet" for regularized regression
• Set the train control argument to ctrl_cv
• Set the preProcess argument to c("center", "scale") to make sure the variables are standarised when estimating the beta weights (this is good practice for lasso regression)
• Set the tuneGrid argument such that alpha is 1 (for lasso regression), and with all lambda values you specified in lambda_vec (we’ve done this for you below)
# Lasso Regression --------------------------
salary_lasso <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX,
preProcess = c("XX", "XX"),  # Standardise
tuneGrid = expand.grid(alpha = 1,  # Lasso penalty
lambda = lambda_vec))
# Fit a lasso regression
salary_lasso <- train(form = Salary ~ .,
data = hitters_train,
method = "glmnet",
trControl = ctrl_cv,
preProcess = c("center", "scale"),
tuneGrid = expand.grid(alpha = 1,  # Lasso penalty
lambda = lambda_vec))
1. Print your salary_lasso object what do you see?
salary_lasso
glmnet

50 samples
19 predictors

Pre-processing: centered (19), scaled (19)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 45, 44, 44, 45, 46, 44, ...
Resampling results across tuning parameters:

lambda    RMSE  Rsquared  MAE
1.00e-04  611   0.205     491
1.20e-04  611   0.205     491
1.45e-04  611   0.205     491
1.75e-04  611   0.205     491
2.10e-04  611   0.205     491
2.54e-04  611   0.205     491
3.05e-04  611   0.205     491
3.68e-04  611   0.205     491
4.43e-04  611   0.205     491
5.34e-04  611   0.205     491
6.43e-04  611   0.205     491
7.74e-04  611   0.205     491
9.33e-04  611   0.205     491
1.12e-03  611   0.205     491
1.35e-03  611   0.205     491
1.63e-03  611   0.205     491
1.96e-03  611   0.205     491
2.36e-03  611   0.205     491
2.85e-03  611   0.205     491
3.43e-03  611   0.205     491
4.13e-03  611   0.205     491
4.98e-03  611   0.205     491
5.99e-03  611   0.205     491
7.22e-03  611   0.205     491
8.70e-03  611   0.205     491
1.05e-02  611   0.205     491
1.26e-02  611   0.205     491
1.52e-02  611   0.205     491
1.83e-02  611   0.205     491
2.21e-02  611   0.205     491
2.66e-02  611   0.205     491
3.20e-02  611   0.205     491
3.85e-02  611   0.205     491
4.64e-02  611   0.205     490
5.59e-02  611   0.205     490
6.73e-02  610   0.205     490
8.11e-02  610   0.205     489
9.77e-02  610   0.205     489
1.18e-01  609   0.206     488
1.42e-01  609   0.206     487
1.71e-01  609   0.206     486
2.06e-01  608   0.205     484
2.48e-01  607   0.206     483
2.98e-01  606   0.206     480
3.59e-01  605   0.206     478
4.33e-01  603   0.205     474
5.21e-01  601   0.205     470
6.28e-01  599   0.204     465
7.56e-01  596   0.204     460
9.11e-01  590   0.204     453
1.10e+00  580   0.208     445
1.32e+00  571   0.216     437
1.59e+00  560   0.226     428
1.92e+00  550   0.240     420
2.31e+00  538   0.258     410
2.78e+00  523   0.283     399
3.35e+00  508   0.315     387
4.04e+00  494   0.337     377
4.86e+00  485   0.343     371
5.86e+00  477   0.344     367
7.05e+00  472   0.343     364
8.50e+00  469   0.340     362
1.02e+01  466   0.337     361
1.23e+01  464   0.335     360
1.48e+01  459   0.343     356
1.79e+01  452   0.359     351
2.15e+01  444   0.377     346
2.60e+01  439   0.395     343
3.13e+01  434   0.417     339
3.76e+01  429   0.444     334
4.53e+01  425   0.461     330
5.46e+01  424   0.467     328
6.58e+01  426   0.466     328
7.92e+01  430   0.456     328
9.55e+01  433   0.438     328
1.15e+02  433   0.420     327
1.38e+02  432   0.422     326
1.67e+02  434   0.433     329
2.01e+02  447   0.429     338
2.42e+02  466   0.376     353
2.92e+02  481   0.326     373
3.51e+02  480     NaN     375
4.23e+02  480     NaN     375
5.09e+02  480     NaN     375
6.14e+02  480     NaN     375
7.39e+02  480     NaN     375
8.90e+02  480     NaN     375
1.07e+03  480     NaN     375
1.29e+03  480     NaN     375
1.56e+03  480     NaN     375
1.87e+03  480     NaN     375
2.26e+03  480     NaN     375
2.72e+03  480     NaN     375
3.27e+03  480     NaN     375
3.94e+03  480     NaN     375
4.75e+03  480     NaN     375
5.72e+03  480     NaN     375
6.89e+03  480     NaN     375
8.30e+03  480     NaN     375
1.00e+04  480     NaN     375

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 54.6.
1. Plot your salary_lasso object. What do you see? Which value of the regularization parameter seems to be the best?
# Plot salary_lasso object
plot(XX)
plot(salary_lasso)

1. Print the best value of lambda by running the following code. Does this match what you saw in the plot above?
# Print best regularisation parameter
salary_lasso$bestTune$lambda
[1] 54.6
1. What were your final regression model coefficients for the best lambda value? Find them by running the following code:
# Get coefficients from best lambda value
coef(salary_lasso$finalModel, salary_lasso$bestTune$lambda) 20 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 608.8 AtBat . Hits . HmRun . Runs . RBI . Walks . Years . CAtBat . CHits . CHmRun . CRuns 128.1 CRBI . CWalks 105.6 LeagueN 80.2 DivisionW -48.0 PutOuts . Assists 29.9 Errors . NewLeagueN .  1. How do these coefficients compare to what you found in regular regression? Are they similar? Different? Do you notice that any coefficients are now set to exactly 0? # Yep! # I see that many features are now removed! ### F - Decision Tree It’s time to fit an optimized decision tree model!! 1. Before we can fit a decision tree, we need to specify which values of the complexity parameter cp we want to try. Using the code below, create a vector called cp_vec which contains 100 values between 0 and 1 # Determine possible values of the complexity parameter cp cp_vec <- seq(from = 0, to = 1, length = 100) 1. Fit a decision tree model predicting Salary as a function of all features • Set the formula to Salary ~ . • Set the data to hitters_train • Set the method to "rpart" for decision trees • Set the train control argument to ctrl_cv • Set the tuneGrid argument with all cp values you specified in cp_vec # Decision Tree -------------------------- salary_rpart <- train(form = XX ~ ., data = XX, method = "XX", trControl = XX, tuneGrid = expand.grid(cp = cp_vec)) # Decision Tree -------------------------- salary_rpart <- train(form = Salary ~ ., data = hitters_train, method = "rpart", trControl = ctrl_cv, tuneGrid = expand.grid(cp = cp_vec)) 1. Print your salary_rpart object what do you see? salary_rpart CART 50 samples 19 predictors No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 45, 45, 45, 44, 45, 46, ... Resampling results across tuning parameters: cp RMSE Rsquared MAE 0.0000 494 0.25380 394 0.0101 494 0.25380 394 0.0202 494 0.25380 394 0.0303 494 0.25380 394 0.0404 491 0.25264 395 0.0505 491 0.25264 395 0.0606 511 0.21445 416 0.0707 511 0.19485 419 0.0808 500 0.27169 397 0.0909 499 0.25986 400 0.1010 515 0.26111 409 0.1111 512 0.26111 409 0.1212 508 0.26001 409 0.1313 499 0.28888 401 0.1414 499 0.28888 401 0.1515 500 0.28888 404 0.1616 500 0.28888 404 0.1717 500 0.28888 404 0.1818 500 0.28888 404 0.1919 500 0.28888 404 0.2020 500 0.28888 404 0.2121 500 0.28888 404 0.2222 500 0.28888 404 0.2323 500 0.28888 404 0.2424 500 0.28888 404 0.2525 512 0.22942 423 0.2626 530 0.11363 436 0.2727 534 0.04042 442 0.2828 529 0.04583 434 0.2929 529 0.04583 434 0.3030 529 0.04583 434 0.3131 511 0.05993 413 0.3232 491 0.00361 386 0.3333 491 0.00361 386 0.3434 491 0.00361 386 0.3535 491 0.00361 386 0.3636 491 0.00361 386 0.3737 491 0.00361 386 0.3838 491 0.00361 386 0.3939 491 0.00361 386 0.4040 491 0.00361 386 0.4141 491 0.00361 386 0.4242 491 0.00361 386 0.4343 491 0.00361 386 0.4444 491 0.00361 386 0.4545 478 NaN 378 0.4646 478 NaN 378 0.4747 478 NaN 378 0.4848 478 NaN 378 0.4949 478 NaN 378 0.5051 478 NaN 378 0.5152 478 NaN 378 0.5253 478 NaN 378 0.5354 478 NaN 378 0.5455 478 NaN 378 0.5556 478 NaN 378 0.5657 478 NaN 378 0.5758 478 NaN 378 0.5859 478 NaN 378 0.5960 478 NaN 378 0.6061 478 NaN 378 0.6162 478 NaN 378 0.6263 478 NaN 378 0.6364 478 NaN 378 0.6465 478 NaN 378 0.6566 478 NaN 378 0.6667 478 NaN 378 0.6768 478 NaN 378 0.6869 478 NaN 378 0.6970 478 NaN 378 0.7071 478 NaN 378 0.7172 478 NaN 378 0.7273 478 NaN 378 0.7374 478 NaN 378 0.7475 478 NaN 378 0.7576 478 NaN 378 0.7677 478 NaN 378 0.7778 478 NaN 378 0.7879 478 NaN 378 0.7980 478 NaN 378 0.8081 478 NaN 378 0.8182 478 NaN 378 0.8283 478 NaN 378 0.8384 478 NaN 378 0.8485 478 NaN 378 0.8586 478 NaN 378 0.8687 478 NaN 378 0.8788 478 NaN 378 0.8889 478 NaN 378 0.8990 478 NaN 378 0.9091 478 NaN 378 0.9192 478 NaN 378 0.9293 478 NaN 378 0.9394 478 NaN 378 0.9495 478 NaN 378 0.9596 478 NaN 378 0.9697 478 NaN 378 0.9798 478 NaN 378 0.9899 478 NaN 378 1.0000 478 NaN 378 RMSE was used to select the optimal model using the smallest value. The final value used for the model was cp = 1. 1. Plot your salary_rpart object. What do you see? Which value of the regularization parameter seems to be the best? # Plot salary_rpart object plot(XX) plot(salary_rpart) 1. Print the best value of cp by running the following code. Does this match what you saw in the plot above? # Print best regularisation parameter salary_rpart$bestTune$cp [1] 1 1. Plot your final decision tree using the following code # Visualise your trees plot(as.party(salary_rpart$finalModel)) 

1. How do the nodes in the tree compare to the coefficients you found in your regression analyses? Do you see similarities or differences?
# Actually, the tree just has one root and no nodes! This is due to the optimal complexity parameter being so high.

### G - Random Forests

It’s time to fit an optimized random forest model!

1. Before we can fit a random forest model, we need to specify which values of the diversity parameter mtry we want to try. Using the code below, create a vector called mtry_vec which is a sequence of numbers from 1 to 10
# Determine possible values of the random forest diversity parameter mtry
mtry_vec <- 1:10
1. Fit a random forest model predicting Salary as a function of all features
• Set the formula to Salary ~ .
• Set the data to hitters_train
• Set the method to "rf" for random forests
• Set the train control argument to ctrl_cv
• Set the tuneGrid argument such that mtry can take on the values you defined in mtry_vec
# Random Forest --------------------------
salary_rf <- train(form = XX ~ .,
data = XX,
method = "XX",
trControl = XX,
tuneGrid = expand.grid(mtry = mtry_vec))
# Random Forest --------------------------
salary_rf <- train(form = Salary ~ .,
data = hitters_train,
method = "rf",
trControl = ctrl_cv,
tuneGrid = expand.grid(mtry = mtry_vec))
1. Print your salary_rf object what do you see?
salary_rf
Random Forest

50 samples
19 predictors

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 46, 46, 46, 45, 44, 45, ...
Resampling results across tuning parameters:

mtry  RMSE  Rsquared  MAE
1    365   0.608     259
2    346   0.628     245
3    342   0.630     238
4    331   0.635     231
5    329   0.642     231
6    322   0.647     227
7    320   0.650     225
8    318   0.668     225
9    317   0.656     223
10    320   0.662     227

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.
1. Plot your salary_rf object. What do you see? Which value of the regularization parameter seems to be the best?
# Plot salary_rf object
plot(XX)
plot(salary_rf)

1. Print the best value of mtry by running the following code. Does this match what you saw in the plot above?
# Print best mtry parameter
salary_rf$bestTune$mtry
[1] 9

### H - Estimate prediction accuracy from training folds

1. Using resamples(), calculate the estimated prediction accuracy for each of your models. To do this, put your model objects in the named list (e.g.; glm = salary_glm, …)
# Summarise accuracy statistics across folds
salary_resamples <- resamples(list(glm = XXX,
ridge = XXX,
lasso = XXX,
dt = XXX,
rf = XXX))
# Summarise accuracy statistics across folds
salary_resamples <- resamples(list(glm = salary_glm,
ridge = salary_ridge,
lasso = salary_lasso,
dt = salary_rpart,
rf = salary_rf))
1. Look at the summary of your salary_resamples object with summary(salary_resamples). What does this tell you? Which model do you expect to have the best prediction accuracy for the test data?
# Print summaries of cross-validation accuracy

# I see below that the random forest model has the lowest mean MAE, so I
#  would expect it to be the best model in the true test data
summary(salary_resamples)

Call:
summary.resamples(object = salary_resamples)

Models: glm, ridge, lasso, dt, rf
Number of resamples: 10

MAE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
glm   249.1     292    378  441     611  710    0
ridge 132.8     220    267  298     342  578    0
lasso 148.2     263    347  328     370  524    0
dt    191.7     286    373  378     463  544    0
rf     87.6     169    217  223     264  346    0

RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
glm    297     342    467  523     654  914    0
ridge  183     249    325  393     437  907    0
lasso  175     318    401  424     483  885    0
dt     251     342    423  478     665  726    0
rf     112     227    297  317     321  692    0

Rsquared
Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
glm   0.000151 0.00722 0.0733 0.273   0.522 0.771    0
ridge 0.002319 0.10065 0.5269 0.428   0.606 0.991    0
lasso 0.031517 0.27983 0.4492 0.467   0.663 0.906    0
dt          NA      NA     NA   NaN      NA    NA   10
rf    0.271936 0.47366 0.6919 0.656   0.843 0.999    0

### I - Calculate prediction accuracy

1. Save the criterion value for the test data as a new vector called criterion_test
# Save salaries of players in test dataset as criterion_test
criterion_test <- XXX$XXX criterion_test <- hitters_test$Salary
1. Using predict(), save the prediction of your regular regression model salary_glm for the hitters_test data as a new object called glm_pred:
• Set the first argument to salary_glm
• Set the newdata argument to hitters_test
# Save the glm predicted salaries of hitters_test
glm_pred <- predict(XXX, newdata = XXX)
# Save predictions for glm
glm_pred <- predict(salary_glm, newdata = hitters_test)
1. Now do the same with your ridge, lasso, decision tree, and random forest models to get the objects ridge_pred, lasso_pred, rpart_pred and rf_pred
ridge_pred <- predict(XXX, newdata = XXX)
lasso_pred <- predict(XXX, newdata = XXX)
rpart_pred <- predict(XXX, newdata = XXX)
rf_pred <- predict(XXX, newdata = XXX)
# Save predictions from other models
ridge_pred <- predict(salary_ridge, newdata = hitters_test)
lasso_pred <- predict(salary_lasso, newdata = hitters_test)
rpart_pred <- predict(salary_rpart, newdata = hitters_test)
rf_pred <- predict(salary_rf, newdata = hitters_test)
1. Using postResample(), calculate the aggregate prediction accuracy for each model using the template below
• Set the pred argument to your model predictions (e.g.; ridge_pred)
• Set the obs argument to the true criterion values criterion_test)
# Calculate aggregate accuracy for a model
postResample(pred = XXX,
obs = XXX)
# Prediction error for normal regression
postResample(pred = glm_pred,
obs = hitters_test$Salary)  RMSE Rsquared MAE 552.1782 0.0907 421.8898  # Prediction error for ridge regression postResample(pred = ridge_pred, obs = hitters_test$Salary)
    RMSE Rsquared      MAE
364.885    0.338  275.275 
# Prediction error for lasso regression
postResample(pred = lasso_pred,
obs = hitters_test$Salary)  RMSE Rsquared MAE 386.992 0.257 286.544  # Prediction error for decision trees postResample(pred = rpart_pred, obs = hitters_test$Salary)
    RMSE Rsquared      MAE
445       NA      358 
# Prediction error for random forests
postResample(pred = rf_pred,
obs = hitters_test$Salary)  RMSE Rsquared MAE 299.573 0.537 200.313  1. Which of your models had the best performance in the true test data? # Random forests had the lowest test MAE of postResample(pred = rf_pred, obs = hitters_test$Salary)[3]
MAE
200 
1. How close were your models’ true prediction error to the values you estimated in the previous section when you ran resamples()?
# Depends on what you mean by 'close', but they are definitely higher (worse) in the test data

### Z - Challenges

1. In addition to ‘regular’ 10 fold cross-validation, you can also do repeated 10-fold cross-validation, where the cross validation procedure is repeated many times. Do you think this will improve your models’ performance? To test this, create a new training control object called ctrl_cv_rep as below. Then, train your models again using ctrl_cv_rep (instead of ctrl_cv), and evaluate their prediction performance. Do they improve? Do you get different optimal tuning values compared to your previous models?
# Repeated cross validation.
#  Folds = 10
#  Repeats = 5
ctrl_cv_rep <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)
1. Using the same procedure as above, compare models predicting the prices of houses in King County Washington using the house_train and house_test datasets.

2. When using lasso regression, do you find that the lasso sets any beta weights to exactly 0? If so, which ones?

3. Which model does the best and how accurate was it? Was it the same model that performed the best predicting baseball player salaries?

4. Using the same procedure as above, compare models predicting the graduate rate of students from different colleges using the college_train and college_test datasets.

5. When using lasso regression, do you find that the lasso sets any beta weights to exactly 0? If so, which ones?

6. Which model does the best and how accurate was it? Was it the same model that performed the best predicting baseball player salaries?

## Examples

# Model optimization with Regression

library(tidyverse)    # Load tidyverse for dplyr and tidyr
library(caret)        # For ML mastery
library(partykit)     # For decision trees
library(party)        # For decision trees

# Step 1: Load, clean, and explore data ----------------------

# training data

# test data

# Convert all characters to factor
#  Some ML models require factors
data_train <- data_train %>%
mutate_if(is.character, factor)

data_test <- data_test %>%
mutate_if(is.character, factor)

# Explore training data
data_train        # Print the dataset
View(data_train)  # Open in a new spreadsheet-like window
dim(data_train)   # Print dimensions
names(data_train) # Print the names

# Define criterion_train
#   We'll use this later to evaluate model accuracy
criterion_train <- data_train$price criterion_test <- data_test$price

# Step 2: Define training control parameters -------------

# Use 10-fold cross validation
ctrl_cv <- trainControl(method = "cv",
number = 10)

# Step 3: Train models: -----------------------------
#   Criterion: hwy
#   Features: year, cyl, displ

# Normal Regression --------------------------
price_glm <- train(form = price ~ carat + depth + table + x + y,
data = data_train,
method = "glm",
trControl = ctrl_cv)

# Print key results
price_glm

# RMSE  Rsquared  MAE
# 1506  0.86      921

# Coefficients
coef(price_glm$finalModel) # (Intercept) carat depth table x y # 21464.9 11040.4 -215.6 -94.2 -3681.6 2358.9 # Lasso -------------------------- # Vector of lambda values to try lambda_vec <- 10 ^ seq(-3, 3, length = 100) price_lasso <- train(form = price ~ carat + depth + table + x + y, data = data_train, method = "glmnet", trControl = ctrl_cv, preProcess = c("center", "scale"), # Standardise tuneGrid = expand.grid(alpha = 1, # Lasso lambda = lambda_vec)) # Print key results price_lasso # glmnet # # 5000 samples # 5 predictor # # Pre-processing: centered (5), scaled (5) # Resampling: Cross-Validated (10 fold) # Summary of sample sizes: 4500, 4500, 4500, 4500, 4500, 4500, ... # Resampling results across tuning parameters: # # lambda RMSE Rsquared MAE # 1.00e-03 1509 0.858 918 # 1.15e-03 1509 0.858 918 # 1.32e-03 1509 0.858 918 # 1.52e-03 1509 0.858 918 # Plot regularisation parameter versus error plot(price_lasso) # Print best regularisation parameter price_lasso$bestTune$lambda # Get coefficients from best lambda value coef(price_lasso$finalModel,
price_lasso$bestTune$lambda)

# 6 x 1 sparse Matrix of class "dgCMatrix"
#                 1
# (Intercept)  4001
# carat        5179
# depth        -300
# table        -213
# x           -3222
# y            1755

# Ridge --------------------------

# Vector of lambda values to try
lambda_vec <- 10 ^ seq(-3, 3, length = 100)

price_ridge <- train(form = price ~ carat + depth + table + x + y,
data = data_train,
method = "glmnet",
trControl = ctrl_cv,
preProcess = c("center", "scale"),  # Standardise
tuneGrid = expand.grid(alpha = 0,  # Ridge penalty
lambda = lambda_vec))

# Print key results
price_ridge

# glmnet
#
# 5000 samples
#    5 predictor
#
# Pre-processing: centered (5), scaled (5)
# Resampling: Cross-Validated (10 fold)
# Summary of sample sizes: 4500, 4500, 4500, 4500, 4500, 4500, ...
# Resampling results across tuning parameters:
#
#   lambda    RMSE  Rsquared  MAE
#   1.00e-03  1638  0.835     1137
#   1.15e-03  1638  0.835     1137
#   1.32e-03  1638  0.835     1137
#   1.52e-03  1638  0.835     1137
#   1.75e-03  1638  0.835     1137

# Plot regularisation parameter versus error
plot(price_ridge)

# Print best regularisation parameter
price_ridge$bestTune$lambda

# Get coefficients from best lambda value
coef(price_ridge$finalModel, price_ridge$bestTune$lambda) # 6 x 1 sparse Matrix of class "dgCMatrix" # 1 # (Intercept) 4001 # carat 2059 # depth -131 # table -168 # x 716 # y 797 # Decision Trees -------------------------- # Vector of cp values to try cp_vec <- seq(0, .1, length = 100) price_rpart <- train(form = price ~ carat + depth + table + x + y, data = data_train, method = "rpart", trControl = ctrl_cv, tuneGrid = expand.grid(cp = cp_vec)) # Print key results price_rpart # Plot complexity parameter vs. error plot(price_rpart) # Print best complexity parameter price_rpart$bestTune$cp # [1] 0.00202 # Step 3: Estimate prediction accuracy from folds ---- # Get accuracy statistics across folds resamples_price <- resamples(list(ridge = price_ridge, lasso = price_lasso, glm = price_glm)) # Print summary of accuracies summary(resamples_price) # MAE # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # ridge 1094 1100 1117 1137 1170 1217 0 # lasso 869 887 929 918 944 960 0 # glm 856 882 921 920 949 986 0 # # RMSE # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # ridge 1545 1580 1609 1638 1703 1772 0 # lasso 1323 1479 1518 1509 1583 1593 0 # glm 1350 1429 1526 1509 1582 1702 0 # # Rsquared # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # ridge 0.798 0.828 0.836 0.835 0.848 0.854 0 # lasso 0.827 0.846 0.858 0.858 0.868 0.902 0 # glm 0.819 0.849 0.863 0.860 0.870 0.888 0 # Step 4: Measure prediction Accuracy ------------------- # GLM ================================ # Predictions glm_pred <- predict(price_glm, newdata = data_test) # Calculate aggregate accuracy postResample(pred = glm_pred, obs = criterion_test) # RMSE Rsquared MAE # 1654.017 0.832 944.854 # Ridge ================================ # Predictions ridge_pred <- predict(price_ridge, newdata = data_test) # Calculate aggregate accuracy postResample(pred = ridge_pred, obs = criterion_test) # RMSE Rsquared MAE # 1650.541 0.832 1133.063 # Lasso ================================ # Predictions lasso_pred <- predict(price_lasso, newdata = data_test) # Calculate aggregate accuracy postResample(pred = lasso_pred, obs = criterion_test) # RMSE Rsquared MAE # 1653.675 0.832 942.870 # Visualise Accuracy ------------------------- # Tidy competition results accuracy <- tibble(criterion_test = criterion_test, Regression = glm_pred, Ridge = ridge_pred, Lasso = lasso_pred) %>% gather(model, prediction, -criterion_test) %>% # Add error measures mutate(se = prediction - criterion_test, ae = abs(prediction - criterion_test)) # Calculate summaries accuracy_agg <- accuracy %>% group_by(model) %>% summarise(mae = mean(ae)) # Calculate MAE (mean absolute error) # Plot A) Scatterplot of truth versus predictions ggplot(data = accuracy, aes(x = criterion_test, y = prediction, col = model)) + geom_point(alpha = .5) + geom_abline(slope = 1, intercept = 0) + labs(x = "True Prices", y = "Predicted Prices", title = "Predicting Diamond Prices", subtitle = "Black line indicates perfect performance") # Plot B) Violin plot of absolute errors ggplot(data = accuracy, aes(x = model, y = ae, fill = model)) + geom_violin() + geom_jitter(width = .05, alpha = .2) + labs(title = "Fitting Absolute Errors", subtitle = "Numbers indicate means", x = "Model", y = "Absolute Error (Log Transformed)") + guides(fill = FALSE) + annotate(geom = "label", x = accuracy_agg$model,
y = accuracy_agg$mae, label = round(accuracy_agg$mae, 2)) +
scale_y_continuous(trans='log')

## Datasets

File Rows Columns
hitters_train.csv 50 20
hitters_test.csv 213 20
college_train.csv 500 18
college_test.csv 277 18
house_train.csv 5000 21
house_test.csv 1000 21
• The hitters_train and hitters_test data are taken from the Hitters dataset in the ISLR package. They are data frames with observations of major league baseball players from the 1986 and 1987 seasons.

• The college_train and college_test data are taken from the College dataset in the ISLR package. They contain statistics for a large number of US Colleges from the 1995 issue of US News and World Report.

• The house_train and house_test data come from https://www.kaggle.com/harlfoxem/housesalesprediction

#### Variable description of hitters_train and hitters_test

Name Description
Salary 1987 annual salary on opening day in thousands of dollars.
AtBat Number of times at bat in 1986.
Hits Number of hits in 1986.
HmRun Number of home runs in 1986.
Runs Number of runs in 1986.
RBI Number of runs batted in in 1986.
Walks Number of walks in 1986.
Years Number of years in the major leagues.
CAtBat Number of times at bat during his career.
CHits Number of hits during his career.
CHmRun Number of home runs during his career.
CRuns Number of runs during his career.
CRBI Number of runs batted in during his career.
CWalks Number of walks during his career.
League A factor with levels A and N indicating player’s league at the end of 1986.
Division A factor with levels E and W indicating player’s division at the end of 1986.
PutOuts Number of put outs in 1986.
Assists Number of assists in 1986.
Errors Number of errors in 1986.
NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987.

#### Variable description of college_train and college_test

Name Description
Private A factor with levels No and Yes indicating private or public university.
Apps Number of applications received.
Accept Number of applications accepted.
Enroll Number of new students enrolled.
Top10perc Pct. new students from top 10% of H.S. class.
Top25perc Pct. new students from top 25% of H.S. class.
F.Undergrad Number of fulltime undergraduates.
P.Undergrad Number of parttime undergraduates.
Outstate Out-of-state tuition.
Room.Board Room and board costs.
Books Estimated book costs.
Personal Estimated personal spending.
PhD Pct. of faculty with Ph.D.’s.
Terminal Pct. of faculty with terminal degree.
S.F.Ratio Student/faculty ratio.
perc.alumni Pct. alumni who donate.
Expend Instructional expenditure per student.
Grad.Rate Graduation rate.

#### Variable description of house_train and house_test

Name Description
price Price of the house in \$.
bedrooms Number of bedrooms.
bathrooms Number of bathrooms.
sqft_living Square footage of the home.
sqft_lot Square footage of the lot.
floors Total floors (levels) in house.
waterfront House which has a view to a waterfront.
view Has been viewed.
condition How good the condition is (Overall).
grade Overall grade given to the housing unit, based on King County grading system.
sqft_above Square footage of house apart from basement.
sqft_basement Square footage of the basement.
yr_built Built Year.
yr_renovated Year when house was renovated.
zipcode Zip code.
lat Latitude coordinate.
long Longitude coordinate.
sqft_living15 Living room area in 2015 (implies some renovations). This might or might not have affected the lotsize area.
sqft_lot15 lot-size area in 2015 (implies some renovations).

## Functions

### Packages

Package Installation
tidyverse install.packages("tidyverse")
caret install.packages("caret")
partykit install.packages("partykit")
party install.packages("party")

### Functions

Function Package Description
trainControl() caret Define modelling control parameters
train() caret Train a model
predict(object, newdata) stats Predict the criterion values of newdata based on object
postResample() caret Calculate aggregate model performance in regression tasks
confusionMatrix() caret Calculate aggregate model performance in classification tasks