Machine Learning with R Basel R Bootcamp |
from from dilbert.com
By the end of this practical you will:
Open your BaselRBootcamp
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data file(s) listed in the Datasets
section are in your 1_Data
folder.
Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called Features_practical.R
in the 2_Code
folder.
Using library()
load the set of packages for this practical listed in the packages section above.
# Load packages necessary for this script
library(tidyverse)
library(caret)
Datasets
as new objects.# Pima Indians diabetes
pima_diabetes <- read_csv(file = "1_Data/pima_diabetes.csv")
# (Non-) violent crime statistics
violent_crime <- read_csv(file = "1_Data/violent_crime.csv")
nonviolent_crime <- read_csv(file = "1_Data/nonviolent_crime.csv")
# murders crime statistics
murders_crime <- read_csv(file = "1_Data/murders_crime.csv")
In this section, you will explore feature selection for the Pima Indians Diabetes data set. The Pima are a group of Native Americans living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor of carbohydrates for years. In the recent years, because of a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, made them develop the highest prevalence of type 2 diabetes. For this reason they have been subject of many studies.
pima_diabetes
# A tibble: 724 x 7
diabetes pregnant glucose pressure mass pedigree age
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 pos 6 148 72 33.6 0.627 50
2 neg 1 85 66 26.6 0.351 31
3 pos 8 183 64 23.3 0.672 32
4 neg 1 89 66 28.1 0.167 21
5 pos 0 137 40 43.1 2.29 33
6 neg 5 116 74 25.6 0.201 30
7 pos 3 78 50 31 0.248 26
8 pos 2 197 70 30.5 0.158 53
9 neg 4 110 92 37.6 0.191 30
10 pos 10 168 74 38 0.537 34
# … with 714 more rows
dim()
function.dim(pima_diabetes)
[1] 724 7
names()
function.names(pima_diabetes)
[1] "diabetes" "pregnant" "glucose" "pressure" "mass" "pedigree"
[7] "age"
View()
. Do they look OK?View(pima_diabetes)
pima_train
and pima_test
using the createDataPartition()
function. Set p = .15
to select (only) 15% of cases for the training set. See code below. Also, store the variable diabetes
from the test set as a factor, which will be the criterion.# split index
train_index <- createDataPartition(XX$XX, p = .15, list = FALSE)
# train and test sets
pima_train <- XX %>% slice(train_index)
pima_test <- XX %>% slice(-train_index)
# test criterion
criterion <- as.factor(pima_test$XX)
# split index
train_index <- createDataPartition(pima_diabetes$diabetes, p = .15, list = FALSE)
# train and test sets
pima_train <- pima_diabetes %>% slice(train_index)
pima_test <- pima_diabetes %>% slice(-train_index)
# test criterion
criterion <- as.factor(pima_test$diabetes)
OK, with the training set, let’s get to work and remove some features.
# Select predictors
pima_train_pred <- pima_train %>% select(-XX)
# Select criterion
pima_train_crit <- pima_train %>% pull(XX)
# Select predictors
pima_train_pred <- pima_train %>% select(-diabetes)
# Select criterion
pima_train_crit <- pima_train %>% pull(diabetes)
cor()
and findCorrelation()
. Are there any?# determine correlation matrix
corr_matrix <- cor(XX_pred)
# find excessively correlated variables
findCorrelation(corr_matrix)
# determine correlation matrix
corr_matrix <- cor(pima_train_pred)
# find excessively correlated variables
findCorrelation(corr_matrix)
integer(0)
nearZeroVar
function. Any of those?# find near zero variance predictors
nearZeroVar(XX_pred)
# find near zero variance predictors
nearZeroVar(pima_train_pred)
integer(0)
As we identified no problems with the features in our data set, we have retained all of them. In this section, you will carry out feature selection on grounds of feature importance. To do this, we first need to fit a model on the basis of which we can determine the importance of features. How about using a simple logistic regression aka method = "glm"
?
glm
model to the training data.# fit regression
pima_glm <- train(diabetes ~ .,
data = XX,
method = XX
)
# fit regression
pima_glm <- train(diabetes ~ .,
data = pima_train,
method = "glm")
varImp()
. The function will show importance on a scale from 0 (least important feature) to 100 (most important feature). You can set scale = TRUE
to see absolute importance measures scaled as t-values.# determine variable importance
varimp_glm <- varImp(XX)
# print variable importance
varimp_glm
# print variable importance
plot(varimp_glm)
# determine variable importance
varimp_glm <- varImp(pima_glm)
# print variable importance
varimp_glm
glm variable importance
Overall
glucose 100.00
mass 76.21
age 52.67
pregnant 10.72
pedigree 2.02
pressure 0.00
# print variable importance
plot(varimp_glm)
Now, let’s create a second model using the best features and compare performances.
# fit glm with best four features
pima_glm4 <- train(diabetes ~ XX + YY + ZZ + AA,
data = XX,
method = XX)
# fit glm with best four features
pima_glm4 <- train(diabetes ~ glucose + mass + pregnant + pedigree,
data = pima_train,
method = "glm")
confusionMatrix()
. Which model model is better?# determine predictions for test data
pima_glm_pred <- predict(XX, newdata = XX)
pima_glm4_pred <- predict(XX, newdata = XX)
# evaluate the results
confusionMatrix(XX, reference = XX)
confusionMatrix(XX, reference = XX)
# determine predictions for test data
pima_glm_pred <- predict(pima_glm, newdata = pima_test)
pima_glm4_pred <- predict(pima_glm4, newdata = pima_test)
# evaluate the results
confusionMatrix(pima_glm_pred, criterion)
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 344 89
pos 59 122
Accuracy : 0.759
95% CI : (0.723, 0.792)
No Information Rate : 0.656
P-Value [Acc > NIR] : 2.34e-08
Kappa : 0.447
Mcnemar's Test P-Value : 0.0171
Sensitivity : 0.854
Specificity : 0.578
Pos Pred Value : 0.794
Neg Pred Value : 0.674
Prevalence : 0.656
Detection Rate : 0.560
Detection Prevalence : 0.705
Balanced Accuracy : 0.716
'Positive' Class : neg
confusionMatrix(pima_glm4_pred, criterion)
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 353 96
pos 50 115
Accuracy : 0.762
95% CI : (0.727, 0.795)
No Information Rate : 0.656
P-Value [Acc > NIR] : 8.3e-09
Kappa : 0.444
Mcnemar's Test P-Value : 0.000196
Sensitivity : 0.876
Specificity : 0.545
Pos Pred Value : 0.786
Neg Pred Value : 0.697
Prevalence : 0.656
Detection Rate : 0.575
Detection Prevalence : 0.731
Balanced Accuracy : 0.710
'Positive' Class : neg
You might have observed that the model with two fewer features is actually slightly better than the full model (if this is not the case, keep in mind that the partitioning of the dataset was done randomly, i.e., if you do it a second time, your results may slightly change). Why do you think this is the case?
Play around: Up the proportion dedicated to training or use a different model, e.g., random forest
, and see whether things change.
In this section, you will explore feature selection using a different data set. The data combines socio-economic data from the US ’90 Census, data from Law Enforcement Management and Admin Stats survey, and crime data from the FB. The gaol of this section is to use various socio-demographic variables to predict whether murders have been committed (murders
), the criterion of this exercise.
murders_crime
data frame by printing them to the console.murders_crime
# A tibble: 1,823 x 102
murders population householdsize racepctblack racePctWhite racePctAsian
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 yes 27591 2.63 0.17 94.8 1.6
2 yes 36830 2.6 42.4 53.7 0.54
3 yes 23928 2.6 11.0 81.3 1.78
4 no 15675 2.59 4.08 84.1 0.54
5 yes 96086 3.19 8.54 59.8 17.2
6 no 48622 2.44 3.14 95.5 0.77
7 no 10444 3.02 0.41 98.9 0.39
8 yes 222103 2.49 11.4 82.4 3.77
9 no 15535 2.35 14.9 84.6 0.32
10 yes 24664 2.73 13.7 85.0 0.53
# … with 1,813 more rows, and 96 more variables: racePctHisp <dbl>,
# agePct12t21 <dbl>, agePct12t29 <dbl>, agePct16t24 <dbl>,
# agePct65up <dbl>, numbUrban <dbl>, pctUrban <dbl>, medIncome <dbl>,
# pctWWage <dbl>, pctWFarmSelf <dbl>, pctWInvInc <dbl>,
# pctWSocSec <dbl>, pctWPubAsst <dbl>, pctWRetire <dbl>,
# medFamInc <dbl>, perCapInc <dbl>, whitePerCap <dbl>,
# blackPerCap <dbl>, indianPerCap <dbl>, AsianPerCap <dbl>,
# HispPerCap <dbl>, NumUnderPov <dbl>, PctPopUnderPov <dbl>,
# PctLess9thGrade <dbl>, PctNotHSGrad <dbl>, PctBSorMore <dbl>,
# PctUnemployed <dbl>, PctEmploy <dbl>, PctEmplManu <dbl>,
# PctEmplProfServ <dbl>, PctOccupManu <dbl>, PctOccupMgmtProf <dbl>,
# MalePctDivorce <dbl>, MalePctNevMarr <dbl>, FemalePctDiv <dbl>,
# TotalPctDiv <dbl>, PersPerFam <dbl>, PctFam2Par <dbl>,
# PctKids2Par <dbl>, PctYoungKids2Par <dbl>, PctTeen2Par <dbl>,
# PctWorkMomYoungKids <dbl>, PctWorkMom <dbl>,
# NumKidsBornNeverMar <dbl>, PctKidsBornNeverMar <dbl>, NumImmig <dbl>,
# PctImmigRecent <dbl>, PctImmigRec5 <dbl>, PctImmigRec8 <dbl>,
# PctImmigRec10 <dbl>, PctRecentImmig <dbl>, PctRecImmig5 <dbl>,
# PctRecImmig8 <dbl>, PctRecImmig10 <dbl>, PctSpeakEnglOnly <dbl>,
# PctNotSpeakEnglWell <dbl>, PctLargHouseFam <dbl>,
# PctLargHouseOccup <dbl>, PersPerOccupHous <dbl>,
# PersPerOwnOccHous <dbl>, PersPerRentOccHous <dbl>,
# PctPersOwnOccup <dbl>, PctPersDenseHous <dbl>, PctHousLess3BR <dbl>,
# MedNumBR <dbl>, HousVacant <dbl>, PctHousOccup <dbl>,
# PctHousOwnOcc <dbl>, PctVacantBoarded <dbl>, PctVacMore6Mos <dbl>,
# MedYrHousBuilt <dbl>, PctHousNoPhone <dbl>, PctWOFullPlumb <dbl>,
# OwnOccLowQuart <dbl>, OwnOccMedVal <dbl>, OwnOccHiQuart <dbl>,
# OwnOccQrange <dbl>, RentLowQ <dbl>, RentMedian <dbl>, RentHighQ <dbl>,
# RentQrange <dbl>, MedRent <dbl>, MedRentPctHousInc <dbl>,
# MedOwnCostPctInc <dbl>, MedOwnCostPctIncNoMtg <dbl>,
# NumInShelters <dbl>, NumStreet <dbl>, PctForeignBorn <dbl>,
# PctBornSameState <dbl>, PctSameHouse85 <dbl>, PctSameCity85 <dbl>,
# PctSameState85 <dbl>, LandArea <dbl>, PopDens <dbl>,
# PctUsePubTrans <dbl>, LemasPctOfficDrugUn <dbl>
dim()
function.dim(murders_crime)
[1] 1823 102
names()
function.names(murders_crime)
[1] "murders" "population"
[3] "householdsize" "racepctblack"
[5] "racePctWhite" "racePctAsian"
[7] "racePctHisp" "agePct12t21"
[9] "agePct12t29" "agePct16t24"
[11] "agePct65up" "numbUrban"
[13] "pctUrban" "medIncome"
[15] "pctWWage" "pctWFarmSelf"
[17] "pctWInvInc" "pctWSocSec"
[19] "pctWPubAsst" "pctWRetire"
[21] "medFamInc" "perCapInc"
[23] "whitePerCap" "blackPerCap"
[25] "indianPerCap" "AsianPerCap"
[27] "HispPerCap" "NumUnderPov"
[29] "PctPopUnderPov" "PctLess9thGrade"
[31] "PctNotHSGrad" "PctBSorMore"
[33] "PctUnemployed" "PctEmploy"
[35] "PctEmplManu" "PctEmplProfServ"
[37] "PctOccupManu" "PctOccupMgmtProf"
[39] "MalePctDivorce" "MalePctNevMarr"
[41] "FemalePctDiv" "TotalPctDiv"
[43] "PersPerFam" "PctFam2Par"
[45] "PctKids2Par" "PctYoungKids2Par"
[47] "PctTeen2Par" "PctWorkMomYoungKids"
[49] "PctWorkMom" "NumKidsBornNeverMar"
[51] "PctKidsBornNeverMar" "NumImmig"
[53] "PctImmigRecent" "PctImmigRec5"
[55] "PctImmigRec8" "PctImmigRec10"
[57] "PctRecentImmig" "PctRecImmig5"
[59] "PctRecImmig8" "PctRecImmig10"
[61] "PctSpeakEnglOnly" "PctNotSpeakEnglWell"
[63] "PctLargHouseFam" "PctLargHouseOccup"
[65] "PersPerOccupHous" "PersPerOwnOccHous"
[67] "PersPerRentOccHous" "PctPersOwnOccup"
[69] "PctPersDenseHous" "PctHousLess3BR"
[71] "MedNumBR" "HousVacant"
[73] "PctHousOccup" "PctHousOwnOcc"
[75] "PctVacantBoarded" "PctVacMore6Mos"
[77] "MedYrHousBuilt" "PctHousNoPhone"
[79] "PctWOFullPlumb" "OwnOccLowQuart"
[81] "OwnOccMedVal" "OwnOccHiQuart"
[83] "OwnOccQrange" "RentLowQ"
[85] "RentMedian" "RentHighQ"
[87] "RentQrange" "MedRent"
[89] "MedRentPctHousInc" "MedOwnCostPctInc"
[91] "MedOwnCostPctIncNoMtg" "NumInShelters"
[93] "NumStreet" "PctForeignBorn"
[95] "PctBornSameState" "PctSameHouse85"
[97] "PctSameCity85" "PctSameState85"
[99] "LandArea" "PopDens"
[101] "PctUsePubTrans" "LemasPctOfficDrugUn"
View()
. Do they look OK?View(murders_crime)
murders_train
and murders_test
using createDataPartition()
with (only) 25% of cases going into the training set. Also store the variable murders
from the test, i.e., the criterion, and turn into a factor using as.factor()
.# split index
train_index <- createDataPartition(murders_crime$murders, p = .25, list = FALSE)
# train and test sets
murders_train <- murders_crime %>% slice(train_index)
murders_test <- murders_crime %>% slice(-train_index)
# test criterion
criterion <- as.factor(murders_test$murders)
Before we start modeling, let us get to work and remove some features from the training set.
# Select predictors
murders_train_pred <- murders_train %>% select(-murders)
# Select criterion
murders_train_crit <- murders_train %>% select(murders) %>% pull()
cor()
and findCorrelation()
. Are there any this time?# determine correlation matrix
corr_matrix <- cor(murders_train_pred)
# find excessively correlated variables
findCorrelation(corr_matrix)
[1] 11 17 27 30 41 44 51 53 54 55 57 58 59 61 64 71 84 85 87 7 8 13 20
[24] 21 31 38 43 1 62 67 79 80 81
# remove features
murders_train_pred <- murders_train_pred %>% select(- XX)
# remove features
murders_train_pred <- murders_train_pred %>%
select(-findCorrelation(corr_matrix))
# find near zero variance predictors
nearZeroVar(murders_train_pred)
integer(0)
# clean training set
murders_train_clean <- murders_train_pred %>%
add_column(murders = XX)
# combine clean predictor set with criterion
murders_train_clean <- murders_train_pred %>%
add_column(murders = murders_train_crit)
glm
twice, once using the original training set and once using the clean training set, and store the fits in separate objects. See template below.# fit glm
murders_glm <- train(murders ~ .,
data = XX,
method = "glm")
# fit glm with clean data
murders_glm_clean <- train(murders ~ .,
data = XX,
method = "glm")
# fit glm
murders_glm <- train(murders ~ .,
data = murders_train,
method = "glm")
# fit glm with clean data
murders_glm_clean <- train(murders ~ .,
data = murders_train_clean,
method = "glm")
# determine predictions for test data
murders_pred <- predict(murders_glm, newdata = murders_test)
murders_clean_pred <- predict(murders_glm_clean, newdata = murders_test)
# evaluate the results
confusionMatrix(murders_pred, criterion)
Confusion Matrix and Statistics
Reference
Prediction no yes
no 446 192
yes 184 545
Accuracy : 0.725
95% CI : (0.7, 0.748)
No Information Rate : 0.539
P-Value [Acc > NIR] : <2e-16
Kappa : 0.447
Mcnemar's Test P-Value : 0.718
Sensitivity : 0.708
Specificity : 0.739
Pos Pred Value : 0.699
Neg Pred Value : 0.748
Prevalence : 0.461
Detection Rate : 0.326
Detection Prevalence : 0.467
Balanced Accuracy : 0.724
'Positive' Class : no
confusionMatrix(murders_clean_pred, criterion)
Confusion Matrix and Statistics
Reference
Prediction no yes
no 471 199
yes 159 538
Accuracy : 0.738
95% CI : (0.714, 0.761)
No Information Rate : 0.539
P-Value [Acc > NIR] : <2e-16
Kappa : 0.475
Mcnemar's Test P-Value : 0.0393
Sensitivity : 0.748
Specificity : 0.730
Pos Pred Value : 0.703
Neg Pred Value : 0.772
Prevalence : 0.461
Detection Rate : 0.345
Detection Prevalence : 0.490
Balanced Accuracy : 0.739
'Positive' Class : no
principal component analysis
(PCA). Create a third fit object using the original, full training set using again method = glm
and in addition preProcess = c('pca')
and trControl = trainControl(preProcOptions = list(thresh = 0.8))
in the training function. These additional settings instruct R to extract new features from the data using PCA and to retain only as many features as are needed to capture 80% of the original variance.# fit glm with preprocessed features
murders_glm_pca = train(murders ~ .,
data = murders_train,
method = "glm",
preProcess = c("pca"),
trControl = trainControl(preProcOptions = list(thresh = 0.8)))
# determine predictions for test data
murders_pca <- predict(murders_glm_pca, newdata = murders_test)
# evaluate the results
confusionMatrix(murders_pca, criterion)
Confusion Matrix and Statistics
Reference
Prediction no yes
no 490 196
yes 140 541
Accuracy : 0.754
95% CI : (0.73, 0.777)
No Information Rate : 0.539
P-Value [Acc > NIR] : <2e-16
Kappa : 0.509
Mcnemar's Test P-Value : 0.0027
Sensitivity : 0.778
Specificity : 0.734
Pos Pred Value : 0.714
Neg Pred Value : 0.794
Prevalence : 0.461
Detection Rate : 0.358
Detection Prevalence : 0.502
Balanced Accuracy : 0.756
'Positive' Class : no
PCA
(using thresh
), increase the proportion dedicated to training, use a different model, e.g., random forest
, and see whether things change.varImp()
, evaluate the feature importance for each of the three models used in the previous section.# determine variable importance
varimp_glm <- varImp(murders_glm)
varimp_glm_clean <- varImp(murders_glm_clean)
varimp_glm_pca <- varImp(murders_glm_pca)
# print variable importance
varimp_glm
glm variable importance
only 20 most important variables shown (out of 99)
Overall
PctForeignBorn 100.0
PctNotSpeakEnglWell 98.2
pctWRetire 97.8
PersPerFam 85.7
PctKidsBornNeverMar 82.0
racepctblack 81.1
PctFam2Par 81.1
PctHousNoPhone 75.6
NumUnderPov 74.3
racePctAsian 73.6
NumStreet 71.7
agePct12t21 69.4
PctKids2Par 65.7
PctBornSameState 63.2
population 58.8
PctOccupManu 57.9
pctWSocSec 57.8
PctEmplManu 56.2
MedRentPctHousInc 52.7
blackPerCap 50.1
varimp_glm_clean
glm variable importance
only 20 most important variables shown (out of 68)
Overall
NumKidsBornNeverMar 100.0
PctKidsBornNeverMar 79.5
PctEmplManu 61.4
PctHousOccup 56.2
MedRentPctHousInc 54.1
PctOccupManu 48.8
pctWRetire 48.2
PctHousNoPhone 46.7
PctUsePubTrans 46.3
PctBornSameState 45.4
racepctblack 45.3
PersPerFam 42.5
PctForeignBorn 39.9
OwnOccQrange 39.1
pctWPubAsst 38.4
PctRecentImmig 35.4
MedOwnCostPctInc 34.8
PctOccupMgmtProf 34.7
FemalePctDiv 34.5
racePctAsian 34.1
varimp_glm_pca
glm variable importance
Overall
PC1 100.00
PC2 94.19
PC4 80.80
PC3 77.90
PC5 74.74
PC11 60.11
PC8 48.74
PC6 35.34
PC7 32.85
PC9 9.67
PC10 0.00
# print variable importance
plot(varimp_glm)
plot(varimp_glm_clean)
plot(varimp_glm_pca)
Analyze the violent and non-violent crime data sets predicting either the number of violent crimes per 100k inhabitants (ViolentCrimesPerPop
) or the number of non-violent crimes per 100k inhabitants (nonViolPerPop
). Both criteria are numeric, implying that this is not classification problem, but one of regression. Other than that, the features in the data set are identical to the previous analyses. How well can you predict violent or non-violent crimes?
Another approach to feature selection, beyond selection by hand or PCA, is to have the computer try to automically select subsets of features that lead to the best possible cross-validation performance. One such process is called recursive feature elimination. Try it out using caret
’s rfe()
function. See code below.
# split index
train_index <- createDataPartition(violent_crime$ViolentCrimesPerPop,
p = .8,
list = FALSE)
# train and test sets
violent_train <- violent_crime %>% slice(train_index)
violent_test <- violent_crime %>% slice(-train_index)
# remove extreme correlations (OPTIONAL)
predictors <- violent_train %>% select(-ViolentCrimesPerPop)
predictors <- predictors %>% select(-findCorrelation(cor(predictors)))
violent_train_clean <- predictors %>%
add_column(ViolentCrimesPerPop = violent_train$ViolentCrimesPerPop)
# Feature elimination settings
ctrl_rfe <- rfeControl(functions = lmFuncs, # linear model
method = "cv",
verbose = FALSE,
rerank = FALSE)
# Run feature elimination
profile <- rfe(x = violent_train %>% select(-ViolentCrimesPerPop),
y = violent_train$ViolentCrimesPerPop,
sizes = 1:(ncol(violent_train_clean)-1), # Features set sizes
rfeControl = ctrl_rfe)
# inspect cross-validation as a function of performance
plot(profile)
# Step 0: Load packages-----------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(tibble) # For advanced tibble functions
library(caret) # For ML mastery
# Step 1: Load, prepare, and explore data ----------------------
# read data
data <- read_csv("1_Data/mpg_num.csv")
# Convert all characters to factors
data <- data %>%
mutate_if(is.character, factor)
# Explore training data
data # Print the dataset
dim(data) # Print dimensions
names(data) # Print the names
# Step 2: Create training and test sets -------------
# Create train index
train_index <- createDataPartition(criterion,
p = .8,
list = FALSE)
# Create training and test sets
data_train <- data %>% slice(train_index)
data_test <- data %>% slice(-train_index)
# split predictors and criterion
criterion_train <- data_train %>% select(hwy) %>% pull()
predictors_train <- data_train %>% select(-hwy)
criterion_test <- data_test %>% select(hwy) %>% pull()
predictors_test <- data_test %>% select(-hwy)
# Step 3: Clean data -------------
# Test for excessively correlated features
corr_matrix <- cor(predictors_train)
corr_features <- findCorrelation(corr_matrix)
# Remove excessively correlated features
predictors_train <- predictors_train %>% select(-corr_features)
# Test for near zero variance features
zerovar_features <- nearZeroVar(predictors_train)
# Remove near zero variance features
predictors_train <- predictors_train %>% select(-zerovar_features)
# recombine data
data_train <- predictors_train %>% add_column(hwy = criterion_train)
# Step 4: Define training control parameters -------------
# Train using cross-validation
ctrl_cv <- trainControl(method = "cv")
# Step 5: Fit models -------------
# Fit glm vanilla flavor
hwy_glm <- train(form = hwy ~ .,
data = data_train,
method = "glm",
trControl = ctrl_cv)
# Fit with pca transformation
hwy_glm_pca <- train(form = hwy ~ .,
data = data_train,
method = "glm",
trControl = ctrl_cv,
preProcess = c('pca'))
# Fit scaling and centering
hwy_glm_sca <- train(form = hwy ~ .,
data = data_train,
method = "glm",
trControl = ctrl_cv,
preProcess = c('scale', 'center'))
# Get fits
glm_fit <- predict(hwy_glm)
glm_pca_fit <- predict(hwy_glm_pca)
glm_sca_fit <- predict(hwy_glm_sca)
# Step 6: Evaluate variable importance -------------
# Run varImp()
imp_glm <- varImp(hwy_glm)
imp_glm_pca <- varImp(hwy_glm_pca)
imp_glm_sca <- varImp(hwy_glm_sca)
# Plot variable importance
plot(imp_glm)
plot(imp_glm_pca)
plot(imp_glm_sca)
# Step 7: Select variables -------------
# Select by hand in formula
hwy_glm_sel <- train(form = hwy ~ cty,
data = data_train,
method = "glm",
trControl = ctrl_cv)
# Select by reducing pca criterion ---
# Reduce criterion to 50% variance epxlained
ctrl_cv_pca <- trainControl(method = "cv",
preProcOptions = list(thresh = 0.50))
# Refit model with update
hwy_glm_sel <- train(form = hwy ~ .,
data = data_train,
method = "glm",
trControl = ctrl_cv_pca,
preProcess = c('pca'))
# Step 8: Recursive feature elimination -------------
# Feature elimination settings
ctrl_rfe <- rfeControl(functions = lmFuncs, # linear model
method = "cv",
verbose = FALSE)
# Run feature elimination
profile <- rfe(x = predictors_train,
y = criterion_train,
sizes = c(1, 2, 3), # Features set sizes should be considered
rfeControl = ctrl_rfe)
# plot result
trellis.par.set(caretTheme())
plot(profile, type = c("g", "o"))
# Step 9: Evaluate models -------------
# you know how...
File | Rows | Columns |
---|---|---|
pima_diabetes | 724 | 7 |
murders_crime | 1000 | 102 |
violent_crime | 1000 | 102 |
nonviolent_crime | 1000 | 102 |
The pima_diabetes
is a subset of the PimaIndiansDiabetes2
data set in the mlbench
package. It contains medical and demographic data of Pima Indians.
The murders_crime
, violent_crime
, and non_violent_crime
data are subsets of the Communities and Crime Unnormalized Data Set data set from the UCI Machine Learning Repository. To see column descriptions, visit this site: Communities and Crime Unnormalized Data Set. Due to the large number of variables (102), we do not include the full tables below.
pima_diabetes
Name | Description |
---|---|
pregnant |
Number of times pregnant. |
glucose |
Plasma glucose concentration (glucose tolerance test). |
pressure |
Diastolic blood pressure (mm Hg). |
triceps |
Triceps skin fold thickness (mm). |
insulin |
2-Hour serum insulin (mu U/ml). |
mass |
Body mass index (weight in kg/(height in m)^2). |
pedigree |
Diabetes pedigree function. |
age |
Age (years). |
diabetes |
Class variable (test for diabetes). |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
tibble |
install.packages("tibble") |
caret |
install.packages("caret") |
Function | Package | Description |
---|---|---|
trainControl() |
caret |
Define modelling control parameters |
train() |
caret |
Train a model |
predict(object, newdata) |
stats |
Predict the criterion values of newdata based on object |
postResample() |
caret |
Calculate aggregate model performance in regression tasks |
confusionMatrix() |
caret |
Calculate aggregate model performance in classification tasks |
varImp() |
caret |
Determine the model-specific importance of features |
findCorrelation() , nearZeroVar() |
caret |
Identify highly correlated and low variance features. |
rfe() , rfeControl() |
caret |
Run and control recursive feature selection. |