Machine Learning with R Basel R Bootcamp |

from from dilbert.com

By the end of this practical you will:

- Understand the importance of the curse of dimensionality.
- Know how to eliminate unwanted features.
- Explore and use feature importance.
- Use dimensionality reduction.

Open your

`BaselRBootcamp`

R project. It should already have the folders`1_Data`

and`2_Code`

. Make sure that the data file(s) listed in the`Datasets`

section are in your`1_Data`

folder.Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called

`Features_practical.R`

in the`2_Code`

folder.Using

`library()`

load the set of packages for this practical listed in the packages section above.

```
# Load packages necessary for this script
library(tidyverse)
library(caret)
```

- In the code below, we will load each of the data sets listed in the
`Datasets`

as new objects.

```
# Pima Indians diabetes
pima_diabetes <- read_csv(file = "1_Data/pima_diabetes.csv")
# (Non-) violent crime statistics
violent_crime <- read_csv(file = "1_Data/violent_crime.csv")
nonviolent_crime <- read_csv(file = "1_Data/nonviolent_crime.csv")
# murders crime statistics
murders_crime <- read_csv(file = "1_Data/murders_crime.csv")
```

In this section, you will explore feature selection for the Pima Indians Diabetes data set. The Pima are a group of Native Americans living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor of carbohydrates for years. In the recent years, because of a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, made them develop the highest prevalence of type 2 diabetes. For this reason they have been subject of many studies.

- Take a look at the first few rows of the pima diabetes data frame by printing then to the console.

`pima_diabetes`

```
# A tibble: 724 x 7
diabetes pregnant glucose pressure mass pedigree age
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 pos 6 148 72 33.6 0.627 50
2 neg 1 85 66 26.6 0.351 31
3 pos 8 183 64 23.3 0.672 32
4 neg 1 89 66 28.1 0.167 21
5 pos 0 137 40 43.1 2.29 33
6 neg 5 116 74 25.6 0.201 30
7 pos 3 78 50 31 0.248 26
8 pos 2 197 70 30.5 0.158 53
9 neg 4 110 92 37.6 0.191 30
10 pos 10 168 74 38 0.537 34
# … with 714 more rows
```

- Print the numbers of rows and columns of each data set using the
`dim()`

function.

`dim(pima_diabetes)`

`[1] 724 7`

- Look at the names of the data frame with the
`names()`

function.

`names(pima_diabetes)`

```
[1] "diabetes" "pregnant" "glucose" "pressure" "mass" "pedigree"
[7] "age"
```

- Open the data set in a new window using
`View()`

. Do they look OK?

`View(pima_diabetes)`

- As always, before you do anything you need to make sure that you separate a hold-out data set for later. Create
`pima_train`

and`pima_test`

using`createDataPartition()`

with as little as**15% of cases going into the training set**. Also store the variable`diabetes`

from the test set as a factor, which will be the criterion.

```
# split index
train_index <- createDataPartition(XX$XX, p = .15, list = FALSE)
# train and test sets
pima_train <- XX %>% slice(train_index)
pima_test <- XX %>% slice(-train_index)
# test criterion
criterion <- as.factor(pima_test$XX)
```

```
# split index
train_index <- createDataPartition(pima_diabetes$diabetes, p = .15, list = FALSE)
# train and test sets
pima_train <- pima_diabetes %>% slice(train_index)
pima_test <- pima_diabetes %>% slice(-train_index)
# test criterion
criterion <- as.factor(pima_test$diabetes)
```

OK, with the training set, let’s get to work and remove some features.

- First split the training data into a data frame holding the predictors and the criterion using the code below.

```
# Select predictors
pima_train_pred <- pima_train %>% select(-XX)
# Select criterion
pima_train_crit <- pima_train %>% select(XX) %>% pull()
```

```
# Select predictors
pima_train_pred <- pima_train %>% select(-diabetes)
# Select criterion
pima_train_crit <- pima_train %>% select(diabetes) %>% pull()
```

- Although, this data set is rather small and rather well curated, test if there are any excessively correlated features using
`cor()`

and`findCorrelation()`

using the code below. Are there any?

```
# determine correlation matrix
corr_matrix <- cor(XX_pred)
# find excessively correlated variables
findCorrelation(corr_matrix)
```

```
# determine correlation matrix
corr_matrix <- cor(pima_train_pred)
# find excessively correlated variables
findCorrelation(corr_matrix)
```

`integer(0)`

- Now, test if there are any near-zero variance features Any of those?

```
# find near zero variance predictors
nearZeroVar(XX_pred)
```

```
# find near zero variance predictors
nearZeroVar(pima_train_pred)
```

`integer(0)`

After having retained all features in the previous section, this section explores feature selection on grounds of feature importance. To do this, we first need to fit our model. How about a simple logistic regression aka `method = "glm"`

?

- Fit the
`glm`

to the training data.

```
# fit regression
pima_glm <- train(diabetes ~ .,
data = XX,
method = XX
)
```

```
# fit regression
pima_glm <- train(diabetes ~ .,
data = pima_train,
method = "glm")
```

- Evaluate feature importance using
`varImp()`

. The function will show importance on a scale from 0 (least important feature) to 100 (most important feature). You can set`scale = TRUE`

to see absolute importance measures in t-values.

```
# determine variable importance
varimp_glm <- varImp(XX)
# print variable importance
varimp_glm
# print variable importance
plot(varimp_glm)
```

```
# determine variable importance
varimp_glm <- varImp(pima_glm)
# print variable importance
varimp_glm
```

```
glm variable importance
Overall
glucose 100.00
mass 35.04
pregnant 32.81
pedigree 9.12
pressure 1.81
age 0.00
```

```
# print variable importance
plot(varimp_glm)
```

- Fit the glm a second time using only the four best features and store the result in a different fit object.

```
# fit glm with best four features
pima_glm4 = train(diabetes ~ XX + YY + ZZ + AA,
data = XX,
method = XX)
```

```
# fit glm with best four features
pima_glm4 = train(diabetes ~ glucose + mass + pregnant + pedigree,
data = pima_train,
method = "glm")
```

- Using both fits, all features versus the best four, predict the criterion and evaluate the prediction using
`confusionMatrix()`

. Which model`glm`

is better?

```
# determine predictions for test data
pima_glm_pred <- predict(XX, newdata = XX)
pima_glm4_pred <- predict(XX, newdata = XX)
# evaluate the results
confusionMatrix(XX, reference = XX)
confusionMatrix(XX, reference = XX)
```

```
# determine predictions for test data
pima_glm_pred <- predict(pima_glm, newdata = pima_test)
pima_glm4_pred <- predict(pima_glm4, newdata = pima_test)
# evaluate the results
confusionMatrix(pima_glm_pred, criterion)
```

```
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 352 93
pos 51 118
Accuracy : 0.765
95% CI : (0.73, 0.798)
No Information Rate : 0.656
P-Value [Acc > NIR] : 2.84e-09
Kappa : 0.454
Mcnemar's Test P-Value : 0.000634
Sensitivity : 0.873
Specificity : 0.559
Pos Pred Value : 0.791
Neg Pred Value : 0.698
Prevalence : 0.656
Detection Rate : 0.573
Detection Prevalence : 0.725
Balanced Accuracy : 0.716
'Positive' Class : neg
```

`confusionMatrix(pima_glm4_pred, criterion)`

```
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 356 94
pos 47 117
Accuracy : 0.77
95% CI : (0.735, 0.803)
No Information Rate : 0.656
P-Value [Acc > NIR] : 5.31e-10
Kappa : 0.462
Mcnemar's Test P-Value : 0.000107
Sensitivity : 0.883
Specificity : 0.555
Pos Pred Value : 0.791
Neg Pred Value : 0.713
Prevalence : 0.656
Detection Rate : 0.580
Detection Prevalence : 0.733
Balanced Accuracy : 0.719
'Positive' Class : neg
```

You might have observed that the model with two features less is actually slightly better than the full model (if this is not the case, keep in mind that the partitioning of the dataset was done randomly, i.e., if you do it a second time, your results may slightly change). Why do you think is that the case?

Play around: Up the proportion dedicated to training or use a different model, e.g.,

`random forest`

, and see whether things change.

In this section, you will explore feature selection but use a data set with very different properties. The data combines socio-economic data from the ’90 Census, data from Law Enforcement Management and Admin Stats survey, and crime data from the FBI, allowing to analyse the relationship between various socio-demographic variables and whether murders have been committed (`murders`

), the criterion of this exercise.

- Take a look at the first few rows of the
`murders_crime`

data frame by printing them to the console.

`murders_crime`

```
# A tibble: 1,823 x 102
murders population householdsize racepctblack racePctWhite racePctAsian
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 yes 27591 2.63 0.17 94.8 1.6
2 yes 36830 2.6 42.4 53.7 0.54
3 yes 23928 2.6 11.0 81.3 1.78
4 no 15675 2.59 4.08 84.1 0.54
5 yes 96086 3.19 8.54 59.8 17.2
6 no 48622 2.44 3.14 95.5 0.77
7 no 10444 3.02 0.41 98.9 0.39
8 yes 222103 2.49 11.4 82.4 3.77
9 no 15535 2.35 14.9 84.6 0.32
10 yes 24664 2.73 13.7 85.0 0.53
# … with 1,813 more rows, and 96 more variables: racePctHisp <dbl>,
# agePct12t21 <dbl>, agePct12t29 <dbl>, agePct16t24 <dbl>,
# agePct65up <dbl>, numbUrban <dbl>, pctUrban <dbl>, medIncome <dbl>,
# pctWWage <dbl>, pctWFarmSelf <dbl>, pctWInvInc <dbl>,
# pctWSocSec <dbl>, pctWPubAsst <dbl>, pctWRetire <dbl>,
# medFamInc <dbl>, perCapInc <dbl>, whitePerCap <dbl>,
# blackPerCap <dbl>, indianPerCap <dbl>, AsianPerCap <dbl>,
# HispPerCap <dbl>, NumUnderPov <dbl>, PctPopUnderPov <dbl>,
# PctLess9thGrade <dbl>, PctNotHSGrad <dbl>, PctBSorMore <dbl>,
# PctUnemployed <dbl>, PctEmploy <dbl>, PctEmplManu <dbl>,
# PctEmplProfServ <dbl>, PctOccupManu <dbl>, PctOccupMgmtProf <dbl>,
# MalePctDivorce <dbl>, MalePctNevMarr <dbl>, FemalePctDiv <dbl>,
# TotalPctDiv <dbl>, PersPerFam <dbl>, PctFam2Par <dbl>,
# PctKids2Par <dbl>, PctYoungKids2Par <dbl>, PctTeen2Par <dbl>,
# PctWorkMomYoungKids <dbl>, PctWorkMom <dbl>,
# NumKidsBornNeverMar <dbl>, PctKidsBornNeverMar <dbl>, NumImmig <dbl>,
# PctImmigRecent <dbl>, PctImmigRec5 <dbl>, PctImmigRec8 <dbl>,
# PctImmigRec10 <dbl>, PctRecentImmig <dbl>, PctRecImmig5 <dbl>,
# PctRecImmig8 <dbl>, PctRecImmig10 <dbl>, PctSpeakEnglOnly <dbl>,
# PctNotSpeakEnglWell <dbl>, PctLargHouseFam <dbl>,
# PctLargHouseOccup <dbl>, PersPerOccupHous <dbl>,
# PersPerOwnOccHous <dbl>, PersPerRentOccHous <dbl>,
# PctPersOwnOccup <dbl>, PctPersDenseHous <dbl>, PctHousLess3BR <dbl>,
# MedNumBR <dbl>, HousVacant <dbl>, PctHousOccup <dbl>,
# PctHousOwnOcc <dbl>, PctVacantBoarded <dbl>, PctVacMore6Mos <dbl>,
# MedYrHousBuilt <dbl>, PctHousNoPhone <dbl>, PctWOFullPlumb <dbl>,
# OwnOccLowQuart <dbl>, OwnOccMedVal <dbl>, OwnOccHiQuart <dbl>,
# OwnOccQrange <dbl>, RentLowQ <dbl>, RentMedian <dbl>, RentHighQ <dbl>,
# RentQrange <dbl>, MedRent <dbl>, MedRentPctHousInc <dbl>,
# MedOwnCostPctInc <dbl>, MedOwnCostPctIncNoMtg <dbl>,
# NumInShelters <dbl>, NumStreet <dbl>, PctForeignBorn <dbl>,
# PctBornSameState <dbl>, PctSameHouse85 <dbl>, PctSameCity85 <dbl>,
# PctSameState85 <dbl>, LandArea <dbl>, PopDens <dbl>,
# PctUsePubTrans <dbl>, LemasPctOfficDrugUn <dbl>
```

- Print the numbers of rows and columns of each data set using the
`dim()`

function.

`dim(murders_crime)`

`[1] 1823 102`

- Look at the names of the data frame with the
`names()`

function.

`names(murders_crime)`

```
[1] "murders" "population"
[3] "householdsize" "racepctblack"
[5] "racePctWhite" "racePctAsian"
[7] "racePctHisp" "agePct12t21"
[9] "agePct12t29" "agePct16t24"
[11] "agePct65up" "numbUrban"
[13] "pctUrban" "medIncome"
[15] "pctWWage" "pctWFarmSelf"
[17] "pctWInvInc" "pctWSocSec"
[19] "pctWPubAsst" "pctWRetire"
[21] "medFamInc" "perCapInc"
[23] "whitePerCap" "blackPerCap"
[25] "indianPerCap" "AsianPerCap"
[27] "HispPerCap" "NumUnderPov"
[29] "PctPopUnderPov" "PctLess9thGrade"
[31] "PctNotHSGrad" "PctBSorMore"
[33] "PctUnemployed" "PctEmploy"
[35] "PctEmplManu" "PctEmplProfServ"
[37] "PctOccupManu" "PctOccupMgmtProf"
[39] "MalePctDivorce" "MalePctNevMarr"
[41] "FemalePctDiv" "TotalPctDiv"
[43] "PersPerFam" "PctFam2Par"
[45] "PctKids2Par" "PctYoungKids2Par"
[47] "PctTeen2Par" "PctWorkMomYoungKids"
[49] "PctWorkMom" "NumKidsBornNeverMar"
[51] "PctKidsBornNeverMar" "NumImmig"
[53] "PctImmigRecent" "PctImmigRec5"
[55] "PctImmigRec8" "PctImmigRec10"
[57] "PctRecentImmig" "PctRecImmig5"
[59] "PctRecImmig8" "PctRecImmig10"
[61] "PctSpeakEnglOnly" "PctNotSpeakEnglWell"
[63] "PctLargHouseFam" "PctLargHouseOccup"
[65] "PersPerOccupHous" "PersPerOwnOccHous"
[67] "PersPerRentOccHous" "PctPersOwnOccup"
[69] "PctPersDenseHous" "PctHousLess3BR"
[71] "MedNumBR" "HousVacant"
[73] "PctHousOccup" "PctHousOwnOcc"
[75] "PctVacantBoarded" "PctVacMore6Mos"
[77] "MedYrHousBuilt" "PctHousNoPhone"
[79] "PctWOFullPlumb" "OwnOccLowQuart"
[81] "OwnOccMedVal" "OwnOccHiQuart"
[83] "OwnOccQrange" "RentLowQ"
[85] "RentMedian" "RentHighQ"
[87] "RentQrange" "MedRent"
[89] "MedRentPctHousInc" "MedOwnCostPctInc"
[91] "MedOwnCostPctIncNoMtg" "NumInShelters"
[93] "NumStreet" "PctForeignBorn"
[95] "PctBornSameState" "PctSameHouse85"
[97] "PctSameCity85" "PctSameState85"
[99] "LandArea" "PopDens"
[101] "PctUsePubTrans" "LemasPctOfficDrugUn"
```

- Open the data set in a new window using
`View()`

. Do they look OK?

`View(murders_crime)`

- Again, before you do anything you need to make sure that you separate a hold-out data set for later. Create
`murders_train`

and`murders_test`

using`createDataPartition()`

with as little as**25% of cases going into the training set**. Also store the variable`murders`

from the test set as a factor, which will be the criterion.

```
# split index
train_index <- createDataPartition(murders_crime$murders, p = .25, list = FALSE)
# train and test sets
murders_train <- murders_crime %>% slice(train_index)
murders_test <- murders_crime %>% slice(-train_index)
# test criterion
criterion <- as.factor(murders_test$murders)
```

OK, with the training set, let’s get to work and remove some features.

- First split the training data into a data frame holding the predictors and the criterion using the code below.

```
# Select predictors
murders_train_pred <- murders_train %>% select(-murders)
# Select criterion
murders_train_crit <- murders_train %>% select(murders) %>% pull()
```

- Test if there are any excessively correlated features using
`cor()`

and`findCorrelation()`

using the code below. Are there any this time?

```
# determine correlation matrix
corr_matrix <- cor(murders_train_pred)
# find excessively correlated variables
findCorrelation(corr_matrix)
```

```
[1] 11 17 27 30 37 40 41 44 48 53 54 55 57 58 59 60 61 64 68 71 80 81 84
[24] 85 87 93 7 8 14 13 20 21 43 1 49 62 67 51 91
```

- Remove the excessively correlated features from the training predictor set.

```
# remove features
murders_train_pred <- murders_train_pred %>% select(- XX)
```

```
# remove features
murders_train_pred <- murders_train_pred %>%
select(-findCorrelation(corr_matrix))
```

- Test if there are any near-zero variance features Any of those this time?

```
# find near zero variance predictors
nearZeroVar(murders_train_pred)
```

`integer(0)`

- There were plenty of excessively correlated features but no near-zero variance predictors. Bind the reduced predictor set together with the criterion into a new, clean version of the training set.

```
# clean training set
murders_train_clean <- murders_train_pred %>%
add_column(murders = XX)
```

```
# combine clean predictor set with criterion
murders_train_clean <- murders_train_pred %>%
add_column(murders = murders_train_crit)
```

- Fit a
`glm`

twice, once using the original training set and once using the clean training set, and store the fits in separate objects.

```
# fit glm
murders_glm <- train(murders ~ .,
data = XX,
method = "glm")
# fit glm
murders_glm_clean <- train(murders ~ .,
data = XX,
method = "glm")
```

```
# fit glm
murders_glm <- train(murders ~ .,
data = murders_train,
method = "glm")
# fit glm
murders_glm_clean <- train(murders ~ .,
data = murders_train_clean,
method = "glm")
```

- You probably have noticed warning messages. They concern the very fact that the features in both data sets, but especially the non-clean set, are still too highly correlated. Go ahead and evaluate the fits on the hold-out set. Which set of features predicts better?

```
# determine predictions for test data
murders_pred <- predict(murders_glm, newdata = murders_test)
murders_clean_pred <- predict(murders_glm_clean, newdata = murders_test)
# evaluate the results
confusionMatrix(murders_pred, criterion)
```

```
Confusion Matrix and Statistics
Reference
Prediction no yes
no 443 202
yes 187 535
Accuracy : 0.715
95% CI : (0.691, 0.739)
No Information Rate : 0.539
P-Value [Acc > NIR] : <2e-16
Kappa : 0.428
Mcnemar's Test P-Value : 0.478
Sensitivity : 0.703
Specificity : 0.726
Pos Pred Value : 0.687
Neg Pred Value : 0.741
Prevalence : 0.461
Detection Rate : 0.324
Detection Prevalence : 0.472
Balanced Accuracy : 0.715
'Positive' Class : no
```

`confusionMatrix(murders_clean_pred, criterion)`

```
Confusion Matrix and Statistics
Reference
Prediction no yes
no 461 192
yes 169 545
Accuracy : 0.736
95% CI : (0.712, 0.759)
No Information Rate : 0.539
P-Value [Acc > NIR] : <2e-16
Kappa : 0.47
Mcnemar's Test P-Value : 0.247
Sensitivity : 0.732
Specificity : 0.739
Pos Pred Value : 0.706
Neg Pred Value : 0.763
Prevalence : 0.461
Detection Rate : 0.337
Detection Prevalence : 0.478
Balanced Accuracy : 0.736
'Positive' Class : no
```

- Given the high features correlations, it is sensible to compress the data using
`principal component analysis`

(PCA). Create a third fit object using the original training set with`preProcess = c('pca')`

and`trControl = trainControl(preProcOptions = list(thresh = 0.8))`

in the training function. The first argument tells R to use PCA, the second specifies that exactly 80% of the variance should be retained.

```
# fit glm with preprocessed features
murders_glm_pca = train(murders ~ .,
data = murders_train,
method = "glm",
preProcess = c("pca"),
trControl = trainControl(preProcOptions = list(thresh = 0.8)))
```

- Compare the prediction performance to the previous to the previous two models.

```
# determine predictions for test data
murders_pca <- predict(murders_glm_pca, newdata = murders_test)
# evaluate the results
confusionMatrix(murders_pca, criterion)
```

```
Confusion Matrix and Statistics
Reference
Prediction no yes
no 491 179
yes 139 558
Accuracy : 0.767
95% CI : (0.744, 0.79)
No Information Rate : 0.539
P-Value [Acc > NIR] : <2e-16
Kappa : 0.534
Mcnemar's Test P-Value : 0.0287
Sensitivity : 0.779
Specificity : 0.757
Pos Pred Value : 0.733
Neg Pred Value : 0.801
Prevalence : 0.461
Detection Rate : 0.359
Detection Prevalence : 0.490
Balanced Accuracy : 0.768
'Positive' Class : no
```

- Play around: Alter the amount of variance explained by the
`PCA`

(using`thresh`

), increase the proportion dedicated to training, use a different model, e.g.,`random forest`

, and see whether things change.

- Evaluate the feature importance for each of the three models.

```
# determine variable importance
varimp_glm <- varImp(murders_glm)
varimp_glm_clean <- varImp(murders_glm_clean)
varimp_glm_pca <- varImp(murders_glm_pca)
# print variable importance
varimp_glm
```

```
glm variable importance
only 20 most important variables shown (out of 99)
Overall
MedRent 100.0
RentMedian 97.2
PctWorkMomYoungKids 96.9
PctWorkMom 90.8
racepctblack 89.1
PctKidsBornNeverMar 86.5
pctWRetire 84.1
PctVacMore6Mos 82.9
PctRecImmig5 77.3
PctRecImmig8 76.1
PctRecentImmig 75.8
PctRecImmig10 69.6
PctSpeakEnglOnly 64.6
racePctWhite 62.4
PctImmigRec10 61.9
PctLargHouseFam 58.6
PctTeen2Par 53.6
pctWSocSec 52.4
HousVacant 51.8
indianPerCap 51.1
```

`varimp_glm_clean`

```
glm variable importance
only 20 most important variables shown (out of 62)
Overall
LandArea 100.0
racepctblack 62.2
racePctAsian 55.7
pctWRetire 55.2
LemasPctOfficDrugUn 51.8
PopDens 51.4
racePctHisp 50.4
MedOwnCostPctIncNoMtg 46.1
PctEmploy 45.4
racePctWhite 45.3
PctWOFullPlumb 44.8
PersPerOwnOccHous 44.0
PctHousNoPhone 42.7
PctEmplManu 41.1
PctVacMore6Mos 40.8
NumStreet 40.1
PctVacantBoarded 37.3
PctOccupManu 34.3
PctSameCity85 33.2
pctWPubAsst 32.1
```

`varimp_glm_pca`

```
glm variable importance
Overall
PC1 100.000
PC8 80.750
PC2 72.828
PC3 61.968
PC4 51.265
PC5 49.183
PC6 48.974
PC7 3.677
PC10 2.826
PC11 0.504
PC9 0.000
```

```
# print variable importance
plot(varimp_glm)
```

`plot(varimp_glm_clean)`

`plot(varimp_glm_pca)`

- Can you select a set of features based on feature performance that reliably beats predictions based on the pca-compressed training set?

Run similar analyses for the Violent and non-violent crime data sets predicting either the number of violent crimes per 100k inhabitants (

`ViolentCrimesPerPop`

) or the number of non-violent crimes per 100k inhabitants (`nonViolPerPop`

). This time the criterion is numeric, implying regression rather than classification. The features are identical to the murders analysis of the previous section.Use the computer to find a good set of features using recursive feature elimination with

`rfe()`

.

```
# split index
train_index <- createDataPartition(violent_crime$ViolentCrimesPerPop,
p = .8,
list = FALSE)
# train and test sets
violent_train <- violent_crime %>% slice(train_index)
violent_test <- violent_crime %>% slice(-train_index)
# remove extreme correlations (OPTIONAL)
predictors <- violent_train %>% select(-ViolentCrimesPerPop)
predictors <- predictors %>% select(-findCorrelation(cor(predictors)))
violent_train_clean <- predictors %>%
add_column(ViolentCrimesPerPop = violent_train$ViolentCrimesPerPop)
# Feature elimination settings
ctrl_rfe <- rfeControl(functions = lmFuncs, # linear model
method = "cv",
verbose = FALSE,
rerank = FALSE)
# Run feature elimination
profile <- rfe(x = violent_train %>% select(-ViolentCrimesPerPop),
y = violent_train$ViolentCrimesPerPop,
sizes = 1:(ncol(violent_train_clean)-1), # Features set sizes
rfeControl = ctrl_rfe)
# inspect cross-validation as a function of performance
plot(profile)
```

```
# Step 0: Load packages-----------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(tibble) # For advanced tibble functions
library(caret) # For ML mastery
# Step 1: Load, prepare, and explore data ----------------------
# read data
data <- read_csv("1_Data/mpg_num.csv")
# Convert all characters to factors
data <- data %>%
mutate_if(is.character, factor)
# Explore training data
data # Print the dataset
dim(data) # Print dimensions
names(data) # Print the names
# Step 2: Create training and test sets -------------
# Create train index
train_index <- createDataPartition(criterion,
p = .8,
list = FALSE)
# Create training and test sets
data_train <- data %>% slice(train_index)
data_test <- data %>% slice(-train_index)
# split predictors and criterion
criterion_train <- data_train %>% select(hwy) %>% pull()
predictors_train <- data_train %>% select(-hwy)
criterion_test <- data_test %>% select(hwy) %>% pull()
predictors_test <- data_test %>% select(-hwy)
# Step 3: Clean data -------------
# Test for excessively correlated features
corr_matrix <- cor(predictors_train)
corr_features <- findCorrelation(corr_matrix)
# Remove excessively correlated features
predictors_train <- predictors_train %>% select(-corr_features)
# Test for near zero variance features
zerovar_features <- nearZeroVar(predictors_train)
# Remove near zero variance features
predictors_train <- predictors_train %>% select(-zerovar_features)
# recombine data
data_train <- predictors_train %>% add_column(hwy = criterion_train)
# Step 4: Define training control parameters -------------
# Train using cross-validation
ctrl_cv <- trainControl(method = "cv")
# Step 5: Fit models -------------
# Fit glm vanilla flavor
hwy_glm <- train(form = hwy ~ .,
data = data_train,
method = "glm",
trControl = ctrl_cv)
# Fit with pca transformation
hwy_glm_pca <- train(form = hwy ~ .,
data = data_train,
method = "glm",
trControl = ctrl_cv,
preProcess = c('pca'))
# Fit scaling and centering
hwy_glm_sca <- train(form = hwy ~ .,
data = data_train,
method = "glm",
trControl = ctrl_cv,
preProcess = c('scale', 'center'))
# Get fits
glm_fit <- predict(hwy_glm)
glm_pca_fit <- predict(hwy_glm_pca)
glm_sca_fit <- predict(hwy_glm_sca)
# Step 6: Evaluate variable importance -------------
# Run varImp()
imp_glm <- varImp(hwy_glm)
imp_glm_pca <- varImp(hwy_glm_pca)
imp_glm_sca <- varImp(hwy_glm_sca)
# Plot variable importance
plot(imp_glm)
plot(imp_glm_pca)
plot(imp_glm_sca)
# Step 7: Select variables -------------
# Select by hand in formula
hwy_glm_sel <- train(form = hwy ~ cty,
data = data_train,
method = "glm",
trControl = ctrl_cv)
# Select by reducing pca criterion ---
# Reduce criterion to 50% variance epxlained
ctrl_cv_pca <- trainControl(method = "cv",
preProcOptions = list(thresh = 0.50))
# Refit model with update
hwy_glm_sel <- train(form = hwy ~ .,
data = data_train,
method = "glm",
trControl = ctrl_cv_pca,
preProcess = c('pca'))
# Step 8: Recursive feature elimination -------------
# Feature elimination settings
ctrl_rfe <- rfeControl(functions = lmFuncs, # linear model
method = "cv",
verbose = FALSE)
# Run feature elimination
profile <- rfe(x = predictors_train,
y = criterion_train,
sizes = c(1, 2, 3), # Features set sizes should be considered
rfeControl = ctrl_rfe)
# plot result
trellis.par.set(caretTheme())
plot(profile, type = c("g", "o"))
# Step 9: Evaluate models -------------
# you know how...
```

File | Rows | Columns |
---|---|---|

pima_diabetes | 724 | 7 |

murders_crime | 1000 | 102 |

violent_crime | 1000 | 102 |

nonviolent_crime | 1000 | 102 |

The

`pima_diabetes`

is a subset of the`PimaIndiansDiabetes2`

data set in the`mlbench`

package. It contains medical and demographic data of Pima Indians.The

`murders_crime`

,`violent_crime`

, and`non_violent_crime`

data are subsets of the Communities and Crime Unnormalized Data Set data set from the UCI Machine Learning Repository. To see column descriptions, visit this site: Communities and Crime Unnormalized Data Set. Due to the large number of variables (102), we do not include the full tables below.

`pima_diabetes`

Name | Description |
---|---|

`pregnant` |
Number of times pregnant. |

`glucose` |
Plasma glucose concentration (glucose tolerance test). |

`pressure` |
Diastolic blood pressure (mm Hg). |

`triceps` |
Triceps skin fold thickness (mm). |

`insulin` |
2-Hour serum insulin (mu U/ml). |

`mass` |
Body mass index (weight in kg/(height in m)^2). |

`pedigree` |
Diabetes pedigree function. |

`age` |
Age (years). |

`diabetes` |
Class variable (test for diabetes). |

Package | Installation |
---|---|

`tidyverse` |
`install.packages("tidyverse")` |

`tibble` |
`install.packages("tibble")` |

`caret` |
`install.packages("caret")` |

Function | Package | Description |
---|---|---|

`trainControl()` |
`caret` |
Define modelling control parameters |

`train()` |
`caret` |
Train a model |

`predict(object, newdata)` |
`stats` |
Predict the criterion values of `newdata` based on `object` |

`postResample()` |
`caret` |
Calculate aggregate model performance in regression tasks |

`confusionMatrix()` |
`caret` |
Calculate aggregate model performance in classification tasks |

`varImp()` |
`caret` |
Determine the model-specific importance of features |

`findCorrelation()` , `nearZeroVar()` |
`caret` |
Identify highly correlated and low variance features. |

`rfe()` , `rfeControl()` |
`caret` |
Run and control recursive feature selection. |