from from dilbert.com

Overview

By the end of this practical you will:

  1. Understand the importance of the curse of dimensionality.
  2. Know how to eliminate unwanted features.
  3. Explore and use feature importance.
  4. Use dimensionality reduction.

Tasks

A - Setup

  1. Open your BaselRBootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data file(s) listed in the Datasets section are in your 1_Data folder.

  2. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called Features_practical.R in the 2_Code folder.

  3. Using library() load the set of packages for this practical listed in the packages section above.

# Load packages necessary for this script
library(tidyverse)
library(caret)
  1. In the code below, we will load each of the data sets listed in the Datasets as new objects.
# Pima Indians diabetes
pima_diabetes <- read_csv(file = "1_Data/pima_diabetes.csv")

# (Non-) violent crime statistics
violent_crime    <- read_csv(file = "1_Data/violent_crime.csv")
nonviolent_crime <- read_csv(file = "1_Data/nonviolent_crime.csv")

# murders crime statistics
murders_crime <- read_csv(file = "1_Data/murders_crime.csv")

B - Pima Indians Diabetes

In this section, you will explore feature selection for the Pima Indians Diabetes data set. The Pima are a group of Native Americans living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor of carbohydrates for years. In the recent years, because of a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, made them develop the highest prevalence of type 2 diabetes. For this reason they have been subject of many studies.

  1. Take a look at the first few rows of the pima diabetes data frame by printing then to the console.
pima_diabetes
# A tibble: 724 x 7
   diabetes pregnant glucose pressure  mass pedigree   age
   <chr>       <dbl>   <dbl>    <dbl> <dbl>    <dbl> <dbl>
 1 pos             6     148       72  33.6    0.627    50
 2 neg             1      85       66  26.6    0.351    31
 3 pos             8     183       64  23.3    0.672    32
 4 neg             1      89       66  28.1    0.167    21
 5 pos             0     137       40  43.1    2.29     33
 6 neg             5     116       74  25.6    0.201    30
 7 pos             3      78       50  31      0.248    26
 8 pos             2     197       70  30.5    0.158    53
 9 neg             4     110       92  37.6    0.191    30
10 pos            10     168       74  38      0.537    34
# … with 714 more rows
  1. Print the numbers of rows and columns of each data set using the dim() function.
dim(pima_diabetes)
[1] 724   7
  1. Look at the names of the data frame with the names() function.
names(pima_diabetes)
[1] "diabetes" "pregnant" "glucose"  "pressure" "mass"     "pedigree"
[7] "age"     
  1. Open the data set in a new window using View(). Do they look OK?
View(pima_diabetes)

Splitting

  1. As always, before you do anything you need to make sure that you separate a hold-out data set for later. Create pima_train and pima_test using createDataPartition() with as little as 15% of cases going into the training set. Also store the variable diabetes from the test set as a factor, which will be the criterion.
# split index
train_index <- createDataPartition(XX$XX, p = .15, list = FALSE)

# train and test sets
pima_train <- XX %>% slice(train_index)
pima_test  <- XX %>% slice(-train_index)

# test criterion
criterion <- as.factor(pima_test$XX)
# split index
train_index <- createDataPartition(pima_diabetes$diabetes, p = .15, list = FALSE)

# train and test sets
pima_train <- pima_diabetes %>% slice(train_index)
pima_test  <- pima_diabetes %>% slice(-train_index)

# test criterion
criterion <- as.factor(pima_test$diabetes)

Remove unwanted features

OK, with the training set, let’s get to work and remove some features.

  1. First split the training data into a data frame holding the predictors and the criterion using the code below.
# Select predictors
pima_train_pred <- pima_train %>% select(-XX)

# Select criterion
pima_train_crit <- pima_train %>% select(XX) %>% pull()
# Select predictors
pima_train_pred <- pima_train %>% select(-diabetes)

# Select criterion
pima_train_crit <- pima_train %>% select(diabetes) %>% pull()
  1. Although, this data set is rather small and rather well curated, test if there are any excessively correlated features using cor() and findCorrelation() using the code below. Are there any?
# determine correlation matrix
corr_matrix <- cor(XX_pred)

# find excessively correlated variables
findCorrelation(corr_matrix)
# determine correlation matrix
corr_matrix <- cor(pima_train_pred)

# find excessively correlated variables
findCorrelation(corr_matrix)
integer(0)
  1. Now, test if there are any near-zero variance features Any of those?
# find near zero variance predictors
nearZeroVar(XX_pred)
# find near zero variance predictors
nearZeroVar(pima_train_pred)
integer(0)

Feature importance

After having retained all features in the previous section, this section explores feature selection on grounds of feature importance. To do this, we first need to fit our model. How about a simple logistic regression aka method = "glm"?

  1. Fit the glm to the training data.
# fit regression
pima_glm <- train(diabetes ~ .,
                data = XX,
                method = XX
                )
# fit regression
pima_glm <- train(diabetes ~ .,
                data = pima_train,
                method = "glm")
  1. Evaluate feature importance using varImp(). The function will show importance on a scale from 0 (least important feature) to 100 (most important feature). You can set scale = TRUE to see absolute importance measures in t-values.
# determine variable importance
varimp_glm <- varImp(XX)

# print variable importance
varimp_glm

# print variable importance
plot(varimp_glm)
# determine variable importance
varimp_glm <- varImp(pima_glm)

# print variable importance
varimp_glm
glm variable importance

         Overall
glucose   100.00
mass       35.04
pregnant   32.81
pedigree    9.12
pressure    1.81
age         0.00
# print variable importance
plot(varimp_glm)

Model comparison

  1. Fit the glm a second time using only the four best features and store the result in a different fit object.
# fit glm with best four features
pima_glm4 = train(diabetes ~ XX + YY + ZZ + AA,
                data = XX,
                method = XX)
# fit glm with best four features
pima_glm4 = train(diabetes ~ glucose + mass + pregnant + pedigree,
                data = pima_train,
                method = "glm")
  1. Using both fits, all features versus the best four, predict the criterion and evaluate the prediction using confusionMatrix(). Which model glm is better?
# determine predictions for test data
pima_glm_pred <- predict(XX, newdata = XX)
pima_glm4_pred <- predict(XX, newdata = XX)

# evaluate the results
confusionMatrix(XX, reference = XX)
confusionMatrix(XX, reference = XX)
# determine predictions for test data
pima_glm_pred <- predict(pima_glm, newdata = pima_test)
pima_glm4_pred <- predict(pima_glm4, newdata = pima_test)

# evaluate the results
confusionMatrix(pima_glm_pred, criterion)
Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg 352  93
       pos  51 118
                                       
               Accuracy : 0.765        
                 95% CI : (0.73, 0.798)
    No Information Rate : 0.656        
    P-Value [Acc > NIR] : 2.84e-09     
                                       
                  Kappa : 0.454        
                                       
 Mcnemar's Test P-Value : 0.000634     
                                       
            Sensitivity : 0.873        
            Specificity : 0.559        
         Pos Pred Value : 0.791        
         Neg Pred Value : 0.698        
             Prevalence : 0.656        
         Detection Rate : 0.573        
   Detection Prevalence : 0.725        
      Balanced Accuracy : 0.716        
                                       
       'Positive' Class : neg          
                                       
confusionMatrix(pima_glm4_pred, criterion)
Confusion Matrix and Statistics

          Reference
Prediction neg pos
       neg 356  94
       pos  47 117
                                        
               Accuracy : 0.77          
                 95% CI : (0.735, 0.803)
    No Information Rate : 0.656         
    P-Value [Acc > NIR] : 5.31e-10      
                                        
                  Kappa : 0.462         
                                        
 Mcnemar's Test P-Value : 0.000107      
                                        
            Sensitivity : 0.883         
            Specificity : 0.555         
         Pos Pred Value : 0.791         
         Neg Pred Value : 0.713         
             Prevalence : 0.656         
         Detection Rate : 0.580         
   Detection Prevalence : 0.733         
      Balanced Accuracy : 0.719         
                                        
       'Positive' Class : neg           
                                        
  1. You might have observed that the model with two features less is actually slightly better than the full model (if this is not the case, keep in mind that the partitioning of the dataset was done randomly, i.e., if you do it a second time, your results may slightly change). Why do you think is that the case?

  2. Play around: Up the proportion dedicated to training or use a different model, e.g., random forest, and see whether things change.

C - Murders

In this section, you will explore feature selection but use a data set with very different properties. The data combines socio-economic data from the ’90 Census, data from Law Enforcement Management and Admin Stats survey, and crime data from the FBI, allowing to analyse the relationship between various socio-demographic variables and whether murders have been committed (murders), the criterion of this exercise.

  1. Take a look at the first few rows of the murders_crime data frame by printing them to the console.
murders_crime
# A tibble: 1,823 x 102
   murders population householdsize racepctblack racePctWhite racePctAsian
   <chr>        <dbl>         <dbl>        <dbl>        <dbl>        <dbl>
 1 yes          27591          2.63         0.17         94.8         1.6 
 2 yes          36830          2.6         42.4          53.7         0.54
 3 yes          23928          2.6         11.0          81.3         1.78
 4 no           15675          2.59         4.08         84.1         0.54
 5 yes          96086          3.19         8.54         59.8        17.2 
 6 no           48622          2.44         3.14         95.5         0.77
 7 no           10444          3.02         0.41         98.9         0.39
 8 yes         222103          2.49        11.4          82.4         3.77
 9 no           15535          2.35        14.9          84.6         0.32
10 yes          24664          2.73        13.7          85.0         0.53
# … with 1,813 more rows, and 96 more variables: racePctHisp <dbl>,
#   agePct12t21 <dbl>, agePct12t29 <dbl>, agePct16t24 <dbl>,
#   agePct65up <dbl>, numbUrban <dbl>, pctUrban <dbl>, medIncome <dbl>,
#   pctWWage <dbl>, pctWFarmSelf <dbl>, pctWInvInc <dbl>,
#   pctWSocSec <dbl>, pctWPubAsst <dbl>, pctWRetire <dbl>,
#   medFamInc <dbl>, perCapInc <dbl>, whitePerCap <dbl>,
#   blackPerCap <dbl>, indianPerCap <dbl>, AsianPerCap <dbl>,
#   HispPerCap <dbl>, NumUnderPov <dbl>, PctPopUnderPov <dbl>,
#   PctLess9thGrade <dbl>, PctNotHSGrad <dbl>, PctBSorMore <dbl>,
#   PctUnemployed <dbl>, PctEmploy <dbl>, PctEmplManu <dbl>,
#   PctEmplProfServ <dbl>, PctOccupManu <dbl>, PctOccupMgmtProf <dbl>,
#   MalePctDivorce <dbl>, MalePctNevMarr <dbl>, FemalePctDiv <dbl>,
#   TotalPctDiv <dbl>, PersPerFam <dbl>, PctFam2Par <dbl>,
#   PctKids2Par <dbl>, PctYoungKids2Par <dbl>, PctTeen2Par <dbl>,
#   PctWorkMomYoungKids <dbl>, PctWorkMom <dbl>,
#   NumKidsBornNeverMar <dbl>, PctKidsBornNeverMar <dbl>, NumImmig <dbl>,
#   PctImmigRecent <dbl>, PctImmigRec5 <dbl>, PctImmigRec8 <dbl>,
#   PctImmigRec10 <dbl>, PctRecentImmig <dbl>, PctRecImmig5 <dbl>,
#   PctRecImmig8 <dbl>, PctRecImmig10 <dbl>, PctSpeakEnglOnly <dbl>,
#   PctNotSpeakEnglWell <dbl>, PctLargHouseFam <dbl>,
#   PctLargHouseOccup <dbl>, PersPerOccupHous <dbl>,
#   PersPerOwnOccHous <dbl>, PersPerRentOccHous <dbl>,
#   PctPersOwnOccup <dbl>, PctPersDenseHous <dbl>, PctHousLess3BR <dbl>,
#   MedNumBR <dbl>, HousVacant <dbl>, PctHousOccup <dbl>,
#   PctHousOwnOcc <dbl>, PctVacantBoarded <dbl>, PctVacMore6Mos <dbl>,
#   MedYrHousBuilt <dbl>, PctHousNoPhone <dbl>, PctWOFullPlumb <dbl>,
#   OwnOccLowQuart <dbl>, OwnOccMedVal <dbl>, OwnOccHiQuart <dbl>,
#   OwnOccQrange <dbl>, RentLowQ <dbl>, RentMedian <dbl>, RentHighQ <dbl>,
#   RentQrange <dbl>, MedRent <dbl>, MedRentPctHousInc <dbl>,
#   MedOwnCostPctInc <dbl>, MedOwnCostPctIncNoMtg <dbl>,
#   NumInShelters <dbl>, NumStreet <dbl>, PctForeignBorn <dbl>,
#   PctBornSameState <dbl>, PctSameHouse85 <dbl>, PctSameCity85 <dbl>,
#   PctSameState85 <dbl>, LandArea <dbl>, PopDens <dbl>,
#   PctUsePubTrans <dbl>, LemasPctOfficDrugUn <dbl>
  1. Print the numbers of rows and columns of each data set using the dim() function.
dim(murders_crime)
[1] 1823  102
  1. Look at the names of the data frame with the names() function.
names(murders_crime)
  [1] "murders"               "population"           
  [3] "householdsize"         "racepctblack"         
  [5] "racePctWhite"          "racePctAsian"         
  [7] "racePctHisp"           "agePct12t21"          
  [9] "agePct12t29"           "agePct16t24"          
 [11] "agePct65up"            "numbUrban"            
 [13] "pctUrban"              "medIncome"            
 [15] "pctWWage"              "pctWFarmSelf"         
 [17] "pctWInvInc"            "pctWSocSec"           
 [19] "pctWPubAsst"           "pctWRetire"           
 [21] "medFamInc"             "perCapInc"            
 [23] "whitePerCap"           "blackPerCap"          
 [25] "indianPerCap"          "AsianPerCap"          
 [27] "HispPerCap"            "NumUnderPov"          
 [29] "PctPopUnderPov"        "PctLess9thGrade"      
 [31] "PctNotHSGrad"          "PctBSorMore"          
 [33] "PctUnemployed"         "PctEmploy"            
 [35] "PctEmplManu"           "PctEmplProfServ"      
 [37] "PctOccupManu"          "PctOccupMgmtProf"     
 [39] "MalePctDivorce"        "MalePctNevMarr"       
 [41] "FemalePctDiv"          "TotalPctDiv"          
 [43] "PersPerFam"            "PctFam2Par"           
 [45] "PctKids2Par"           "PctYoungKids2Par"     
 [47] "PctTeen2Par"           "PctWorkMomYoungKids"  
 [49] "PctWorkMom"            "NumKidsBornNeverMar"  
 [51] "PctKidsBornNeverMar"   "NumImmig"             
 [53] "PctImmigRecent"        "PctImmigRec5"         
 [55] "PctImmigRec8"          "PctImmigRec10"        
 [57] "PctRecentImmig"        "PctRecImmig5"         
 [59] "PctRecImmig8"          "PctRecImmig10"        
 [61] "PctSpeakEnglOnly"      "PctNotSpeakEnglWell"  
 [63] "PctLargHouseFam"       "PctLargHouseOccup"    
 [65] "PersPerOccupHous"      "PersPerOwnOccHous"    
 [67] "PersPerRentOccHous"    "PctPersOwnOccup"      
 [69] "PctPersDenseHous"      "PctHousLess3BR"       
 [71] "MedNumBR"              "HousVacant"           
 [73] "PctHousOccup"          "PctHousOwnOcc"        
 [75] "PctVacantBoarded"      "PctVacMore6Mos"       
 [77] "MedYrHousBuilt"        "PctHousNoPhone"       
 [79] "PctWOFullPlumb"        "OwnOccLowQuart"       
 [81] "OwnOccMedVal"          "OwnOccHiQuart"        
 [83] "OwnOccQrange"          "RentLowQ"             
 [85] "RentMedian"            "RentHighQ"            
 [87] "RentQrange"            "MedRent"              
 [89] "MedRentPctHousInc"     "MedOwnCostPctInc"     
 [91] "MedOwnCostPctIncNoMtg" "NumInShelters"        
 [93] "NumStreet"             "PctForeignBorn"       
 [95] "PctBornSameState"      "PctSameHouse85"       
 [97] "PctSameCity85"         "PctSameState85"       
 [99] "LandArea"              "PopDens"              
[101] "PctUsePubTrans"        "LemasPctOfficDrugUn"  
  1. Open the data set in a new window using View(). Do they look OK?
View(murders_crime)

Splitting

  1. Again, before you do anything you need to make sure that you separate a hold-out data set for later. Create murders_train and murders_test using createDataPartition() with as little as 25% of cases going into the training set. Also store the variable murders from the test set as a factor, which will be the criterion.
# split index
train_index <- createDataPartition(murders_crime$murders, p = .25, list = FALSE)

# train and test sets
murders_train <- murders_crime %>% slice(train_index)
murders_test  <- murders_crime %>% slice(-train_index)

# test criterion
criterion <- as.factor(murders_test$murders)

Remove unwanted features

OK, with the training set, let’s get to work and remove some features.

  1. First split the training data into a data frame holding the predictors and the criterion using the code below.
# Select predictors
murders_train_pred <- murders_train %>% select(-murders)

# Select criterion
murders_train_crit <- murders_train %>% select(murders) %>% pull()
  1. Test if there are any excessively correlated features using cor() and findCorrelation() using the code below. Are there any this time?
# determine correlation matrix
corr_matrix <- cor(murders_train_pred)

# find excessively correlated variables
findCorrelation(corr_matrix)
 [1] 11 17 27 30 37 40 41 44 48 53 54 55 57 58 59 60 61 64 68 71 80 81 84
[24] 85 87 93  7  8 14 13 20 21 43  1 49 62 67 51 91
  1. Remove the excessively correlated features from the training predictor set.
# remove features
murders_train_pred <- murders_train_pred %>% select(- XX)
# remove features
murders_train_pred <- murders_train_pred %>% 
  select(-findCorrelation(corr_matrix))
  1. Test if there are any near-zero variance features Any of those this time?
# find near zero variance predictors
nearZeroVar(murders_train_pred)
integer(0)
  1. There were plenty of excessively correlated features but no near-zero variance predictors. Bind the reduced predictor set together with the criterion into a new, clean version of the training set.
# clean training set
murders_train_clean <- murders_train_pred %>% 
  add_column(murders = XX)
# combine clean predictor set with criterion
murders_train_clean <- murders_train_pred %>% 
  add_column(murders = murders_train_crit)

Model comparison

  1. Fit a glm twice, once using the original training set and once using the clean training set, and store the fits in separate objects.
# fit glm
murders_glm <- train(murders ~ .,
                     data = XX,
                     method = "glm")

# fit glm
murders_glm_clean <- train(murders ~ .,
                           data = XX,
                           method = "glm")
# fit glm
murders_glm <- train(murders ~ .,
                     data = murders_train,
                     method = "glm")

# fit glm
murders_glm_clean <- train(murders ~ .,
                           data = murders_train_clean,
                           method = "glm")
  1. You probably have noticed warning messages. They concern the very fact that the features in both data sets, but especially the non-clean set, are still too highly correlated. Go ahead and evaluate the fits on the hold-out set. Which set of features predicts better?
# determine predictions for test data
murders_pred <- predict(murders_glm, newdata = murders_test)
murders_clean_pred <- predict(murders_glm_clean, newdata = murders_test)

# evaluate the results
confusionMatrix(murders_pred, criterion)
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  443 202
       yes 187 535
                                        
               Accuracy : 0.715         
                 95% CI : (0.691, 0.739)
    No Information Rate : 0.539         
    P-Value [Acc > NIR] : <2e-16        
                                        
                  Kappa : 0.428         
                                        
 Mcnemar's Test P-Value : 0.478         
                                        
            Sensitivity : 0.703         
            Specificity : 0.726         
         Pos Pred Value : 0.687         
         Neg Pred Value : 0.741         
             Prevalence : 0.461         
         Detection Rate : 0.324         
   Detection Prevalence : 0.472         
      Balanced Accuracy : 0.715         
                                        
       'Positive' Class : no            
                                        
confusionMatrix(murders_clean_pred, criterion)
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  461 192
       yes 169 545
                                        
               Accuracy : 0.736         
                 95% CI : (0.712, 0.759)
    No Information Rate : 0.539         
    P-Value [Acc > NIR] : <2e-16        
                                        
                  Kappa : 0.47          
                                        
 Mcnemar's Test P-Value : 0.247         
                                        
            Sensitivity : 0.732         
            Specificity : 0.739         
         Pos Pred Value : 0.706         
         Neg Pred Value : 0.763         
             Prevalence : 0.461         
         Detection Rate : 0.337         
   Detection Prevalence : 0.478         
      Balanced Accuracy : 0.736         
                                        
       'Positive' Class : no            
                                        

Data compression with PCA

  1. Given the high features correlations, it is sensible to compress the data using principal component analysis (PCA). Create a third fit object using the original training set with preProcess = c('pca') and trControl = trainControl(preProcOptions = list(thresh = 0.8)) in the training function. The first argument tells R to use PCA, the second specifies that exactly 80% of the variance should be retained.
# fit glm with preprocessed features
murders_glm_pca = train(murders ~ .,
                        data = murders_train,
                        method = "glm",
                        preProcess = c("pca"),
                        trControl = trainControl(preProcOptions = list(thresh = 0.8)))
  1. Compare the prediction performance to the previous to the previous two models.
# determine predictions for test data
murders_pca <- predict(murders_glm_pca, newdata = murders_test)

# evaluate the results
confusionMatrix(murders_pca, criterion)
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  491 179
       yes 139 558
                                       
               Accuracy : 0.767        
                 95% CI : (0.744, 0.79)
    No Information Rate : 0.539        
    P-Value [Acc > NIR] : <2e-16       
                                       
                  Kappa : 0.534        
                                       
 Mcnemar's Test P-Value : 0.0287       
                                       
            Sensitivity : 0.779        
            Specificity : 0.757        
         Pos Pred Value : 0.733        
         Neg Pred Value : 0.801        
             Prevalence : 0.461        
         Detection Rate : 0.359        
   Detection Prevalence : 0.490        
      Balanced Accuracy : 0.768        
                                       
       'Positive' Class : no           
                                       
  1. Play around: Alter the amount of variance explained by the PCA (using thresh), increase the proportion dedicated to training, use a different model, e.g., random forest, and see whether things change.

Feature importance

  1. Evaluate the feature importance for each of the three models.
# determine variable importance
varimp_glm <- varImp(murders_glm)
varimp_glm_clean <- varImp(murders_glm_clean)
varimp_glm_pca <- varImp(murders_glm_pca)

# print variable importance
varimp_glm
glm variable importance

  only 20 most important variables shown (out of 99)

                    Overall
MedRent               100.0
RentMedian             97.2
PctWorkMomYoungKids    96.9
PctWorkMom             90.8
racepctblack           89.1
PctKidsBornNeverMar    86.5
pctWRetire             84.1
PctVacMore6Mos         82.9
PctRecImmig5           77.3
PctRecImmig8           76.1
PctRecentImmig         75.8
PctRecImmig10          69.6
PctSpeakEnglOnly       64.6
racePctWhite           62.4
PctImmigRec10          61.9
PctLargHouseFam        58.6
PctTeen2Par            53.6
pctWSocSec             52.4
HousVacant             51.8
indianPerCap           51.1
varimp_glm_clean
glm variable importance

  only 20 most important variables shown (out of 62)

                      Overall
LandArea                100.0
racepctblack             62.2
racePctAsian             55.7
pctWRetire               55.2
LemasPctOfficDrugUn      51.8
PopDens                  51.4
racePctHisp              50.4
MedOwnCostPctIncNoMtg    46.1
PctEmploy                45.4
racePctWhite             45.3
PctWOFullPlumb           44.8
PersPerOwnOccHous        44.0
PctHousNoPhone           42.7
PctEmplManu              41.1
PctVacMore6Mos           40.8
NumStreet                40.1
PctVacantBoarded         37.3
PctOccupManu             34.3
PctSameCity85            33.2
pctWPubAsst              32.1
varimp_glm_pca
glm variable importance

     Overall
PC1  100.000
PC8   80.750
PC2   72.828
PC3   61.968
PC4   51.265
PC5   49.183
PC6   48.974
PC7    3.677
PC10   2.826
PC11   0.504
PC9    0.000
# print variable importance
plot(varimp_glm)

plot(varimp_glm_clean)

plot(varimp_glm_pca)

  1. Can you select a set of features based on feature performance that reliably beats predictions based on the pca-compressed training set?

Z - Violent & Non-violent Crime Data

  1. Run similar analyses for the Violent and non-violent crime data sets predicting either the number of violent crimes per 100k inhabitants (ViolentCrimesPerPop) or the number of non-violent crimes per 100k inhabitants (nonViolPerPop). This time the criterion is numeric, implying regression rather than classification. The features are identical to the murders analysis of the previous section.

  2. Use the computer to find a good set of features using recursive feature elimination with rfe().

# split index
train_index <- createDataPartition(violent_crime$ViolentCrimesPerPop, 
                                   p = .8, 
                                   list = FALSE)

# train and test sets
violent_train <- violent_crime %>% slice(train_index)
violent_test  <- violent_crime %>% slice(-train_index)

# remove extreme correlations (OPTIONAL)
predictors <- violent_train %>% select(-ViolentCrimesPerPop)
predictors <- predictors %>% select(-findCorrelation(cor(predictors)))
violent_train_clean <- predictors %>%
  add_column(ViolentCrimesPerPop = violent_train$ViolentCrimesPerPop)

# Feature elimination settings 
ctrl_rfe <- rfeControl(functions = lmFuncs,  # linear model
                          method = "cv",
                          verbose = FALSE,
                          rerank = FALSE)

# Run feature elimination
profile <- rfe(x = violent_train %>% select(-ViolentCrimesPerPop), 
               y = violent_train$ViolentCrimesPerPop,
               sizes = 1:(ncol(violent_train_clean)-1), # Features set sizes
               rfeControl = ctrl_rfe)

# inspect cross-validation as a function of performance
plot(profile)

Examples

# Step 0: Load packages-----------

library(tidyverse)    # Load tidyverse for dplyr and tidyr
library(tibble)       # For advanced tibble functions
library(caret)        # For ML mastery 

# Step 1: Load, prepare, and explore data ----------------------

# read data
data <- read_csv("1_Data/mpg_num.csv")

# Convert all characters to factors
data <- data %>%
  mutate_if(is.character, factor)

# Explore training data
data        # Print the dataset
dim(data)   # Print dimensions
names(data) # Print the names

# Step 2: Create training and test sets -------------

# Create train index
train_index <- createDataPartition(criterion, 
                                   p = .8, 
                                   list = FALSE)

# Create training and test sets
data_train <- data %>% slice(train_index)
data_test <- data %>% slice(-train_index)

# split predictors and criterion
criterion_train <- data_train %>% select(hwy) %>% pull()
predictors_train <- data_train %>% select(-hwy)
criterion_test <- data_test %>% select(hwy) %>% pull()
predictors_test <- data_test %>% select(-hwy)

# Step 3: Clean data -------------

# Test for excessively correlated features
corr_matrix <- cor(predictors_train)
corr_features <- findCorrelation(corr_matrix)

# Remove excessively correlated features
predictors_train <- predictors_train %>% select(-corr_features)

# Test for near zero variance features
zerovar_features <- nearZeroVar(predictors_train)

# Remove near zero variance features
predictors_train <- predictors_train %>% select(-zerovar_features)

# recombine data
data_train <- predictors_train %>% add_column(hwy = criterion_train)

# Step 4: Define training control parameters -------------

# Train using cross-validation
ctrl_cv <- trainControl(method = "cv") 

# Step 5: Fit models -------------

# Fit glm vanilla flavor
hwy_glm <- train(form = hwy ~ .,
                 data = data_train,
                 method = "glm",
                 trControl = ctrl_cv)

# Fit with pca transformation
hwy_glm_pca <- train(form = hwy ~ .,
                     data = data_train,
                     method = "glm",
                     trControl = ctrl_cv,
                     preProcess = c('pca'))

# Fit scaling and centering
hwy_glm_sca <- train(form = hwy ~ .,
                     data = data_train,
                     method = "glm",
                     trControl = ctrl_cv,
                     preProcess = c('scale', 'center'))

# Get fits
glm_fit     <- predict(hwy_glm)
glm_pca_fit <- predict(hwy_glm_pca)
glm_sca_fit <- predict(hwy_glm_sca)

# Step 6: Evaluate variable importance -------------

# Run varImp()
imp_glm     <- varImp(hwy_glm)
imp_glm_pca <- varImp(hwy_glm_pca)
imp_glm_sca <- varImp(hwy_glm_sca)

# Plot variable importance
plot(imp_glm)
plot(imp_glm_pca)
plot(imp_glm_sca)

# Step 7: Select variables -------------

# Select by hand in formula
hwy_glm_sel <- train(form = hwy ~ cty,
                     data = data_train,
                     method = "glm",
                     trControl = ctrl_cv)

# Select by reducing pca criterion ---

# Reduce criterion to 50% variance epxlained 
ctrl_cv_pca <- trainControl(method = "cv",
                            preProcOptions = list(thresh = 0.50)) 

# Refit model with update
hwy_glm_sel <- train(form = hwy ~ .,
                     data = data_train,
                     method = "glm",
                     trControl = ctrl_cv_pca,
                     preProcess = c('pca'))

# Step 8: Recursive feature elimination -------------

# Feature elimination settings 
ctrl_rfe <- rfeControl(functions = lmFuncs,  # linear model
                       method = "cv",
                       verbose = FALSE)

# Run feature elimination
profile <- rfe(x = predictors_train, 
               y = criterion_train,
               sizes = c(1, 2, 3),     # Features set sizes should be considered
               rfeControl = ctrl_rfe)

# plot result
trellis.par.set(caretTheme())
plot(profile, type = c("g", "o"))

# Step 9: Evaluate models -------------

# you know how...

Datasets

File Rows Columns
pima_diabetes 724 7
murders_crime 1000 102
violent_crime 1000 102
nonviolent_crime 1000 102
  • The pima_diabetes is a subset of the PimaIndiansDiabetes2 data set in the mlbench package. It contains medical and demographic data of Pima Indians.

  • The murders_crime, violent_crime, and non_violent_crime data are subsets of the Communities and Crime Unnormalized Data Set data set from the UCI Machine Learning Repository. To see column descriptions, visit this site: Communities and Crime Unnormalized Data Set. Due to the large number of variables (102), we do not include the full tables below.

Variable description of pima_diabetes

Name Description
pregnant Number of times pregnant.
glucose Plasma glucose concentration (glucose tolerance test).
pressure Diastolic blood pressure (mm Hg).
triceps Triceps skin fold thickness (mm).
insulin 2-Hour serum insulin (mu U/ml).
mass Body mass index (weight in kg/(height in m)^2).
pedigree Diabetes pedigree function.
age Age (years).
diabetes Class variable (test for diabetes).

Functions

Packages

Package Installation
tidyverse install.packages("tidyverse")
tibble install.packages("tibble")
caret install.packages("caret")

Functions

Function Package Description
trainControl() caret Define modelling control parameters
train() caret Train a model
predict(object, newdata) stats Predict the criterion values of newdata based on object
postResample() caret Calculate aggregate model performance in regression tasks
confusionMatrix() caret Calculate aggregate model performance in classification tasks
varImp() caret Determine the model-specific importance of features
findCorrelation(), nearZeroVar() caret Identify highly correlated and low variance features.
rfe(), rfeControl() caret Run and control recursive feature selection.

Resources

Cheatsheet

Trulli
from github.com/rstudio