Features

# Features
### Machine Learning with R <a href='https://therbootcamp.github.io'> The R Bootcamp @ DHLab </a> <a href='https://therbootcamp.github.io/ML-DHLab/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### November 2020

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Machine Learning with R | November 2020
 
 </a>
 
 </div>

---

# Feature issues

<ul>
 <li class="m1">Too many features</li>
 <ul class="level">
 <li>Curse of <high>dimensionality</high></li>
 <li>Feature <high>importance</high></li>
 </ul> 
 <li class="m2">Wrong features
 <ul class="level">
 <li>Feature <high>scaling</high></li>
 <li>Feature <high>correlation</high></li>
 <li>Feature <high>quality</high></li>
 </ul>
 <li class="m3">Create new features
 <ul class="level">
 <li>Feature <high>engineering</high></li>
 </ul>
 </li>
</ul>

]

]

---

# Curse of dimensionality

<ul>
 <li class="m1">Density</li>
 <ul class="level">
 <li>The number of cases, that are necessary to <high>cover the data space</high> grows exponentially with the number of features.</li>
 </ul> 
 <li class="m2">Redundancy</li>
 <ul class="level">
 <li>Redundancy between the features grows with number, implying increased <high>uncertainty</high> in estimation.</li>
 </ul> 
 <li class="m3">Efficiency</li>
 <ul class="level">
 <li>The number of <high>parameters</high> grows with features requiring for <high>computational resources</high>.</li>
 </ul>
 </li>
</ul>

]

<img src="image/cod.png"> 
from <a href="https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335?gi=6e6735e00188">medium.freecodecamp.org</a>

]

---

# How to reduce dimensionality?

<ul>
 <li class="m1">Manual selection</li>
 <ul class="level">
 <li>Reduce features <high>manually</high> based on statistical or intuitive considerations..</li>
 </ul> 
 <li class="m2">Automatic selection</li>
 <ul class="level">
 <li>Reduce variables <high>automatically</high> using suitable ML algorithms, e.g., <mono>random forests</mono> or <mono>lasso</mono>, or feature selection algorithms, e.g., <mono>recursive feature selection</mono>.</li>
 </ul> 
 <li class="m3">Automatic reduction</li>
 <ul class="level">
 <li>Compress variables using <high>dimensionality reduction algorithms</high>, such as principal component analysis (PCA).</li>
 </ul>
 </li>
</ul>

]

<img src="image/highd.jpeg" height=350>
from <a href="">Interstellar</a>

]

---

# Feature importance

<ul>
 <li class="m1">Characterizes how much a <high>feature contributes</high> to the fitting/prediction performance. .</li> 
 <li class="m2">The metric is <high>model specific</high>, but typically <high>normalized</high> to <mono>[0, 100]</mono>.</li> 
 <li class="m3">Strategies</li>
 <ul class="level">
 <li>Single variable prediction (e.g., using LOESS, ROC) </li>
 <li>Accuracy loss from scrambling</li>
 <li>Random Forest importance</li>
 <li>etc.</li>
 </ul>
</ul>

]

```r
# plot variable importance for lm(income ~ .)
plot(varImp(income_lm))
```

]

---

# `varImp()`

<ul>
 <li class="m1">Automatically selects <high>appropriate measure</high> of variable importance for a given algorithm.</li>
</ul>

```r
varImp(income_lm)
```

```
lm variable importance

Overall
age       100.000
food       42.480
alcohol    23.682
happiness  13.909
tattoos     6.284
height      3.230
children    1.837
datause     1.486
```

]

```r
# plot variable importance for lm(income ~ .)
plot(varImp(income_lm))
```

]

---

# Recursive feature selection

<ul>
 <li class="m1"><mono>rfe()</mono> uses <high>cross-validation</high> to select the best n freatures.</li> 
 <li class="m2">Algorithm</li>
 <ol>
 <li><high>Candidates</high>, e.g., <mono>n = [2, 3, 5, 10]</mono>.</li>
 <li><high>Resample</high> and split data.</li>
 <li>Evaluate <high>performance</high> for the best <mono>n</mono> features.</li>
 <li>Select best <mono>n</mono> on the basis of <high>aggregate performance</high>.</li>
 </ol>
</ul>

]

```r
# Run feature elimination
rfe(x = ..., y = ..., 
    sizes = c(3,4,5,10), # feature set sizes
    rfeControl = rfeControl(functions = lmFuncs))
```

```

Recursive feature selection

Outer resampling method: Bootstrapped (25 reps)

Resampling performance over subset size:

Variables  RMSE Rsquared   MAE RMSESD RsquaredSD  MAESD Selected
         3 0.386    0.855 0.303 0.0143    0.01099 0.0127         
         4 0.382    0.858 0.299 0.0146    0.01063 0.0134         
         5 0.382    0.858 0.299 0.0134    0.00987 0.0124        *
        10 0.383    0.858 0.299 0.0127    0.00959 0.0115         
        14 0.382    0.858 0.299 0.0128    0.00970 0.0114

The top 5 variables (out of 5):
   age, food, alcohol, happiness, tattoos
```

]

---

# Principal component analysis

<ul>
 <li class="m1">The <high>go-to algorithm</high> for dimensionality reduction</li> 
 <li class="m2">Linear model (regression) represents features in a <high>new, smaller feature space</high>.</li> 
 <li class="m3">The new feature space explains <high>maximal variance</high> of the original features.</li>
</ul>

]

<img src="image/pca.png" height=350>
from <a href="https://blog.umetrics.com/what-is-principal-component-analysis-pca-and-how-it-is-used
">blog.umetrics.com</a>

]

---

# Using `PCA`

```r
# train model WITHOUT PCA preprocessing
model = train(income ~ ., method = 'lm', 
           data = bas_train)

plot(varImp(model))
```

]

```r
# train model WITH PCA preprocessing
model = train(income ~ ., method = 'lm', 
              data = bas_train,
              preProc = c('pca'))
plot(varImp(model))
```

]

---

# Other, easy feature problems

### Multi-collinearity

<ul>
 <li class="m1"><high>High feature correlations</high> mean that there is redundancy in the data, which can lead to less stable fits, uninterpretable variable importances, and worse predictions.</li>
</ul>

```r
# identify redundant variables
findCorrelation(cor(basel))
```

```
[1] 5
```

```r
# remove from data
remove <- findCorrelation(cor(basel))
basel <- basel %>%
 select(-remove)
```

]

### Low variance

<ul>
 <li class="m2">Low variance variables add parameters, but <high>can hardly contribute to prediction</high> and are, thus, also redundant.</li>
</ul>

```r
# identify low variance variables
nearZeroVar(basel)
```

```
integer(0)
```

<ul>
 <li class="m3">Unequal variance <high>breaks regularization</high> (L1, L2) and renders estimates difficult to interpret..</li>
</ul>

```r
# standardize and center variables
train(..., preProc("center", "scale"))
```

]

---

# Difficult feature problems

<ul>
 <li class="m1">Trivial Features</li>
 <ul>
 <li>Successful prediction not necessarily implies that a meaningful pattern has been detected.</li>
 </ul> 
 <li class="m2">Missing features</li>
 <ul>
 <li>Some problems are hard, requiring the engineering of new features.</li> 
 </ul>
</ul>

]

]

---

# Trivial features

<a href="https://www.gwern.net/Tanks">An urban myth?!</a>

"The Army trained a program to differentiate American tanks from Russian tanks with 100% accuracy. Only later did analysts realize that the American tanks had been photographed on a sunny day and the Russian tanks had been photographed on a cloudy day. The computer had learned to detect brightness." 
New York Times <a href="https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html" style="font-size:8px">[Full text]</a>

]

<img src="image/tank.jpg">
from <a href="https://en.wikipedia.org/wiki/British_heavy_tanks_of_World_War_I#/media/File:Mark_I_series_tank.jpg">wikipedia.org</a>

]

---

# (Always!) missing features

"…some machine learning projects succeed and some fail. What makes the difference? <high>Easily the most important factor is the features used</high>."

[Pedro Domingos](https://en.wikipedia.org/wiki/Pedro_Domingos)

"The algorithms we used are very standard for Kagglers. […] <high>We spent most of our efforts in feature engineering</high>. [...] We were also very careful to discard features likely to expose us to the risk of over-fitting our model."

[Xavier Conort]()

"Coming up with features is difficult, time-consuming, requires expert knowledge. <high>Applied machine learning is basically feature engineering</high>."

[Andrew Ng](https://en.wikipedia.org/wiki/Andrew_Ng)

]

---

# Feature engineering

“Feature engineering is the process of <high>transforming raw data</high> into features that <high>better represent the underlying problem</high> to the predictive models, resulting in improved model accuracy on unseen data.”

[Jason Brownlee]()

"...while avoiding the <high>curse of dimensionality</high>."

[duw]()

]

]

---

# <mono>createDataPartition()</mono>

<ul>
 <li class="m1">Use <mono>createDataPartition()</mono> to split the <high>data set</high> in training and test.</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Argument
 </td>
 <td bgcolor="white">
 Beschreibung
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>y</mono>
 </td>
 <td bgcolor="white">
 The criterion. Important for a <high>balanced split</high> of the data. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>p</mono>
 </td>
 <td bgcolor="white">
 The <high>proportion of data</high> of data assigned to training. Often <mono>.8</mono> or <mono>.5</mono>. 
 </td> 
</tr>
</table>

]

```r
# Important for reproducible results
set.seed(100)

# Index for training
index <- 
 createDataPartition(y = basel$income,
 p = .8,
 list = FALSE)

# Create training 
basel_train <- basel %>% 
 slice(index)

# Create test
basel_test <- basel %>% 
 slice(-index)
```

]

---

<h1><a href="https://therbootcamp.github.io/ML-DHLab/_sessions/Features/Features_practical.html">Practical</a></h1>