Features

# Features
### Machine Learning with R <a href='https://therbootcamp.github.io'> Basel R Bootcamp </a> <a href='https://therbootcamp.github.io/ML_2019Oct/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### October 2019

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Machine Learning with R | October 2019
 
 </a>
 
 </div>

---

# Feature issues
 
Too many features

- Curse of dimensionality
- Feature importance

Wrong features

- Feature scaling
- Feature correlation
- Feature quality

Create new features

- Feature engineering

]

]

---

# Curse of dimensionality

As the number of features grows...

<high>Performance</high> - the amount of data that needs to generalize accurately grows exponentially.

<high>Efficiency</high> - the amount of computations grows (how much depends on the model).

<high>Redundancy</high> - the amount of redundancy grows (how much depends on the model).

&#8594; <high>Small set of good predictors<high>

]

<img src="image/cod.png"> 
from <a href="https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335?gi=6e6735e00188">medium.freecodecamp.org</a>

]

---

# How to reduce dimensionality?

3 ways

1 - Reduce variables <high>manually</high> based on statistical or intuitive considerations.

2 - Reduce variables <high>automatically</high> using the right ML algorithms, e.g., `random forests` or `lasso regression`, or feature selection algorithms, e.g., `recursive feature selection`.

3 - Compress variables using <high>dimensionality reduction algorithms</high>, such as `principal component analysis` (PCA).

]

<img src="image/highd.jpeg" height=350>
from <a href="">Interstellar</a>

]

---

# Feature importance

<high>Feature importance</high> characterizes how much a feature contributes to the fitting/prediction performance.

The metric is <high>model specific </high>, but typically <high>normalized</high> to `[0, 100]`.

Strategies
- Single variable prediction (e.g., using `LOESS`, `ROC`) 
- Accuracy loss from scrambling
- `random forests` importance
- etc. 
]

```r
# plot variable importance for lm(income ~ .)
plot(varImp(income_lm))
```

]

---

# `varImp()`

.pull-left45[
`varImp()` <high>automatically selects appropriate measure</high> of variable importance for a given algorithm.

```r
varImp(income_lm)
```

```
lm variable importance

Overall
age       100.000
food       41.134
alcohol    25.247
happiness  11.582
tattoos     5.888
children    2.745
height      2.337
weight      0.983
```

]

```r
# plot variable importance for lm(income ~ .)
plot(varImp(income_lm))
```

]

---

# Recursive feature selection `rfe()`

Find the <high>best number of `n` predictors</high>, with `n` drawn from specified candidate sets `N`, e.g., `N = [2,3,5,10]`.

Algorithm
1. <high>Resample</high> and split data<br2>
2. Identify <high>best `n` predictors</high> and their prediction performance<br2>
3. <high>Aggregate performance</high> and select best `n` and the accordingly best predictors

]

```r
# Run feature elimination
rfe(x = ..., y = ..., 
    sizes = c(3,4,5,10), # feature set sizes
    rfeControl = rfeControl(functions = lmFuncs))
```

```

Recursive feature selection

Outer resampling method: Bootstrapped (25 reps)

Resampling performance over subset size:

Variables  RMSE Rsquared   MAE RMSESD RsquaredSD   MAESD Selected
         3 0.371    0.861 0.292 0.0133     0.0139 0.01078         
         4 0.367    0.863 0.288 0.0127     0.0126 0.00992        *
         5 0.368    0.863 0.289 0.0122     0.0124 0.00951         
        10 0.371    0.861 0.292 0.0123     0.0121 0.00922         
        14 0.371    0.861 0.291 0.0122     0.0120 0.00918

The top 4 variables (out of 4):
   age, food, alcohol, happiness
```

]

---

# Dimensionality reduction using `PCA`

The go-to algorithm for dimensionality reduction is <high>principal component analysis</high> (PCA).

PCA is an <high>unsupervised</high>, regression-based algorithm that re-represents the data in a <high>new feature space</high>.

The new features aka principal components are greedy. <high>Skimming off the best components</high> results in a small number of features that <high>preserve the original features</high> as well as possible.

]

<img src="image/pca.png" height=350>
from <a href="https://blog.umetrics.com/what-is-principal-component-analysis-pca-and-how-it-is-used
">blog.umetrics.com</a>

]

---

# Using `PCA`

```r
# train model WITHOUT PCA preprocessing
model = train(income ~ ., method = 'lm', 
           data = bas_train)

plot(varImp(model))
```

]

```r
# train model WITH PCA preprocessing
model = train(income ~ ., method = 'lm', 
              data = bas_train,
              preProc = c('pca'))
plot(varImp(model))
```

]

---

# Other, easy feature problems

### Multi-collinearity

Multi-collinearity, <high>high feature correlations</high>, mean that there is redundancy in the data, which can lead to <high>less stable fits</high>, <high>uninterpretable variable importances</high>, and <high>worse predictions</high>.

```r
# identify redundant variables
findCorrelation(cor(baselers))
```

```
[1] 5
```

```r
# remove from data
remove <- findCorrelation(cor(baselers))
baselers <- baselers %>%
 select(-remove)
```

]

### Unequal & low variance

Unequal variance <high>breaks regularization</high> (L1, L2) and renders estimates difficult to interpret.

```r
# standardize and center variables
train(..., preProc("center", "scale"))
```

Low variance variables add parameters, but <high>can hardly contribute to prediction</high> and are, thus, also redundant.

```r
# identify low variance variables
nearZeroVar(baselers)
```

```
integer(0)
```

]

---

# Difficult feature problems

1 - Trivial features

Successful prediction not necessarily implies that a meaningful pattern has been detected.

2 - Missing features

Some problems are hard, requiring the engineering of new features.

]

]

---

# Trivial features

<a href="https://www.gwern.net/Tanks">An urban myth?!</a>

"The Army trained a program to differentiate American tanks from Russian tanks with 100% accuracy. Only later did analysts realize that the American tanks had been photographed on a sunny day and the Russian tanks had been photographed on a cloudy day. The computer had learned to detect brightness." 
New York Times <a href="https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html" style="font-size:8px">[Full text]</a>

]

<img src="image/tank.jpg">
from <a href="https://en.wikipedia.org/wiki/British_heavy_tanks_of_World_War_I#/media/File:Mark_I_series_tank.jpg">wikipedia.org</a>

]

---

# Trivial features

In 2012, Nate Silver was praised to have correctly predicted the outcomes of the presidential election in 50 states after having correctly predicted 49 states in 2008. <high>But how much of a challenge was that?</high>

]

]

---

# (Always!) missing features

"…some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."

Pedro Domingos

"The algorithms we used are very standard for Kagglers. […] We spent most of our efforts in feature engineering. [...] We were also very careful to discard features likely to expose us to the risk of over-fitting our model."

Xavier Conort

"Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering."

Andrew Ng

---

# Feature engineering

“Feature engineering is the process of <high>transforming raw data</high> into features that <high>better represent the underlying problem</high> to the predictive models, resulting in improved model accuracy on unseen data.”

Jason Brownlee

"...while avoiding the <high>curse of dimensionality</high>."

duw

Feature engineering involves

- Transformations
- Interactions
- New features

]

]

---

<h1><a href="https://therbootcamp.github.io/ML_2019Oct/_sessions/Features/Features_practical.html">Practical</a></h1>