Models

# Models
### Machine Learning with R <a href='https://therbootcamp.github.io'> Basel R Bootcamp </a> <a href='https://therbootcamp.github.io/ML_2019Oct/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### October 2019

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Machine Learning with R | October 2019
 
 </a>
 
 </div>

---

# There is no free lunch

Theorem

Given a finite set `$V$` and a finite set `$S$` of real numbers, <high>assume that `$f:V\to S$` is chosen at random</high> according to uniform distribution on the set `$S^{V}$` of all possible functions from `$V$` to `$S$`. For the problem of optimizing `$f$` over the set `$V$`, <high>then no algorithm performs better than blind search.</high>
 
<a href="https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf">Wolpert & Macready, 1997, No Free Lunch Theorems for Optimization</a>

]

<img src="image/free_lunch.jpg" height=400px width=650px> 
 from <a href="http://christianfunnypictures.com/2016/02/theres-no-such-thing-as-a-free-lunch-or-is-there.html">christianfunnypictures.com</a>

]

---

# Know your problem

Bias-variance dilemma

`$$\Large Error = Bias + Variance$$`

Simply put...

Bias arises from strong <high>model assumptions</high> not being met by the environment.

Variance arises from high <high>model flexibility</high> fitting the noise in the data (i.e., overfitting).

&#8594; <high>Make strong assumptions</high> (use simple models), if possible.

]

]

---

One important model assumptions concerns linearity.

Linear models (`lm`, `glm`) make strong model assumptions. They are more often wrong, but also ceteris paribus <high>less prone to overfitting</high>.

Non-linear Models (everything else) make weaker model assumptions, leaving the exact relationship (more) open. They are are closer to the truth, but also ceteris paribus <high>more prone to overfitting</high>.

]

.pull-right5[
 
 

 <img src="image/linearity.png" height=480px> 
 from <a href="https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html">scikit-learn.org</a>

]

---

# Kernel trick

<high>Transforms "input space" into new "feature space"</high> to allows for object separation.

Used in <high>Support Vector Machines</high> (e.g., `method = "svmRadial"`) often using a <high>radial basis function</high> (rdf).

Kernels <high>re-represent objects</high> in terms of other objects!

]

<img src="image/linearity.png" height=480px> 
 from <a href="https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html">scikit-learn.org</a>

]

---

# Automatic feature engineering

<high>Deep learning</high> aka neural networks and, especially, <high>convolutional neural networks</high>, excel because they generate their features.

Neural networks are not the focus of `caret` and this course. Powerful implementations based on <high>Google's Tensorflow</high> library are provided by `tensorflow`.

<img src="image/tf.png"> 
 from <a href="https://de.wikipedia.org/wiki/TensorFlow">towardsdatascience.com</a>

]

<img src="image/power_of_deeplearning.png" height=265px> 
 from <a href="https://towardsdatascience.com/cnn-application-on-structured-data-automated-feature-extraction-8f2cd28d9a7e">towardsdatascience.com</a>

]

---

# Robustness

To produce <high>robust predictions</high> that <high> suffer less from variance</high> ML models use a variety of <high>tricks</high>.

<img src="image/robustness_sel.png" width=350px> 
 from <a href="https://www.istockphoto.com/ch/grafiken/kraftathlet?sort=mostpopular&mediatype=illustration&assetfiletype=eps&phrase=kraftathlet">istockphoto.com</a>

]

.pull-right55[
<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="210">
 <col width="210">
 <col width="210">
<tr>
 <th>Approach</th>
 <th>Implementation</th>
 <th>Examples</th>
</tr>
<tr style="background-color:#ffffff">
 <td align="center">Tolerance</td>
 <td align="center">Decrease error tolerance</td>
 <td align="center"><mono>svmRadial</mono></td>
</tr>
<tr style="background-color:#ffffff">
 <td align="center">Regularization</td>
 <td align="center">Penalize for complexity</td>
 <td align="center"><mono>lasso</mono>, <mono>ridge</mono>, <mono>elasticnet</mono></td>
</tr>
<tr style="background-color:#ffffff">
 <td align="center">Ensemble</td>
 <td align="center">Bagging</td>
 <td align="center"><mono>treebag</mono>, <mono>randomGLM</mono>, <mono>randomForest</mono></td>
</tr>
<tr style="background-color:#ffffff">
 <td align="center">Ensemble</td>
 <td align="center">Boosting</td>
 <td align="center"><mono>adaboost</mono>, <mono>xgbTree</mono></td>
</tr>
<tr style="background-color:#ffffff">
 <td align="center">Feature selection</td>
 <td align="center">Regularization</td>
 <td align="center"><mono>lasso</mono></td>
</tr>
<tr style="background-color:#ffffff">
 <td align="center">Feature selection</td>
 <td align="center">Importance</td>
 <td align="center"><mono>random forest</mono></td>
</tr>
</table>

]

---

# Regularization

Regularization is the process of adding model terms, usually <high>penalties for complexity</high>, in order to prevent overfitting (or solve a problem in the first place).

<br2>
<high>Loss</high> = <high>Misfit</high> + <high>Penalty</high>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="160">
 <col width="160">
 <col width="160">
<tr>
 <th>Name</th>
 <th>Penalty</th>
 <th>`caret`</th>
</tr>
<tr style="background-color:#ffffff">
 <td align="center"><high>AIC/BIC</high></td>
 <td align="center"><img src="image/regularization/aicbic.png" height=24px></td>
 <td align="center">-</td>
</tr>
<tr style="background-color:#ffffff">
 <td align="center"><high>Lasso</high></td>
 <td align="center"><img src="image/regularization/lasso.png" height=24px></td>
 <td align="center">`method = "glmnet"`</td>
</tr>
<tr style="background-color:#ffffff">
 <td align="center"><high>Ridge</high></td>
 <td align="center"><img src="image/regularization/ridge.png" height=24px></td>
 <td align="center">`method = "glmnet"`</td>
</tr>
<tr style="background-color:#ffffff">
 <td align="center"><high>Elastic Net</high></td>
 <td align="center"><img src="image/regularization/ridge.png" height=24px></td>
 <td align="center">`method = "glmnet"`</td>
</tr>
</table>

]

]

---

# Bagging

<high>Aggregate</high> predictions from multiple fits to <high>resampled</high> data.

Especially beneficial for models that produce relatively unstable solutions, e.g., regression trees. `rpart` &#8594; `treebag`.

Algorithm

1 - <high>Resample</high> data (with replacement).

2 - <high>Fit</high> model to resampled data.

3 - <high>Average</high> predictions.

]

.pull-right45[
 

 <img src="image/münchhausen.jpg" height=450px> 
 from <a href="https://en.wikipedia.org/wiki/M%C3%BCnchhausen_trilemma">wikipedia.org</a>

]

---

# Boosting

Bootsing <high>adaptively re-weights</high> samples based on performance.

`adaboost` and, newer, `xgbTree`, are some of the <high>best ML models out there</high>.

Algorithm

1 - Assign <high>equal weight</high> to all cases.

2 - <high>Fit</high> simple model.

3 - <high>Increase weight of misfit cases</high> by model misfit for next iteration.

4 - <high>Repeat</high>.

5 - <high>Average</high> predictions weighted by model misfit.

]

<img src="image/bagg_boost.png" height=410px> 
 from <a href="https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html">scikit-learn.org</a>

]

---

# Automatic feature selection

Many models reduce complexity by automatically relying on a subset of good features.

Two examples

LASSO

Regularization, in particular via `lasso`, frequently <high>estimates <mono>beta = 0</mono></high> and, thus, essentially deselects that feature.

Random forests

As random forests select at any node the best of `mtry`-many randomly selected features, <high>unpredictive features may never come to action</high>. This is especially true for large `mtry`.

]

<img src="image/self_tuning.png" height=420px> 
from <a href="https://medium.com/@dkwok94/machine-learning-for-my-grandma-ca242e97ef62">medium.com</a>

]

---

# Some help in choosing models

<img src="image/mlmap.png" height = 450px> 
from <a href="https://scikit-learn.org">scikit-learn.org

---

# Remember

"…some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."

Pedro Domingos

"The algorithms we used are very standard for Kagglers. […] We spent most of our efforts in feature engineering. [...] We were also very careful to discard features likely to expose us to the risk of over-fitting our model."

Xavier Conort

]

]

---

<h1><a href="https://therbootcamp.github.io/ML_2019Oct/_sessions/Models/Models_practical.html">Practical</a></h1>