Prediction

class: center, middle, inverse, title-slide

# Prediction
### Applied Machine Learning with R <a href='https://therbootcamp.github.io'> The R Bootcamp @ AMLD </a> <a href='https://therbootcamp.github.io/AML_2020AMLD/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### January 2020

---

layout: true

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Applied Machine Learning with R @ AMLD | January 2020
 
 </a>
 
 </div>

---

# Predict hold-out data

.pull-left45[

<ul>
 <li class="m1">Model performance must be evaluated as true prediction on an <high>unseen data set</high>.</li>
 <li class="m2">The unseen data set can be <high>naturally</high> occurring, e.g., using 2019 stock prizes to evaluate a model fit using 2018 stock prizes.</li>
 <li class="m3">More commonly unseen data is created by <high>splitting the available data</high> into a training set and a test set.</li>
</ul>

]

.pull-right45[

]

---

# Why do we separate training from testing?

.pull-left45[

<high>Training data</high>

| id|sex | age|fam_history |smoking | criterion|
|--:|:---|---:|:-----------|:-------|---------:|
|  1|f   |  45|No          |TRUE    |         0|
|  2|m   |  43|No          |FALSE   |         0|
|  3|f   |  40|Yes         |FALSE   |         1|
|  4|f   |  51|Yes         |TRUE    |         1|
|  5|m   |  44|Yes         |FALSE   |         0|

]

.pull-right45[

| id|sex | age|fam_history |smoking |criterion |
|--:|:---|---:|:-----------|:-------|:---------|
| 91|f   |  51|No          |FALSE   |?         |
| 92|m   |  47|No          |TRUE    |?         |
| 93|f   |  39|Yes         |TRUE    |?         |
| 94|f   |  51|Yes         |TRUE    |?         |
| 95|f   |  50|No          |TRUE    |?         |

]

---

.pull-left4[

# Overfitting

<ul>
 <li class="m1">Occurs when a model <high>fits data too closely</high> and therefore <high>fails to reliably predict</high> future observations.</li> 
 <li class="m2">In other words, overfitting occurs when a model <high>'mistakes' random noise for a predictable signal</high>.</li> 
 <li class="m3">More <high>complex models</high> are more <high>prone to overfitting</high>.</li>
</ul>

]

.pull-right5[
 

<img src="image/overfitting.png">

]

---

# Overfitting

---

# Training

<ul>
 <li class="m1">Training a model means to <high>fit the model</high> to data by finding the parameter combination that <high>minizes some error function</high>, e.g., mean squared error (MSE).</li> 
</ul>

---

# Test

<ul style="margin-bottom:-20px">
 <li class="m1">To test a model means to <high>evaluate the prediction error</high> for a fitted model, i.e., for a <high>fixed parameter combination</high>.</li> 
</ul>

---
class: center, middle

# Two new models enter the ring...

---
class: center, middle

<h1>Regression</h1>

<high><h1>Decision Trees</h1></high>

<h1>Random Forests</h1>

---

# CART

.pull-left45[

<ul>
 <li class="m1">CART is short for <high>Classification and Regression Trees</high>, which are often just called <high>Decision trees</high>.</li> 
 <li class="m2">In <a href="https://en.wikipedia.org/wiki/Decision_tree">decision trees</a>, the criterion is modeled as a <high>sequence of logical TRUE or FALSE questions</high>.</li> 
</ul>

]

.pull-right45[

<img src="image/tree.png">

]

---

# Classificiation trees

.pull-left45[

<ul>
 <li class="m1">Classification trees (and regression trees) are created using a relatively simple <high>three-step algorithm</high>.</li> 
 <li class="m2">Algorithm:
 
 <ul class="level">
 <li>1 - <high>Split</high> nodes to maximize purity gain (e.g., Gini gain).</li> 
 <li>2 - <high>Repeat</high> until pre-defined threshold (e.g., <mono>minsplit</mono>) splits are no longer possible.</li> 
 <li>3 - <high>Prune</high> tree to reasonable size.</li>
 </ul>
 </li>
</ul>

]

.pull-right45[

<img src="image/tree.png">

]

---

# Node splitting

.pull-left45[

<ul>
 <li class="m1">Classification trees attempt to <high>minize node impurity</high> using, e.g., the <high>Gini coefficient</high>.</li>
</ul>

`$$\large Gini(S) = 1 - \sum_j^kp_j^2$$`

<ul>
 <li class="m2">Nodes are <high>split</high> using the variable and split value that <high>maximizes Gini gain</high>.</li>
</ul>

`$$Gini \; gain = Gini(S) - Gini(A,S)$$`

with

`$$Gini(A, S) = \sum \frac{n_i}{n}Gini(S_i)$$`

]

.pull-right45[

]

---

# Pruning trees

.pull-left45[

<ul>
 <li class="m1">Classification trees are <high>pruned</high> back such that every split has a purity gain of at least <high><mono>cp</mono></high>, with <mono>cp</mono> often set to <mono>.01</mono>.</li>
 <li class="m2">Minimize:</li>
</ul>

$$
\large
`\begin{split}
Loss = & Impurity\,+\\
&cp*(n\:terminal\:nodes)\\
\end{split}`
$$

]

.pull-right45[

]
---

# Regression trees

.pull-left45[

<ul>
 <li class="m1">Trees can also be used to perform regression tasks. Instead of impurity, regression trees attempt to <high>minimize within-node variance</high>.</li> 
</ul>

`$$\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2$$`

<ul>
 <li class="m2">Algorithm:
 
 <ul class="level">
 <li>1 - <high>Split</high> nodes to maximize homogeneity gain.</li> 
 <li>2 - <high>Repeat</high> until pre-defined threshold (e.g., <mono>minsplit</mono>) splits are no longe possible.</li> 
 <li>3 - <high>Prune</high> tree to reasonable size.</li>
 </ul>
 </li>
</ul>

]

.pull-right45[

]

---

# CART in <mono>caret</mono>

.pull-left4[

<ul>
 <li class="m1">Fit <high>decision trees</high> in <mono>caret</mono> using <mono>method = "rpart"</mono>.</li>
 <li class="m2"><mono>caret</mono> will <high>choose automatically</high> whether to use classification or regression trees depending on whether the criterion is a <mono>factor</mono> or not.</li>
</ul>

]

.pull-right45[

```r
# Fit a decision tree predicting default

train(form = default ~ .,
      data = Loans,
      method = "rpart", # Decision Tree
      trControl = ctrl)

# Fit a decision tree predicting income

train(form = income ~ .,
      data = baselers,
      method = "rpart", # Decision Tree
      trControl = ctrl)
```

]

---
class: center, middle

<h1>Regression</h1>

<h1>Decision Trees</h1>

<high><h1>Random Forests</h1></high>

---

.pull-left45[

# Random Forest

<ul>
 <li class="m1">In <a href="https://en.wikipedia.org/wiki/Random_forest">Random Forest</a>, the criterion is modeled as the <high>aggregate prediction of a large number of decision trees</high> each based on different features.</li> 
 <li class="m2">Algorithm:
 
 <ul class="level">
 <li>1 - <high>Repeat</high> n times</li>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1 - <high>Resample</high> data 
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2 - <high>Grow</high> non-pruned decision tree 
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Each split <high>consider only m 
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;features</high>
 <li>2 - <high>Average</high> fitted values.</li> 
 </ul>
 </li>
</ul>

]

.pull-right45[

]

---

# Random Forest

.pull-left45[

<ul>
 <li class="m1">Random forests make use of important machine learning elements, <high>resampling</high> and <high>averaging</high> that together are also referred to as <high>bagging</high>.</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Element
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Resampling
 </td>
 <td bgcolor="white">
 Creates new data sets that vary in their composition thereby <high>deemphasizing idiosyncracies</high> of the available data. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Averaging
 </td>
 <td bgcolor="white">
 Combining predictions typically <high>evens out idiosyncracies</high> of the models created from single data sets. 
 </td> 
</tr>
</table>
]

.pull-right45[

]

---

# Random forests in <mono>caret</mono>

.pull-left4[

<ul>
 <li class="m1">Fit <high>decision trees</high> in <mono>caret</mono> using <mono>method = "rf"</mono>.</li> 
 <li class="m2">Just like CART, random forests can be used for <high>classification or regression</high>.</li> 
 <li class="m3"><mono>caret</mono> will <high>choose automatically</high> whether to use classification or regression trees depending on whether the crition is a <mono>factor</mono> or not.</li>
</ul>

]

.pull-right45[

```r
# Fit a decision tree predicting default

train(form = default ~ .,
      data = Loans,
      method = "rf", # Decision Tree
      trControl = ctrl)

# Fit a decision tree predicting income

train(form = income ~ .,
      data = baselers,
      method = "rf", # Decision Tree
      trControl = ctrl)
```

]

---
class: center,  middle

# Evaluating model predictions with <mono>caret</mono>

---

# <mono>createDataPartition()</mono>

.pull-left4[

<ul>
 <li class="m1">Use <mono>createDataPartition()</mono> to <high>split a dataset</high> into separate training and test datasets.</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Argument
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>y</mono>
 </td>
 <td bgcolor="white">
 The criterion. Used to create a <high>balanced split</high>. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>p</mono>
 </td>
 <td bgcolor="white">
 The <high>proportion of data</high> going into the training set. Often <mono>.8</mono> or <mono>.5</mono>. 
 </td> 
</tr>
</table>

]

.pull-right5[

```r
# Set the randomisation seed to get the 
#  same results each time
set.seed(100)

# Get indices for training
index <- 
 createDataPartition(y = baselers$income,
 p = .8,
 list = FALSE)

# Create training data
baselers_train <- baselers %>% 
 slice(index)

# Create test data
baselers_test <- baselers %>% 
 slice(-index)
```

]

---

# <mono>predict(, newdata)</mono>

.pull-left4[

<ul>
 <li class="m1">To <high>test model predictions</high> with <mono>caret</mono>, all you need to do is get a vector of predictions from a new dataframe <mono>newdata</mono> using the <mono>predict()</mono> function.</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Argument
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>object</mono>
 </td>
 <td bgcolor="white">
 <mono>caret</mono> fit object. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>newdata</mono>
 </td>
 <td bgcolor="white">
 Test data sest. Must contain same features as provided in <mono>object</mono>. 
 </td> 
</tr>
</table>

]

.pull-right5[

```r
# Fit model to training data
mod <- train(form = income ~ .,
 method = "glm",
 data = baselers_train)

# Get fitted values (for training data)
mod_fit <- predict(mod)

# Predictions for NEW data_test data!
mod_pred <- predict(mod, 
 newdata = baselers_test)

# Evaluate prediction results
postResample(pred = mod_pred, 
             obs = baselers_test$income)
```

]

---
class: middle, center

<h1><a href=https://therbootcamp.github.io/AML_2020AMLD/_sessions/Prediction/Prediction_practical.html>Practical</a></h1>