Prediction

# Prediction
### Machine Learning with R <a href='https://therbootcamp.github.io'> The R Bootcamp @ DHLab </a> <a href='https://therbootcamp.github.io/ML-DHLab/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### November 2020

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Machine Learning with R | November 2020
 
 </a>
 
 </div>

---

# Prediction is...

Prediction is very difficult, especially if it's about the future.
 
Nils Bohr, Nobel Laureate in Physics
 
An economist is an expert who will know tomorrow why the things he predicted yesterday didn't happen today.

Evan Esar, Humorist

]

<img src="image/bohr.jpg"> 
from <a href="https://futurism.com/know-your-scientist-niels-bohr-the-father-of-the-atom">futurism.com</a>

]

---

# Hold-out data

<ul>
 <li class="m1">Model performance must be evaluated as true prediction on an <high>unseen data set</high>.</li> 
 <li class="m2">The unseen data set can be <high>naturally</high> occurring.</li>
 <ul class="level">
 <li>e.g. using 2019 stock prizes to evaluate a model fit using 2018 stock prizes</li>
 </ul> 
 <li class="m3">More commonly unseen data is created by <high>splitting the available data</high> into a training set and a test set..</li>
</ul>

]

]

---

# Training

---

# Test

---

# Overfitting

<ul>
 <li class="m1">Occurs when a model <high>fits data too closely</high> and therefore fails to reliably predict future observations.</li> 
 <li class="m2">Overfitting occurs when a model 'mistakes' random <high>noise</high> for a predictable <high>signal</high>.</li> 
 <li class="m3">More <high>complex models</high> are more prone to overfitting.</li>
</ul>

]

]

---

# Overfitting

---
class: center, middle

<h1>Regression</h1>

<high><h1>Decision Trees</h1></high>

<h1>Random Forests</h1>

---

# CART

<ul>
 <li class="m1">CART is short for <high>Classification and Regression Trees</high>, which are often simply called Decision trees.</li> 
 <li class="m2">Models criterion is modeled as a sequence of <high>logical TRUE or FALSE questions</high>.</li>
</ul>

]

<img src="image/tree.png">

]

---

# Classificiation trees

<ul>
 <li class="m1">Classification trees (and regression trees) are created using a relatively simple <high>three-step algorithm</high>:</li> 
 <ul>
 <li>1 - <high>Split</high> nodes to maximize purity gain (e.g., Gini gain).</li> 
 <li>2 - <high>Repeat</high> until splits are no longer possible.</li> 
 <li>3 - <high>Prune</high> tree to reasonable size.</mono></li>
 </ul>
</ul>

]

<img src="image/tree.png">

]

---

# Node splitting

<ul>
 <li class="m1">Classification trees attempt to <high>minize node impurity</high> using, e.g., the <high>Gini coefficient</high>.</li>
</ul>

`$$\large Gini(S) = 1 - \sum_j^kp_j^2$$`

<ul>
 <li class="m2">Nodes are <high>split</high> using the variable and split value that <high>maximizes Gini gain</high>.</li>
</ul>

`$$Gini \; gain = Gini(S) - Gini(A,S)$$`

with

`$$Gini(A, S) = \sum \frac{n_i}{n}Gini(S_i)$$`

]

]

---

# Pruning trees

<ul>
 <li class="m1">CClassification trees are <high>pruned</high> back such that every split has a purity gain of at least <high><mono>cp</mono></high>.</li>
</ul>

$$
\large
`\begin{split}
Loss = & Impurity\,+\\
&cp*(n\:terminal\:nodes)\\
\end{split}`
$$

]

]
---

# Regression trees

<ul>
 <li class="m1">Trees can also be used to perform regression tasks. Instead of impurity, regression trees attempt to <high>minimize within-node variance</high> (or maximize node homogeneity):</li>
</ul>

`$$\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2$$`

<ul>
 <li class="m2">Algorithm:</li>
 <ul>
 <li>1 - <high>Split</high> nodes to maximize homogeneity gain.</li> 
 <li>2 - <high>Repeat</high> until splits are no longe possible.</li> 
 <li>3 - <high>Prune</high> tree to reasonable size.</li>
 <ul>
</ul>

]

]

---

# CART in <mono>caret</mono>

<ul>
 <li class="m1">Fit <high>decision trees</high> in <mono>caret</mono> using <mono>method = "rpart"</mono>.</li>
 <li class="m2"><mono>caret</mono> will <high>choose automatically</high> whether to use classification or regression trees depending on whether the criterion is a <mono>factor</mono> or not.</li>
</ul>
]

```r
# Fit a decision tree predicting default

train(form = default ~ ., # factor
      data = Loans,
      method = "rpart", # Decision Tree
      trControl = ctrl)

# Fit a decision tree predicting income

train(form = income ~ ., # continuous
      data = basel,
      method = "rpart", # Decision Tree
      trControl = ctrl)
```

]

---
class: center, middle

<h1>Regression</h1>

<h1>Decision Trees</h1>

<high><h1>Random Forests</h1></high>

---

# Random Forest

<ul>
 <li class="m1">In Random Forests the criterion is modeled as the <high>aggregate prediction of <high>many decision trees</high> each based on different features.</li> 
 <li class="m2">Algorithmus:</li>
 <ul>
 <li>1 - <high>Repeat</high> <mono>n</mono> times.</li> 
 <ul>
 <li>1 - <high>Resample</high> data.</li> 
 <li>2 - Each split <high>consider m features</high>.</li> 
 </ul>
 <li>2 - <high>Average</high> predictions.</li> 
 </ul>
</ul>

]

]

---

# Random Forest

<ul>
 <li class="m1">Random forests make use of <high>bagging</high>, which consists of <high>resampling</high> and <high>averaging</high>.</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Element
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Resampling
 </td>
 <td bgcolor="white">
 Creates new data sets that vary in their composition thereby <high>deemphasizing idiosyncracies</high> of the available data. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 Averaging
 </td>
 <td bgcolor="white">
 Combining predictions typically <high>evens out idiosyncracies</high> of the models created from single data sets. 
 </td> 
</tr>
</table>
]

]

---

# Random forests in <mono>caret</mono>

<ul>
 <li class="m1">Fit a Random Forest in <mono>caret</mono> with <highm>method = "rf"</highm>.</li>
 <li class="m2"><mono>caret</mono> will <high>choose automatically</high> whether to use classification or regression trees for the Random Forest depending on whether the criterion is a <mono>factor</mono> or not.</li>
</ul>

]

```r
# Fit a decision tree predicting default

train(form = default ~ .,
      data = Loans,
      method = "rf", # Decision Tree
      trControl = ctrl)

# Fit a decision tree predicting income

train(form = income ~ .,
      data = basel,
      method = "rf", # Decision Tree
      trControl = ctrl)
```

]

---

<h1><a>Evaluating model predictions with <mono>caret</mono></h1>

<!---

# <mono>createDataPartition()</mono>

Use `createDataPartition()` to <high>split a dataset</high> into separate training and test datasets.

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Argument
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>y</mono>
 </td>
 <td bgcolor="white">
 The criterion. Used to create a <high>balanced split</high>. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>p</mono>
 </td>
 <td bgcolor="white">
 The <high>proportion of data</high> going into the training set. Often <mono>.8</mono> or <mono>.5</mono>. 
 </td> 
</tr>
</table>

]

```r
# Set the randomisation seed to get the 
#  same results each time
set.seed(100)

# Get indices for training
index <- 
 createDataPartition(y = basel$income,
 p = .8,
 list = FALSE)

# Create training data
basel_train <- basel %>% 
 slice(index)

# Create test data
basel_test <- basel %>% 
 slice(-index)
```

]

--->

---

# <mono>predict(, newdata)</mono>

<ul>
 <li class="m1">To <high>test model predictions</high>, you nee to compute a vector of predictions from a the test data (<mono>newdata</mono> using the <mono>predict()</mono> function:</li>
</ul>

<table style="cellspacing:0; cellpadding:0; border:none;">
 <col width="30%">
 <col width="70%">
<tr>
 <td bgcolor="white">
 Argument
 </td>
 <td bgcolor="white">
 Description
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>object</mono>
 </td>
 <td bgcolor="white">
 <mono>caret</mono> fit object. 
 </td> 
</tr>
<tr>
 <td bgcolor="white">
 <mono>newdata</mono>
 </td>
 <td bgcolor="white">
 Test data sest. Must contain same features as provided in <mono>object</mono>. 
 </td> 
</tr>
</table>

]

```r
# Fit model to training data
mod <- train(form = income ~ .,
 method = "glm",
 data = basel_train)

# Get fitted values (for training data)
mod_fit <- predict(mod)

# Predictions for NEW data_test data!
mod_pred <- predict(mod, 
 newdata = basel_test)

# Evaluate prediction results
postResample(pred = mod_pred, 
             obs = basel_test$income)
```

]

---
class: middle, center

<h1><a href=https://therbootcamp.github.io/ML-DHLab/_sessions/Prediction/Prediction_practical.html>Practical</a></h1>