class: center, middle, inverse, title-slide # Prediction ### Machine Learning with R
Basel R Bootcamp
### October 2019 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Machine Learning with R | October 2019 </font> </a> </span> </div> --- # Prediction is... .pull-left45[ <p> <font style="font-size:32px"><i>Prediction is very difficult, especially if it's about the future.</i></font> <br><br> Nils Bohr, Nobel Laureate in Physics <br><br> <font style="font-size:32px"><i>An economist is an expert who will know tomorrow why the things he predicted yesterday didn't happen today.</i></font> <br><br> Evan Esar, Humorist </p> ] .pull-right45[ <p align = "center"> <img src="image/bohr.jpg"><br> <font style="font-size:10px">from <a href="https://futurism.com/know-your-scientist-niels-bohr-the-father-of-the-atom">futurism.com</a></font> </p> ] --- # Hold-out data .pull-left45[ Model performance must be evaluated as true prediction on an <high>unseen data set</high>. The unseen data set can be <high>naturally</high> occurring, e.g., using 2019 stock prizes to evaluate a model fit using 2018 stock prizes. More commonly unseen data is created by <high>splitting the available data</high> into a training set and a test set. ] .pull-right45[ <p align = "center"> <img src="image/testdata.png" height=430px> </p> ] --- # Training Training a model means to <high>fit the model</high> to data by finding the parameter combination that <high>minizes some error function</high>, e.g., mean squared error (MSE). <p align = "center" style="padding-top:30px"> <img src="image/training_flow.png" height=350px> </p> --- # Test To test a model means to <high>evaluate the prediction error</high> for a fitted model, i.e., for a <high>fixed parameter combination</high>. <p align = "center" style="padding-top:30px"> <img src="image/testing_flow.png" height=350px> </p> --- # Why do we separate training from testing? <br> <p align = "center"><font size = 6><i>"Can you come up with a model that will perfectly fit the training criterion but is worthless in predicting test data?"</i></font><br><br> .pull-left45[ <high>Training data</high> <br> | id|sex | age|fam_history |smoking | criterion| |--:|:---|---:|:-----------|:-------|---------:| | 1|f | 45|No |TRUE | 0| | 2|m | 43|No |FALSE | 0| | 3|f | 40|Yes |FALSE | 1| | 4|f | 51|Yes |TRUE | 1| | 5|m | 44|Yes |FALSE | 0| ] .pull-right45[ <high> Test data</high> <br> | id|sex | age|fam_history |smoking |criterion | |--:|:---|---:|:-----------|:-------|:---------| | 91|f | 51|No |FALSE |? | | 92|m | 47|No |TRUE |? | | 93|f | 39|Yes |TRUE |? | | 94|f | 51|Yes |TRUE |? | | 95|f | 50|No |TRUE |? | ] --- .pull-left4[ <br><br> # Overfitting Occurs when a model <high>fits data too closely</high> and therefore <high>fails to reliably predict</high> future observations. In other words, overfitting occurs when a model <high>'mistakes' random noise for a predictable signal</high>. More <high>complex models</high> are more <high>prone to overfitting</high>. ] .pull-right5[ <br><br><br> <p align = "center" style="padding-top:0px"> <img src="image/overfitting.png"> </p> ] --- # Overfitting <img src="Prediction_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- class: center, middle # Two new models enter the ring... --- class: center, middle <font color = "gray"><h1>Regression</h1></font> <high><h1>Decision Trees</h1></high> <font color = "gray"><h1>Random Forests</h1></font> --- # CART .pull-left45[ CART is short for <high>Classification and Regression Trees</high>, which are often just called <high>Decision trees</high>. In [decision trees](https://en.wikipedia.org/wiki/Decision_tree), the criterion is modeled as a <high>sequence of logical TRUE or FALSE questions</high>. <br><br> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree.png"> </p> ] --- # Classificiation trees .pull-left45[ Classification trees (and regression trees) are created using a relatively simple <high>three-step algorithm</high>. <u>Algorithm</u> 1 - <high>Split</high> nodes to maximize <b>purity gain</b> (e.g., Gini gain). 2 - <high>Repeat</high> until pre-defined threshold (e.g., `minsplit`) splits are no longer possible. 3 - <high>Prune</high> tree to reasonable size. ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree.png"> </p> ] --- # Node splitting .pull-left45[ Classification trees attempt to <high>minize node impurity</high> using, e.g., the <high>Gini coefficient</high>. `$$\large Gini(S) = 1 - \sum_j^kp_j^2$$` Nodes are <high>split</high> using the variable and split value that <high>maximizes Gini gain</high>. `$$Gini \; gain = Gini(S) - Gini(A,S)$$` with `$$Gini(A, S) = \sum \frac{n_i}{n}Gini(S_i)$$` ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/splitting.png"> </p> ] --- # Pruning trees .pull-left45[ Classification trees are <high>pruned</high> back such that every split has a purity gain of at least <high><mono>cp</mono></high>, with `cp` typically set to `.01`. Minimize: <br> $$ \large `\begin{split} Loss = & Impurity\,+\\ &cp*(n\:terminal\:nodes)\\ \end{split}` $$ ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/splitting.png"> </p> ] --- # Regression trees .pull-left45[ Trees can also be used to perform regression tasks. Instead of impurity, regression trees attempt to <high>minimize within-node variance</high> (or maximize node homogeneity): `$$\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2$$` <u>Algorithm</u> 1 - <high>Split</high> nodes to maximize <b>homogeneity gain</b>. 2 - <high>Repeat</high> until pre-defined threshold (e.g., `minsplit`) splits are no longe possible. 3 - <high>Prune</high> tree to reasonable size. ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/splitting_regr.png"> </p> ] --- # CART in <mono>caret</mono> .pull-left4[ Fit <high>decision trees</high> in `caret` using `method = "rpart"`. `caret` will <high>choose automatically</high> whether to use classification or regression trees depending on whether the criterion is a `factor` or not. ] .pull-right45[ ```r # Fit a decision tree predicting default train(form = default ~ ., data = Loans, method = "rpart", # Decision Tree trControl = ctrl) # Fit a decision tree predicting income train(form = income ~ ., data = baselers, method = "rpart", # Decision Tree trControl = ctrl) ``` ] --- class: center, middle <font color = "gray"><h1>Regression</h1></font> <font color = "gray"><h1>Decision Trees</h1></font> <high><h1>Random Forests</h1></high> --- .pull-left45[ # Random Forest <p style="padding-top:1px"></p> In [Random Forest](https://en.wikipedia.org/wiki/Random_forest), the criterion is modeled as the <high>aggregate prediction of a large number of decision trees</high> each based on different features. <br> <u>Algorithm</u> 1 - <high>Repeat</high> *n* times 1 - <high>Resample</high> data 2 - <high>Grow</high> non-pruned decision tree Each split <high>consider only <i>m</i> features</high> 2 - <high>Average</high> fitted values ] .pull-right45[ <br> <p align = "center" style="padding-top:0px"> <img src="image/rf.png"> </p> ] --- # Random Forest .pull-left45[ <p style="padding-top:1px"></p> Random forests make use of important machine learning elements, <high>resampling</high> and <high>averaging</high> that together are also referred to as <high>bagging</high>. <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Element</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <i>Resampling</i> </td> <td bgcolor="white"> Creates new data sets that vary in their composition thereby <high>deemphasizing idiosyncracies</high> of the available data. </td> </tr> <tr> <td bgcolor="white"> <i>Averaging</i> </td> <td bgcolor="white"> Combining predictions typically <high>evens out idiosyncracies</high> of the models created from single data sets. </td> </tr> </table> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree_crowd.png"> </p> ] --- # Random forests in <mono>caret</mono> .pull-left4[ Fit <high>decision trees</high> in `caret` using `method = "rf"`. Just like CART, random forests can be used for <high>classification or regression</high>. `caret` will <high>choose automatically</high> whether to use classification or regression trees depending on whether the crition is a `factor` or not. ] .pull-right45[ ```r # Fit a decision tree predicting default train(form = default ~ ., data = Loans, method = "rf", # Decision Tree trControl = ctrl) # Fit a decision tree predicting income train(form = income ~ ., data = baselers, method = "rf", # Decision Tree trControl = ctrl) ``` ] --- class: center, middle <br><br> # Evaluating model predictions with caret <img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2014/09/Caret-package-in-R.png" width="70%" style="display: block; margin: auto;" /> --- # <mono>createDataPartition()</mono> .pull-left4[ Use `createDataPartition()` to <high>split a dataset</high> into separate training and test datasets. <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Argument</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>y</mono> </td> <td bgcolor="white"> The criterion. Used to create a <high>balanced split</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>p</mono> </td> <td bgcolor="white"> The <high>proportion of data</high> going into the training set. Often <mono>.8</mono> or <mono>.5</mono>. </td> </tr> </table> ] .pull-right5[ ```r # Set the randomisation seed to get the # same results each time set.seed(100) # Get indices for training index <- createDataPartition(y = baselers$income, p = .8, list = FALSE) # Create training data baselers_train <- baselers %>% slice(index) # Create test data baselers_test <- baselers %>% slice(-index) ``` ] --- # <mono>predict(, newdata)</mono> .pull-left4[ To <high>test model predictions</high> with `caret`, all you need to do is get a vector of predictions from a new dataframe `newdata` using the `predict()` function: <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Argument</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>object</mono> </td> <td bgcolor="white"> <mono>caret</mono> fit object. </td> </tr> <tr> <td bgcolor="white"> <mono>newdata</mono> </td> <td bgcolor="white"> Test data sest. Must contain same features as provided in <mono>object</mono>. </td> </tr> </table> ] .pull-right5[ ```r # Fit model to training data mod <- train(form = income ~ ., method = "glm", data = baselers_train) # Get fitted values (for training data) mod_fit <- predict(mod) # Predictions for NEW data_test data! mod_pred <- predict(mod, newdata = baselers_test) # Evaluate prediction results postResample(pred = mod_pred, obs = baselers_test$income) ``` ] --- class: middle, center <h1><a href=https://therbootcamp.github.io/ML_2019Oct/_sessions/Prediction/Prediction_practical.html>Practical</a></h1>