class: center, middle, inverse, title-slide # Prediction ### Applied Machine Learning with R
The R Bootcamp @ AMLD
### November 2021 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Applied Machine Learning with R @ AMLD | November 2021 </font> </a> </span> </div> --- # Predict hold-out data .pull-left45[ <ul> <li class="m1"><span>Model performance must be evaluated as true prediction on an <high>unseen data set</high>.</span></li> <li class="m2"><span>The unseen data set can be <high>naturally</high> occurring, e.g., using 2019 stock prizes to evaluate a model fit using 2018 stock prizes.</span></li> <li class="m3"><span>More commonly unseen data is created by <high>splitting the available data</high> into a training set and a test set.</span></li> </ul> ] .pull-right45[ <p align = "center"> <img src="image/testdata.png" height=430px> </p> ] --- .pull-left4[ <br><br> # Overfitting <ul> <li class="m1"><span>Occurs when a model <high>fits data too closely</high> and therefore <high>fails to reliably predict</high> future observations.</span></li><br><br> <li class="m2"><span>In other words, overfitting occurs when a model <high>'mistakes' random noise for a predictable signal</high>.</span></li><br><br> <li class="m3"><span>More <high>complex models</high> are more <high>prone to overfitting</high>.</span></li> </ul> ] .pull-right5[ <br><br><br> <p align = "center" style="padding-top:0px"> <img src="image/overfitting.png"> </p> ] --- # Overfitting <img src="Prediction_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Training <ul> <li class="m1"><span>Training a model means to <high>fit the model</high> to data by finding the parameter combination that <high>minizes some error function</high>, e.g., mean squared error (MSE).</span></li><br><br> </ul> <p align = "center"> <img src="image/training_flow.png" height=350px> </p> --- # Test <ul style="margin-bottom:-20px"> <li class="m1"><span>To test a model means to <high>evaluate the prediction error</high> for a fitted model, i.e., for a <high>fixed parameter combination</high>.</span></li><br><br> </ul> <p align = "center"> <img src="image/testing_flow.png" height=350px> </p> --- class: center, middle <font color = "gray"><h1>Regression</h1></font> <high><h1>Decision Trees</h1></high> <font color = "gray"><h1>Random Forests</h1></font> --- # CART .pull-left45[ <ul> <li class="m1"><span>CART is short for <high>Classification and Regression Trees</high>, which are often just called <high>Decision trees</high>.</span></li><br> <li class="m2"><span>In <a href="https://en.wikipedia.org/wiki/Decision_tree">decision trees</a>, the criterion is modeled as a <high>sequence of logical TRUE or FALSE questions</high>.</span></li><br><br> </ul> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree.png"> </p> ] --- # Classificiation trees .pull-left45[ <ul> <li class="m1"><span>Classification trees (and regression trees) are created using a relatively simple <high>three-step algorithm</high>.</span></li><br> <li class="m2"><span>Algorithm: <br><br> <ul class="level"> <li><span>1 - <high>Split</high> nodes to maximize <b>purity gain</b> (e.g., Gini gain).</span></li><br> <li><span>2 - <high>Repeat</high> until pre-defined threshold (e.g., <mono>minsplit</mono>) splits are no longer possible.</span></li><br> <li><span>3 - <high>Prune</high> tree to reasonable size.</span></li> </ul> </span></li> </ul> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree.png"> </p> ] --- # Node splitting .pull-left45[ <ul> <li class="m1"><span>Classification trees attempt to <high>minize node impurity</high> using, e.g., the <high>Gini coefficient</high>.</span></li> </ul> `$$\large Gini(S) = 1 - \sum_j^kp_j^2$$` <ul> <li class="m2"><span>Nodes are <high>split</high> using the variable and split value that <high>maximizes Gini gain</high>.</span></li> </ul> `$$Gini \; gain = Gini(S) - Gini(A,S)$$` <p style="padding:0;margin:0" align="center">with</p> `$$Gini(A, S) = \sum \frac{n_i}{n}Gini(S_i)$$` ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/splitting.png"> </p> ] --- # Pruning trees .pull-left45[ <ul> <li class="m1"><span>Classification trees are <high>pruned</high> back such that every split has a purity gain of at least <high><mono>cp</mono></high>, with <mono>cp</mono> often set to <mono>.01</mono>.</span></li> <li class="m2"><span>Minimize:</span></li> </ul> <br> $$ \large `\begin{split} Loss = & Impurity\,+\\ &cp*(n\:terminal\:nodes)\\ \end{split}` $$ ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/splitting.png"> </p> ] --- # Regression trees .pull-left45[ <ul> <li class="m1"><span>Trees can also be used to perform regression tasks. Instead of impurity, regression trees attempt to <high>minimize within-node variance</high>.</span></li><br> </ul> `$$\large SSE = \sum_{i \in S_1}(y_i - \bar{y}_1)^2+\sum_{i \in S_2}(y_i - \bar{y}_2)^2$$` <ul> <li class="m2"><span>Algorithm: <br><br> <ul class="level"> <li><span>1 - <high>Split</high> nodes to maximize <b>homogeneity gain</b>.</span></li><br> <li><span>2 - <high>Repeat</high> until pre-defined threshold (e.g., <mono>minsplit</mono>) splits are no longe possible.</span></li><br> <li><span>3 - <high>Prune</high> tree to reasonable size.</span></li> </ul> </span></li> </ul> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/splitting_regr.png"> </p> ] --- class: center, middle <font color = "gray"><h1>Regression</h1></font> <font color = "gray"><h1>Decision Trees</h1></font> <high><h1>Random Forests</h1></high> --- .pull-left45[ # Random Forest <p style="padding-top:1px"></p> <ul> <li class="m1"><span>In <a href="https://en.wikipedia.org/wiki/Random_forest">Random Forest</a>, the criterion is modeled as the <high>aggregate prediction of a large number of decision trees</high> each based on different features.</span></li><br> <li class="m2"><span>Algorithm: <br><br> <ul class="level"> <li><span>1 - <high>Repeat</high> <i>n</i> times</span></li> 1 - <high>Resample</high> data<br> 2 - <high>Grow</high> non-pruned decision tree<br> Each split <high>consider only <i>m</i><br> features</high> <li><span>2 - <high>Average</high> fitted values.</span></li><br> </ul> </span></li> </ul> ] .pull-right45[ <br> <p align = "center" style="padding-top:0px"> <img src="image/rf.png"> </p> ] --- # Random Forest .pull-left45[ <p style="padding-top:1px"></p> <ul> <li class="m1"><span>Random forests make use of important machine learning elements, <high>resampling</high> and <high>averaging</high> that together are also referred to as <high>bagging</high>.</span></li> </ul> <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Element</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <i>Resampling</i> </td> <td bgcolor="white"> Creates new data sets that vary in their composition thereby <high>deemphasizing idiosyncracies</high> of the available data. </td> </tr> <tr> <td bgcolor="white"> <i>Averaging</i> </td> <td bgcolor="white"> Combining predictions typically <high>evens out idiosyncracies</high> of the models created from single data sets. </td> </tr> </table> ] .pull-right45[ <p align = "center" style="padding-top:0px"> <img src="image/tree_crowd.png"> </p> ] --- class: center, middle <p align = "center"> <img src="https://www.tidymodels.org/images/tidymodels.png" width=240px><br> <font style="font-size:10px">from <a href="https://www.tidymodels.org/packages/">tidymodels.org</a></font> </p> --- .pull-left4[ # Fitting <mono>tidymodels</mono> <br> <ul> <li class="m1"><span>Split the data.</span></li><br> <li class="m2"><span>Change the model.</span></li><br> <li class="m3"><span>Predict.</span></li><br> </ul> ] .pull-right5[ <p align = "center"> <br> <img src="image/tidymodels_pred.png" height=560px><br> </p> ] --- # Split data into training and test .pull-left4[ <ul> <li class="m1"><span>Use <mono>initial_split()</mono> to <high>split a dataset</high> into separate training and test datasets.</span></li> </ul> <br> <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Argument</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>data</mono> </td> <td bgcolor="white"> The dataset. Used to create a <high>balanced split</high>. </td> </tr> <tr> <td bgcolor="white"> <mono>prop</mono> </td> <td bgcolor="white"> The <high>proportion of data</high> going into the training set. Often .8, <mono>.75</mono> (default) or <mono>.5</mono>. </td> </tr> <tr> <td bgcolor="white"> <mono>strata</mono> </td> <td bgcolor="white"> The criterion or another stratification variable. Used to create a <high>balanced split</high>. </td> </tr> </table> ] .pull-right5[ ```r # Get initial split baselers_split <- initial_split(baselers, prop = .8, strata = income) # Create training data baselers_train <- training(baselers_split) # Create test data baselers_test <- testing(baselers_split) ``` ] --- # Fit to training .pull-left4[ <ul> <li class="m2"><span>To fit training rather than test simply replace <highm>data</highm> in the <mono>recipe</mono>.</span></li> </ul> ] .pull-right5[ ```r # fit regression model to training recipe <- recipe(income ~ ., data = baselers_train) %>% step_dummy(all_nominal_predictors()) lm_model <- linear_reg() %>% set_engine("lm") %>% set_mode("regression") lm_workflow <- workflow() %>% add_recipe(recipe) %>% add_model(lm_model) income_lm <- fit(lm_workflow, data = baselers_train) ``` ] --- # Fit Decision tree .pull-left4[ <ul> <li class="m1"><span>Fit <high>decision trees</high> in <mono>tidymodels</mono> using <mono>decision_tree()</mono> in the model definition.</span></li> <li class="m2"><span>Set the engine to <mono>rpart</mono>.</span></li> </ul> ] .pull-right5[ ```r # fit decision tree to training recipe <- recipe(income ~ ., data = baselers_train) %>% step_other(all_nominal_predictors(), threshold = 0.005) dt_model <- decision_tree() %>% set_engine("rpart") %>% set_mode("regression") # or classification dt_workflow <- workflow() %>% add_recipe(recipe) %>% add_model(dt_model) income_dt <- fit(dt_workflow, data = baselers_train) ``` ] --- # Fit Random Forest .pull-left4[ <ul> <li class="m1"><span>Fit <high>random forests</high> in <mono>tidymodels</mono> using <mono>rand_forest()</mono> in the model definition.</span></li> <li class="m2"><span>Set the engine to, e.g., <mono>ranger</mono>.</span></li> </ul> ] .pull-right5[ ```r # fit decision tree to training recipe <- recipe(income ~ ., data = baselers_train) %>% step_other(all_nominal_predictors(), threshold = 0.005) rf_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("regression") # or classification rf_workflow <- workflow() %>% add_recipe(recipe) %>% add_model(rf_model) income_rf <- fit(rf_workflow, data = baselers_train) ``` ] --- # Random forest engines .pull-left4[ <ul> <li class="m1"><span>Show all available engines with <mono>show_engines("rand_forest")</mono>.</span></li> </ul> ] .pull-right5[ ```r # show possible engines for random forest show_engines("rand_forest") ``` ``` # A tibble: 6 x 2 engine mode <chr> <chr> 1 ranger classification 2 ranger regression 3 randomForest classification 4 randomForest regression 5 spark classification 6 spark regression ``` ] --- # <mono>predict(, newdata)</mono> .pull-left4[ <ul> <li class="m1"><span>To <high>test model predictions</high> with <mono>tidymodels</mono>, supply new data frame <mono>newdata</mono> to the <mono>predict()</mono> functionbaseler_train.</span></li> </ul> <br> <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Argument</b> </td> <td bgcolor="white"> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> <mono>object</mono> </td> <td bgcolor="white"> <mono>tidymodels</mono> fit object. </td> </tr> <tr> <td bgcolor="white"> <mono>newdata</mono> </td> <td bgcolor="white"> Test data sest. Must contain same features as provided in <mono>object</mono>. </td> </tr> </table> ] .pull-right5[ ```r # generate out-of-sample predictions lm_pred <- income_lm %>% predict(baselers_test) %>% bind_cols(baselers_test %>% select(income)) metrics(lm_pred, truth = income, estimate = .pred) ``` ``` # A tibble: 3 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 rmse standard 1024. 2 rsq standard 0.846 3 mae standard 821. ``` ] --- class: middle, center <h1><a href=https://therbootcamp.github.io/AML_2021AMLD/_sessions/Prediction/Prediction_practical.html>Practical</a></h1>