class: center, middle, inverse, title-slide # Recap ### Machine Learning with R
Basel R Bootcamp
### October 2019 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Machine Learning with R | October 2019 </font> </a> </span> </div> --- # What is machine learning? .pull-left45[ <b>Machine learning is</b>... <p style="padding-left:20px"> ...a <high>field of artificial intelligence</high>...<br><br> ...that uses <high>statistical techniques</high>... <br><br> ...to allow computer systems to <high>"learn"</high>,...<br><br> ...i.e., to progressively <high>improve performance</high> on a specific task...<br><br> ...from small or large amounts of <high>data</high>,... <br><br> ....<high>without being explicitly programmed</high>....<br><br> ....with the goal to <high>discover structure</high> or </high>improve decision making and predictions</high>. </p> ] .pull-right45[ <p align = "center"> <img src="image/ml_robot.jpg" height=380px><br> <font style="font-size:10px">from <a href="https://medium.com/@dkwok94/machine-learning-for-my-grandma-ca242e97ef62">medium.com</a></font> </p> ] --- .pull-left3[ # Types of machine learning tasks There are many types of machine learning tasks, each of which call for different models. <high>We will focus on supervised machine learning</high>. ] .pull-right65[ <br><br> <p align = "center"> <img src="image/mltypes.png" height=500px><br> <font style="font-size:10px">from <a href="image/mltypes.png">amazonaws.com</a></font> </p> ] --- # Loss function .pull-left45[ Possible <high>the most important concept</high> in statistics and machine learning. The loss function defines some <high>summary of the errors committed by the model</high>. <p style="padding-top:7px"> `$$\Large Loss = f(Error)$$` <p style="padding-top:7px"> <u>Two purposes</u> <table style="cellspacing:0; cellpadding:0; border:none;"> <tr> <td> <b>Purpose</b> </td> <td> <b>Description</b> </td> </tr> <tr> <td bgcolor="white"> Fitting </td> <td bgcolor="white"> Find parameters that minimize loss function. </td> </tr> <tr> <td> Evaluation </td> <td> Calculate loss function for fitted model. </td> </tr> </table> ] .pull-right45[ <img src="Recap_files/figure-html/unnamed-chunk-2-1.png" width="90%" /> ] --- # 2 types of supervised problems .pull-left5[ There are two types of supervised learning problems that can often be approached using the same model. <font style="font-size:24px"><b>Regression</b></font> Regression problems involve the <high>prediction of a quantitative feature</high>. E.g., predicting the cholesterol level as a function of age. <font style="font-size:24px"><b>Classification</b></font> Classification problems involve the <high>prediction of a categorical feature</high>. E.g., predicting the type of chest pain as a function of age. ] .pull-right4[ <p align = "center"> <img src="image/twotypes.png" height=440px><br> </p> ] --- # 3 key (supervised) models <p align = "center" style="padding-top:20px"> <img src="image/models.png"><br> </p> --- # Hold-out data .pull-left45[ Model performance must be evaluated as true prediction on an <high>unseen data set</high>. The unseen data set can be <high>naturally</high> occurring, e.g., using 2019 stock prizes to evaluate a model fit using 2018 stock prizes. More commonly unseen data is created by splitting the available data into a training set and a test set. ] .pull-right45[ <p align = "center"> <img src="image/testdata.png" height=430px> </p> ] --- .pull-left4[ <br><br> # Overfitting Occurs when a model <high>fits data too closely</high> and therefore <high>fails to reliably predict</high> future observations. In other words, overfitting occurs when a model <high>'mistakes' random noise for a predictable signal</high>. More <high>complex models</high> are more <high>prone to overfitting</high>. ] .pull-right5[ <br><br><br> <p align = "center" style="padding-top:0px"> <img src="image/overfitting.png"> </p> ] --- .pull-left5[ # 7 steps with <mono>caret</mono> Step 0: Load data ```r data <- read_csv("1_Data/data.csv") ``` Step 1: split into training and test data ```r # Create index ind <- createDataPartition(y = data$criterion, p = .8, list = FALSE) # Create training and test data data data_train <- baselers %>% slice(ind) data_test <- baselers %>% slice(-ind) ``` Step 2: Define control parameters ```r # Use method = "none" for now ctrl <- trainControl(method = "none") ``` ] .pull-right45[ Step 3: Train model ```r mod <- train(form = Y ~ ., data = data_train, method = "My Favorite Model", trControl = ctrl) ``` Step 4: Explore ```r mod # Print object mod$finalModel # Final model ``` Step 5: Predict ```r # Evaluate fitting performance mod_pred <- predict(object = mod, newdata = data_test) ``` Step 6: Evaluate prediction accuracy ```r # Evaluate prediction performance postResample(pred = mod_pred, obs = data_test$Y) ``` ] --- class: middle, center <h1><a href=https://therbootcamp.github.io/ML_2019Oct/index.html>Schedule</a></h1>