class: center, middle, inverse, title-slide # Machine Learning ### Basel R Bootcamp
www.therbootcamp.com
@therbootcamp
### July 2018 --- layout: true <div class="my-footer"><span> <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">BaselRBootcamp, July 2018</font></a>                                           <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">www.therbootcamp.com</font></a> </span></div> --- # What is machine learning? .pull-left55[ ### Algorithms autonomously learning from data. Given data, an algorithm tunes its <high>parameters</high> to match the data, understand how it works, and make predictions for what will occur in the future. <br><br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/mldiagram_A.png"> </p> ] .pull-right4[ <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/machinelearningcartoon.png"> </p> ] --- # Everyone uses machine learning .pull-left4[ > ### "Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more." > ### Jeff Bezos, founder of Amazon ] .pull-right55[ <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/mlexamples.png"> </p> ] --- # What is the basic machine learning process? <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/MLdiagram.png"> </p> --- .pull-left45[ # What is a model? A model is a <high>formal</high> (mathematical) procedure describing the relationships between variables. Most data have one main <high>criterion</high> or variable of interest, and several <high>features</high>. <br> .pull-left7[ | id|sex | age|fam_history |smoking | disease| |--:|:---|---:|:-----------|:-------|-------:| | 1|m | 45|No |FALSE | 0| | 2|m | 43|Yes |FALSE | 1| | 3|f | 40|Yes |FALSE | 1| | 4|m | 51|Yes |FALSE | 1| | 5|m | 44|No |TRUE | 0| ] ] .pull-right5[ <br><br> ### Decision Tree <p align="center"> <img src="https://github.com/therbootcamp/therbootcamp.github.io/blob/master/_sessions/_image/decision_tree_example.png?raw=true" height="280px"> </p> ### Weighted Additive (Regression) `$$\large{Risk = age \times 0.01 + smoking \times 0.20 + fam\_history \times 0.20}$$` ] --- .pull-left45[ # What is model training? Model <high>training</high> (aka, fitting) is the process of matching a model's <high>parameters</high> to a specific dataset. Q: What are the parameters in the two models on the right? <br> .pull-left7[ | id|sex | age|fam_history |smoking | disease| |--:|:---|---:|:-----------|:-------|-------:| | 1|m | 45|No |FALSE | 0| | 2|m | 43|Yes |FALSE | 1| | 3|f | 40|Yes |FALSE | 1| | 4|m | 51|Yes |FALSE | 1| | 5|m | 44|No |TRUE | 0| ] ] .pull-right5[ <br><br> ### Decision Tree <p align="center"> <img src="https://github.com/therbootcamp/therbootcamp.github.io/blob/master/_sessions/_image/decision_tree_example.png?raw=true" height="280px"> </p> ### Weighted Additive (Regression) `$$\large{Risk = age \times 0.01 + smoking \times 0.20 + fam\_history \times 0.20}$$` ] --- # Fit your own linear model! <br> <img src="MachineLearning_files/figure-html/unnamed-chunk-6-1.png" width="85%" style="display: block; margin: auto;" /> --- # Fit your own linear model! <br> <img src="MachineLearning_files/figure-html/unnamed-chunk-7-1.png" width="85%" style="display: block; margin: auto;" /> --- # Fit your own linear model! <br> <img src="MachineLearning_files/figure-html/unnamed-chunk-8-1.png" width="85%" style="display: block; margin: auto;" /> --- # Why do we separate training from prediction? .pull-left35[ <br> Just because a model can <high>fit past data well</high>, does *not* necessarily mean that it will <high>predict new data well</high>. Anyone can come up with a model of past data (e.g.; stock performance, lottery winnings). <high>Predicting what you can't see in the future is much more difficult.</high> ] .pull-right6[ <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/Erfurt_2018June/master/_sessions/_image/prediction_collage.png"> </p> ] --- <br><br> <font size = 6>"Can you come up with a model that will perfectly match past data but is worthless in predicting future data?"</font><br><br> .pull-left45[ <br> <font size=5><hfont>Past <high>Training</high> Data</hfont></font> <br> | id|sex | age|fam_history |smoking | disease| |--:|:---|---:|:-----------|:-------|-------:| | 1|m | 45|No |FALSE | 0| | 2|m | 43|Yes |FALSE | 1| | 3|f | 40|Yes |FALSE | 1| | 4|m | 51|Yes |FALSE | 1| | 5|m | 44|No |TRUE | 0| ] .pull-right45[ <br> <font size=5><hfont>Future <high> Test</high> Data</hfont></font> <br> | id|sex | age|fam_history |smoking |disease | |--:|:---|---:|:-----------|:-------|:-------| | 91|m | 51|Yes |TRUE |? | | 92|f | 47|No |TRUE |? | | 93|m | 39|No |TRUE |? | | 94|f | 51|Yes |TRUE |? | | 95|f | 50|Yes |FALSE |? | ] --- # Two types of prediction tasks .pull-left45[ <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/classification_task.png" height="450px"> </p> ] .pull-right45[ <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/regression_task.png" height="450px"> </p> ] --- # What machine learning algorithms are there? .pull-left55[ There are thousands of machine learning algorithms from many different fields. [Wikipedia](https://en.wikipedia.org/wiki/Category:Machine_learning) lists 57 categories of machine learning algorithms: <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/wikipediaml.png" height="250px"> </p> ] .pull-right4[ ### Algorithims we focus on <br> We will focus on 3 algorithms that apply to most ML tasks: .pull-left6[ | Algorithm | Complexity| |:--------------------------------------|:-------------------| | [Decision Trees](https://en.wikipedia.org/wiki/Decision_tree) | Low | | [Regression](https://en.wikipedia.org/wiki/Regression_analysis) | Low / Medium | | [Random Forests](https://en.wikipedia.org/wiki/Random_forest) | High | ] ] --- # How do you fit and evaluate ML models in R? .pull-left45[ <high>ML models work the same way you fit standard statistical models.</high> Install the package, load, and find the main fitting functions. ```r # Install the glmnet package install.packages("glmnet") # Load glmnet library(glmnet) # Look at help menu ?glmnet ``` Note: Some functions will use the standard `FUN(formula, data)` arguments, but others (like `glmnet()`) require other arguments, like `x, y` (numeric matrices). ] .pull-right5[ <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/Erfurt_2018June/master/_sessions/_image/glmnet_help.jpg" height="400px"> </p> ] --- # Regression .pull-left45[ In regression, the criterion is modeled as the <high>sum of predictors times weights `\(\beta_{1}\)`, `\(\beta_{2}\)`</high>. <u>Loan example</u><br> For instance, one could model the risk of defaulting on a loan as: `$$Risk = Age \times \beta_{age} + Income \times \beta_{income} + ...$$` Training a model means finding values of `\(\beta_{Age}\)` and `\(\beta_{Income}\)` that 'best' match the training data. <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/regression.png" height="180px"> </p> ] .pull-right5[ ### Regression with glm() The `glm()` function in the base stats package performs standard regression ```r # Standard linear regression glm_mod <- glm(formula = happiness ~ ., data = baselers) # Logisitic regression with family = 'binomial' glm_mod <- glm(formula = sex ~ ., data = baselers. family = "binomial") ``` ] --- # Decision Trees .pull-left45[ In decision trees, the criterion is modeled as a <high>sequence of logical YES or NO questions</high>. <br><br> <u>Loan example</u><br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/defaulttree.png" height="250px"> </p> ] .pull-right5[ ### Decision trees with rpart This codes runs decision trees with functions from the `rpart`-package. ```r install.packages("rpart") library(rpart) # Train rpart model loan_rpart_mod <- rpart(formula, data, method = "class", rpart.control) ``` ] --- # Random Forest .pull-left45[ In [Random Forest](https://en.wikipedia.org/wiki/Random_forest), the criterion is models as the <high>aggregate prediction of a large number of decision trees</high> each based on different features. <br> <u>Loan example</u><br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/Erfurt_2018June/master/_sessions/_image/randomforest_diagram.png" height="285px"><br> <a href="https://medium.com/@williamkoehrsen">Source</a> </p> ] .pull-right5[ ### Random Forests with `randomforest` ```r install.packages("randomForest") library(randomForest) # Create a randomForest model randomForest(formula = y ~., # Formula data = data_train, # Training data ntree, mtry) # Tuning parameters ``` <br> Tuning parameters |Parameter | Description| |:-------|:-------| |`ntree`|Number of trees in forest| |`mtry`|Number of variables randomly selected at splits| ] --- .pull-left35[ # Exploring ML objects Just like objects from statistical functions, objects from machine learning functions are <high>lists</high> that you can explore using <high>generic functions</high>: |Function|Description |:------|:----| |`summary()`| Overview of the most important information| |`names()`|See all named elements you can access with $| |`plot()`|Visualise the object (sometimes)| |`predict()`|Predict new data based on the ML model| ] .pull-right6[ ```r # Create a regression object baselers_glm <- glm(income ~ age + height + children, data = baselers) # Look at summary results summary(baselers_glm) # [...] ``` ```r # Look at all named outputs names(baselers_glm) ``` ``` ## [1] "coefficients" "residuals" "fitted.values" "effects" "R" ## [6] "rank" "qr" "family" "linear.predictors" "deviance" ## [11] "aic" "null.deviance" "iter" "weights" "prior.weights" ## [16] "df.residual" "df.null" "y" "converged" "boundary" ## [21] "model" "na.action" "call" "formula" "terms" ## [26] "data" "offset" "control" "method" "contrasts" ## [31] "xlevels" ``` ```r # Access specific outputs baselers_glm$coefficients ``` ``` ## (Intercept) age height children ## 574.740 149.302 1.720 7.727 ``` ] --- # Predict new data with predict() .pull-left4[ All machine learning objects will allow you to <high>predict the criterion of new data</high> using `predict()`. Compare the predicted values to the true criterion values of `newdata` to see how well your model did. <br> |argument|description| |:----|:-----| |object| A machine learning / statistical object created from `glm()`, `randomforest()`, ...| |newdata|A dataframe of new data| ] .pull-right55[ Predict values from `zurichers` data frame: | id| age| children| height| income| |--:|---:|--------:|------:|------:| | 1| 65| 0| 1.66| 7500| | 2| 75| 3| 1.96| 5400| | 3| 35| 1| 1.76| 8400| | 4| 54| 0| 1.73| 9500| | 5| 65| 2| 1.59| 3700| ```r # produce vector of new predictions predict(object = baselers_glm, # ML object newdata = zurichers) # DF of new data ``` ``` ## 1 2 3 4 5 ## 10282 11799 5811 8640 10298 ``` ] --- # Practical <p> <font size=6> <a href="https://therbootcamp.github.io/BaselRBootcamp_2018July/_sessions/MachineLearning/MachineLearning_practical.html"><b>Link to practical<b></a> </font> </p>