Machine Learning

class: center, middle, inverse, title-slide

# Machine Learning
### Basel R Bootcamp <a href='https://therbootcamp.github.io'>www.therbootcamp.com</a> <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### July 2018

---

layout: true

<div class="my-footer">
<a href="https://therbootcamp.github.io/">BaselRBootcamp, July 2018</a>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
<a href="https://therbootcamp.github.io/">www.therbootcamp.com</a>
</div>

---

# What is machine learning?

.pull-left55[

### Algorithms autonomously learning from data.

Given data, an algorithm tunes its <high>parameters</high> to match the data, understand how it works, and make predictions for what will occur in the future.
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/mldiagram_A.png">

]

.pull-right4[

]

---

# Everyone uses machine learning

.pull-left4[

> ### "Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more."
> ### Jeff Bezos, founder of Amazon

]

.pull-right55[

]

---

# What is the basic machine learning process?

---

.pull-left45[

# What is a model?

A model is a <high>formal</high> (mathematical) procedure describing the relationships between variables.

Most data have one main <high>criterion</high> or variable of interest, and several <high>features</high>.

.pull-left7[
  
  | id|sex | age|fam_history |smoking | disease|
  |--:|:---|---:|:-----------|:-------|-------:|
  |  1|m   |  45|No          |FALSE   |       0|
  |  2|m   |  43|Yes         |FALSE   |       1|
  |  3|f   |  40|Yes         |FALSE   |       1|
  |  4|m   |  51|Yes         |FALSE   |       1|
  |  5|m   |  44|No          |TRUE    |       0|
  ]

]

.pull-right5[
 
### Decision Tree

### Weighted Additive (Regression)

`$$\large{Risk = age \times 0.01 + smoking \times 0.20 + fam\_history \times 0.20}$$`

]

---

.pull-left45[

# What is model training?

Model <high>training</high> (aka, fitting) is the process of matching a model's <high>parameters</high> to a specific dataset.

Q: What are the parameters in the two models on the right?

]

.pull-right5[
 
### Decision Tree

### Weighted Additive (Regression)

`$$\large{Risk = age \times 0.01 + smoking \times 0.20 + fam\_history \times 0.20}$$`

]

---

# Fit your own linear model!

---

# Fit your own linear model!
 
<img src="MachineLearning_files/figure-html/unnamed-chunk-7-1.png" width="85%" style="display: block; margin: auto;" />

---

# Fit your own linear model!
 
<img src="MachineLearning_files/figure-html/unnamed-chunk-8-1.png" width="85%" style="display: block; margin: auto;" />

---

# Why do we separate training from prediction?

.pull-left35[
 
Just because a model can <high>fit past data well</high>, does *not* necessarily mean that it will <high>predict new data well</high>.

Anyone can come up with a model of past data (e.g.; stock performance, lottery winnings).

<high>Predicting what you can't see in the future is much more difficult.</high>

]
 
.pull-right6[

]

---
 
"Can you come up with a model that will perfectly match past data but is worthless in predicting future data?"

.pull-left45[

<hfont>Past <high>Training</high> Data</hfont>

| id|sex | age|fam_history |smoking | disease|
|--:|:---|---:|:-----------|:-------|-------:|
|  1|m   |  45|No          |FALSE   |       0|
|  2|m   |  43|Yes         |FALSE   |       1|
|  3|f   |  40|Yes         |FALSE   |       1|
|  4|m   |  51|Yes         |FALSE   |       1|
|  5|m   |  44|No          |TRUE    |       0|

]

.pull-right45[

<hfont>Future <high> Test</high> Data</hfont>

| id|sex | age|fam_history |smoking |disease |
|--:|:---|---:|:-----------|:-------|:-------|
| 91|m   |  51|Yes         |TRUE    |?       |
| 92|f   |  47|No          |TRUE    |?       |
| 93|m   |  39|No          |TRUE    |?       |
| 94|f   |  51|Yes         |TRUE    |?       |
| 95|f   |  50|Yes         |FALSE   |?       |

]

---

# Two types of prediction tasks

.pull-left45[

]

.pull-right45[

]

---

# What machine learning algorithms are there?

.pull-left55[

There are thousands of machine learning algorithms from many different fields.

[Wikipedia](https://en.wikipedia.org/wiki/Category:Machine_learning) lists 57 categories of machine learning algorithms:

]

.pull-right4[

### Algorithims we focus on
 
We will focus on 3 algorithms that apply to most ML tasks:

.pull-left6[
  | Algorithm | Complexity|
  |:--------------------------------------|:-------------------|
  |     [Decision Trees](https://en.wikipedia.org/wiki/Decision_tree) | Low |
  |     [Regression](https://en.wikipedia.org/wiki/Regression_analysis) | Low / Medium | 
  |     [Random Forests](https://en.wikipedia.org/wiki/Random_forest) | High |
  ]
]

---

# How do you fit and evaluate ML models in R?

.pull-left45[

<high>ML models work the same way you fit standard statistical models.</high> Install the package, load, and find the main fitting functions.

```r
# Install the glmnet package
install.packages("glmnet")

# Load glmnet
library(glmnet)

# Look at help menu
?glmnet
```

Note: Some functions will use the standard `FUN(formula, data)` arguments, but others (like `glmnet()`) require other arguments, like `x, y` (numeric matrices).

]

.pull-right5[

]

---

# Regression

.pull-left45[

In regression, the criterion is modeled as the <high>sum of predictors times weights `$\beta_{1}$`, `$\beta_{2}$`</high>.

Loan example 
For instance, one could model the risk of defaulting on a loan as:

`$$Risk = Age \times \beta_{age} + Income \times \beta_{income} + ...$$`

Training a model means finding values of `$\beta_{Age}$` and `$\beta_{Income}$` that 'best' match the training data.

]

.pull-right5[

### Regression with glm()

The `glm()` function in the base stats package performs standard regression

```r
# Standard linear regression
glm_mod <- glm(formula = happiness ~ .,
 data = baselers)

# Logisitic regression with family = 'binomial'
glm_mod <- glm(formula = sex ~ .,
 data = baselers.
 family = "binomial")
```

]

---

# Decision Trees

.pull-left45[

In decision trees, the criterion is modeled as a <high>sequence of logical YES or NO questions</high>.

Loan example 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/defaulttree.png" height="250px">

]

.pull-right5[

### Decision trees with rpart

This codes runs decision trees with functions from the `rpart`-package.

```r
install.packages("rpart")
library(rpart)

# Train rpart model
loan_rpart_mod <- rpart(formula, data,
 method = "class",
 rpart.control)
```

]

---

# Random Forest

.pull-left45[

In [Random Forest](https://en.wikipedia.org/wiki/Random_forest), the criterion is models as the <high>aggregate prediction of a large number of decision trees</high> each based on different features.

Loan example 

 <img src="https://raw.githubusercontent.com/therbootcamp/Erfurt_2018June/master/_sessions/_image/randomforest_diagram.png" height="285px"> 
 <a href="https://medium.com/@williamkoehrsen">Source</a>

]

.pull-right5[

### Random Forests with `randomforest`

```r
install.packages("randomForest")
library(randomForest)

# Create a randomForest model
randomForest(formula = y ~.,    # Formula 
             data = data_train, # Training data
             ntree, mtry)  # Tuning parameters
```

Tuning parameters

|Parameter | Description|
|:-------|:-------|
|`ntree`|Number of trees in forest|
|`mtry`|Number of variables randomly selected at splits|

]

---
.pull-left35[

# Exploring ML objects

Just like objects from statistical functions, objects from machine learning functions are <high>lists</high> that you can explore using <high>generic functions</high>:

|Function|Description
|:------|:----|
|`summary()`| Overview of the most important information|
|`names()`|See all named elements you can access with $|
|`plot()`|Visualise the object (sometimes)|
|`predict()`|Predict new data based on the ML model|

]

.pull-right6[

```r
# Create a regression object
baselers_glm <- glm(income ~ age + height + children,
 data = baselers)

# Look at summary results
summary(baselers_glm)
# [...]
```

```r
# Look at all named outputs
names(baselers_glm)
```

```
##  [1] "coefficients"      "residuals"         "fitted.values"     "effects"           "R"                
##  [6] "rank"              "qr"                "family"            "linear.predictors" "deviance"         
## [11] "aic"               "null.deviance"     "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                 "converged"         "boundary"         
## [21] "model"             "na.action"         "call"              "formula"           "terms"            
## [26] "data"              "offset"            "control"           "method"            "contrasts"        
## [31] "xlevels"
```

```r
# Access specific outputs
baselers_glm$coefficients
```

```
## (Intercept)         age      height    children 
##     574.740     149.302       1.720       7.727
```

]

---

# Predict new data with predict()

.pull-left4[

All machine learning objects will allow you to <high>predict the criterion of new data</high> using `predict()`.

Compare the predicted values to the true criterion values of `newdata` to see how well your model did.

|argument|description|
|:----|:-----|
|object| A machine learning / statistical object created from `glm()`, `randomforest()`, ...|
|newdata|A dataframe of new data|

]

.pull-right55[

Predict values from `zurichers` data frame:

| id| age| children| height| income|
|--:|---:|--------:|------:|------:|
|  1|  65|        0|   1.66|   7500|
|  2|  75|        3|   1.96|   5400|
|  3|  35|        1|   1.76|   8400|
|  4|  54|        0|   1.73|   9500|
|  5|  65|        2|   1.59|   3700|

```r
# produce vector of new predictions
predict(object = baselers_glm,  # ML object
        newdata = zurichers)    # DF of new data
```

```
##     1     2     3     4     5 
## 10282 11799  5811  8640 10298
```

]

---

# Practical

<a href="https://therbootcamp.github.io/BaselRBootcamp_2018July/_sessions/MachineLearning/MachineLearning_practical.html">Link to practical</a>