Fitting

# Fitting
### Machine Learning with R <a href='https://therbootcamp.github.io'> Basel R Bootcamp </a> <a href='https://therbootcamp.github.io/ML_2019Oct/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### October 2019

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Machine Learning with R | October 2019
 
 </a>
 
 </div>

---

# Fitting

Models are actually <high>families of models</high>, with every parameter combination specifying a different model.

To fit a model means to <high>identify</high> from the family of models <high>the specific model that fits the data best</high>.

]

<img src="image/curvefits.png" height=480px> 
adapted from <a href="https://www.explainxkcd.com/wiki/index.php/2048:_Curve-Fitting">explainxkcd.com</a>

]

---

# Which of these models is better? Why?

---

# Which of these models is better? Why?

---

# Loss function

Possible <high>the most important concept</high> in statistics and machine learning.

The loss function defines some <high>summary of the errors committed by the model</high>.

`$$\Large Loss = f(Error)$$`

Two purposes

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Purpose
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 Fitting
 </td>
 <td bgcolor="white">
 Find parameters that minimize loss function.
 </td>
</tr>
<tr>
 <td>
 Evaluation
 </td>
 <td>
 Calculate loss function for fitted model.
 </td>
</tr>
</table>

]

]

---

<high><h1>Regression</h1></high>

<h1>Decision Trees</h1>

<h1>Random Forests</h1>

---

# Regression

In [regression](https://en.wikipedia.org/wiki/Regression_analysis), the criterion `$Y$` is modeled as the <high>sum</high> of <high>features</high> `$X_1, X_2, ...$` <high>times weights</high> `$\beta_1, \beta_2, ...$` plus `$\beta_0$` the so-called the intercept.

`$$\large \hat{Y} =  \beta_{0} + \beta_{1} \times X_1 + \beta_{2} \times X2 + ...$$`

The weight `$\beta_{i}$` indiciates the <high>amount of change</high> in `$\hat{Y}$` for a change of 1 in `$X_{i}$`.

Ceteris paribus, the <high>more extreme</high> `$\beta_{i}$`, the <high>more important</high> `$X_{i}$` for the prediction of `$Y$` (Note: the scale of `$X_{i}$` matters too!).

If `$\beta_{i} = 0$`, then `$X_{i}$` <high>does not help</high> predicting `$Y$`

]

]

---

# Regression loss

Mean Squared Error (MSE) <high>Average squared distance</high> between predictions and true values?

$$ MSE = \frac{1}{n}\sum_{i \in 1,...,n}(Y_{i} - \hat{Y}_{i})^{2}$$

Mean Absolute Error (MAE) <high>Average absolute distance</high> between predictions and true values?

$$ MAE = \frac{1}{n}\sum_{i \in 1,...,n} \lvert Y_{i} - \hat{Y}_{i} \rvert$$

]

]

---

# Fitting

There are two fundamentally different ways to find the set of parameters that minimizes loss.

Analytically

In rare cases, the parameters can be <high>directly calculated</high>, e.g., using the normal equation:

`$$\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y$$`

Numerically

In most cases, parameters need to be found using a <high>directed trial and error</high>, e.g., gradient descent:

`$$\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})$$`

]

<img src="image/gradient.png" height=420px> 
adapted from <a href="https://me.me/i/machine-learning-gradient-descent-machine-learning-machine-learning-behind-the-ea8fe9fc64054eda89232d7ffc9ba60e">me.me</a>

]

---

# Fitting

There are two fundamentally different ways to find the set of parameters that minimizes loss.

Analytically

In rare cases, the parameters can be <high>directly calculated</high>, e.g., using the normal equation:

`$$\boldsymbol \beta = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y$$`

Numerically

In most cases, parameters need to be found using a <high>directed trial and error</high>, e.g., gradient descent:

`$$\boldsymbol \beta_{n+1} = \boldsymbol \beta_{n}+\gamma \nabla F(\boldsymbol \beta_{n})$$`

]

<br2>

<img src="image/gradient1.gif" height=250px> 
adapted from <a href="https://dunglai.github.io/2017/12/21/gradient-descent/
">dunglai.github.io</a> 
<img src="image/gradient2.gif" height=250px> 
adapted from <a href="https://dunglai.github.io/2017/12/21/gradient-descent/
">dunglai.github.io</a>

]

---

# 2 types of supervised problems

There are two types of supervised learning problems that can often be approached using the same model.

Regression

Regression problems involve the <high>prediction of a quantitative feature</high>.

E.g., predicting the cholesterol level as a function of age.

Classification

Classification problems involve the <high>prediction of a categorical feature</high>.

E.g., predicting the type of chest pain as a function of age.

]

]

---

# Logistic regression

In [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression), the class criterion `$Y \in (0,1)$` is modeled also as the <high>sum of feature times weights</high>, but with the prediction being transformed using a <high>logistic link function</high>:

`$$\large \hat{Y} =  Logistic(\beta_{0} + \beta_{1} \times X_1 + ...)$$`

The logistic function <high>maps predictions to the range of 0 and 1</high>, the two class values.

$$ Logistic(x) = \frac{1}{1+exp(-x)}$$

]

]

---

# Logistic regression

`$$\large \hat{Y} =  Logistic(\beta_{0} + \beta_{1} \times X_1 + ...)$$`

The logistic function <high>maps predictions to the range of 0 and 1</high>, the two class values.

$$ Logistic(x) = \frac{1}{1+exp(-x)}$$

]

]

---

# Classification loss - two ways

Distance

Logloss is <high>used to fit the parameters</high>, alternative distance measures are MSE and MAE.

`$$\small LogLoss = -\frac{1}{n}\sum_{i}^{n}(log(\hat{y})y+log(1-\hat{y})(1-y))$$`
`$$\small MSE = \frac{1}{n}\sum_{i}^{n}(y-\hat{y})^2, \: MAE = \frac{1}{n}\sum_{i}^{n} \lvert y-\hat{y} \rvert$$`

Overlap

Does the <high>predicted class match the actual class</high>. Often preferred for <high>ease of interpretation</high>.

`$$\small Loss_{01}=\frac{1}{n}\sum_i^n I(y \neq \lfloor \hat{y} \rceil)$$`

]

]

---

# Confusion matrix

The confusion matrix <high>tabulates prediction matches and mismatches</high> as a function of the true class.

The confusion matrix permits specification of a number of <high>helpful performance metrics</high>.

Confusion matrix

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 </td>
 <td>
 <eq>y&#770; = 1</eq>
 </td>
 <td>
 <eq>y&#770; = 0</eq>
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <eq>y = 1</eq>
 </td>
 <td bgcolor="white">
 True positive (TP)
 </td>
 <td bgcolor="white">
 False negative (FN)
 </td>
</tr>
<tr>
 <td>
 <eq>y = 0</eq>
 </td>
 <td>
 False positive (FP)
 </td>
 <td>
 True negative (TN)
 </td>
</tr>
</table>

]

Accuracy: Of all cases, what percent of predictions are correct?

`$$\small Acc. = \frac{TP + TN}{ TP + TN + FN + FP} = 1-Loss_{01}$$`

Sensitivity: Of the truly Positive cases, what percent of predictions are correct?

`$$\small Sensitivity = \frac{TP}{ TP +FN }$$`

Specificity: Of the truly Negative cases, what percent of predictions are correct?

`$$\small Specificity = \frac{TN}{ TN + FP }$$`

]

---

# Confusion matrix

The confusion matrix <high>tabulates prediction matches and mismatches</high> as a function of the true class.

The confusion matrix permits specification of a number of <high>helpful performance metrics</high>.

Confusion matrix

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 </td>
 <td>
 <eq>"Default"</eq>
 </td>
 <td>
 <eq>"Repay"</eq>
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <eq>Default</eq>
 </td>
 <td bgcolor="white">
 TP = 3
 </td>
 <td bgcolor="white">
 FN = 1
 </td>
</tr>
<tr>
 <td>
 <eq>Repay</eq>
 </td>
 <td>
 FP = 1
 </td>
 <td>
 TN = 2
 </td>
</tr>
</table>

]

Accuracy: Of all cases, what percent of predictions are correct?

`$$\small Acc. = \frac{TP + TN}{ TP + TN + FN + FP} = 1-Loss_{01}$$`

Sensitivity: Of the truly Positive cases, what percent of predictions are correct?

`$$\small Sensitivity = \frac{TP}{ TP +FN }$$`

Specificity: Of the truly Negative cases, what percent of predictions are correct?

`$$\small Specificity = \frac{TN}{ TN + FP }$$`

]

---
class: center,  middle

# Let's fit regression models with `caret`!

---

# `caret`

<mono>caret</mono>'s key fitting functions

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Function
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>trainControl()</mono>
 </td>
 <td bgcolor="white">
 Choose settings for how fitting should be carried out.
 </td>
</tr>
<tr>
 <td>
 <mono>train()</mono>
 </td>
 <td>
 Specify the model and find *best* parameters.
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>postResample()</mono>
 </td>
 <td bgcolor="white">
 Evaluate model performance (fitting or prediction) for regression.
 </td>
</tr>
<tr>
 <td>
 <mono>confusionMatrix()</mono>
 </td>
 <td bgcolor="white">
 Evaluate model performance (fitting or prediction) for classification.
 </td>
</tr>
</table>

]

```r
# Step 1: Define control parameters
#   trainControl()

ctrl <- trainControl(...)

# Step 2: Train and explore model
#   train()

mod <- train(...)
summary(mod)
mod$finalModel # see final model

# Step 3: Assess fit
#   predict(), postResample(), fon

fit <- predict(mod)
postResample(fit, truth)
confusionMatrix(fit, truth)
```

]

---

# `trainControl()`

`trainControl()` controls how `caret` fits an ML model.

For now, set `method = "none"` to keep things simple. More in the session on optimization.

```r
# Fit the model without any 
#  advanced parameter tuning methods

ctrl <- trainControl(method = "none")
```

]

```r
?trainControl
```

]

---

# `train()`

`train()` is the fitting <high>workhorse</high> of `caret`, offering you <high>200+ models</high> just by changing the <high>method</high> argument!

<mono>train()</mono>'s key arguments

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Argument
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>form</mono>
 </td>
 <td bgcolor="white">
 Formula specifying features and criterion.
 </td>
</tr>
<tr>
 <td>
 <mono>data</mono>
 </td>
 <td>
 Training data.
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>method</mono>
 </td>
 <td bgcolor="white">
 The model (algorithm). 
 </td>
</tr>
<tr>
 <td>
 <mono>trControl</mono>
 </td>
 <td bgcolor="white">
 Control parameters for fitting.
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>tuneGrid</mono>, <mono>preProcess</mono>
 </td>
 <td bgcolor="white">
 Cool stuff for later.
 </td>
</tr>
</table>

]

```r
# Fit a regression model predicting Price

income_mod <- 
 train(form = income ~ ., # Formula
 data = baselers, # Training data
 method = "glm", # Regression
 trControl = ctrl) # Control Param's
income_mod
```

```
Generalized Linear Model

1000 samples
  19 predictor

No pre-processing
Resampling: None 
```

]

---

# `train()`

`train()` is the fitting <high>workhorse</high> of `caret`, offering you <high>200+ models</high> just by changing the <high>method</high> argument!

<mono>train()</mono>'s key arguments

]

```r
# Fit a random forest predicting Price

income_mod <- 
 train(form = income ~ .,# Formula
 data = baselers, # Training data
 method = "rf", # Random Forest
 trControl = ctrl) # Control Param's
income_mod
```

```
Random Forest

1000 samples
  19 predictor

No pre-processing
Resampling: None 
```

]

---

# `train()`

`train()` is the fitting <high>workhorse</high> of `caret`, offering you <high>200+ models</high> just by changing the <high>method</high> argument!

Find all 200+ models [here](http://topepo.github.io/caret/available-models.html).

]

]

---

# `train()`

The criterion must be the right type:

<mono>numeric</mono> criterion = Regression 
<mono>factor</mono> criterion = Classification!

```
# A tibble: 5 x 5
 Default Age Gender Cards Education
 <dbl> <dbl> <chr> <dbl> <dbl>
1 0 45 M 3 11
2 1 36 F 2 14
3 0 76 F 5 12
4 1 25 M 2 17
5 1 36 F 3 12
```

]

```r
# Will be a regression task

loan_mod <- train(form = Default ~ .,
 data = Loans,
 method = "glm",
 trControl = ctrl)

# Will be a classification task

load_mod <- train(form = factor(Default) ~ .,
 data = Loans,
 method = "glm",
 trControl = ctrl)
```

]

---

# <mono>.$finalModel</mono>

The `train()` function returns a `list` with a key object called `finalModel` - this is your <high>final machine learning model</high>!

Access the model with `mod$finalModel` and <high>explore</high> the object with generic functions:

<table style="cellspacing:0; cellpadding:0; border:none;">
<tr>
 <td>
 Function
 </td>
 <td>
 Description
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>summary()</mono>
 </td>
 <td bgcolor="white">
 Overview of the most important results.
 </td>
</tr>
<tr>
 <td bgcolor="white">
 <mono>names()</mono>
 </td>
 <td bgcolor="white">
 See all named elements you can access with $.
 </td>
</tr>
</table>

]

```r
# Create a regression object
income_mod <- 
 train(form = income ~ age + height,
 data = baselers) # Training data

# Look at all named outputs
names(income_mod$finalModel)
```

```
[1] "coefficients"  "residuals"     "fitted.values"
[4] "effects"       "R"             "rank"         
 [ reached getOption("max.print") -- omitted 28 entries ]
```

```r
# Access specific outputs
income_mod$finalModel$coefficients
```

```
(Intercept)         age      height 
    177.084     151.786       3.466 
```

]

---

# `predict()`

The `predict()` function <high>produces predictions</high> from a model. Simply put model object as the first argument.

```r
# Get fitted values
glm_fits <- predict(object = income_mod)
glm_fits[1:8]
```

```
    1     2     3     4     5     6     7     8 
 5508  6960  6982  8645  5325 10648  8663  4592 
```

]

]

---

# `postResample()`

The `postResample()` function <high>gives a simple summary</high> of a models' performance in a <high>regression task</high>. Simply put the predicted values and the true values inside the function.

```r
# evaluate
postResample(glm_fits,
             baselers$income)
```

```
    RMSE Rsquared      MAE 
1173.079    0.821  937.113 
```

]

]

---

## `confusionMatrix()`

The `confusionMatrix()` does the same for a models' performance in a <high>classification task</high>. Simply put the predicted values and the true values inside the function.

```r
# eyecor to factor
baselers$eyecor <- factor(baselers$eyecor)

# run glm model for classification
eyecor_mod <- 
 train(form = eyecor ~ age + height,
 data = baselers, 
 method = "glm", 
 trControl = ctrl)

# evaluate
confusionMatrix(predict(eyecor_mod), 
                baselers$eyecor)
```

]

```
Confusion Matrix and Statistics

Reference
Prediction no yes
 no 0 0
 yes 353 647
 
 Accuracy : 0.647 
 95% CI : (0.616, 0.677)
 No Information Rate : 0.647 
 P-Value [Acc > NIR] : 0.514 
 
 Kappa : 0 
 
 Mcnemar's Test P-Value : <2e-16 
 
 Sensitivity : 0.000 
 Specificity : 1.000 
 Pos Pred Value : NaN 
 Neg Pred Value : 0.647 
 Prevalence : 0.353 
 Detection Rate : 0.000 
 Detection Prevalence : 0.000 
 Balanced Accuracy : 0.500 
 
 'Positive' Class : no 
 
```

]

---

<h1><a href=https://therbootcamp.github.io/ML_2019Oct/_sessions/Fitting/Fitting_practical.html>Practical</a></h1>