Unsupervised learning

class: center, middle, inverse, title-slide

# Unsupervised learning
### Machine Learning with R <a href='https://therbootcamp.github.io'> The R Bootcamp @ DHLab </a> <a href='https://therbootcamp.github.io/ML-DHLab'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### November 2020

---

layout: true

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Machine Learning with R | November 2020
 
 </a>
 
 </div>

---

# Unsupervised learning

.pull-left45[

<ul>
 <li class="m1">The gaol of Unsupervised learning is the identifikation of <high>hidden structure</high> in the <high>similarities</high> of cases or features.</li> 
 <li class="m2">There are two domains: <high>cluster analysis</high> and <high>dimensionality reduction</high>.</li> 
</ul>

]

.pull-right45[

]

---

.pull-left35[

# Two domains of UL

<ul>
 <li class="m1">Dimensionality reduction</li>
 <ul>
 <li>Focuses on the similarities of <high>features</high>.</li>
 <li>Aims to <high>identify relevant dimensions</high> in the data and is the basis of deep learning.</li>
 </ul> 
 <li class="m2">Cluster analysis</li>
 <ul>
 <li>Focuses on the similarities of <high>cases</high>.</li>
 <li>ims to <high>identify groups and outliers</high> in the data.</li>
 </ul> 
</ul>

]

.pull-right55[

]

---

.pull-left4[

# Gapminder

<ul>
 <li class="m1">Hans Rosling's Gapminder project maps the <high>health and economic development of countries</high> in the world.</li>
 <li class="m2">Are there clearly separable classes of countries</high>: e.g. <high>Developed versus Devloping</high> countries.</li>
</ul>

<img src="image/rosling.png" height=200px> 
Hans Rosling, 1948-2017, adapted from <a href="https://www.volkskrant.nl/cultuur-media/hans-rosling-68-volksverheffer-met-rocksterstatus~b3746a58/?referer=https%3A%2F%2Fwww.google.com%2F">volkskrant.nl</a>

]

.pull-right5[
 

<img src="image/gapminder.png" height=520px> 
adapted from <a href="https://www.ullstein-buchverlage.de/nc/buch/details/factfulness-9783548060415.html">Factfulness, Hans Rosling, Ullstein</a>

]

---

.pull-left4[

# Gapminder

]

.pull-right5[
 

<img src="image/data.gif"> 
data from <a href="https://cran.r-project.org/web/packages/gapminder/README.html">gapminder</a>

]

---

.pull-left4[

# Gap in 1952

<ul>
 <li class="m1">In 1952, are there <high>classes of countries</high> concerning life expectancy and GDP per capita?</li> 
 <li class="m2">Algorithms:</li> 
 <ul>
 <li><high>k-means</high></li> 
 <li><high>DBSCAN</high></li> 
 <li><high>Gaussian mixtures</high></li>
 </ul>
</ul>

]

.pull-right5[
 

<img src="image/gap1952.png">

]

---

.pull-left4[

# k-means

<ul>
 <li class="m1">Maybe the most frequently employed algorithms for cluster analysis.</li> 
 <li class="m2">Algorithm</li> 
 <ol class="">
 <li>Random <high>start points</high> for cluster centroids.</li> 
 <li>Assign cases to <high>closest centroids</high>.</li> 
 <li>Calculate <high>new cluster centroids</high> as average of all points.</li>
 </ol>
</ul>

]

.pull-right5[

]

---

.pull-left4[

# k-means

<ul>
 <li class="m1">Maybe the most <high>frequently employed</high> algorithms for cluster analysis.</li> 
 <li class="m2">Algorithm</li> 
 <ol class="">
 <li>Random <high>start points</high> for cluster centroids.</li> 
 <li>Assign cases to <high>closest centroids</high>.</li> 
 <li>Calculate <high>new cluster centroids</high> as average of all points.</li>
 </ol>
</ul>

]

.pull-right5[

```r
# calculate k-means
gap_kmeans <- kmeans(gap1952, 
 centers = 3)

# show content
names(gap_kmeans)
```

```
## [1] "cluster"      "centers"      "totss"       
## [4] "withinss"     "tot.withinss" "betweenss"   
## [7] "size"         "iter"         "ifault"
```

```r
# show clusters
gap_kmeans$cluster
```

```
##   [1] 3 3 3 2 2 1 2 1 3 1 3 2 3 3 3 3 3 3 3 3 1 3
##  [23] 3 2 3 3 3 3 3 2 3 2 2 1 1 2 3 2 3 2 3 3 3 2
##  [45] 1 2 3 1 3 2 3 3 3 3 3 2 2 1 3 3 2 2 2 2 2 2
##  [67] 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 2 3 2 3 3 3 3
##  [89] 3 1 1 2 3 3 1 3 3 3 3 2 3 2 2 2 2 2 3 3 2 3
## [111] 2 3 3 2 2 3 2 2 3 3 3 1 1 3 3 3 3 3 2 3 3 3
## [133] 1 1 2 1 3 3 3 3 3
```

]

---

.pull-left4[

# k-selection

<ul>
 <li class="m1">There is <high>no correct k</high>!</li> 
 <li class="m2">Comparing two aspects a <high>reasonable k</high> might be found:</li>
 <ul>
 <li>k should be <high>as small as possible</high>.</li>
 <li>The k clusters should describe the data <high>as accurately as possible</high>.</li>
 </ul> 
 <li class="m3">Approaches</li>
 <ul>
 <li>Elbow of the <high>within variance</high>.</li>
 <li>Gap-statistic.</li>
 <li>Slope-statistic.</li>
 <li>Cluster-instability.</li>
 <li>etc.</li>
 </ul>
 </ul>
</ul>

]

.pull-right5[

]

---

.pull-left4[

# k-selection

<ul>
 <li class="m1">There is <high>no correct k</high>!.</li> 
 <li class="m2">Comparing two aspects <high>reasonable k</high> can be identified:</li>
 <ul>
 <li>k should be <high>as small as possible</high>.</li>
 <li>The k clusters should describe the data <high>as accurately as possible</high>.</li>
 </ul> 
 <li class="m3">Approaches</li>
 <ul>
 <li>Elbow of the <high>within variance</high>.</li>
 <li>Gap-statistic.</li>
 <li>Slope-statistic.</li>
 <li>Cluster-instability.</li>
 <li>etc.</li>
 </ul>
 </ul>
</ul>

]

.pull-right5[

```r
# load cstab
library(cstab)

# calculate k-sel
gap_ksel <- cDistance(as.matrix(gap1952), 
 kseq = 2:10,
 method = "kmeans")

# show estimated cluster number
gap_ksel$k_Gap
```

```
## [1] 1
```

```r
gap_ksel$k_Slope
```

```
## [1] 2
```
]

---

.pull-left4[

# k-selection

]

.pull-right5[

```r
# load cstab
library(cstab)

# calculate k-sel
gap_ksel <- cStability(as.matrix(gap1952), 
 kseq = 2:10,
 method = "kmeans")
```

```r
# show estimated cluster number
gap_ksel$k_instab
```

```
## [1] 2
```

]

---

.pull-left4[

# DBSCAN

<ul>
 <li class="m1">DBSCAN = Density-Based Spatial Clustering of Applications with Noise.</li> 
 <li class="m2">Algorithm</li> 
 <ol>
 <li>For every point, check other <high>points with distance &epsilon;</high>: a. N = 0 &rarr; <high>outlier</high> b. N &ge; minPts &rarr; <high>core point</high> c. minPts &gt; N &gt; 0 &rarr; <high>undetermined</high></li> 
 <li>Join <high>core points</high> with distance &lt &epsilon;; to clusters.</li> 
 <li>Add <high>Undetermined points</high> with distance &lt &epsilon;; to core points to cluster, otherwise outlier.</li>
 </ol>
</ul>

]

.pull-right5[

&epsilon; = .2
<img src="image/dbscan_1.gif">

]

---

.pull-left4[

# DBSCAN

]

.pull-right5[

&epsilon; = .3
<img src="image/dbscan_2.gif">

]

---

.pull-left4[

# DBSCAN

]

.pull-right5[

```r
# load dbscan
library(dbscan)

# calculate dbscan
gap_dbscan <- dbscan(scale(gap1952), 
 eps = .2)

# show results
gap_dbscan
```

```
## DBSCAN clustering for 141 objects.
## Parameters: eps = 0.2, minPts = 5
## The clustering contains 2 cluster(s) and 57 noise points.
## 
##  0  1  2 
## 57 72 12 
## 
## Available fields: cluster, eps, minPts
```

]

---

.pull-left4[

# Eigenschaften von DBSCAN

<ul>
 <li class="m1">Advantages</li>
 <ul>
 <li><high>Identifies k</high> automatically.</li>
 <li>Can identify <high>"complex" clusters</high>.</li>
 <li>Can identify <high>outliers</high>.</li>
 </ul>
<li class="m2">Disadvantages.</li>
 <ul>
 <li><high>Very sensitive</high> to parameter settings.</li>
 <li>Hard time with <high>clusters of different density</high>.</li>
 </ul>
</ul>

]

.pull-right5[

]

---

.pull-left4[

# Properties of DBSCAN

]

.pull-right5[

]

---

.pull-left4[

# Properties of DBSCAN

]

.pull-right5[

]

---

.pull-left4[

# Gaussian mixtures

<ul>
 <li class="m1">A probabilistic <high>extension of k-means algorithm</high> on the basis of multi-variate normal distributions</li>
 <li class="m2">Works well with clusters of different <high>orientation and density</high></li>
 <li class="m3">Straighforward estimation of k via <high>model comparison</high></li>
 <li class="m4">Difficulties with <high>complex topologies</high>.</li>
</ul>

]

.pull-right5[

]

---

.pull-left4[

# Gaussian mixtures

]

.pull-right5[

]

---

.pull-left4[

# Gaussian mixtures

]

.pull-right5[

```r
# load gaussian mixture package
library(mclust)

# calculate gaussian mixture
gap_mclust <- Mclust(gap1952)

# show results
gap_mclust
```

```
## 'Mclust' model object: (VVE,3) 
## 
## Available components: 
##  [1] "call"           "data"          
##  [3] "modelName"      "n"             
##  [5] "d"              "G"             
##  [7] "BIC"            "loglik"        
##  [9] "df"             "bic"           
## [11] "icl"            "hypvol"        
## [13] "parameters"     "z"             
## [15] "classification" "uncertainty"
```

]

---

.pull-left4[

# Gaussian mixtures

]

.pull-right5[

```r
# lade gaussian mixture
library(mclust)

# berechne gaussian mixture
gap_mclust <- Mclust(gap1952)

# zeige Ergebnis
gap_mclust$classification
```

```
##   [1] 1 2 2 2 3 3 3 2 1 3 1 2 2 1 2 2 1 1 1 1 3 1
##  [23] 1 2 1 2 1 1 2 2 1 2 2 3 3 2 1 2 1 2 1 1 1 3
##  [45] 3 2 1 3 1 3 2 1 1 1 2 2 3 3 1 1 2 2 3 3 3 2
##  [67] 3 1 1 2 1 2 1 1 2 1 1 2 1 1 2 2 1 2 1 1 1 2
##  [89] 1 3 3 2 1 1 3 1 1 2 3 2 1 2 2 3 2 2 1 1 2 1
## [111] 2 1 2 3 3 1 2 3 2 1 1 3 3 2 2 1 2 1 2 1 2 1
## [133] 3 3 3 2 1 1 1 1 1
```

]

---
class: middle, center

<h1><a href=https://therbootcamp.github.io/ML-DHLab/_sessions/Unsupervised/Unsupervised_practical.html>Practical</a></h1>