class: center, middle, inverse, title-slide # Unsupervised learning ### Machine Learning with R
The R Bootcamp @ DHLab
### November 2020 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="" height=14 style="vertical-align: middle"/> </span> <a href=""> <span style="padding-left:82px"> <font color="#7E7E7E"> </font> </span> </a> <a href=""> <font color="#7E7E7E"> Machine Learning with R | November 2020 </font> </a> </span> </div> --- # Unsupervised learning .pull-left45[ <ul> <li class="m1"><span>The gaol of Unsupervised learning is the identifikation of <high>hidden structure</high> in the <high>similarities</high> of cases or features.</span></li><br> <li class="m2"><span>There are two domains: <high>cluster analysis</high> and <high>dimensionality reduction</high>.</span></li><br> </ul> ] .pull-right45[ <p align = "center"> <img src="image/unsupervised.jpg"><br> <font style="font-size:10px">from <a href=""></a></font> </p> ] --- .pull-left35[ # Two domains of UL <ul> <li class="m1"><span><b>Dimensionality reduction</b></span></li> <ul> <li><span>Focuses on the similarities of <high>features</high>.</span></li> <li><span>Aims to <high>identify relevant dimensions</high> in the data and is the basis of deep learning.</span></li> </ul><br> <li class="m2"><span><b>Cluster analysis</b></span></li> <ul> <li><span>Focuses on the similarities of <high>cases</high>.</span></li> <li><span>ims to <high>identify groups and outliers</high> in the data.</span></li> </ul> </ul> ] .pull-right55[ <br> <p align = "center"> <img src="image/types.png" height=520px><br> </p> ] --- .pull-left4[ # Gapminder <ul> <li class="m1"><span>Hans Rosling's Gapminder project maps the <high>health and economic development of countries</high> in the world.</span></li> <li class="m2"><span>Are there clearly separable classes of countries</high>: e.g. <high>Developed versus Devloping</high> countries.</span></li> </ul> <p align = "center" style="margin-top:55px"> <img src="image/rosling.png" height=200px><br> <font style="font-size:10px">Hans Rosling, 1948-2017, adapted from <a href=""></a></font> </p> ] .pull-right5[ <br> <p align = "center"> <img src="image/gapminder.png" height=520px><br> <font style="font-size:10px">adapted from <a href="">Factfulness, Hans Rosling, Ullstein</a></font> </p> ] --- .pull-left4[ # Gapminder <ul> <li class="m1"><span>Hans Rosling's Gapminder project maps the <high>health and economic development of countries</high> in the world.</span></li> <li class="m2"><span>Are there clearly separable classes of countries</high>: e.g. <high>Developed versus Devloping</high> countries.</span></li> </ul> <p align = "center" style="margin-top:55px"> <img src="image/rosling.png" height=200px><br> <font style="font-size:10px">Hans Rosling, 1948-2017, adapted from <a href=""></a></font> </p> ] .pull-right5[ <br> <p align = "center"> <img src="image/data.gif"><br> <font style="font-size:10px">data from <a href="">gapminder</a></font> </p> ] --- .pull-left4[ # Gap in 1952 <ul> <li class="m1"><span>In 1952, are there <high>classes of countries</high> concerning life expectancy and GDP per capita?</span></li><br> <li class="m2"><span>Algorithms:</span></li><br> <ul> <li><span><high><i>k</i>-means</high></span></li><br> <li><span><high>DBSCAN</high></span></li><br> <li><span><high>Gaussian mixtures</high></span></li> </ul> </ul> ] .pull-right5[ <br> <p align = "center"> <img src="image/gap1952.png"><br> </p> ] --- .pull-left4[ # <i>k</i>-means <ul> <li class="m1"><span>Maybe the most frequently employed algorithms for cluster analysis.</span></li><br> <li class="m2"><span><b>Algorithm</b></span></li><br> <ol class=""> <li><span>Random <high>start points</high> for cluster centroids.</span></li><br> <li><span>Assign cases to <high>closest centroids</high>.</span></li><br> <li><span>Calculate <high>new cluster centroids</high> as average of all points.</span></li> </ol> </ul> ] .pull-right5[ <br> <p align = "center"> <img src="image/kmeans.gif"><br> </p> ] --- .pull-left4[ # <i>k</i>-means <ul> <li class="m1"><span>Maybe the most <high>frequently employed</high> algorithms for cluster analysis.</span></li><br> <li class="m2"><span><b>Algorithm</b></span></li><br> <ol class=""> <li><span>Random <high>start points</high> for cluster centroids.</span></li><br> <li><span>Assign cases to <high>closest centroids</high>.</span></li><br> <li><span>Calculate <high>new cluster centroids</high> as average of all points.</span></li> </ol> </ul> ] .pull-right5[ <br> ```r # calculate k-means gap_kmeans <- kmeans(gap1952, centers = 3) # show content names(gap_kmeans) ``` ``` ## [1] "cluster" "centers" "totss" ## [4] "withinss" "tot.withinss" "betweenss" ## [7] "size" "iter" "ifault" ``` ```r # show clusters gap_kmeans$cluster ``` ``` ## [1] 3 3 3 2 2 1 2 1 3 1 3 2 3 3 3 3 3 3 3 3 1 3 ## [23] 3 2 3 3 3 3 3 2 3 2 2 1 1 2 3 2 3 2 3 3 3 2 ## [45] 1 2 3 1 3 2 3 3 3 3 3 2 2 1 3 3 2 2 2 2 2 2 ## [67] 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 2 3 2 3 3 3 3 ## [89] 3 1 1 2 3 3 1 3 3 3 3 2 3 2 2 2 2 2 3 3 2 3 ## [111] 2 3 3 2 2 3 2 2 3 3 3 1 1 3 3 3 3 3 2 3 3 3 ## [133] 1 1 2 1 3 3 3 3 3 ``` ] --- .pull-left4[ # <i>k</i>-selection <ul> <li class="m1"><span>There is <high>no correct <i>k</i></high>!</span></li><br> <li class="m2"><span>Comparing two aspects a <high>reasonable <i>k</i></high> might be found:</span></li> <ul> <li><span><i>k</i> should be <high>as small as possible</high>.</span></li> <li><span>The <i>k</i> clusters should describe the data <high>as accurately as possible</high>.</span></li> </ul><br> <li class="m3"><span><b>Approaches</b></span></li> <ul> <li><span>Elbow of the <high>within variance</high>.</span></li> <li><span>Gap-statistic.</span></li> <li><span>Slope-statistic.</span></li> <li><span>Cluster-instability.</span></li> <li><span>etc.</span></li> </ul> </ul> </ul> ] .pull-right5[ <br> <p align = "center"> <img src="image/k-selection1.gif"><br> <img src="image/k-selection2.gif"><br> </p> ] --- .pull-left4[ # <i>k</i>-selection <ul> <li class="m1"><span>There is <high>no correct <i>k</i></high>!.</span></li><br> <li class="m2"><span>Comparing two aspects <high>reasonable <i>k</i></high> can be identified:</span></li> <ul> <li><span><i>k</i> should be <high>as small as possible</high>.</span></li> <li><span>The <i>k</i> clusters should describe the data <high>as accurately as possible</high>.</span></li> </ul><br> <li class="m3"><span><b>Approaches</b></span></li> <ul> <li><span>Elbow of the <high>within variance</high>.</span></li> <li><span>Gap-statistic.</span></li> <li><span>Slope-statistic.</span></li> <li><span>Cluster-instability.</span></li> <li><span>etc.</span></li> </ul> </ul> </ul> ] .pull-right5[ <br><br><br><br> ```r # load cstab library(cstab) # calculate k-sel gap_ksel <- cDistance(as.matrix(gap1952), kseq = 2:10, method = "kmeans") # show estimated cluster number gap_ksel$k_Gap ``` ``` ## [1] 1 ``` ```r gap_ksel$k_Slope ``` ``` ## [1] 2 ``` ] --- .pull-left4[ # <i>k</i>-selection <ul> <li class="m1"><span>There is <high>no correct <i>k</i></high>!.</span></li><br> <li class="m2"><span>Comparing two aspects <high>reasonable <i>k</i></high> can be identified:</span></li> <ul> <li><span><i>k</i> should be <high>as small as possible</high>.</span></li> <li><span>The <i>k</i> clusters should describe the data <high>as accurately as possible</high>.</span></li> </ul><br> <li class="m3"><span><b>Approaches</b></span></li> <ul> <li><span>Elbow of the <high>within variance</high>.</span></li> <li><span>Gap-statistic.</span></li> <li><span>Slope-statistic.</span></li> <li><span>Cluster-instability.</span></li> <li><span>etc.</span></li> </ul> </ul> </ul> ] .pull-right5[ <br><br><br><br> ```r # load cstab library(cstab) # calculate k-sel gap_ksel <- cStability(as.matrix(gap1952), kseq = 2:10, method = "kmeans") ``` ```r # show estimated cluster number gap_ksel$k_instab ``` ``` ## [1] 2 ``` ] --- .pull-left4[ # DBSCAN <ul> <li class="m1"><span>DBSCAN = Density-Based Spatial Clustering of Applications with Noise.</span></li><br> <li class="m2"><span><b>Algorithm</b></span></li><br> <ol> <li><span>For every point, check other <high>points with distance <b>ε</b></high>:<br>a. N = 0 <b>→</b> <high>outlier</high><br>b. N ≥ <i>minPts</i> <b>→</b> <high>core point</high><br>c. <i>minPts</i> > N > 0 <b>→</b> <high>undetermined</high></span></li><br> <li><span>Join <high>core points</high> with distance < <b>ε</b>; to clusters.</span></li><br> <li><span>Add <high>Undetermined points</high> with distance < <b>ε</b>; to core points to cluster, otherwise outlier.</span></li> </ol> </ul> ] .pull-right5[ <br> <p align = "center"> <font style="font-size:20px;font-weight:900">ε = .2</font> <img src="image/dbscan_1.gif"><br> </p> ] --- .pull-left4[ # DBSCAN <ul> <li class="m1"><span>DBSCAN = Density-Based Spatial Clustering of Applications with Noise.</span></li><br> <li class="m2"><span><b>Algorithm</b></span></li><br> <ol> <li><span>For every point, check other <high>points with distance <b>ε</b></high>:<br>a. N = 0 <b>→</b> <high>outlier</high><br>b. N ≥ <i>minPts</i> <b>→</b> <high>core point</high><br>c. <i>minPts</i> > N > 0 <b>→</b> <high>undetermined</high></span></li><br> <li><span>Join <high>core points</high> with distance < <b>ε</b>; to clusters.</span></li><br> <li><span>Add <high>Undetermined points</high> with distance < <b>ε</b>; to core points to cluster, otherwise outlier.</span></li> </ol> </ul> ] .pull-right5[ <br> <p align = "center"> <font style="font-size:20px;font-weight:900">ε = .3</font> <img src="image/dbscan_2.gif"><br> </p> ] --- .pull-left4[ # DBSCAN <ul> <li class="m1"><span>DBSCAN = Density-Based Spatial Clustering of Applications with Noise.</span></li><br> <li class="m2"><span><b>Algorithm</b></span></li><br> <ol> <li><span>For every point, check other <high>points with distance <b>ε</b></high>:<br>a. N = 0 <b>→</b> <high>outlier</high><br>b. N ≥ <i>minPts</i> <b>→</b> <high>core point</high><br>c. <i>minPts</i> > N > 0 <b>→</b> <high>undetermined</high></span></li><br> <li><span>Join <high>core points</high> with distance < <b>ε</b>; to clusters.</span></li><br> <li><span>Add <high>Undetermined points</high> with distance < <b>ε</b>; to core points to cluster, otherwise outlier.</span></li> </ol> </ul> ] .pull-right5[ <br><br><br><br> ```r # load dbscan library(dbscan) # calculate dbscan gap_dbscan <- dbscan(scale(gap1952), eps = .2) # show results gap_dbscan ``` ``` ## DBSCAN clustering for 141 objects. ## Parameters: eps = 0.2, minPts = 5 ## The clustering contains 2 cluster(s) and 57 noise points. ## ## 0 1 2 ## 57 72 12 ## ## Available fields: cluster, eps, minPts ``` ] --- .pull-left4[ # Eigenschaften von DBSCAN <ul> <li class="m1"><span><b>Advantages</b></span></li> <ul> <li><span><high>Identifies <i>k</i></high> automatically.</span></li> <li><span>Can identify <high>"complex" clusters</high>.</span></li> <li><span>Can identify <high>outliers</high>.</span></li> </ul> <li class="m2"><span><b>Disadvantages</b>.</span></li> <ul> <li><span><high>Very sensitive</high> to parameter settings.</span></li> <li><span>Hard time with <high>clusters of different density</high>.</span></li> </ul> </ul> ] .pull-right5[ <p align = "center"> <img src="image/dbscan_adv1.gif"><br> <img src="image/dbscan_adv2.gif"><br> </p> ] --- .pull-left4[ # Properties of DBSCAN <ul> <li class="m1"><span><b>Advantages</b></span></li> <ul> <li><span><high>Identifies <i>k</i></high> automatically.</span></li> <li><span>Can identify <high>"complex" clusters</high>.</span></li> <li><span>Can identify <high>outliers</high>.</span></li> </ul> <li class="m2"><span><b>Disadvantages</b>.</span></li> <ul> <li><span><high>Very sensitive</high> to parameter settings.</span></li> <li><span>Hard time with <high>clusters of different density</high>.</span></li> </ul> </ul> ] .pull-right5[ <p align = "center"> <img src="image/dbscan_adv3.gif"><br> <img src="image/dbscan_adv4.gif"><br> </p> ] --- .pull-left4[ # Properties of DBSCAN <ul> <li class="m1"><span><b>Advantages</b></span></li> <ul> <li><span><high>Identifies <i>k</i></high> automatically.</span></li> <li><span>Can identify <high>"complex" clusters</high>.</span></li> <li><span>Can identify <high>outliers</high>.</span></li> </ul> <li class="m2"><span><b>Disadvantages</b>.</span></li> <ul> <li><span><high>Very sensitive</high> to parameter settings.</span></li> <li><span>Hard time with <high>clusters of different density</high>.</span></li> </ul> </ul> ] .pull-right5[ <p align = "center"> <img src="image/dbscan_adv5.gif"><br> <img src="image/dbscan_adv6.gif"><br> </p> ] --- .pull-left4[ # Gaussian mixtures <ul> <li class="m1"><span>A probabilistic <high>extension of <i>k</i>-means algorithm</high> on the basis of multi-variate normal distributions</span></li> <li class="m2"><span>Works well with clusters of different <high>orientation and density</high></span></li> <li class="m3"><span>Straighforward estimation of <i>k</i> via <high>model comparison</high></span></li> <li class="m4"><span>Difficulties with <high>complex topologies</high>.</span></li> </ul> ] .pull-right5[ <br> <p align = "center"> <img src="image/mclust1.png"><br> </p> ] --- .pull-left4[ # Gaussian mixtures <ul> <li class="m1"><span>A probabilistic <high>extension of <i>k</i>-means algorithm</high> on the basis of multi-variate normal distributions</span></li> <li class="m2"><span>Works well with clusters of different <high>orientation and density</high></span></li> <li class="m3"><span>Straighforward estimation of <i>k</i> via <high>model comparison</high></span></li> <li class="m4"><span>Difficulties with <high>complex topologies</high>.</span></li> </ul> ] .pull-right5[ <br> <p align = "center"> <img src="image/mclust2.png"><br> </p> ] --- .pull-left4[ # Gaussian mixtures <ul> <li class="m1"><span>A probabilistic <high>extension of <i>k</i>-means algorithm</high> on the basis of multi-variate normal distributions</span></li> <li class="m2"><span>Works well with clusters of different <high>orientation and density</high></span></li> <li class="m3"><span>Straighforward estimation of <i>k</i> via <high>model comparison</high></span></li> <li class="m4"><span>Difficulties with <high>complex topologies</high>.</span></li> </ul> ] .pull-right5[ <br><br><br><br> ```r # load gaussian mixture package library(mclust) # calculate gaussian mixture gap_mclust <- Mclust(gap1952) # show results gap_mclust ``` ``` ## 'Mclust' model object: (VVE,3) ## ## Available components: ## [1] "call" "data" ## [3] "modelName" "n" ## [5] "d" "G" ## [7] "BIC" "loglik" ## [9] "df" "bic" ## [11] "icl" "hypvol" ## [13] "parameters" "z" ## [15] "classification" "uncertainty" ``` ] --- .pull-left4[ # Gaussian mixtures <ul> <li class="m1"><span>A probabilistic <high>extension of <i>k</i>-means algorithm</high> on the basis of multi-variate normal distributions</span></li> <li class="m2"><span>Works well with clusters of different <high>orientation and density</high></span></li> <li class="m3"><span>Straighforward estimation of <i>k</i> via <high>model comparison</high></span></li> <li class="m4"><span>Difficulties with <high>complex topologies</high>.</span></li> </ul> ] .pull-right5[ <br><br><br><br> ```r # lade gaussian mixture library(mclust) # berechne gaussian mixture gap_mclust <- Mclust(gap1952) # zeige Ergebnis gap_mclust$classification ``` ``` ## [1] 1 2 2 2 3 3 3 2 1 3 1 2 2 1 2 2 1 1 1 1 3 1 ## [23] 1 2 1 2 1 1 2 2 1 2 2 3 3 2 1 2 1 2 1 1 1 3 ## [45] 3 2 1 3 1 3 2 1 1 1 2 2 3 3 1 1 2 2 3 3 3 2 ## [67] 3 1 1 2 1 2 1 1 2 1 1 2 1 1 2 2 1 2 1 1 1 2 ## [89] 1 3 3 2 1 1 3 1 1 2 3 2 1 2 2 3 2 2 1 1 2 1 ## [111] 2 1 2 3 3 1 2 3 2 1 1 3 3 2 2 1 2 1 2 1 2 1 ## [133] 3 3 3 2 1 1 1 1 1 ``` ] --- class: middle, center <h1><a href=>Practical</a></h1>