Machine learning with R The R Bootcamp |
![]() |
from gapminder.org
In this practical, you will learn how to apply cluster analysis to two data sets.
In the end, you will know how to:
Open your TheRBootcamp
R project.
Open a new R script. Write your name, the date and “Unsupervised learning Practical” as comments at the top.
## NAME
## DATUM
## Unsupervised learning practical
Save the script as Unsupervised_practical.R
in the 2_Code
folder.
Load the packages tidyverse
, cstab
, dbscan
, and mclust
.
gap
data setread_csv()
, read in gap.csv
and save it as gap
.# Read gap.csv
gap <- read_csv('1_Data/gap.csv')
Print the data set and inspect its contents.
Use summary()
to get additional insight into the data.
Use the code below to create a new data set containing only the data from year 2007 and features Lebenserwartung
and BIP pro Kopf
.
# gap in 2007
gap2007 <- gap %>%
filter(Jahr == 2007) %>%
select(`BIP pro Kopf`, Lebenserwartung)
kmeans()
, identify three clusters (centers
) in gap2007
.# kmeans of gap in 2007
gap2007_km <- kmeans(x = XX, centers = XX)
# kmeans of gap in 2007
gap2007_km <- kmeans(x = gap2007, centers = 3)
gap2007_km
and study the output.# kmeans of gap in 2007
gap2007_km
K-means clustering with 3 clusters of sizes 27, 80, 34
Cluster means:
BIP pro Kopf Lebenserwartung
1 34099 79.9
2 2809 60.3
3 13701 72.1
Clustering vector:
[1] 2 2 2 2 3 1 1 1 2 1 2 2 2 3 3 3 2 2 2 2 1 2 2 3 2 2 2 2 2 3 2 3 3 3 1 2 2
[38] 2 2 2 3 2 2 1 1 3 2 1 2 1 2 2 2 2 2 1 3 1 2 2 3 2 1 1 1 2 1 2 2 2 3 3 2 2
[75] 3 2 2 3 2 2 3 3 2 3 2 2 2 2 2 1 1 2 2 2 1 3 2 3 2 2 2 3 3 3 2 3 2 2 3 2 3
[112] 2 1 3 1 2 3 1 2 2 2 1 1 2 1 2 2 2 3 2 3 2 1 1 3 3 2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 9.86e+08 3.76e+08 6.82e+08
(between_SS / total_SS = 90.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
The first row of the output and the table Cluster means
shows how many objects were assigned to each of the three clusters and where the centroids (cluster means) of these are located.
At the bottom of the output is a list of names of objects contained in the clustering object. Use gap2007_km$XX
to select the object clusters
, as well as the elements centers
and store them as clusters
and centers
respectively.
# gap2007_km
clusters <- gap2007_km$XX
centers <- gap2007_km$XX
# gap2007_km
clusters <- gap2007_km$cluster
centers <- gap2007_km$centers
# kmeans of gap in 2007
plot(gap2007, col = clusters)
# kmeans of gap in 2007
plot(gap2007, col = clusters)
points(centers, pch = 16, col = 1:3, cex = 2)
Something’s off, right? Some points of the middle cluster actually seem to lie closer to the bottom-left cluster. This shouldn’t be the case. Any ideas how this has come about?
The problem is that the features have different scales. The values of BIP pro Kopf
are way larger than those in Lebenserwartung
and, thus, a lot further away from each other. For that reason, BIP pro Kopf
plays a much larger role for cluster assignments than Lebenserwartung
. To fix this problem, run the code below, which scaled the features in gap2007
.
# scale gap in 2007
gap2007_scaled <- gap2007 %>%
scale() %>%
as_tibble()
kmeans()
for gap2007_scaled
and plot the data and cluster assignments. All good now?# kmeans plot for gap in 2007
gap2007_scaled_km <- kmeans(x = gap2007_scaled, centers = 3)
# extract elements
clusters <- gap2007_scaled_km$cluster
centers <- gap2007_scaled_km$centers
# plot
plot(gap2007_scaled, col = clusters)
points(centers, pch = 16, col = 1:3, cex = 2)
kmeans
solutions associated with 2 to 20 clusters. The code uses the gap2007_scaled
data.# within-cluster variance development
km_development <- purrr::map(2:20, kmeans, x = gap2007_scaled)
withinvar <- purrr::map_dbl(km_development,
`[[`, i = 'tot.withinss')
plot()
create a plot of the development of the withinvar
.# kmeans within-variance development
plot(withinvar)
What does the plot tell you? Is there an elbow that would suggest a particular value of k?
Several values of k seem plausible: 1, 3, or 7. Use cDistance()
from the cstab
package to estimate k with values from 2 to 20 (2:20
) as candidates.
# estimate k with cstab
k_est <- cDistance(data = as.matrix(XX),
kseq = XX:XX)
# estimate k with cstab
k_est <- cDistance(data = as.matrix(gap2007_scaled),
kseq = 2:20)
k_est$k_Gap
und k_est$k_Slope
. Do the numbers seem reasonable?# estimate k with cstab
k_est$k_Gap
[1] 14
k_est$k_Slope
[1] 3
cStability()
and extract k_instab
. Reasonable?# estimate k with cstab
k_est <- cStability(data = as.matrix(gap2007_scaled),
kseq = 2:20)
========
================
========================
================================
========================================
================================================
========================================================
================================================================
========================================================================
================================================================================
k_est$k_instab
[1] 3
Remember: There is no true k.
dbscan()
from the dbscan
package to cluster the data. Again it is essential to work with gap2007_scaled
as otherwise eps
would describe an ellipse and not a circle. Set eps
to .5
.# cluster using DBSCAN
gap2007_scaled_dbscan <- dbscan(x = XX,
eps = XX)
# cluster using DBSCAN
gap2007_scaled_dbscan <- dbscan(x = gap2007_scaled,
eps = .5)
Print gap2007_scaled_dbscan
. What does the outpute tell you? Remember 0 means outlier.
A single cluster and 5 outliers were identified. Visualize the solution using the same approach as above. The + 1
is necessary as 0 implies no color.
# extract clusters
clusters <- gap2007_scaled_dbscan$XX
# plot
plot(XX, col = XX + 1)
# extract clusters
clusters <- gap2007_scaled_dbscan$cluster
# plot
plot(gap2007_scaled, col = clusters + 1)
dbscan()
again using different values for eps
. Try eps = .3
and eps = .1
. Both times plot the results the same way as before. Any changes? Any of the solutions reasonable?# cluster using DBSCAN
gap2007_scaled_dbscan.3 <- dbscan(x = gap2007_scaled, eps = .3)
gap2007_scaled_dbscan.1 <- dbscan(x = gap2007_scaled, eps = .1)
# plot
par(mfrow = c(1, 3))
plot(gap2007_scaled, col = gap2007_scaled_dbscan$cluster + 1)
plot(gap2007_scaled, col = gap2007_scaled_dbscan.3$cluster + 1)
plot(gap2007_scaled, col = gap2007_scaled_dbscan.1$cluster + 1)
dbscan()
has additional parameters. minPts
determines the number of points within eps
necessary for a point to be a core point. If you like, try different values to explore the effect of minPts
.Mclust
from the mclust
package to cluster the date using Gaussian mixtures. This time, you can work with the original data set gap2007
as Gaussian mixtures are able to account for the differences in scale.# Gaussian mixtures
gap2007_gm <- Mclust(XX)
# Gaussian mixtures
gap2007_gm <- Mclust(gap2007)
Print the object gap2007_gm
. What do you see?
The output of gap2007_gm
reveals very little. It only shows which objects it contains. Use table(gap2007_gm$classification)
to gain an overview over the cluster assignments. How many clusters were identified and how many does each contain?
Use the classification
element to create a visualization of the cluster assignments.
# plot
plot(gap2007_scaled, col = gap2007_gm$classification)
plot(gap2007_gm, what = 'classification')
to create mclust
’s own visualization.# plot using mclust
plot(gap2007_gm, what = 'classification')
Try to understand what the ellipses in the mclust
plot reveal about the clusters. Remember they represent normal distributions with variances and co-variances reflecting the correlations between the features.
One desirable property of Gaussian mixtures is that they produce estimates of the uncertainty associated with assigning a data point to a particular cluster. Use plot(gap2007_gm, what = 'uncertainty')
to visualize this. The size of the points reflects the degree of uncertainty associated with a point being assigned to one particular cluster and not the others.
# plot
plot(gap2007_gm, what = 'uncertainty')
Mclust()
already considers a variety of implementations in addition to different values of k. You can gain an overview on the performance of these using plot(gap2007_gm, what = 'BIC')
.# plot
plot(gap2007_gm, what = 'BIC')
BIC is the Bayesian Information Criterion and it characterizes the model fit in relation to model complexity. In this case, higher BIC values indicate better performance. In the visualization you see the performance of 14 different implementations as a function of different values k (x-axis).
plot(gap2007_gm, what = 'BIC', ylim = c(-4200, -3900))
to get a better view on the critical region of the plot. Now you should be able to see that EVV
for k=4
shows the best performance overall.# plot
plot(gap2007_gm, what = 'BIC', ylim = c(-4200, -3900))
?mclustModelNames
to see descriptions for each of the model. This reveals that EEV
only makes the simplifying assumption of equal volume between the different ellipsoids. This is also what you see in plot(gap2007_gm, what = 'classification')
.plot(gap2007_gm, what = 'classification')
modelNames = 'XX'
with one of the three letter codes denoting the different models. Then visualize the result. Begin with 'EEI'
.# Try different implementation
gap2007_gm <- Mclust(gap2007, modelNames = 'XX')
plot(gap2007_gm, what = 'classification')
# Try different implementation
gap2007_gm <- Mclust(gap2007, modelNames = 'EEI')
plot(gap2007_gm, what = 'classification')
read_csv()
to read in credit.csv
as object credit
.# Read credit.csv
credit <- read_csv('1_Data/credit.csv')
Print the data set and familiarize yourself with its contents.
Use one or more of the clustering methods used above to identify clusters among credit card users. Have fun!
library(tidyverse)
library(cstab)
library(dbscan)
library(mclust, mask.ok = F)
# Example data
data(mpg)
# Process data
mpg <- mpg %>% select_if(is.numeric)
mpg_stand <- mpg %>%
scale %>% # standardize/scale data
as_tibble()
# k-means -----
# Identify clusters
mpg_km <- kmeans(mpg_stand,
centers = 3)
# Show centroids
mpg_km$centers
# k-selection -----
# Sow within-variance
km_development <- purrr::map(2:20, kmeans, x = mpg_stand)
withinvar <- purrr::map_dbl(km_development,
`[[`, i = 'tot.withinss')
# Visualize withinvar
plot(withinvar)
# Gap & Slope statistics
k_est <- cDistance(as.matrix(mpg_stand),
kseq = 2:20)
k_est$k__Gap
k_est$k_Slope
# Cluster stability
k_est <- cStability(as.matrix(mpg_stand),
kseq = 2:20)
k_est$k_instab
# DBSCAN -----
# Identify clusters
mpg_dbscan <- dbscan(mpg_stand, eps = 1)
# Show centroids
mpg %>%
mutate(cl = mpg_dbscan$cluster) %>%
group_by(cl) %>%
summarize_all(mean)
# Gaussian Mixtures -----
# Identify clusters
mpg_gm <- Mclust(mpg)
# Show centroids
mpg %>%
mutate(cl = mpg_gm$classification) %>%
group_by(cl) %>%
summarize_all(mean)
# Visualize clusters
plot(mpg_gm, what = 'classification')
# Compare clusters -----
table(mpg_km$cluster, mpg_dbscan$cluster)
table(mpg_km$cluster, mpg_gm$classification)
table(mpg_dbscan$cluster, mpg_gm$classification)
File | Rows | Columns |
---|---|---|
gap.csv | 1692 | 6 |
credit.csv | 8636 | 8 |
The gap
data set is based on the Gapminder project and has been extracted from the R package gapminder.
Feature | Description |
---|---|
Land | Name of country |
Kontinent | Name of continent |
Jahr | year |
Lebenserwartung | life expectancy in years |
Population | Anzahl Population of country |
BIP pro Kopf | GDP per capita |
The credit
data set is an excerpt of the publicly available Credit Card Dataset. The data set contains 8 features, that describe the behavior of 8636 credit card custmers.
Variable | Beschreibung |
---|---|
BALANCE | Available credit |
BALANCE_FREQUENCY | Frequency of balance changes of (1 = frequent, 0 = infrequent) |
PURCHASES | Sum of purchases |
CREDITLIMIT | Limit of credit card |
ONEOFFPURCHASES | Value of largest one-off purchase |
MINIMUM_PAYMENTS | Minimal credit card payment |
PRCFULLPAYMENT | Percent in-full credit card payments |
TENURE | Duration of credit card account |
Paket | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
cstab |
install.packages("cstab") |
dbscan |
install.packages("dbscan") |
mclust |
install.packages("mclust") |
Clustering
Function | Package | Description |
---|---|---|
kmeans() |
stats |
Cluster data with k-means |
dbscan() |
dbscan |
Cluster data with DBSCAN |
Mclust() |
mclust |
Cluster data with Gaussian mixtures |
k-selection
Function | Package | Description |
---|---|---|
cDistance() |
cstab |
Identify k with distance-based methods, e.g., gap statistic. |
cStability() |
cstab |
Identify k with instability-based methods. |