Recap last session

  • Assessing model predictions
  • Cross-validation (library(caret); confusionMatrix(); createFolds())
  • Linear discriminant analysis (library(MASS); lda())

Today’s session

Cluster analysis identifies subgroups (cluster) such that data points in the same subgroup are very similar while data points in different clusters are very different.

In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure.

There are two principal types of cluster algorithms:

  • Hierarchical algorithms and
  • Partitional algorithms (e.g. k-means).

Hierarchical clustering

Hierarchic classifications may be represented by a two-dimensional diagram known as a dendrogram, which illustrates the fusions made a each stage of the analysis. The Figure shows an example of a dendrogram from Finding Groups in Data: Introduction to Cluster Analysis, Kaufman and Rousseeuw.

data(iris)
iris_dat <- iris[,-5] # remove species column
head(iris_dat)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

Dissimilarity measures

The first step is to compute the distance (dissimilarity) matrix. With dist() you can choose between euclidean, maximum, manhattan, canberra, binary or minkowski distance.

iris_dist <- dist(iris_dat, method = "euclidian")
class(iris_dist)
## [1] "dist"
dim(as.matrix(iris_dist))
## [1] 150 150
as.matrix(iris_dist)[1:5,1:8]
##           1         2        3         4         5         6         7
## 1 0.0000000 0.5385165 0.509902 0.6480741 0.1414214 0.6164414 0.5196152
## 2 0.5385165 0.0000000 0.300000 0.3316625 0.6082763 1.0908712 0.5099020
## 3 0.5099020 0.3000000 0.000000 0.2449490 0.5099020 1.0862780 0.2645751
## 4 0.6480741 0.3316625 0.244949 0.0000000 0.6480741 1.1661904 0.3316625
## 5 0.1414214 0.6082763 0.509902 0.6480741 0.0000000 0.6164414 0.4582576
##           8
## 1 0.1732051
## 2 0.4242641
## 3 0.4123106
## 4 0.5000000
## 5 0.2236068

Agglomeration methods

In the second step, we run an agglomeration cluster algorithm on the dissimilarity matrix. hclust() allows you to choose between several agglomeration methods: “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median”, or “centroid”.

iris_ward <- hclust(iris_dist, method="ward.D")
iris_ward
## 
## Call:
## hclust(d = iris_dist, method = "ward.D")
## 
## Cluster method   : ward.D 
## Distance         : euclidean 
## Number of objects: 150

From R’s help: A number of different clustering methods are provided. Ward’s minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters. The single linkage method (which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods.pl

Plot dendrogram

iris_dendr <- as.dendrogram(iris_ward)
plot(iris_dendr, horiz=F)

Visualize dendrogram with ggplot

library(ggdendro)               
library(ggplot2)

dendro <- as.dendrogram(iris_ward)

dendro <- dendro_data(dendro, type = "rectangle")

ggplot(segment(dendro)) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) + 
  geom_text(data = label(dendro), 
            aes(x = x, y = y - 2, label = label), size = 3)

Selection of clusters

cutree() cuts a tree, e.g., as resulting from hclust(), into several groups either by specifying the desired number(s) of groups (k) or the cut height(s) (h). Let’s split the tree into three groups.

iris_dat$cluster <- as.factor(cutree(iris_ward, k = 3))
head(iris_dat$cluster)
## [1] 1 1 1 1 1 1
## Levels: 1 2 3
table(iris_dat$cluster)
## 
##  1  2  3 
## 50 64 36

Visualize clusters

Two visualize the four clusters, we create a scatter-plot matrix with pairs(), while using the vector of the cluster id’s to specify the symbol color.

pairs(iris_dat[,1:4], col=iris_dat$cluster, pch=19, cex=0.4)

Visualize clusters with ggpairs()

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
iris_dat$cluster <- as.factor(cutree(iris_ward, k = 3))
ggpairs(iris_dat, columns = 1:4, mapping = ggplot2::aes(color = cluster))

k-means cluster analysis

A k-means cluster analysis can be performed using the kmeans() function. Notice, that we also scale the data to unit variance and mean zero.

data(iris)

iris_kmeans <- kmeans(x = scale(iris[, c(1:4)]), centers = 3, algorithm="Lloyd", iter.max=20)
iris_kmeans
## K-means clustering with 3 clusters of sizes 34, 16, 100
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   -1.1924784   0.4015443    -1.309094  -1.2608902
## 2   -0.6259564   1.8042613    -1.282644  -1.2290567
## 3    0.5055957  -0.4252069     0.650315   0.6253518
## 
## Clustering vector:
##   [1] 1 1 1 1 2 2 1 1 1 1 2 1 1 1 2 2 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2
##  [38] 2 1 1 1 1 1 1 2 1 2 1 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3
## 
## Within cluster sum of squares by cluster:
## [1]  15.97485   6.45758 173.52867
##  (between_SS / total_SS =  67.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We can plot the cluster solution using ggplot:

iris$cluster3 <- iris_kmeans$cluster

ggplot(iris, aes(Sepal.Length, y = Sepal.Width, col = factor(cluster3))) +
  geom_point()

And compare it to the actual species classification:

table(iris$cluster3, iris$Species)
##    
##     setosa versicolor virginica
##   1     34          0         0
##   2     16          0         0
##   3      0         50        50

Previously, we selected \(k=3\), based on a-priori knowledge on the number of species. We found that species setosa was well separated from the other species but species versicolor and virginia showed strong overlapping characteristics. The question is, whether there are subgroups that might be better represented by additional clusters.

But how many clusters do we need?

iris$cluster2 <- kmeans(x = iris[, c(1:4)], centers = 2)$cluster
iris$cluster4 <- kmeans(x = iris[, c(1:4)], centers = 4)$cluster
iris$cluster5 <- kmeans(x = iris[, c(1:4)], centers = 5)$cluster
iris$cluster6 <- kmeans(x = iris[, c(1:4)], centers = 6)$cluster

iris %>%
  gather(key = k, value = cluster, -(Sepal.Length:Species)) %>%
  ggplot(., aes(Sepal.Length, y = Sepal.Width, col = factor(cluster))) +
  geom_point() +
  facet_wrap(~k, ncol = 6)

To answer the last question we run kmeans() for a range of \(k\) and calculate the within-group sum of squares for each \(k\) (here ranging from 1 to 10 clusters).

wss <- c()
var_explained <- c()

for(k in 1:10){
  set.seed(42)
  fit <- kmeans(scale(iris[, 1:4]), centers = k)
  wss <- c(wss, fit$tot.withinss)
  var_explained <- c(var_explained, fit$betweenss/fit$totss)
}

head(wss)
## [1] 596.00000 220.87929 138.88836 113.64981 104.92773  95.48123
head(var_explained)
## [1] 1.907497e-16 6.293972e-01 7.669658e-01 8.093124e-01 8.239468e-01
## [6] 8.397966e-01

Now, when we plot \(k\) against the within-group sum of squares (\(SSE\)), you will see that the error decreases as \(k\) gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller. The idea of the elbow method is to choose the k at which the \(SSE\) decreases abruptly. This produces an “elbow effect” in the graph.

ggplot(data.frame(k = 1:10, wss = wss), aes(x = k, y = wss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)
ggplot(data.frame(k = 1:10, wss = wss), aes(x = k, y = wss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

set.seed(42)
fit3 <- kmeans(scale(iris[, 1:4]), centers = 3)
table(iris$Species, factor(fit3$cluster))
##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 39 11
##   virginica   0 14 36
set.seed(42)
fit4 <- kmeans(scale(iris[, 1:4]), centers = 4)
table(iris$Species, factor(fit4$cluster))
##             
##               1  2  3  4
##   setosa     49  1  0  0
##   versicolor  0 19 29  2
##   virginica   0  2 21 27

k-means on raster data

Let’s load the bioclim data previously used for PCA …

library(raster)
clim <- stack("data/bioclim_subset.tif")
names(clim) <- c("Tmean", "Tmax", "Tmin", "Precip")

Our kmeans function only works on data frames, so we need to convert the raster into a data frame.

bioclim_values <- as.data.frame(clim)
head(bioclim_values)
##      Tmean     Tmax      Tmin   Precip
## 1 8.256389 21.67000 -7.142000 585.8333
## 2 8.249315 21.82311 -7.425778 587.8889
## 3 8.227834 21.74600 -7.437000 596.5000
## 4 8.143750 21.43000 -7.293000 606.2500
## 5 8.134527 21.51867 -7.390000 605.1667
## 6 8.163889 21.63067 -7.493333 595.8333

Create a logical vector indicating if a row contains valid observations (not NA) in all bioclim variables (TRUE) or if a row contains any missing values (FALSE).

valid <- apply(bioclim_values, MARGIN = 1, FUN = function(x){any(is.na(x)) == FALSE})
head(valid)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

To run k-means only on valid observations (no missing obserations), we use our logical vector valid to index the rows of the data.frame bioclim_values:

clim_kmeans <- kmeans(scale(bioclim_values[valid,]), centers = 5)
str(clim_kmeans)
## List of 9
##  $ cluster     : Named int [1:22500] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:22500] "1" "2" "3" "4" ...
##  $ centers     : num [1:5, 1:4] 0.00162 1.05575 -0.81785 -0.79067 -2.07217 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:4] "Tmean" "Tmax" "Tmin" "Precip"
##  $ totss       : num 89996
##  $ withinss    : num [1:5] 4138 7358 2290 3321 3161
##  $ tot.withinss: num 20268
##  $ betweenss   : num 69728
##  $ size        : int [1:5] 5925 8096 2257 4824 1398
##  $ iter        : int 5
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Create a new column cluster in our bioclim data.frame and fill it with the k-means cluster id’s. However, we only fill the rows for which we have valid cluster results, i.e. rows with only valid bioclim observations (no missing data) .

bioclim_values[valid, "cluster"] <- clim_kmeans$cluster
head(bioclim_values)
##      Tmean     Tmax      Tmin   Precip cluster
## 1 8.256389 21.67000 -7.142000 585.8333       1
## 2 8.249315 21.82311 -7.425778 587.8889       1
## 3 8.227834 21.74600 -7.437000 596.5000       1
## 4 8.143750 21.43000 -7.293000 606.2500       1
## 5 8.134527 21.51867 -7.390000 605.1667       1
## 6 8.163889 21.63067 -7.493333 595.8333       1

To convert our data.frame back to a raster, we first create a copy of the bioclim raster clim. Although clim is a raster stack with multiple bioclim layers, the resulting kmean_raster has only a single layer (recall raster() returns single layers). Then we fill the raster values with the cluster ids from the data.frame using the values() function.

kmean_raster <- raster(clim)
kmean_raster
## class      : RasterLayer 
## dimensions : 150, 150, 22500  (nrow, ncol, ncell)
## resolution : 5000, 5000  (x, y)
## extent     : 4896250, 5646250, 2405250, 3155250  (xmin, xmax, ymin, ymax)
## crs        : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs
values(kmean_raster) <- bioclim_values$cluster
plot(kmean_raster)


Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.