Today’s session

• Principal component analysis

Raster data

For today’s exercise we are going to use spatial data stored in a raster stack, that is a stack of gridded layers with matching spatial resolution and extent. Raster data can be of a wide variety of formats (e.g., geoTiff, ENVI, grid, binary, etc.) and with or without spatial reference. The raster package provides functions to read, write, and manipulate raster data. For reading a single raster layer, you can use the raster() function. Reading a raster stack with multiple layers use the stack() function,

Bioclimatic variables are derived from monthly temperature and rainfall values to generate more biologically meaningful variables. These are often used in species distribution modeling and related ecological modeling techniques. The original bioclim dataset contains 19 layers derived from the global worldclim dataset. For the exercise, we use a spatial subset centered on the Carpathian mountains in Europe, and only 4 bioclim layers.

library(raster)

clim <- stack("Data/bioclim_subset.tif")

Printing the raster object reveals useful information about the data set. Note, that the raster is not loaded into your computers memory at this point. R merely established a connection to the file.

clim
## class      : RasterStack
## dimensions : 150, 150, 22500, 4  (nrow, ncol, ncell, nlayers)
## resolution : 5000, 5000  (x, y)
## extent     : 4896250, 5646250, 2405250, 3155250  (xmin, xmax, ymin, ymax)
## crs        : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs
## names      : Annual.Mean.Temperature, Max.Temperature.of.Warmest.Month, Min.Temperature.of.Coldest.Month, Annual.Precipitation

We can rename the layers (like columns of a data.frame):

names(clim) <- c("Tmean", "Tmax", "Tmin", "Precip")

Visualizing the raster is easily achieved with the plot() command. You can select a layer for plotting as below. Otherwise, all layers are plotted in a multi-plot window:

plot(clim, y = 1)

The rasterVis package provide more flexible visualization routines. You may try them out as well.

library(rasterVis)
levelplot(clim, layers=1)

Raster can have millions of pixels, which can make computations very slow. To deal with this issue, we are going to sample 10,000 pixels from the raster:

set.seed(42)
clim_samp <- sampleRandom(clim, size = 10000)

The function sampleRandom returns an object of type matrix. It is convenient to convert it into a data.frame.

clim_samp <- as.data.frame(clim_samp)

head(clim_samp)
##       Tmean     Tmax      Tmin   Precip
## 1 11.311277 25.50667 -5.476000 686.1667
## 2 11.494459 27.12400 -6.679000 606.0000
## 3  8.451709 23.62800 -9.362000 583.7500
## 4  7.576667 20.51600 -6.679000 787.0000
## 5 10.612583 25.34467 -6.862667 521.1667
## 6  6.324722 20.51933 -8.358666 567.3333

Let’s explore the correlation matrix with a graph:

library(GGally)

clim_samp %>%
sample_n(., size = 100) %>%
ggpairs(.)

Principal Component Analysis

Fit principal components

There are two functions in R for performing a PCA: princomp() and prcomp(). While the former function calculates the eigenvalue on the correlation/co-variance matrix using the function eigen(), the latter function uses singular value decomposition, which numerically more stable. We here are going to use prcomp().

pca <- prcomp(clim_samp, scale. = TRUE)

You should set the scale. argument to TRUE, to scale the data matrix to unit variances. The default is scale. = FALSE. You may also use scale() to standardize the original data matrix and then enter the standardized data matrix to prcomp.

Rotation coefficients

The prcomp object stores the rotation coefficients (eigenvectors):

pca$rotation ## PC1 PC2 PC3 PC4 ## Tmean -0.5684067 -0.13134083 0.2931387 -0.75745173 ## Tmax -0.5578910 0.08057592 0.5489958 0.61714561 ## Tmin -0.4580686 -0.60702873 -0.6140113 0.21137504 ## Precip 0.3947716 -0.77959812 0.4854517 0.02680957 Recall, the coefficients describe how each PC is a linear combination of the input variables: $$PC1 = -0.568 * Tmean - 0.557 * Tmax -0.458 * Tmin + 0.394*Precip$$ Also, the coefficients reflect the contribution of each variable to each principal component. For example, the absolute magnitude of the $$PC1$$ coefficients for $$Tmean$$, $$Tmax$$, $$Tmin$$, and $$Precip$$ is relatively similar (0.4 to 0.57), which indicates that all four variables contribute (roughly) similarly to PC1. The sign of the coefficients tells us something about the direction, i.e., $$PC1$$ increases with increasing $$Precip$$, whereas $$PC1$$ decreases with increasing $$Tmean$$, $$Tmax$$, and $$Tmin$$. Note, however, the direction of the eigenvector is arbitrary. The equation below flips the signs of the coefficients (direction of the eigenvector). The resulting $$PC1_i$$ or $$-PC1$$ is equally valid, so do not get attached to the direction of the PC1. But what the signs tell us is: $$Tmean$$, $$Tmax$$, and $$Tmin$$ have an opposite sign to $$Precip$$. We can also say: $$Tmean$$, $$Tmax$$, and $$Tmin$$ form a contrast to $$Precip$$ (while one has a positive effect the other has a negative effect or vice versa). Now, to interpret what that means requires scientific knowledge of the underlying ecological, climatological, biological, or socio-ecological gradients you might be observing here, e.g. elevation gradients, continentality, etc. $$PC1_i = 0.568 * Tmean + 0.557 * Tmax + 0.458 * Tmin - 0.394*Precip$$ $$PC1_i = -PC1$$ Eigenvalues To explore how the total variance of our four climate variables is now re-partitioned in the derived principal components, we need to take a look a the eigenvalues. Recall, the eigenvalues are the variances explained by the principal components (one for each component). The square roots of the eigenvalues are stored in the prcomp object: pca$sdev
## [1] 1.72763113 0.89364037 0.45821787 0.08206068

From this, we can easily calculate the variance explained by each component (the component variance divided by the total variance).

var_exp <- data.frame(pc = 1:4,
var_exp = pca$sdev^2 / sum(pca$sdev^2))

var_exp$var_exp_cumsum <- cumsum(var_exp$var_exp)
var_exp
##   pc     var_exp var_exp_cumsum
## 1  1 0.746177329      0.7461773
## 2  2 0.199648278      0.9458256
## 3  3 0.052490904      0.9983165
## 4  4 0.001683489      1.0000000

The so called scree plot plots the component variances versus the component number. The idea is to look for an “elbow” which corresponds to the point after which the eigenvalues decrease more slowly.

ggplot(var_exp) +
geom_bar(aes(x = pc, y = var_exp), stat = "identity") +
geom_line(aes(x = pc, y = var_exp_cumsum))

You are probably not surprised there is an easier way:

summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7276 0.8936 0.45822 0.08206
## Proportion of Variance 0.7462 0.1996 0.05249 0.00168
## Cumulative Proportion  0.7462 0.9458 0.99832 1.00000

The eigenvalues tell us: 74.62% of the variance is explained by PC1 alone, 19.96% by PC2, 5.249% by PC3, and only 0.1% by PC4. The cumulative proportion makes clear that 99.832% of the variance is captured by the first three components. Hence, if we throw away PC4, we may not loose much information.

Scores

The prcomp object also stores the principal component values (called scores) for our 10k samples:

Apply PCA

We can now apply the PCA transformation to the whole raster stack:

clim_pca <- raster::predict(clim, pca, index = 1:4)
plot(subset(clim_pca, 1))

plot(subset(clim_pca, 2))

plot(subset(clim_pca, 3))

plot(subset(clim_pca, 4))

prcomp vs princomp

Description prcomp() princomp()
Decomposition method svd() eigen()
Standard deviations of the principal components sdev sdev