For today’s exercise we are going to use spatial data stored in a raster stack, that is a stack of gridded layers with matching spatial resolution and extent. Raster data can be of a wide variety of formats (e.g., geoTiff, ENVI, grid, binary, etc.) and with or without spatial reference. The
raster package provides functions to read, write, and manipulate raster data. For reading a single raster layer, you can use the
raster() function. Reading a raster stack with multiple layers use the
Bioclimatic variables are derived from monthly temperature and rainfall values to generate more biologically meaningful variables. These are often used in species distribution modeling and related ecological modeling techniques. The original bioclim dataset contains 19 layers derived from the global worldclim dataset. For the exercise, we use a spatial subset centered on the Carpathian mountains in Europe, and only 4 bioclim layers.
library(raster) clim <- stack("Data/bioclim_subset.tif")
Printing the raster object reveals useful information about the data set. Note, that the raster is not loaded into your computers memory at this point. R merely established a connection to the file.
## class : RasterStack ## dimensions : 150, 150, 22500, 4 (nrow, ncol, ncell, nlayers) ## resolution : 5000, 5000 (x, y) ## extent : 4896250, 5646250, 2405250, 3155250 (xmin, xmax, ymin, ymax) ## crs : +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs ## names : Annual.Mean.Temperature, Max.Temperature.of.Warmest.Month, Min.Temperature.of.Coldest.Month, Annual.Precipitation
We can rename the layers (like columns of a data.frame):
names(clim) <- c("Tmean", "Tmax", "Tmin", "Precip")
Visualizing the raster is easily achieved with the
plot() command. You can select a layer for plotting as below. Otherwise, all layers are plotted in a multi-plot window:
plot(clim, y = 1)
library(rasterVis) levelplot(clim, layers=1)
Raster can have millions of pixels, which can make computations very slow. To deal with this issue, we are going to sample 10,000 pixels from the raster:
set.seed(42) clim_samp <- sampleRandom(clim, size = 10000)
sampleRandom returns an object of type matrix. It is convenient to convert it into a data.frame.
clim_samp <- as.data.frame(clim_samp) head(clim_samp)
## Tmean Tmax Tmin Precip ## 1 11.311277 25.50667 -5.476000 686.1667 ## 2 11.494459 27.12400 -6.679000 606.0000 ## 3 8.451709 23.62800 -9.362000 583.7500 ## 4 7.576667 20.51600 -6.679000 787.0000 ## 5 10.612583 25.34467 -6.862667 521.1667 ## 6 6.324722 20.51933 -8.358666 567.3333
Let’s explore the correlation matrix with a graph:
library(GGally) clim_samp %>% sample_n(., size = 100) %>% ggpairs(.)
There are two functions in R for performing a PCA:
prcomp(). While the former function calculates the eigenvalue on the correlation/co-variance matrix using the function
eigen(), the latter function uses singular value decomposition, which numerically more stable. We here are going to use
pca <- prcomp(clim_samp, scale. = TRUE)
You should set the
scale. argument to
TRUE, to scale the data matrix to unit variances. The default is
scale. = FALSE. You may also use
scale() to standardize the original data matrix and then enter the standardized data matrix to
prcomp object stores the rotation coefficients (eigenvectors):
## PC1 PC2 PC3 PC4 ## Tmean -0.5684067 -0.13134083 0.2931387 -0.75745173 ## Tmax -0.5578910 0.08057592 0.5489958 0.61714561 ## Tmin -0.4580686 -0.60702873 -0.6140113 0.21137504 ## Precip 0.3947716 -0.77959812 0.4854517 0.02680957
Recall, the coefficients describe how each PC is a linear combination of the input variables:
\(PC1 = -0.568 * Tmean - 0.557 * Tmax -0.458 * Tmin + 0.394*Precip\)
Also, the coefficients reflect the contribution of each variable to each principal component. For example, the absolute magnitude of the \(PC1\) coefficients for \(Tmean\), \(Tmax\), \(Tmin\), and \(Precip\) is relatively similar (0.4 to 0.57), which indicates that all four variables contribute (roughly) similarly to PC1.
The sign of the coefficients tells us something about the direction, i.e., \(PC1\) increases with increasing \(Precip\), whereas \(PC1\) decreases with increasing \(Tmean\), \(Tmax\), and \(Tmin\). Note, however, the direction of the eigenvector is arbitrary. The equation below flips the signs of the coefficients (direction of the eigenvector). The resulting \(PC1_i\) or \(-PC1\) is equally valid, so do not get attached to the direction of the PC1. But what the signs tell us is: \(Tmean\), \(Tmax\), and \(Tmin\) have an opposite sign to \(Precip\). We can also say: \(Tmean\), \(Tmax\), and \(Tmin\) form a contrast to \(Precip\) (while one has a positive effect the other has a negative effect or vice versa). Now, to interpret what that means requires scientific knowledge of the underlying ecological, climatological, biological, or socio-ecological gradients you might be observing here, e.g. elevation gradients, continentality, etc.
\(PC1_i = 0.568 * Tmean + 0.557 * Tmax + 0.458 * Tmin - 0.394*Precip\)
\(PC1_i = -PC1\)
To explore how the total variance of our four climate variables is now re-partitioned in the derived principal components, we need to take a look a the eigenvalues. Recall, the eigenvalues are the variances explained by the principal components (one for each component). The square roots of the eigenvalues are stored in the
##  1.72763113 0.89364037 0.45821787 0.08206068
From this, we can easily calculate the variance explained by each component (the component variance divided by the total variance).
var_exp <- data.frame(pc = 1:4, var_exp = pca$sdev^2 / sum(pca$sdev^2)) # add the variances cumulatively var_exp$var_exp_cumsum <- cumsum(var_exp$var_exp) var_exp
## pc var_exp var_exp_cumsum ## 1 1 0.746177329 0.7461773 ## 2 2 0.199648278 0.9458256 ## 3 3 0.052490904 0.9983165 ## 4 4 0.001683489 1.0000000
The so called scree plot plots the component variances versus the component number. The idea is to look for an “elbow” which corresponds to the point after which the eigenvalues decrease more slowly.
ggplot(var_exp) + geom_bar(aes(x = pc, y = var_exp), stat = "identity") + geom_line(aes(x = pc, y = var_exp_cumsum))
You are probably not surprised there is an easier way:
## Importance of components: ## PC1 PC2 PC3 PC4 ## Standard deviation 1.7276 0.8936 0.45822 0.08206 ## Proportion of Variance 0.7462 0.1996 0.05249 0.00168 ## Cumulative Proportion 0.7462 0.9458 0.99832 1.00000
The eigenvalues tell us: 74.62% of the variance is explained by PC1 alone, 19.96% by PC2, 5.249% by PC3, and only 0.1% by PC4. The cumulative proportion makes clear that 99.832% of the variance is captured by the first three components. Hence, if we throw away PC4, we may not loose much information.
prcomp object also stores the principal component values (called scores) for our 10k samples:
pca_scores <- data.frame(pca$x) head(pca_scores)
## PC1 PC2 PC3 PC4 ## 1 -2.05143179 -1.2699438 -0.08480865 -0.02211953 ## 2 -2.28840811 -0.2840354 0.50894396 0.09552631 ## 3 0.08364308 1.0530308 0.40503017 0.02037252 ## 4 0.70719058 -1.2363492 -0.79940910 0.05183253 ## 5 -1.84760490 0.2666353 -0.20228144 -0.03302148 ## 6 0.96988434 0.7671065 -1.02732669 0.23251589
Interpretation of the principal components by exploring the correlation between the PC scores and the original variables:
df <- cbind(clim_samp, pca_scores) ggpairs(df[,1:6])
ggplot(pca_scores, aes(x = PC1, y = PC2)) + geom_point()
The biplot shows the bioclim variables (in form of the rotation coefficients / loadings) and the observations (the PCA scores of our 10k samples) in one graphic. The variables are visualized as arrows from the plot origin, reflecting their direction and magnitude. Note, there are different versions of biplots out there also in different packages (e.g. PCAtools, ggbiplot).
biplot(pca, choices=c(1,3), xlabs=rep('.', nrow(pca$x)))
We can now apply the PCA transformation to the whole raster stack:
clim_pca <- raster::predict(clim, pca, index = 1:4)
|Standard deviations of the principal components||sdev||sdev|
|Matrix of variable loadings (columns are eigenvectors)||rotation||loadings|
|Variable means (means that were substracted)||center||center|
|Variable standard deviations (the scalings applied to each variable )||scale||scale|
|Coordinates of observations on the principal components||x||scores|
Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.