# Recap last session

• GIS vector data in R
• Creating spatial points (SpatialPoints())
• Interactive map visualizations (mapview())
• Overlaying vector and raster data
• Handling projections (projection(); spTransform())
• Extracting raster values at point locations (raster::extract())
• Descriptive point pattern analysis
• Creating a density raster (rasterize(point.data, raster, fun = "count")
• Kernel density estimates (library(spatialEco); sp.kde())
• Distance (distance())

# Overview

• Inverse Distance Weighted interpolation
• Semivariogram
• Kriging

## R Packages

We will mostly deal with package gstat, because it offers the widest functionality in the geostatistics curriculum for R: it covers variogram cloud diagnostics, variogram modeling, everything from global simple kriging to local universal cokriging, multivariate geostatistics, block kriging, indicator and Gaussian conditional simulation, and many combinations.

library("gstat")   # geostatistics
library("mapview") # map plot
library("sp")      # spatial data handling
library("ggplot2") # plotting
library("rgdal")   # reading and writing spatial data

# Meuse data

For this tutorial, we use a dataset provided by the sp package. The meuse data set includes four heavy metals measured in the top soil in a flood plain along the river Meuse, along with a handful of covariates. To see a detailed description of those covariates type in ?meuse at the command line. It seems, that polluted sediments carried by the river are mostly deposited close to the river bank and in areas with low elevation.

data(meuse) # load built-in dataset
head(meuse)
##        x      y cadmium copper lead zinc  elev       dist   om ffreq soil lime landuse dist.m
## 1 181072 333611    11.7     85  299 1022 7.909 0.00135803 13.6     1    1    1      Ah     50
## 2 181025 333558     8.6     81  277 1141 6.983 0.01222430 14.0     1    1    1      Ah     30
## 3 181165 333537     6.5     68  199  640 7.800 0.10302900 13.0     1    1    1      Ah    150
## 4 181298 333484     2.6     81  116  257 7.655 0.19009400  8.0     1    2    0      Ga    270
## 5 181307 333330     2.8     48  117  269 7.480 0.27709000  8.7     1    2    0      Ah    380
## 6 181390 333260     3.0     61  137  281 7.791 0.36406700  7.8     1    2    0      Ga    470

The built-in dataset is a table (data.frame). To conduct geostatistical analysis, we need to convert this into a spatial data format, i.e. a SpatialPointDataFrame. You have seen how to convert a table with x-y-coordinates in the previous session. To save some time, we prepared a shapefile meuse.shp that you can load with rgdal::readOGR().

meuse <- readOGR('data', 'meuse')
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/dirk/Repository/teaching/gcg_quantitative-methods/data", layer: "meuse"
## with 155 features
## It has 12 fields
meuse
## class       : SpatialPointsDataFrame
## features    : 155
## extent      : 178605, 181390, 329714, 333611  (xmin, xmax, ymin, ymax)
## crs         : +proj=sterea +lat_0=52.1561605555556 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## variables   : 12
## names       : cadmium, copper, lead, zinc,  elev,     dist, om, ffreq, soil, lime, landuse, dist_m
## min values  :     0.2,     14,   37,  113,  5.18,        0,  1,     1,    1,    0,      Aa,     10
## max values  :    18.1,    128,  654, 1839, 10.52, 0.880389, 17,     3,    3,    1,       W,   1000

Spatial exploratory data analysis starts with the plotting of maps with a measured variable. Zinc concentration seems to be larger close to the river Meuse banks.

library(mapview)
mapview(meuse, zcol='zinc')

# IDW interpolation

Usually, interpolation is done on a regular grid. For the meuse data, covariates are provided for x-y locations forming a regular grid in the built-in meuse.grid dataset (see ?meuse.grid).

data(meuse.grid)
head(meuse.grid)
##        x      y part.a part.b      dist soil ffreq
## 1 181180 333740      1      0 0.0000000    1     1
## 2 181140 333700      1      0 0.0000000    1     1
## 3 181180 333700      1      0 0.0122243    1     1
## 4 181220 333700      1      0 0.0434678    1     1
## 5 181100 333660      1      0 0.0000000    1     1
## 6 181140 333660      1      0 0.0122243    1     1

Notice, meuse.grid is a data.frame containing x-y coordinates. If you plot the coordinates of the data.frame, you see that they form a regular grid:

plot(meuse.grid$x, meuse.grid$y, cex=0.1, pch=19) However, a data.frame is not spatial, so we need to convert it into a spatial grid, i.e., an object of type SpatialPixel or SpatialPixelsDataFrame. The SpatialPixelsDataFrame from the sp package provides R with a raster type analogous to the raster class of the raster package.

meuse.pixels <- SpatialPixelsDataFrame(points=meuse.grid[,c("x", "y")],
data=meuse.grid[,-(1:2)],
proj4string = CRS(proj4string(meuse)))
mapview(meuse.pixels, zcol="dist")

Inverse distance-based weighted interpolation (IDW) computes a weighted average,

$\begin{equation} \hat{Z}(s_0) = \frac{\sum_{i=1}^{n}w(s_i)Z(s_i)}{\sum_{i=1}^{n}w(s_i)} \end{equation}$

where weights for observations are computed according to their distance to the interpolation location,

$w(s_i) = ||s_i-s_0||^{-p}$

with $$|| ? ||$$ indicating Euclidean distance and $$p$$ an inverse distance weighting power, defaulting to 2. If $$s_0$$ coincides with an observation location, the observed value is returned to avoid infinite weights.

The inverse distance power determines the degree to which the nearer point(s) are preferred over more distant points; for large values IDW converges to the one-nearest-neighbor interpolation.

library(gstat)
zn.idw <- idw(zinc ~ 1, meuse, meuse.pixels, idp = 2)
## [inverse distance weighted interpolation]
as.data.frame(zn.idw)[1:5, ]
##        x      y var1.pred var1.var
## 1 181180 333740  633.6864       NA
## 2 181140 333700  712.5450       NA
## 3 181180 333700  654.1617       NA
## 4 181220 333700  604.4422       NA
## 5 181100 333660  857.2558       NA

The output variable is called var1.pred, and the var1.var values are NA because inverse distance does not provide prediction error variances.

mapview(zn.idw, zcol='var1.pred', layer.name = "Zinc ppm")

IDW considers only distances to the prediction location and thus ignores the spatial configuration of observations. This may lead to undesired effects if the observation locations are strongly clustered.

Notably, IDW weights are guaranteed to be between 0 and 1, resulting in interpolated values never outside the range of observed values.

# Stationarity

The spatial correlation between two observations of a variable $$z(s)$$ at locations $$s_1$$ and $$s_2$$ cannot be estimated, as only a single pair is available. To obtain the necessary replication, we need to replicate the experiment with multiple samples from multiple locations. But to do that, we need to assume that the spatial process is the same at all those locations. In other words, we need to make stationarity assumptions about our spatial process.

Stationarity means that characteristics of a random function (our spatial process that generates a variable) stay the same when shifting a given set of $$n$$ points from one part of a region to another. The spatial process is called strictly stationary if for any set of $$n$$ points $$s_1,..., s_n$$ the multivariate distribution does not change. In other words, everything looks the same everywhere and anytime. Well, this may not seem like a reasonable assumption for geographic processes.

A less restrictive condition is given by weak stationarity, which is also called second-order stationarity because it only assumes stationarity of the first two moments of the distribution. That is, the spatial process has a constant mean $$E[Z(x)]$$ and variance $$Var[Z(x)]$$. Also, the covariance between two observations separated by a distance $$h$$: $$cov(Z(x+h), Z(x))$$ only relies on the distance $$h$$ between the observations and not on the spatial location $$x$$ inside the region.

A specific type of second-order stationarity, and most important for the variogram analysis, is called intrinsic stationarity. Here, we assume second-order stationarity of the differences between pairs of values at two locations: $$Z(x+h)−Z(x)$$. We are not interested in $$Z(x)$$ but in the differences. Again, this implies the assumption that the variance of these differences does not depend on location but only depends on separation distance $$h$$:

$Var(Z(x+h)−Z(x))=2\gamma(h)$

# Lagged scatter plots

A simple way to acknowledge that spatial correlation is present or not is to make scatter plots of pairs $$Z(s_i)$$ and $$Z(s_j)$$, grouped according to their separation distance $$h_{ij} = || {s_{i} - s_{j}} ||$$.

library(gstat)
hscat(log(zinc) ~ 1, meuse, (0:9) * 100) # Variogram cloud

We can plot the difference between pairs of sample values as a function of separation distance $$h$$. The so-called variogram cloud plots all possible squared differences of observation pairs $$(Z(x) - Z(x+h))^2$$ against their separation distance $$h$$. The difference is divided by 2 to account for the fact that two points share this value. Hence, $$\gamma$$ is often called semi-variance (half the variance).

$\gamma_h=\frac{1}{2}(Z(x) - Z(x+h))^2$

meuse.vc <- variogram(log(zinc) ~ 1, meuse, cloud = TRUE)
plot(meuse.vc, ylab=bquote(gamma), xlab=c("h (separation distance in m)")) Since $$\gamma$$ is a difference it is a measure of dissimilarity, i.e., the smaller the dissimilarity the more similar are observations. Thus, the variogram cloud allows you to observe if sample pairs closer to each other are more similar than pairs further apart, which is usually the case. The variogram cloud also allows you to observe the distribution of dissimilarity at particular distances. For example, the variogram cloud above shows that $$\gamma$$ has a skewed distribution at any distance. That is, the majority of plots are similar even at higher distances. However, dissimilarity increases with distances up to 500 m or so.

# Sample variogram

In geostatistics the spatial correlation is often modeled from the sample variogram. The sample variogram plots averages of the variogram cloud values over distance intervals $$h$$. If we further assume isotropy, which is direction independence of semivariance, we can replace the vector $$h$$ with its length, $$||h||$$.

Under this assumption, the variogram can be estimated from $$N_h$$ sample data pairs $$z(x)$$, $$z(x + h)$$ for a number of distances (or distance intervals) $$\tilde{h}_j$$ by:

$\hat{\gamma}(\tilde{h}_j) = \frac{1}{2N_h}\sum_{i=1}^{N_h}(Z(x)-Z(x+h))^2$

You can construct a sample variogram() in R using the variogram function of the gstat package. The ~ 1 defines a single constant predictor leading to a spatially constant mean coefficient.

meuse.v <- gstat::variogram(log(zinc) ~ 1, meuse)
plot(meuse.v, ylab=bquote(gamma), xlab=c("h (separation distance in m)")) ## Direction dependence

The previously used variogram function makes a number of decisions by default. It decides that direction is ignored: point pairs are merged on the basis of distance, not direction. An alternative is, for example to look in four different angles by specifying the alpha keyword.

plot(variogram(log(zinc) ~ 1, meuse, alpha = c(0, 45, + 90, 135))) ## Cutoff and lag width

A similar issue is the cutoff distance, which is the maximum distance up to which point pairs are considered and the width of distance interval over which point pairs are averaged in bins. The default value gstat uses for the cutoff value is one third of the largest diagonal of the bounding box (or cube) of the data. Good reasons to decrease the cutoff may be when a local prediction method is foreseen, and only semivariances up to a rather small distance are required. For the interval width, gstat uses a default of the cutoff value divided by 15.

plot(variogram(log(zinc) ~ 1, meuse, cutoff = 1000, width = 50)) # Theoretical variogram

Using the sample variogram (also called experimental variogram), we can fit a theoretical variogram model to estimate a number of parameters that describe the range of autocorrelation and variance: • $$R$$ is the range of spatial auto-correlation, i.e., sample locations separated by distances closer than R are spatially auto-correlated, whereas locations farther apart than R are not.
• $$C_0$$ is the nugget effect, which can be attributed to measurement errors or spatial sources of variation at distances smaller than the sampling interval.
• $$C_1$$ is the partial sill (sill minus nugget).
• $$C_0+C_1$$ is the sill, which is the modeled semi-variance at range $$R$$.

A wide range of variogram models are available for estimating these parameters, e.g., linear, spherical, exponential, Gaussian. Note, the exponential, Gaussian, and Bessel model reach their sill asymptotically (as $$h \to \infty$$). For an overview of all available variogram models in R, type in the command line vgm() without any model arguments. Pebesma, E. (2014): gstat user’s manual

Use the vgm() function to construct a variogram model by specifying the nugget $$C_0$$, partial sill $$C_1$$ (psill), range $$R$$, and the model form.

myVariogramModel <- vgm(psill=1, "Sph", range=100, nugget=0.5)

plot(myVariogramModel, cutoff=150) A visual overview of the basic variogram models available in gstat can be obtained as follows:

show.vgms() # Fit a variogram

For weighted least squares fitting of a variogram model to the sample variogram, we need to take several steps: 1) calculate the sample variogram, 2) choose a suitable model, 3) choose initial parameters for our model, 4) and fit the model.

1. Calculate the sample variogram
meuse.v <- variogram(log(zinc) ~ 1, meuse)
1. Choose a suitable model (e.g. exponential, spherical), with or without nugget

2. Choose suitable initial values for partial sill(s), range(s), and possibly nugget

Let’s try the the spherical model with and some initial guesses for the parameters. Looking at the sample variogram, good initial values for the nugget may be 0.1, for the partial sill (sill minus nugget) = 0.6, and for the range perhaps 800m. We can plot both, the sample variogram and model on top of each other to evaluate our initial model.

meuse.sph <- vgm(psill=0.6, "Sph", range=800, nugget=0.1)
plot(meuse.v, meuse.sph, cutoff=1000) 1. Fit this model using one of the fitting criteria
meuse.vfit <- fit.variogram(meuse.v, meuse.sph)
meuse.vfit
##   model      psill    range
## 1   Nug 0.05065661   0.0000
## 2   Sph 0.59060200 896.9784

Let’s plot the fitted variogram over the sample variogram:

plot(meuse.v, meuse.vfit) Model fitting may fail if we choose initial values too far off from reasonable values.

fit.variogram(meuse.v, vgm(1, "Sph", 10, 1))
## Warning in fit.variogram(meuse.v, vgm(1, "Sph", 10, 1)): singular model in variogram fit
##   model psill range
## 1   Nug     1     0
## 2   Sph     1    10

## Partial variogram model fitting

Also partial fitting of variogram coefficients can be done with gstat. Suppose we know for some reason that the partial sill for the nugget model (i.e. the nugget variance) is 0.06, and we want to fit the remaining parameters. Alternatively, the range parameter(s) can be fixed using argument fit.ranges.

fit.variogram(meuse.v, vgm(0.6, "Sph", 800, 0.06), fit.sills = c(FALSE, TRUE))
##   model     psill    range
## 1   Nug 0.0600000   0.0000
## 2   Sph 0.5845857 923.0126

## Visual variogram model fitting

An alternative approach to fitting variograms is by visual fitting, the so-called eyeball fit. Package geoR provides a graphical user interface for inter-actively adjusting the parameters:

library(geoR)
v.eye <- eyefit(variog(as.geodata(meuse["zinc"]), max.dist = 1500))
ve.fit <- as.vgm.variomodel(v.eye[])

Typically, visual fitting will minimize $$|\gamma(h) - \hat{\gamma}(h)|$$ with emphasis on short distance/small $$\gamma(h)$$ values, as opposed to a weighted squared difference, used by most numerical fitting. An argument to prefer visual fitting over numerical fitting may be that the person who fits has knowledge that goes beyond the information in the data.

# Ordinary kriging

Kriging utilizes the theoretical variogram to interpolate values at any location based on distant-variance relationship. We’ll perform Ordinary Kriging at the meuse.pixels locations. Recall, ordinary kriging has a constant intercept, denoted in the formula with ~ 1:

library(gstat)

# create sample variogram
meuse.v <- variogram(log(zinc) ~ 1, meuse)

# fit variogram model
meuse.vfit <- fit.variogram(meuse.v, vgm(1, "Sph", 800, 1))

# ordinary kriging
lz.ok <- krige(log(zinc) ~ 1, meuse, meuse.pixels, meuse.vfit)
## [using ordinary kriging]
mapview(lz.ok, zcol='var1.pred') # + mapview(meuse, zcol='zinc')

Kriged maps using a variogram with no or a small nugget are typically very similar to inverse distance interpolation results.

The lz.ok object stores not just the interpolated values, but the variance values as well. These can be passed to the raster object for mapping as follows:

mapview(lz.ok, zcol="var1.var") # + mapview(meuse, zcol='zinc')

# Universal kriging

Universal kriging include other covariates to account for potential trends, e.g. here with distance to the river meuse.

# create sample variogram
meuse.rv <- variogram(log(zinc) ~ sqrt(dist), meuse)

# fit variogram model
meuse.rvfit <- fit.variogram(meuse.rv, vgm(1, "Sph", 300, 1))

# universal kriging
lz.uk <- krige(log(zinc) ~ sqrt(dist), meuse, meuse.pixels, meuse.rvfit)
## [using universal kriging]
# plot
mapview(lz.ok, zcol='var1.pred')

# Create your own meuse grid

xnew <- seq(meuse@bbox[1, 1], meuse@bbox[1, 2], length.out = 50)
ynew <- seq(meuse@bbox[2, 1], meuse@bbox[2, 2], length.out = 50)

gridnew <- expand.grid(xnew, ynew)

gridnew <- SpatialPixels(points=SpatialPoints(gridnew),
proj4string = CRS("+init=epsg:28992")) 