- GIS vector data in R
- Creating spatial points (
`SpatialPoints()`

) - Interactive map visualizations (
`mapview()`

)

- Creating spatial points (
- Overlaying vector and raster data
- Handling projections (
`projection(); spTransform()`

) - Extracting raster values at point locations (
`raster::extract()`

)

- Handling projections (
- Descriptive point pattern analysis
- Creating a density raster (
`rasterize(point.data, raster, fun = "count"`

) - Kernel density estimates (
`library(spatialEco); sp.kde()`

) - Distance (
`distance()`

)

- Creating a density raster (

- Inverse Distance Weighted interpolation
- Semivariogram
- Kriging

We will mostly deal with package `gstat`

, because it offers the widest functionality in the geostatistics curriculum for R: it covers variogram cloud diagnostics, variogram modeling, everything from global simple kriging to local universal cokriging, multivariate geostatistics, block kriging, indicator and Gaussian conditional simulation, and many combinations.

```
library("gstat") # geostatistics
library("mapview") # map plot
library("sp") # spatial data handling
library("ggplot2") # plotting
library("rgdal") # reading and writing spatial data
```

For this tutorial, we use a dataset provided by the `sp`

package. The `meuse`

data set includes four heavy metals measured in the top soil in a flood plain along the river Meuse, along with a handful of covariates. To see a detailed description of those covariates type in `?meuse`

at the command line. It seems, that polluted sediments carried by the river are mostly deposited close to the river bank and in areas with low elevation.

```
data(meuse) # load built-in dataset
head(meuse)
```

```
## x y cadmium copper lead zinc elev dist om ffreq soil lime landuse dist.m
## 1 181072 333611 11.7 85 299 1022 7.909 0.00135803 13.6 1 1 1 Ah 50
## 2 181025 333558 8.6 81 277 1141 6.983 0.01222430 14.0 1 1 1 Ah 30
## 3 181165 333537 6.5 68 199 640 7.800 0.10302900 13.0 1 1 1 Ah 150
## 4 181298 333484 2.6 81 116 257 7.655 0.19009400 8.0 1 2 0 Ga 270
## 5 181307 333330 2.8 48 117 269 7.480 0.27709000 8.7 1 2 0 Ah 380
## 6 181390 333260 3.0 61 137 281 7.791 0.36406700 7.8 1 2 0 Ga 470
```

The built-in dataset is a table (data.frame). To conduct geostatistical analysis, we need to convert this into a spatial data format, i.e. a SpatialPointDataFrame. You have seen how to convert a table with x-y-coordinates in the previous session. To save some time, we prepared a shapefile `meuse.shp`

that you can load with `rgdal::readOGR()`

.

`meuse <- readOGR('data', 'meuse')`

```
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/dirk/Repository/teaching/gcg_quantitative-methods/data", layer: "meuse"
## with 155 features
## It has 12 fields
```

`meuse`

```
## class : SpatialPointsDataFrame
## features : 155
## extent : 178605, 181390, 329714, 333611 (xmin, xmax, ymin, ymax)
## crs : +proj=sterea +lat_0=52.1561605555556 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## variables : 12
## names : cadmium, copper, lead, zinc, elev, dist, om, ffreq, soil, lime, landuse, dist_m
## min values : 0.2, 14, 37, 113, 5.18, 0, 1, 1, 1, 0, Aa, 10
## max values : 18.1, 128, 654, 1839, 10.52, 0.880389, 17, 3, 3, 1, W, 1000
```

Spatial exploratory data analysis starts with the plotting of maps with a measured variable. Zinc concentration seems to be larger close to the river Meuse banks.

```
library(mapview)
mapview(meuse, zcol='zinc')
```

Usually, interpolation is done on a regular grid. For the meuse data, covariates are provided for x-y locations forming a regular grid in the built-in `meuse.grid`

dataset (see `?meuse.grid`

).

```
data(meuse.grid)
head(meuse.grid)
```

```
## x y part.a part.b dist soil ffreq
## 1 181180 333740 1 0 0.0000000 1 1
## 2 181140 333700 1 0 0.0000000 1 1
## 3 181180 333700 1 0 0.0122243 1 1
## 4 181220 333700 1 0 0.0434678 1 1
## 5 181100 333660 1 0 0.0000000 1 1
## 6 181140 333660 1 0 0.0122243 1 1
```

Notice, meuse.grid is a data.frame containing x-y coordinates. If you plot the coordinates of the data.frame, you see that they form a regular grid:

`plot(meuse.grid$x, meuse.grid$y, cex=0.1, pch=19)`

However, a data.frame is not spatial, so we need to convert it into a spatial grid, i.e., an object of type `SpatialPixel`

or `SpatialPixelsDataFrame`

. The `SpatialPixelsDataFrame`

from the `sp`

package provides R with a raster type analogous to the `raster`

class of the `raster`

package.

```
meuse.pixels <- SpatialPixelsDataFrame(points=meuse.grid[,c("x", "y")],
data=meuse.grid[,-(1:2)],
proj4string = CRS(proj4string(meuse)))
```

`mapview(meuse.pixels, zcol="dist")`

Inverse distance-based weighted interpolation (IDW) computes a weighted average,

\[\begin{equation} \hat{Z}(s_0) = \frac{\sum_{i=1}^{n}w(s_i)Z(s_i)}{\sum_{i=1}^{n}w(s_i)} \end{equation}\]

where weights for observations are computed according to their distance to the interpolation location,

\[w(s_i) = ||s_i-s_0||^{-p}\]

with \(|| ? ||\) indicating Euclidean distance and \(p\) an inverse distance weighting power, defaulting to 2. If \(s_0\) coincides with an observation location, the observed value is returned to avoid infinite weights.

The inverse distance power determines the degree to which the nearer point(s) are preferred over more distant points; for large values IDW converges to the one-nearest-neighbor interpolation.

```
library(gstat)
zn.idw <- idw(zinc ~ 1, meuse, meuse.pixels, idp = 2)
```

`## [inverse distance weighted interpolation]`

`as.data.frame(zn.idw)[1:5, ]`

```
## x y var1.pred var1.var
## 1 181180 333740 633.6864 NA
## 2 181140 333700 712.5450 NA
## 3 181180 333700 654.1617 NA
## 4 181220 333700 604.4422 NA
## 5 181100 333660 857.2558 NA
```

The output variable is called var1.pred, and the var1.var values are NA because inverse distance does not provide prediction error variances.

`mapview(zn.idw, zcol='var1.pred', layer.name = "Zinc ppm")`

IDW considers only distances to the prediction location and thus ignores the spatial configuration of observations. This may lead to undesired effects if the observation locations are strongly clustered.

Notably, IDW weights are guaranteed to be between 0 and 1, resulting in interpolated values never outside the range of observed values.

The spatial correlation between two observations of a variable \(z(s)\) at locations \(s_1\) and \(s_2\) cannot be estimated, as only a single pair is available. To obtain the necessary replication, we need to replicate the experiment with multiple samples from multiple locations. But to do that, we need to assume that the spatial process is the same at all those locations. In other words, we need to make **stationarity** assumptions about our spatial process.

Stationarity means that characteristics of a random function (our spatial process that generates a variable) stay the same when shifting a given set of \(n\) points from one part of a region to another. The spatial process is called **strictly stationary** if for any set of \(n\) points \(s_1,..., s_n\) the multivariate distribution does not change. In other words, everything looks the same everywhere and anytime. Well, this may not seem like a reasonable assumption for geographic processes.

A less restrictive condition is given by **weak stationarity**, which is also called **second-order stationarity** because it only assumes stationarity of the first two moments of the distribution. That is, the spatial process has a constant mean \(E[Z(x)]\) and variance \(Var[Z(x)]\). Also, the covariance between two observations separated by a distance \(h\): \(cov(Z(x+h), Z(x))\) only relies on the distance \(h\) between the observations and not on the spatial location \(x\) inside the region.

A specific type of second-order stationarity, and most important for the variogram analysis, is called **intrinsic stationarity**. Here, we assume second-order stationarity of the **differences** between pairs of values at two locations: \(Z(x+h)−Z(x)\). We are not interested in \(Z(x)\) but in the differences. Again, this implies the assumption that the variance of these differences does not depend on location but only depends on separation distance \(h\):

\[Var(Z(x+h)−Z(x))=2\gamma(h)\]

A simple way to acknowledge that spatial correlation is present or not is to make scatter plots of pairs \(Z(s_i)\) and \(Z(s_j)\), grouped according to their separation distance \(h_{ij} = || {s_{i} - s_{j}} ||\).

```
library(gstat)
hscat(log(zinc) ~ 1, meuse, (0:9) * 100)
```

We can plot the difference between pairs of sample values as a function of separation distance \(h\). The so-called variogram cloud plots all possible squared differences of observation pairs \((Z(x) - Z(x+h))^2\) against their separation distance \(h\). The difference is divided by 2 to account for the fact that two points share this value. Hence, \(\gamma\) is often called semi-variance (half the variance).

\[\gamma_h=\frac{1}{2}(Z(x) - Z(x+h))^2\]

```
meuse.vc <- variogram(log(zinc) ~ 1, meuse, cloud = TRUE)
plot(meuse.vc, ylab=bquote(gamma), xlab=c("h (separation distance in m)"))
```

Since \(\gamma\) is a difference it is a measure of dissimilarity, i.e., the smaller the dissimilarity the more similar are observations. Thus, the variogram cloud allows you to observe if sample pairs closer to each other are more similar than pairs further apart, which is usually the case. The variogram cloud also allows you to observe the distribution of dissimilarity at particular distances. For example, the variogram cloud above shows that \(\gamma\) has a skewed distribution at any distance. That is, the majority of plots are similar even at higher distances. However, dissimilarity increases with distances up to 500 m or so.

In geostatistics the spatial correlation is often modeled from the sample variogram. The sample variogram plots averages of the variogram cloud values over distance intervals \(h\). If we further assume **isotropy**, which is direction independence of semivariance, we can replace the vector \(h\) with its length, \(||h||\).

Under this assumption, the variogram can be estimated from \(N_h\) sample data pairs \(z(x)\), \(z(x + h)\) for a number of distances (or distance intervals) \(\tilde{h}_j\) by:

\[\hat{\gamma}(\tilde{h}_j) = \frac{1}{2N_h}\sum_{i=1}^{N_h}(Z(x)-Z(x+h))^2\]

You can construct a sample `variogram()`

in R using the variogram function of the gstat package. The ~ 1 defines a single constant predictor leading to a spatially constant mean coefficient.

```
meuse.v <- gstat::variogram(log(zinc) ~ 1, meuse)
plot(meuse.v, ylab=bquote(gamma), xlab=c("h (separation distance in m)"))
```

The previously used variogram function makes a number of decisions by default. It decides that direction is ignored: point pairs are merged on the basis of distance, not direction. An alternative is, for example to look in four different angles by specifying the `alpha`

keyword.

`plot(variogram(log(zinc) ~ 1, meuse, alpha = c(0, 45, + 90, 135)))`

A similar issue is the cutoff distance, which is the maximum distance up to which point pairs are considered and the width of distance interval over which point pairs are averaged in bins. The default value `gstat`

uses for the cutoff value is one third of the largest diagonal of the bounding box (or cube) of the data. Good reasons to decrease the cutoff may be when a local prediction method is foreseen, and only semivariances up to a rather small distance are required. For the interval width, `gstat`

uses a default of the cutoff value divided by 15.

`plot(variogram(log(zinc) ~ 1, meuse, cutoff = 1000, width = 50))`

Using the sample variogram (also called experimental variogram), we can fit a theoretical variogram model to estimate a number of parameters that describe the range of autocorrelation and variance:

- \(R\) is the
*range*of spatial auto-correlation, i.e., sample locations separated by distances closer than R are spatially auto-correlated, whereas locations farther apart than R are not. - \(C_0\) is the
*nugget*effect, which can be attributed to measurement errors or spatial sources of variation at distances smaller than the sampling interval. - \(C_1\) is the
*partial sill*(sill minus nugget). - \(C_0+C_1\) is the
*sill*, which is the modeled semi-variance at range \(R\).

A wide range of variogram models are available for estimating these parameters, e.g., linear, spherical, exponential, Gaussian. Note, the exponential, Gaussian, and Bessel model reach their sill asymptotically (as \(h \to \infty\)). For an overview of all available variogram models in R, type in the command line `vgm()`

without any model arguments.

Use the `vgm()`

function to construct a variogram model by specifying the nugget \(C_0\), partial sill \(C_1\) (`psill`

), `range`

\(R\), and the model form.

```
myVariogramModel <- vgm(psill=1, "Sph", range=100, nugget=0.5)
plot(myVariogramModel, cutoff=150)
```

A visual overview of the basic variogram models available in `gstat`

can be obtained as follows:

`show.vgms()`

For weighted least squares fitting of a variogram model to the sample variogram, we need to take several steps: 1) calculate the sample variogram, 2) choose a suitable model, 3) choose initial parameters for our model, 4) and fit the model.

- Calculate the sample variogram

`meuse.v <- variogram(log(zinc) ~ 1, meuse)`

Choose a suitable model (e.g. exponential, spherical), with or without nugget

Choose suitable initial values for partial sill(s), range(s), and possibly nugget

Let’s try the the spherical model with and some initial guesses for the parameters. Looking at the sample variogram, good initial values for the nugget may be 0.1, for the partial sill (sill minus nugget) = 0.6, and for the range perhaps 800m. We can plot both, the sample variogram and model on top of each other to evaluate our initial model.

```
meuse.sph <- vgm(psill=0.6, "Sph", range=800, nugget=0.1)
plot(meuse.v, meuse.sph, cutoff=1000)
```

- Fit this model using one of the fitting criteria

```
meuse.vfit <- fit.variogram(meuse.v, meuse.sph)
meuse.vfit
```

```
## model psill range
## 1 Nug 0.05065661 0.0000
## 2 Sph 0.59060200 896.9784
```

Let’s plot the fitted variogram over the sample variogram:

`plot(meuse.v, meuse.vfit)`

Model fitting may fail if we choose initial values too far off from reasonable values.

`fit.variogram(meuse.v, vgm(1, "Sph", 10, 1))`

`## Warning in fit.variogram(meuse.v, vgm(1, "Sph", 10, 1)): singular model in variogram fit`

```
## model psill range
## 1 Nug 1 0
## 2 Sph 1 10
```

Also partial fitting of variogram coefficients can be done with `gstat`

. Suppose we know for some reason that the partial sill for the nugget model (i.e. the nugget variance) is 0.06, and we want to fit the remaining parameters. Alternatively, the range parameter(s) can be fixed using argument `fit.ranges`

.

`fit.variogram(meuse.v, vgm(0.6, "Sph", 800, 0.06), fit.sills = c(FALSE, TRUE))`

```
## model psill range
## 1 Nug 0.0600000 0.0000
## 2 Sph 0.5845857 923.0126
```

An alternative approach to fitting variograms is by visual fitting, the so-called eyeball fit. Package `geoR`

provides a graphical user interface for inter-actively adjusting the parameters:

```
library(geoR)
v.eye <- eyefit(variog(as.geodata(meuse["zinc"]), max.dist = 1500))
ve.fit <- as.vgm.variomodel(v.eye[[1]])
```

Typically, visual fitting will minimize \(|\gamma(h) - \hat{\gamma}(h)|\) with emphasis on short distance/small \(\gamma(h)\) values, as opposed to a weighted squared difference, used by most numerical fitting. An argument to prefer visual fitting over numerical fitting may be that the person who fits has knowledge that goes beyond the information in the data.

Kriging utilizes the theoretical variogram to interpolate values at any location based on distant-variance relationship. We’ll perform Ordinary Kriging at the meuse.pixels locations. Recall, ordinary kriging has a constant intercept, denoted in the formula with `~ 1`

:

```
library(gstat)
# create sample variogram
meuse.v <- variogram(log(zinc) ~ 1, meuse)
# fit variogram model
meuse.vfit <- fit.variogram(meuse.v, vgm(1, "Sph", 800, 1))
# ordinary kriging
lz.ok <- krige(log(zinc) ~ 1, meuse, meuse.pixels, meuse.vfit)
```

`## [using ordinary kriging]`

`mapview(lz.ok, zcol='var1.pred') # + mapview(meuse, zcol='zinc')`

Kriged maps using a variogram with no or a small nugget are typically very similar to inverse distance interpolation results.

The lz.ok object stores not just the interpolated values, but the variance values as well. These can be passed to the raster object for mapping as follows:

`mapview(lz.ok, zcol="var1.var") # + mapview(meuse, zcol='zinc')`

Universal kriging include other covariates to account for potential trends, e.g. here with distance to the river meuse.

```
# create sample variogram
meuse.rv <- variogram(log(zinc) ~ sqrt(dist), meuse)
# fit variogram model
meuse.rvfit <- fit.variogram(meuse.rv, vgm(1, "Sph", 300, 1))
# universal kriging
lz.uk <- krige(log(zinc) ~ sqrt(dist), meuse, meuse.pixels, meuse.rvfit)
```

`## [using universal kriging]`

```
# plot
mapview(lz.ok, zcol='var1.pred')
```

```
xnew <- seq(meuse@bbox[1, 1], meuse@bbox[1, 2], length.out = 50)
ynew <- seq(meuse@bbox[2, 1], meuse@bbox[2, 2], length.out = 50)
gridnew <- expand.grid(xnew, ynew)
gridnew <- SpatialPixels(points=SpatialPoints(gridnew),
proj4string = CRS("+init=epsg:28992"))
```

Pebesma, E. (2014): gstat user’s manual. http://www.gstat.org/gstat.pdf

Bivand, R. S., Pebesma, E., Gómez-Rubio, V. (2013): Applied Spatial Data Analysis with R. Second Edition. Springer

Gimond, M. (2020): Intro to GIS and Spatial Analysis

Pebesma, E. (2020): The meuse data set: a brief tutorial for the gstat R package. May 18, 2020

Wackernagel, H. (1998): Multivariate Geostatistics (2nd edition). Springer

Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.