# Training data

## Learning goals

• Understand how forest types differ spectrally
• Gather training data for a forest type classification

## Background

Collecting training information is an essential step on your way to a classified map. The training pixels should be considered representative for the classes you want to map, as classification algorithms determine class labels for unknown pixels based on their similarity to the training dataset.

Collecting training data is time consuming, regardless if you are collecting in the field or digitally. Small conceptual mistakes may require a revision of your training dataset. Consequently, your training data collection approach should be thoroughly planned before you start. Consider the following points:

• A precise and robust definition of your target classes based on the study region characteristics and research questions is key. Targeting a high thematic detail is beneficial, but spectral(-temporal) similarities between classes might pose limitations to a robust distinction of classes, such as tree species or crop types. In such cases, it is advised to think about a hierarchical structure to aggregate similar classes into more general classes, such as tree species into forest types, or crop types into annual / perennial croplands.

• Gather as much reference information as possible. Can we find additional datasets that guide our interpretation? Is any very high resolution (VHR) imagery available? GoogleEarth is a valuable source of VHR imagery, but it is critical to account for the exact acquisition date, which you can identify with the “historic image” tool. When using other datasets, it is important to understand the relative geometric accuracies as well.

• Good knowledge of the target classes and their spectral and temporal characteristics in the study region is beneficial. We should consider spectrally similar classes and identify potential ways to prevent confusion, e.g., by aggregating classes or identifying spectral features which help to separate them better.

• Random point sampling is unlikely to be the best option for training data collection (unlike collecting independent validation data), as we might want to train small classes that are unlikely to be adequately captured with a random sample. Manual selection of training points is often advised to circumvent this problem.

• Cover your study area well. The image below shows the spatial distribution of six training datasets collected during an earlier iteration of the course. Notice that some training points cluster in a subset of the study region. Ideally, however, training data should be well distributed across the study region to cover regional biophysical variability, such as different soil types, weather patterns, or topography.

• The classification algorithm of your choice might have specific requirements regarding the training data, e.g., concerning the number of samples, their distribution in the spectral feature space, or their purity (pure vs. mixed pixels). We will discuss these aspects later in the course.

• In practice, it´s important to know your training data well. Are the classes separable with the data at hand? Are essential class characteristics well represented? Are there any outliers? To learn more, it is always wise to explore the spectral characteristics of your training data points. We can do this through investigating the spectral reflectances at our training data locations (e.g., through histograms or boxplots) and comparing them between classes. That´s what we want to do today.

## Assignment

This assignment has two larger aims. First, you will learn to collect training data for a broad forest type classification. We provide forestry data to find representative sample pixels in QGIS. We will use the data you generate in this assignment for classification in the next session. Second, you will learn how broad forest types appear spectrally in images acquired during different parts of the growing season.

We provide the following datasets in our repository:

…sr_data/: Four cloud-masked surface reflectance image chips from Landsat 8:

• LC081890252014031001T1-SC20170927101754 (10 March 2014)
• LC081890252014071601T1-SC20171024094741 (16 July 2014)
• LC081890252015082001T1-SC20170927120710 (20 August 2015)
• LC081890252014110501T1-SC20170927102137 (05 November 2014)

…vector/: A shapefile and a *.kmz file for GoogleEarth, which will help you to accurately delineate the Landsat pixel locations and extents for training data collection.

…BDL/: Forestry data collected in 2015 which is publicly available here. For this session, we have prepared a shapefile containing the following attributes:

Attribute field Definition Class
species_en Dominant genus in each stand Ash, Beech, Fir, Spruce…
part_cd Share of this genus within the stand 0 – 100 (in %)
spec_age Average age of the trees in this stand Age in years

### 1) Prepare the training data collection

Visualize and arrange all aforementioned datasets in QGIS. Consider the following steps:

1. Find a good false-color representation of the Landsat 8 bands to highlight differences in vegetation. If needed, recap and explore common settings here and here.

2. Visualize the forestry data by choosing distinct colors for the different tree genera (species_en).

Which genera are dominant in the study area?

Generate a new point shapefile for storing the training data you will collect in the next task. It should contain the attribute field ‘classID’ (of type integer).

Make sure the shapefile has the same spatial reference system as the Landsat data.

### 2) Collect training data

Switch into the editing mode to add training points and assign the corresponding class number. Please collect at least 20 pixels per class.

Class name classID
Deciduous forest 1
Mixed forest 2
Coniferous forest 3
Non-forest 4

Use the multi-temporal Landsat imagery, the forestry polygons and very high resolution imagery in GoogleEarth to identify training points. The historic imagery tool in Google Earth can be extremely useful to guide your interpretation between deciduous and evergreen trees, as it contains imagery from the leaf-off phenological phase. The Landsat grid shapefile and .kmz will help you to identify and label the precise training locations for the four classes. You can also install the Send2GE plugin, which allows you to click into the QGIS map canvas and directly fly to the same location in Google Earth.

Regularly save the collected points and store the final shapefile with 80+ points in your course folder.

### 3) Explore your training data

Load your shapefile in R using readOGR(). Extract the spectral values at your point locations from the March image in R using the extract() function. Specify sp = TRUE to append the spectral values to the point shapefile.

Make sure the result of this task is an object of type data.frame (named sr.march in the code below). Your sample points should be represented as rows and the measured variables as columns (i.e. classID, and 6 spectral bands).

Create visualizations of the surface reflectance values in your training data, grouped or colored according to the land cover class following these examples:

####################################################################
library(raster)
library(rgdal)
library(hexbin)
library(ggplot2)
library(reshape2)

####################################################################

# Read march image as stack and rename spectral bands
names(img) <- c("blue", "green", "red", "nir", "swir1", "swir2")

# Read training points, the following code assumes that it contains only the class attribute
# in readOGR, dsn specifies the path to the folder containing the file (may not end with /),
# layer specifies the name of the shapefile without extension (.shp)

# Extract image values at training point locations
train.sr <- extract(img, train, sp=T)

# Convert to data.frame and convert classID into factor
train.df <- as.data.frame(train.sr)
train.df$classID <- as.factor(train.df$classID)

####################################################################
### Create boxplots of reflectance grouped by land cover class

# Melt dataframe containing point id, classID, and 6 spectral bands
spectra.df <- melt(train.df, id.vars='classID',
measure.vars=c('blue', 'green', 'red', 'nir', 'swir1', 'swir2'))

# Create boxplots of spectral bands per class
ggplot(spectra.df, aes(x=variable, y=value, color=classID)) +
geom_boxplot() +
theme_bw()

####################################################################
### Create 2D scatterplot of image data and locations of training points

# Convert image to data.frame and remove missing values
sr.march.val <- data.frame(getValues(img))
sr.march.val <- na.omit(sr.march.val)

# Randomly sub-sample 100,000 to speed up visualisation
sr.march.val <- sr.march.val[sample(nrow(sr.march.val), 100000),]

# Specify which bands to use for the x and y axis of the plot
xband <- "red"
yband <- "nir"

# Create plot of band value density and training data
ggplot() +
geom_hex(data = sr.march.val, aes(x = get(xband), y = get(yband)), bins = 100) +
geom_point(data = train.df, aes(x = get(xband), y = get(yband), color=classID, shape=classID),
size = 2, inherit.aes = FALSE, alpha=1) +

scale_x_continuous(xband, limits=c(-10, quantile(sr.march.val[xband], 0.98, na.rm=T))) +
scale_y_continuous(yband, limits=c(-10, quantile(sr.march.val[yband], 0.98, na.rm=T))) +
scale_color_manual(values=c("red", "blue", "green", "purple")) +
theme_bw()

Make sure you understand what the melt() function is doing. Feel free to adjust the plot layout.

Similarly to the previous task, extract the values at your point locations from the Tasseled Cap stack. Create similar plots of the three Tasseled Cap components, grouped by classID. Investigate the boxplots in order to investigate the differences between your target classes. Try to answer the following questions:

• Do the Tasseled Cap components allow for discriminating your target classes?
• Which classes are likely difficult to separate?

Voluntary assignment: If you´re keen on exploring spectral changes over time, repeat the above procedures for the remaining images (July, August, November), or think about alternative ways of visualizing the data.