Machine learning classification


Learning goals

  • Understanding fundamentals of the Random Forest classification algorithm
  • Conduct sensitivity analyses to parametrize the classification model
  • Produce a forest type map for the Carpathian study region

Background

Machine learning classification is a form of supervised machine learning in which labeled training data (e.g. image spectra and associated land cover classes) are used by a learning algorithm to define model parameters, or rules, which separate the data into their respective classes. Once this is done, we may use the trained model to estimate the appropriate class label for each pixel in the entire (unlabeled) image and thus, for example, derive a map of land cover classes.

This session focuses specifically on the Random Forest (RF) algorithm (Breimann 2001) and its implementation in R. The basic principles of the RF algorithm can be easily understood when disentangling its name:

“Forest”: RF is an ensemble of self-learning decision trees. Each decision tree consists of a large number of splits, which essentially represent simple binary (yes / no) decisions, e.g. “is reflectance in band 4 > 0.65?”. The sequential binary branching creates a tree-like shape. The idea behind the ensemble component of RF is that many weak learners can come to one strong decision. In this way, RF generates a number of independent decision trees to achieve a given task. These decision trees - metaphorically speaking - shape a forest.

Exemplary illustration of a two-dimensional feature space, divided by binary thresholds (left), as well as the corresponding decision tree (right).

“Random”: RF has two layers of randomness. First, it uses a bootstrapped random sample (with replacement) of the training dataset for growing each individual decision tree. The second random component is the selection of the features considered at each binary split. These two components work together to define each decision tree, first selecting a random feature (e.g. spectral band) at each decision node, then using the random sample to determine a rule that provides the “best split” of the data.


How a self-learning tree grows

Self-learning decision trees automatically find the “best split” in the feature space to break the training data into two groups. Each split will be further subdivided until a clear decision for each sample can be made. The algorithm attempts to create splits that divide the data into relatively homogeneous subgroups.

Finding a good split requires infomration on which feature (e.g. spectral band) and which value (e.g. reflectance) should be used to separate the data. The best split is identified based on a measure of dataset heterogeneity (Gini impurity index). RF selects the split (feature and value) which minimizes the heterogeneity of the resulting datasets:

\(Gini_{D} = 1 - \sum_{j=1}^n p_j^2\)

where \(D\) is the dataset at hand and \(p_j\) is the fraction of class \(j\) in the dataset. High class proportions in the dataset produce low Gini index values. The Gini index for a split is a combined measure of the Gini index for the two datasets \(D_1\) and \(D_2\) resulting from the split. To account for different sizes of the two datasets, the Gini of each sub-dataset is weighted according to the number of observations \(N_1\) and \(N_2\) in the resulting sub-groups of each dataset.

\(Gini_{split} = \frac{N_1}{N} Gini_{D_1} + \frac{N_2}{N} Gini_{D_2}\)

In case we would consider all available features at each split, the decision trees would probably be very similar. This is why the algorithm draws a random subset of features, for which it identifies the split (feature and value) that produces the most homogeneous sub-groups of training data. The user controls for the number of features which should be considered (or tried) at each split via the mtry parameter (in the R implementation). Decreasing the number of variables tried at each split de-correlates the structure of the trees.

The user defines the number of decision trees in the forest (e.g. via the ntrees parameter in the R implementation). Once ntree decision trees are built, we have a readily applicable rule-set to classify each pixel in our image. Every tree produces one “vote” regarding the final class outcome and the majority of votes determines the final class label for each pixel.


RF Probabilities

The RF classifier predicts the class label for a given data point through the majority vote of all individual trees. Thinking about a model with ntrees = 500, one data point may receive different labels from individual trees. It may e.g. be labeled as class 1 by 280 trees, class 2 by 130 trees, class 3 by 75 trees and class 4 by 15 trees. The data point will be predicted as class 1.

It can be useful to analyze this information in more detail. For instance, users can access the class-wise number of votes and standardize them into a probability. In the above example, we can say that

Class Votes Prob
1 280 0.56
2 130 0.26
3 75 0.15
4 15 0.03
Total 500 1.00

Extracting and visualizing RF probabilities spatially can give a continuous indication on where the classifier may be more or less “certain” with its predicted label. This can be interpreted in terms of the agreement of the randomly grown trees in the specific model setup and may guide, for instance the collection of additional training data for further improving the model.


The out-of-bag error

RF has another great feature: the out-of-bag (OOB) error. After growing each tree, the left-out samples (i.e. not used for training of this particular tree) can be classified in a test run to investigate the model performance. Doing this allows us to calculate a classification error. For illustration, consider 8 left-out samples, of which 2 are wrongly and 6 are correctly classified. The OOB error will be 2/8 = 0.25. This simple measure of model performance can be useful to parameterize the classification model. We can for instance test various combinations of input features or training datasets.

Grid representation of the out-of-bag classification error for various combinations of two input composites with different target days of the year.


Tuning hyperparameters

Hyperparameters, such as mtryand ntree for instance, differ from model parameters in that they are not learned from the data but most be specified prior to learning. Some of them can be tuned automatically to optimize the learning process. Algorithms differ in their “tunability” and the Random Forest algorithm is known to work very well without much tuning. Here´s how to make informed decisions on ntree and mtry.

ntree

Because of the random nature of the model and the law of large numbers, the error of the model generally decreases as ntree increases. It will reach a point where the OOB error rate stabilizes (converges) and adding more trees only decreases computing performance.

Determining the value for ntree at the absolute minimum error rate may be intuitive at first. However, if we determine the optimal ntree by minimizing the err.rate, we may not look at the true point of convergence, in the below graph for instance we may choose ntree of 130. However, there is still a lot of variation that is noise and there is only a stabilization after around 300 trees are grown. This number can vary depending on the input features, training data, target classes (complexity of the problem). As a rule of thumb it is therefore best to use as many trees as computationally feasible but not more than needed.

Furthermore, due to the randomness of RF, additional variability may occur between models trained with the same training data and input features. We therefore recommend to verify your choice of ntrees by looking at multiple model runs as shown below.

Bottom line: tuning the number of trees is unnecessary. Instead, set the number of trees to a large, computationally feasible number by comparing the point of convergence across multiple model runs. Read here to find out more details on why we shouldn´t tune ntree.

mtry

We can optimize mtry - the number of variables considered at each split. To minimize the variance between multiple model runs, the prediction for a certain hyper-parameter configuration is repeated k-times for k-folds of the training data. The initial training dataset is split into k-partitions (= folds) where all folds but one are used to train the model and the withheld fold is used to validate the prediction.

We can thereby identify the optimal value of mtry using an automated grid search, which will test a user-defined set of values for mtry and perform k iterations of model training, prediction, and validation. Please take a look at the advanced assignments if you are interested in exploring this further.


Variable importances & partial dependence plots

Next, RF allows us to investigate which features (e.g. spectral bands) were most useful for our classification problem. Variable importance measures for each individual feature can be expressed either as mean decrease in Gini (how much “better” can we split the data based on this feature) or mean increase in accuracy (how much does the out-of-bag error increase when the feature is left out of the model). It is important to note that these variable importances should be interpreted cautiously, as they are always strongly dependent on the available input features and their collinearity structure as well as the training data used. Still, the variable importance of a hypothetical model might reveal, for instance, that the swIR 1 reflectance at DOY 196 is an important feature in this model.

Mean decrease in accuracy (increase in OOB error) caused by the features included in the classification model.

In a next step, we can use partial dependence plots to further investigate the relationship between individual features and the probability of class occurrence. We can try to interpret this graph and assess relationships between individual features and our classes of interest. In the below example, the probability of occurrence for the deciduous forest class is high in the upper range of swIR 1 reflectance (above 13%), while for the coniferous forest class it is high in the lower range of swIR 1 reflectance (10% and less).

Change in class probability across the value range of the swIR 1 reflectance at DOY 196.

While generating meaningful insights is somewhat tricky with reflectance values, we could also imagine including, e.g., a digital elevation model into the RF model to learn about the probability of class occurrence across elevation gradients. An additional read on PDPs with illustrative examples can be found here.


Summing it up, there are a some key advantages of the RF algorithm, which make it popular in data science and remote sensing. RF is easy to understand, computationally efficient, parameterization is straightforward, and it often produces good results. It provides quick insights into model performance via the out-of-bag error and allows for investigating variable importance measures and partial dependence plots.


Assignment

In this assignment, we will deal with image classification using the Random Forest algorithm, as implemented in the randomForest package. Specifically, we will use your training data and your pixel-based composites from the last assignments to map forest types in the Western Beskids. We will assess the performance of multiple classification models through the out-of-bag error and investigate variable importances. In case you missed the last sessions, we provide exemplary composites and training data here.


1) Training a Random Forest model

  1. Load the vector file containing your training data points using readOGR(). Next, create a stack() / brick() of your favourite pixel-based composite (last week´s result).

  2. Use extract() to create a data.frame with training points as rows, and class labels (classID) as well as the spectral bands of your composites as columns. Remove the day of year and year flags (band 7 and 8) for the next steps.

  3. As we want to train a classification (and not a regression), the randomForest() function expects the dependent variable to be of type factor. Use as.factor() for conversion of the classID column. The RF algorithm cannot deal with NoData (NA) values. Remove NAs from the data.frame.

  4. Train a randomForest() classification model with the data.frame created in the prior step. Make sure to include only useful predictors.

  5. Repeat the RF training procedure and produce additional model objects. Use i) the other pixel-based composite, and ii) a stack of both composites as input features.


2) Investigating model performance

The RF model objects contain a wealth of information on the model parameters and performance. Assess the out-of-bag (OOB) error estimates of the trained models by inspecting the err.rate attribute of your model objects. Answer the following questions:

  1. Which model has the lowest OOB error?

  2. How does the OOB behave when increasing the number of trees in your model (ntrees)? You can access the OOB per number of trees via err.rate. Use this information to determine a suitable value for ntrees. From a model’s predicitive performance ntrees cannot be too high, only the computational performance will decrease. The aim is to find the spot where the error rate stabilises (converges) and adding more trees would no longer improve accuracy.

  3. In the model with the lowest OOB error, Which of the four classes has the highest OOB error?

  4. In case you are not satisfied with your model performance, consider joining forces with another team and merge your training datasets. Feel free to experiment and document your findings.


3) Final model parametrization and variable importances

  1. Train a final model with the best combination of images and ntrees.

  2. Investigate the variable importances using varImpPlot(). Use partialPlot() to produce partial dependence plots for your most important predictor and all four classes. Can you explain the differences between classes?


4) Classification

Perform a classification of the image stack using the predict() function. Write the resulting map to disk in GTiff format. When doing so, consider choosing the appropriate datatype argument.

### Run predict() to store RF predictions
map <- predict(composites, model)

### Write classification to disk
writeRaster(map, filename, datatype="INT1S")

Inspect your result in QGIS and visually identify well classified regions and classes, as well as misclassifications.

Next, calculate class probabilities for each pixel and use them to further guide your assessment of map quality

### Run predict() to store RF probabilities for class 1-4
rfp <- predict(composites, model, type = "prob", index=c(1:4))

### Scale probabilities to integer values 0-100 and write to disk
writeRaster(rfp*100, filename, datatype="INT1S", overwrite=T)

Advanced assignment: Automated hyperparameter optimization

The e1071 package offers automated parameter optimization for several classification algorithms. The tune.randomForest() function for instance allows you to optimize the Random Forest hyperparameter mtry. Run the below code to check, whether the tuning of mtry improves your error rate substantially.

library(e1071)

# Define accuracy from 5-fold cross-validation as optimization measure
cv <- tune.control(cross = 5) 

# Use tune.randomForest to assess the optimal combination of ntree and mtry
rf.tune <- tune.randomForest(classID~., data = train.df, ntree=750, mtry=c(2:10), tunecontrol = cv)

# Store the best model in a new object for further use
rf.best <- rf.tune$best.model

# Is the parametrization and/or different from your previous model?
print(rf.best)

Advanced assignment: Support-Vector Machines (SVMs)

Support-Vector Machines (SVMs) are another form of supervised learning algorithms often used with remote sensing datasets. If you are curious how the SVM approach would deal with the same classification problem, feel free to run a test. SVMs are sensitive to parametrization, and the parametrization is not as intuitive as it is for the RF classifier. Luckily we can also optimize parameter settings based on a grid search using tune.svm(). Read the SVM vignette for further information.

SVMs can be highly flexible algorithms due to their use of sub-functions known as kernels. Several kernels are available in the e1071 SVM implementation (Radial Basis Function (RBF), Sigmoid, Polynomial, Linear). Let´s stick to the RBF kernel (default setting), as it has been showing good performance in a variety of settings. Feel free to read the scikit-learn tutorial on the role of the parameters \(C\) and \(\gamma\) in the RBF SVM.

The RBF kernel has the advantage of requiring the specification of only one parameter \(\gamma\) (gamma). Next, we have to define the parameter \(C\) (cost), which defines the tradeoff between training error and model complexity. The authors of the SVM implementation suggest to try a range between small and large values for C. Below, you find a code example following the suggestions in the scikit-learn tutorial.

library(e1071)

# Define accuracy from 5-fold cross-validation as optimization measure
cv <- tune.control(cross = 5) 

# Use tune.svm() for a grid search of the gamma and cost parameters
svm.tune <- tune.svm(classID~., data = train.df, kernel = 'radial', gamma = 10^(-3:3), cost = 10^(-3:3), tunecontrol = cv)

# Store the best model in a new object
svm.best <- svm.tune$best.model

# Which parameters performed best?
print(svm.best$gamma)
print(svm.best$cost)

Now use the predict() function to classify a map as above, using svm.best as the model object. Write the result to disk and compare the RF and the SVM classification in QGIS.


Reading materials

In the next session, we would like to discuss the following paper:

FAO (2016): Map Accuracy Assessment and Area Estimation - A Practical Guide

The paper is outlines best practice guidelines for accuracy assessment and area estimation in the context of land change maps. Please read the paper thoroughly, make sure you understand the basic ideas behind it and write down questions for the next session. Also, answer the following broad questions:

  • Why is a stratified sampling advised?
  • How can we determine how many samples are needed for accuracy assessment?
  • What is the difference of populating an error matrix with \(p_{ij}\) instead of \(n_{ij}\)?
  • How can we derive unbiased area estimates?

Please take a look at the Excel sheet provided here. Use this table to improve your understanding of the area adjusted accuracy assessment. You can manipulate values in the confusion matrix / class proportions and trace the formulas which link these values with the resulting class-specific and overall accuracies as well as the area estimates.

We also highly reccommend you read Olofsson et al. (2014): Good practices for estimating area and assessing accuracy of land change, which documents the state-of-the-art for accuracy assessment and area estimation in the context of land change maps.



Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.