The Random Forest (RF) algorithm (Breimann 2001) is a supervised classification algorithm. It builds upon the concept of decision trees presented in the last session. The RF relies on many self-learning decision trees (i.e. “Forest”). The idea behind using many decision trees (i.e. an ensemble) is that many base learners can come to one strong and robust decision compared to a single DT. Different from the manual (expert-based) definition of decision rules we defined last week, the RF uses self-learning decision trees. These trees automatically define rules at each node based on a training dataset. RF seeks to minimize the heterogeneity of the two resulting subsets of the data created by the respective rule. Heterogeneity is in this case expressed as the Gini impurity index and the rule creating the least heterogeneous subsets of the data is used for the respective node. If you want to know how it is calculated, you may take a look at advanced materials.
Furthermore, each decision tree is randomized (thus “Random”). The RF has two layers of randomness. First, it uses a random sample of the training dataset (with replacement, i.e. a bootstrapped sample) for growing each individual decision tree. The second random component is a random selection of the features (e.g. spectral bands) considered at each node to determine the best rule for splitting the data and ultimately determining a class label.
Many trees may produce different class labels for the same data point. The final class assignment of each image pixel is thus based on the majority vote among all trees in the RF.
In remote sensing, thematic land cover and land use classes consist of multiple spectral classes (cluster of several surfaces which have similar spectral properties). A broad thematic land cover class like “vegetation” should consider the spectral properties of various forests, grasslands, or crops. Similarly a class like “forest” should consider the spectral properties of different forest types in the landscape, or an even more detailed class like “coniferous forest” should consider the spectral properties of potentially different species compositions and stand structures within this class. For training data generation, it is therefore essential to consider the different spectral properties of the several surfaces, i.e. all spectral classes of a thematic class.
Defining the spectral classes well requires good knowledge of the class characteristics in the region of interest. Depending on its definition, a thematic class can have higher or lower within-class variability (or number of spectral classes). Now, if all spectral classes of a thematic class are nicely separable from the spectral classes of the other thematic classes, the separability of the thematic classes is high and we can likely achieve good classification results. This is for instance the case in the figure below.
Sometimes, however, spectral classes overlap. The thematic classes look very similar spectrally. In this case, the separability of the thematic classes will decrease or they even may not be separable. In this case, the thematic classes have a high between-class similarity. This is the case if we add a class “built-up” to the above class catalog.
Supervised classification approaches like the RF Classifier rely on training data to automatically classify each image pixel according to a single class label. Collecting training data is time consuming, regardless if you are collecting in the field or on-screen. Considering the following aspects can help to make the process easier and prevent mistakes.
Define your thematic classes well for the study region. This works well if you first think about the different spectral classes that may be contained in a thematic class, such as tree species within forest types, or different types of herbaceous vegetation.
Training data should be well distributed across the study region to cover the regional biophysical variability, such as different soil types, or topography.
Gather as much reference information as possible. Sometimes, you can find additional datasets that guide your interpretation. Very high resolution (VHR) imagery is available through GoogleEarth for many world regions. It may be critical to account for the exact acquisition date of the VHR data, which you can identify with the “historic image” tool.
Please use the Sentinel-2 summer image (acquisition date 26.07.2019, 20 m, 9 spectral bands) you prepared in session 06 and used in session 07. If you do not have it available anymore, you can find it in the materials for session 07
The goal of this exercise is to collect training data and to perform a RF-based land cover classification for Berlin. Training data collection is based on manual digitization in QGIS, the RF classification is based on the Classification Workflow application provided in the EnMAP Box
Visualize the Sentinel-2 summer image in QGIS.
Generate a new point shapefile with the filename ‘training_data.shp’. Set the Geometry Type to Point and use the CRS EPSG:32633. Next, add an attribute column “lc_id” of type integer.
|Class ID||Class description|
|1||Urban (built-up and non built-up)|
|2||Grass & Crops|
|5||Soil (incl. harvested cropland)|
For each point, note the ‘Class ID’ from the table above in the attribute field ‘lc_id’.
Make sure that your training points are spatially evenly distributed across the entire region. Each training point can belong to one class only. Try to cover all spectral classes within each thematic class.
More tips for training data collection:
Open the EnMAP-Box and visualize the Sentinel-2 image and the “training_data.shp” shapefile in a new Map window.
Open the ‘Classification Workflow’ via ‘Applictions’ to run a RF classification.
Visualize the classification result and establish a link with the Sentinel-2 image.
Assess the quality of the land cover classification:
Expand your training dataset by collecting further training points for the problematic classes. Consider to include mixed pixels that are dominated by impervious surfaces (> 50%) into the “urban” class.
Repeat the RF classification with the expanded training points and critically evaluate the revised classification result.
Repeat the revision procedure until you reach a satisfying final classification result. Note that a certain degree of confusion between thematic classes likely remains due to the overlap of spectral classes of different thematic classes.
Please upload the final classification result (map incl. legend) and the discussion as pdf in moodle.
General submission notes: Submission deadline for the weekly assignment is always the following Monday at 10am. Please use the naming convention indicating session number and family name of all students in the respective team, e.g. ‘s01_surname1_surname2_surname3_surname4.pdf’. Each team member has to upload the assignment individually. Provide single file submissions, in case you have to submit multiple files, create a *.zip archive.
Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.