Image classification#

Machine learning is often categorized as a subfield of artificial intelligence that “uses statistical techniques to give computers the ability to learn without being explicitly programmed” (Arthur 1959, Wikipedia). Evolved from pattern recognition, and referred originally to the construction of algorithms that can learn from and make predictions on data (Kohavi & Provost 1998).

Categories:

1.) Supervised learning: Modelling the relationship between measured features and some associated label. Often applied afterwards to new, unknown data. Subcategories: classification (discrete labels) and regression (cont. labels)

2.) Unsupervised learning: Modelling features of a dataset without reference to any label Main applications: clustering (i.e., identification of distinct groups), dimensionality reduction (i.e., search for more distinct representations of data)

The Python Data Science Handbook has an excellent chapter (5) on machine learning with Python [VanderPlas, 2016]. Please carefully read this chapter and/or follow the accompanying notebooks.

In this course, we will use the scikit-learn library. Python’s scikit-learn contains a large collection of algorithms for classification, regression, clustering, transformations, feature extraction, model selection, and more. Check out their webpage for more information.

Imports#

%matplotlib inline
import os
import pandas as pd
import numpy as np
from osgeo import gdal
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # z-transformation
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC # Support vector classifier
from sklearn import metrics # Accuracy assessment
import geopy_ml_fun as geopy # Visualize and save map

gdal.UseExceptions()

inpath = "../../data/landsat_stm_berlin"
outpath = "../../outputs/machine_learning"
refpath = "../../data/ml_2022"

Example#

In this example, we train and evaluate a support vector machine classification to produce a land cover map using a sample of Landsat pixels with known reference labels (land cover) and extracted statistical features as predictors. Before starting, we need to standardize the feature matrix. Then, we split the reference dataset 50:50 in model training and testing data. Based on the training data, we apply a grid search to tune the hyper-parameters. Then, we estimate the overall accuracy and the class-wise accuracies. Finally, we apply the model to the Landsat feature image to create a land cover map.

Reference data#

berlin_landsat_sample.csv: The reference dataset contains a sample of 332 Landsat pixel. Each sample pixel has a reference class label (“class”) based on the LUCAS survey and 36 satellite data features. The reference land cover class consists of 6 land cover classes (Tab. 1). The satellite features are spectral-temporal metrics of the spectral bands blue (BLU), red (RED), near-IR (NIR), and two short-wave IR bands (SW1 and SW2). The column names show the band name abbreviation followed by the statistical metric: standard deviation (STD) and the 5th, 25th, 50th, 75th, and 95th percentile of the temporal distribution between March and October of 2018.

Id

Class name

1

Artificial land

2

Cropland

3

Grassland

4

Forest broadleaved

5

Forest coniferous

6

Water

Since the reference is in CSV format, we can easily read it as a pandas Data.frame. You see, the dataset has 332 rows (observations) and 37 columns.

lucas = pd.read_csv(os.path.join(refpath, 'berlin_landsat_sample.csv'))
lucas.shape
(332, 37)
lucas.head()
class BLU_Q05 BLU_Q25 BLU_Q50 BLU_Q75 BLU_Q95 BLU_STD GRN_Q05 GRN_Q25 GRN_Q50 ... SW1_Q50 SW1_Q75 SW1_Q95 SW1_STD SW2_Q05 SW2_Q25 SW2_Q50 SW2_Q75 SW2_Q95 SW2_STD
0 2 464 656 797 992 1084 228 811 953 1046 ... 2731 3208 3307 768 653 1105 2024 2373 2583 714
1 2 416 640 871 992 1105 218 646 950 1140 ... 2693 3008 3394 617 816 1350 1925 2629 2838 748
2 3 376 435 475 537 568 62 738 740 772 ... 2384 2467 2740 218 1113 1298 1395 1597 1638 192
3 2 404 827 947 996 1003 236 675 1159 1312 ... 3365 3425 3823 796 732 1414 2545 2851 3114 861
4 2 469 615 758 1327 1358 343 982 1051 1120 ... 2126 3423 3682 921 598 908 1293 2806 2993 954

5 rows × 37 columns

Landsat data#

X0068_Y0043_landsat_stm_2018.tif: The Landsat image contains 36 statistical metrics derived from Landsat observations acquired over the year 2018 (March to October). The statistical metrics are the same statistical metrics as shown in the reference dataset.

lnd_file = os.path.join(inpath, "X0069_Y0043_landsat_stm_2018.tif")
ds_src = gdal.Open(lnd_file)
lnd_img = ds_src.ReadAsArray()
lnd_img.shape
(36, 3000, 3000)

To access the name of a band, we need to create a band object and call GetDescription() on it. If we want to retrieve all band names, we need to loop through all raster band objects. Below, we write a loop inside a list comprehension to achieve that. Note, that we must extract bands with x+1 since the gdal band index starts with 1 and the python index starts with 0.

bnames = [ds_src.GetRasterBand(x+1).GetDescription() for x in range(ds_src.RasterCount)]
bnames[:7]
['BLU_Q05', 'BLU_Q25', 'BLU_Q50', 'BLU_Q75', 'BLU_Q95', 'BLU_STD', 'GRN_Q05']

It is a good idea to visualize the Landsat image to see if we imported everything correctly. Let’s visualize the median image with the band combination NIR_Q50, RED_Q50, and GRN_Q50. To do that, we need to find the positions of those bands in the raster image. Again, we can write a list comprehension to loop over the band names, but that gives us only a single position. To understand what enumerate does click here does.

[i for i, x in enumerate(bnames) if x == "NIR_Q50"]
[14]

To get the positions of all three bands, we can write a nested list comprehension that combines looping over the band names of the raster and the band names of our desired band combination.

band_combination = ["NIR_Q50", "RED_Q50", "GRN_Q50"]
[[i for i, x in enumerate(bnames) if x == k] for k in band_combination]
[[14], [20], [8]]

This is to show you the power of list comprehensions in Python. Don’t be afraid to use them unless you have large datasets. We can also take advantage of the search functionality of pandas Series and DataFrames.

bc = pd.Series(bnames)
bc[bc.isin(["GRN_Q50", "RED_Q50", "NIR_Q50"])]
8     GRN_Q50
14    NIR_Q50
20    RED_Q50
dtype: object
geopy.plot_boa(lnd_img, bands=[14, 20, 8])
../../_images/d9ddf01e4aab71c3d0958ba2f6da2559da4559cd525837826aa25a0193e4e639.png

Data format#

In scikit-learn examples, it is customary to explicitly create the variables y (for the target vector or response) and X (for the feature matrix or predictors). You do not have to follow that convention, but it makes your code easier to read. Scikit-learn accepts numpy arrays and pandas Data.frames and Series. Here, we extract the target vector and feature matrix from the pandas Data.frame.

y = lucas['class']
y.shape
(332,)
X = lucas.drop(['class'], axis=1) # or X = lucas.iloc[:, 1:]
X.shape
(332, 36)

Feature scaling#

Feature scaling standardizes the predictors in your feature matrix. Feature scaling is important when predictors are highly varying in magnitude, e.g., when different units are involved. If such data is not scaled, machine learning algorithms that work with variances and/or distances will underperform because they will only utilize the features with the highest variance. To mitigate the problem, we rescale all features to the properties of a standard normal distribution with mean zero (μ = 0) and standard deviation one (σ = 1) using scikit-learn. The StandardScale() works in two steps: 1) the fit() estimates the distribution parameters mean and standard deviation, and 2) transform() applies the normalization. If you are not interested in keeping the distribution parameters, you can do the transformation in one step using fit_transform().

fittedScaler = StandardScaler().fit(X.astype('float32'))
X_scaled = fittedScaler.transform(X.astype('float32'))
X_scaled.shape
(332, 36)

Splitting data#

Now, lets split the data set int 50% training and %50 test set.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.5)

Instantiate classifier#

First, we create an instance of the classifier. You can check the default hyperparameters with model.get_params().

svc_model = SVC()
svc_model.get_params()
{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

Tuning#

Tuning is the process of maximizing an algorithm’s performance without overfitting or underfitting. Tuning is accomplished by selecting optimal hyperparameters from a model. A hyperparameter is a configuration external to the model and whose value cannot be estimated from data. Scikit-Learn offers two functions for optimizing hyperparameter tuning: GridSearchCV and RandomizedSearchCV. GridSearchCV performs an exhaustive search over specified parameter values for an estimator (or machine learning algorithm) and returns the best performing hyperparametric combination. RandomizedSearchCV uses randomized selections of hyperparameters and therefore tends to be less computationally expensive. SVM has two hyperparameters: C and gamma. Below, we conduct a grid search using three-fold cross-validation to evaluate the performance of each hyperparameter combination.

param_grid = {'C': [1, 5, 10, 40, 50, 60],
              'gamma': [0.0005, 0.001, 0.005, 0.006, 0.007]}
grid = GridSearchCV(svc_model, param_grid, cv=3)

grid.fit(X_train, y_train)
grid.best_params_
{'C': 5, 'gamma': 0.006}

From the fitted grid search object, we extract the best parameters (see above) and the best model (see below).

lc_model = grid.best_estimator_
lc_model.get_params()
{'C': 5,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 0.006,
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

Model Accuracy#

Model accuracy can be describe class-wise and overall. Precision (User’s accuracy) and Recall (Producer’s accuracy) describe the accuracy at the class level. Precision (User’s accuracy) describes the proportion of positive identifications that was actually correct:

\(Precision = \frac{TP}{TP+FP}\)

Recall describes the proportion of actual positives that was identified correctly:

\(Recall = \frac{TP}{TP+FN}\)

where, TP: true positives, FP: false positives, and FN: false negatives. To fully evaluate the effectiveness of a model, you must examine both precision and recall.

Apply model to the test data:

y_pred = lc_model.predict(X_test)
y_pred
array([5, 6, 6, 5, 4, 3, 5, 6, 6, 6, 6, 5, 3, 2, 5, 6, 5, 5, 6, 2, 3, 3,
       5, 5, 3, 4, 3, 5, 3, 2, 3, 2, 3, 3, 5, 5, 1, 5, 2, 6, 6, 4, 3, 5,
       6, 3, 1, 2, 5, 2, 5, 5, 3, 3, 2, 2, 4, 4, 5, 3, 1, 4, 5, 3, 1, 2,
       3, 3, 6, 6, 1, 2, 6, 2, 5, 6, 6, 3, 5, 6, 1, 5, 3, 3, 4, 2, 3, 5,
       5, 3, 2, 5, 3, 2, 5, 5, 5, 5, 5, 6, 3, 6, 3, 1, 6, 6, 6, 5, 1, 5,
       5, 5, 5, 6, 2, 3, 3, 6, 5, 3, 5, 3, 5, 1, 2, 5, 6, 6, 3, 5, 5, 1,
       6, 6, 5, 4, 1, 2, 3, 3, 3, 2, 6, 3, 4, 1, 4, 5, 6, 3, 6, 5, 6, 5,
       5, 3, 1, 5, 3, 3, 5, 1, 4, 4, 6, 6])

What is the probabiliy that a pixel is correctly classified?

print('Accuracy: %.4f' % metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.8554

Confusion matrix#

metrics.confusion_matrix(y_test, y_pred)
array([[11,  0,  4,  1,  3,  0],
       [ 0, 18,  4,  0,  0,  0],
       [ 0,  1, 28,  0,  1,  0],
       [ 3,  0,  3, 11,  4,  0],
       [ 0,  0,  0,  0, 40,  0],
       [ 0,  0,  0,  0,  0, 34]])

Accuracy statistics#

aa_scores = metrics.precision_recall_fscore_support(y_test, y_pred)
aa_scores
(array([0.78571429, 0.94736842, 0.71794872, 0.91666667, 0.83333333,
        1.        ]),
 array([0.57894737, 0.81818182, 0.93333333, 0.52380952, 1.        ,
        1.        ]),
 array([0.66666667, 0.87804878, 0.8115942 , 0.66666667, 0.90909091,
        1.        ]),
 array([19, 22, 30, 21, 40, 34]))

Text report#

print(metrics.classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           1       0.79      0.58      0.67        19
           2       0.95      0.82      0.88        22
           3       0.72      0.93      0.81        30
           4       0.92      0.52      0.67        21
           5       0.83      1.00      0.91        40
           6       1.00      1.00      1.00        34

    accuracy                           0.86       166
   macro avg       0.87      0.81      0.82       166
weighted avg       0.87      0.86      0.85       166

Map predictions#

Before we apply the model to the image, we need to transform the 3D-image with the shape (n_bands, n_row, n_col) into a 2D array of shape (n_row * n_col, n_bands). To achieve that, we first need to move the n_bands dimension to the last axis with transpose. Then we can collapse the first two dimensions with reshape. Finally, we need to transform the predictions back to 3-D.

# Get size of dimensions
n_bands, n_rows, n_cols = lnd_img.shape

# Transpose the array from (n_bands, n_rows, n_cols) to (n_rows, n_cols, n_bands)
flat_img = np.transpose(lnd_img, (1, 2, 0))

# Collapse the row and column dimension to get a 2-d array
flat_img = flat_img.reshape((n_rows * n_cols, n_bands))
print('Reshaped from {o} to {n}'.format(o=lnd_img.shape, n=flat_img.shape))
Reshaped from (36, 3000, 3000) to (9000000, 36)

Rescale image to mean 0 and unit variance.

fittedScaler = StandardScaler().fit(flat_img)
scaled_img = fittedScaler.transform(flat_img)

Make a prediction for each pixel using the flat image array:

lc_flat = lc_model.predict(scaled_img)

Reshape our classified flat array back into the image shape:

lc_map = lc_flat.reshape(lnd_img.shape[1:])
lc_map.shape
(3000, 3000)

Map visualization#

The course module geopy has a small plot_map() function that also takes a list of labels and a dictionary of color codes as input. The color codes are of the format [red, green, blue, alpha], and the values range from 0 to 1.

lcColors = {1: [1.0, 0.0, 0.0, 1.0],  # Artificial
            2: [1.0, 1.0, 0.0, 1.0],  # Cropland
            3: [0.78, 0.78, 0.39, 1.0],  # Grassland
            4: [0.0, 0.78, 0.0, 1.0],  # Forest, broadleaf
            5: [0.0, 0.39, 0.39, 1.0],  # Forest, conifer
            6: [0.0, 0.0, 1.0, 1.0]}  # Water

labels = ['artificial land', 'cropland', 'grassland', 'forest broadleaved', 'forest coniferous', 'water']
geopy.plot_map(lc_map, lcColors, labels)
../../_images/75651ecd96c675a6949a04540c51712d3d1532547eeba62b7f6f4f6bde02ac04.png

Map export#

We could just save the map as a standard 8-bit (np.ubyte()) image. The function below also embeds a predefined color ramp into the tif file. The color values range from 0 to 255.

colors = dict((
    (1, (255, 0, 0)),  # Artificial
    (2, (255, 255, 0)),  # Cropland
    (3, (200, 200, 100)),  # Grassland
    (4, (0, 200, 0)),  # Forest broadleaf
    (5, (0, 100, 100)),  # Forest conifer
    (6, (0, 0, 255))  # Water
))

out_file = os.path.join(outpath, "landsat_landcover_2015.tif")
geopy.writeArray(lc_map.astype(np.ubyte()), out_file, lnd_file, colors=colors)