Image classification

Image classification#

Machine learning is often categorized as a subfield of artificial intelligence that “uses statistical techniques to give computers the ability to learn without being explicitly programmed” (Arthur 1959, Wikipedia). Evolved from pattern recognition, and referred originally to the construction of algorithms that can learn from and make predictions on data (Kohavi & Provost 1998).

Categories:

1.) Supervised learning: Modelling the relationship between measured features and some associated label. Often applied afterwards to new, unknown data. Subcategories: classification (discrete labels) and regression (cont. labels)

2.) Unsupervised learning: Modelling features of a dataset without reference to any label Main applications: clustering (i.e., identification of distinct groups), dimensionality reduction (i.e., search for more distinct representations of data)

The Python Data Science Handbook has an excellent chapter (5) on machine learning with Python [VanderPlas, 2016]. Please carefully read this chapter and/or follow the accompanying notebooks.

In this course, we will use the scikit-learn library. Python’s scikit-learn contains a large collection of algorithms for classification, regression, clustering, transformations, feature extraction, model selection, and more. Check out their webpage for more information.

Imports#

%matplotlib inline
import os
import pandas as pd
import numpy as np
from osgeo import gdal
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # z-transformation
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC # Support vector classifier
from sklearn import metrics # Accuracy assessment
import geopy_ml_fun as geopy # Visualize and save map

gdal.UseExceptions()

inpath = "../../data/landsat_stm_berlin"
outpath = "../../outputs/machine_learning"
refpath = "../../data/ml_2022"

Example#

In this example, we train and evaluate a support vector machine classification to produce a land cover map using a sample of Landsat pixels with known reference labels (land cover) and extracted statistical features as predictors. Before starting, we need to standardize the feature matrix. Then, we split the reference dataset 50:50 in model training and testing data. Based on the training data, we apply a grid search to tune the hyper-parameters. Then, we estimate the overall accuracy and the class-wise accuracies. Finally, we apply the model to the Landsat feature image to create a land cover map.

Reference data#

berlin_landsat_sample.csv: The reference dataset contains a sample of 332 Landsat pixel. Each sample pixel has a reference class label (“class”) based on the LUCAS survey and 36 satellite data features. The reference land cover class consists of 6 land cover classes (Tab. 1). The satellite features are spectral-temporal metrics of the spectral bands blue (BLU), red (RED), near-IR (NIR), and two short-wave IR bands (SW1 and SW2). The column names show the band name abbreviation followed by the statistical metric: standard deviation (STD) and the 5th, 25th, 50th, 75th, and 95th percentile of the temporal distribution between March and October of 2018.

Id	Class name
1	Artificial land
2	Cropland
3	Grassland
4	Forest broadleaved
5	Forest coniferous
6	Water

Since the reference is in CSV format, we can easily read it as a pandas Data.frame. You see, the dataset has 332 rows (observations) and 37 columns.

lucas = pd.read_csv(os.path.join(refpath, 'berlin_landsat_sample.csv'))
lucas.shape

(332, 37)

lucas.head()

	class	BLU_Q05	BLU_Q25	BLU_Q50	BLU_Q75	BLU_Q95	BLU_STD	GRN_Q05	GRN_Q25	GRN_Q50	...	SW1_Q50	SW1_Q75	SW1_Q95	SW1_STD	SW2_Q05	SW2_Q25	SW2_Q50	SW2_Q75	SW2_Q95	SW2_STD
0	2	464	656	797	992	1084	228	811	953	1046	...	2731	3208	3307	768	653	1105	2024	2373	2583	714
1	2	416	640	871	992	1105	218	646	950	1140	...	2693	3008	3394	617	816	1350	1925	2629	2838	748
2	3	376	435	475	537	568	62	738	740	772	...	2384	2467	2740	218	1113	1298	1395	1597	1638	192
3	2	404	827	947	996	1003	236	675	1159	1312	...	3365	3425	3823	796	732	1414	2545	2851	3114	861
4	2	469	615	758	1327	1358	343	982	1051	1120	...	2126	3423	3682	921	598	908	1293	2806	2993	954

5 rows × 37 columns

Landsat data#

X0068_Y0043_landsat_stm_2018.tif: The Landsat image contains 36 statistical metrics derived from Landsat observations acquired over the year 2018 (March to October). The statistical metrics are the same statistical metrics as shown in the reference dataset.

lnd_file = os.path.join(inpath, "X0069_Y0043_landsat_stm_2018.tif")
ds_src = gdal.Open(lnd_file)
lnd_img = ds_src.ReadAsArray()
lnd_img.shape

(36, 3000, 3000)

To access the name of a band, we need to create a band object and call GetDescription() on it. If we want to retrieve all band names, we need to loop through all raster band objects. Below, we write a loop inside a list comprehension to achieve that. Note, that we must extract bands with x+1 since the gdal band index starts with 1 and the python index starts with 0.

bnames = [ds_src.GetRasterBand(x+1).GetDescription() for x in range(ds_src.RasterCount)]
bnames[:7]

['BLU_Q05', 'BLU_Q25', 'BLU_Q50', 'BLU_Q75', 'BLU_Q95', 'BLU_STD', 'GRN_Q05']

It is a good idea to visualize the Landsat image to see if we imported everything correctly. Let’s visualize the median image with the band combination NIR_Q50, RED_Q50, and GRN_Q50. To do that, we need to find the positions of those bands in the raster image. Again, we can write a list comprehension to loop over the band names, but that gives us only a single position. To understand what enumerate does click here does.

[i for i, x in enumerate(bnames) if x == "NIR_Q50"]

[14]

To get the positions of all three bands, we can write a nested list comprehension that combines looping over the band names of the raster and the band names of our desired band combination.

band_combination = ["NIR_Q50", "RED_Q50", "GRN_Q50"]
[[i for i, x in enumerate(bnames) if x == k] for k in band_combination]

[[14], [20], [8]]

This is to show you the power of list comprehensions in Python. Don’t be afraid to use them unless you have large datasets. We can also take advantage of the search functionality of pandas Series and DataFrames.

bc = pd.Series(bnames)
bc[bc.isin(["GRN_Q50", "RED_Q50", "NIR_Q50"])]

   GRN_Q50
  NIR_Q50
  RED_Q50
dtype: object

geopy.plot_boa(lnd_img, bands=[14, 20, 8])

../../_images/d9ddf01e4aab71c3d0958ba2f6da2559da4559cd525837826aa25a0193e4e639.png

Data format#

In scikit-learn examples, it is customary to explicitly create the variables y (for the target vector or response) and X (for the feature matrix or predictors). You do not have to follow that convention, but it makes your code easier to read. Scikit-learn accepts numpy arrays and pandas Data.frames and Series. Here, we extract the target vector and feature matrix from the pandas Data.frame.

y = lucas['class']
y.shape

(332,)

X = lucas.drop(['class'], axis=1) # or X = lucas.iloc[:, 1:]
X.shape

(332, 36)

Feature scaling#

Feature scaling standardizes the predictors in your feature matrix. Feature scaling is important when predictors are highly varying in magnitude, e.g., when different units are involved. If such data is not scaled, machine learning algorithms that work with variances and/or distances will underperform because they will only utilize the features with the highest variance. To mitigate the problem, we rescale all features to the properties of a standard normal distribution with mean zero (μ = 0) and standard deviation one (σ = 1) using scikit-learn. The StandardScale() works in two steps: 1) the fit() estimates the distribution parameters mean and standard deviation, and 2) transform() applies the normalization. If you are not interested in keeping the distribution parameters, you can do the transformation in one step using fit_transform().

fittedScaler = StandardScaler().fit(X.astype('float32'))
X_scaled = fittedScaler.transform(X.astype('float32'))
X_scaled.shape

(332, 36)

Splitting data#

Now, lets split the data set int 50% training and %50 test set.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.5)

Instantiate classifier#

First, we create an instance of the classifier. You can check the default hyperparameters with model.get_params().

svc_model = SVC()
svc_model.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

Tuning#

Tuning is the process of maximizing an algorithm’s performance without overfitting or underfitting. Tuning is accomplished by selecting optimal hyperparameters from a model. A hyperparameter is a configuration external to the model and whose value cannot be estimated from data. Scikit-Learn offers two functions for optimizing hyperparameter tuning: GridSearchCV and RandomizedSearchCV. GridSearchCV performs an exhaustive search over specified parameter values for an estimator (or machine learning algorithm) and returns the best performing hyperparametric combination. RandomizedSearchCV uses randomized selections of hyperparameters and therefore tends to be less computationally expensive. SVM has two hyperparameters: C and gamma. Below, we conduct a grid search using three-fold cross-validation to evaluate the performance of each hyperparameter combination.

param_grid = {'C': [1, 5, 10, 40, 50, 60],
              'gamma': [0.0005, 0.001, 0.005, 0.006, 0.007]}
grid = GridSearchCV(svc_model, param_grid, cv=3)

grid.fit(X_train, y_train)
grid.best_params_

{'C': 5, 'gamma': 0.006}

From the fitted grid search object, we extract the best parameters (see above) and the best model (see below).

lc_model = grid.best_estimator_
lc_model.get_params()

{'C': 5,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 0.006,
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

Model Accuracy#

Model accuracy can be describe class-wise and overall. Precision (User’s accuracy) and Recall (Producer’s accuracy) describe the accuracy at the class level. Precision (User’s accuracy) describes the proportion of positive identifications that was actually correct:

\(Precision = \frac{TP}{TP+FP}\)

Recall describes the proportion of actual positives that was identified correctly:

\(Recall = \frac{TP}{TP+FN}\)

where, TP: true positives, FP: false positives, and FN: false negatives. To fully evaluate the effectiveness of a model, you must examine both precision and recall.

Apply model to the test data:

y_pred = lc_model.predict(X_test)
y_pred

array([5, 6, 6, 5, 4, 3, 5, 6, 6, 6, 6, 5, 3, 2, 5, 6, 5, 5, 6, 2, 3, 3,
       5, 5, 3, 4, 3, 5, 3, 2, 3, 2, 3, 3, 5, 5, 1, 5, 2, 6, 6, 4, 3, 5,
       6, 3, 1, 2, 5, 2, 5, 5, 3, 3, 2, 2, 4, 4, 5, 3, 1, 4, 5, 3, 1, 2,
       3, 3, 6, 6, 1, 2, 6, 2, 5, 6, 6, 3, 5, 6, 1, 5, 3, 3, 4, 2, 3, 5,
       5, 3, 2, 5, 3, 2, 5, 5, 5, 5, 5, 6, 3, 6, 3, 1, 6, 6, 6, 5, 1, 5,
       5, 5, 5, 6, 2, 3, 3, 6, 5, 3, 5, 3, 5, 1, 2, 5, 6, 6, 3, 5, 5, 1,
       6, 6, 5, 4, 1, 2, 3, 3, 3, 2, 6, 3, 4, 1, 4, 5, 6, 3, 6, 5, 6, 5,
       5, 3, 1, 5, 3, 3, 5, 1, 4, 4, 6, 6])

What is the probabiliy that a pixel is correctly classified?

print('Accuracy: %.4f' % metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8554

Confusion matrix#

metrics.confusion_matrix(y_test, y_pred)

array([[11,  0,  4,  1,  3,  0],
       [ 0, 18,  4,  0,  0,  0],
       [ 0,  1, 28,  0,  1,  0],
       [ 3,  0,  3, 11,  4,  0],
       [ 0,  0,  0,  0, 40,  0],
       [ 0,  0,  0,  0,  0, 34]])

Accuracy statistics#

aa_scores = metrics.precision_recall_fscore_support(y_test, y_pred)
aa_scores

(array([0.78571429, 0.94736842, 0.71794872, 0.91666667, 0.83333333,
        1.        ]),
 array([0.57894737, 0.81818182, 0.93333333, 0.52380952, 1.        ,
        1.        ]),
 array([0.66666667, 0.87804878, 0.8115942 , 0.66666667, 0.90909091,
        1.        ]),
 array([19, 22, 30, 21, 40, 34]))

Text report#

print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.79      0.58      0.67        19
           2       0.95      0.82      0.88        22
           3       0.72      0.93      0.81        30
           4       0.92      0.52      0.67        21
           5       0.83      1.00      0.91        40
           6       1.00      1.00      1.00        34

    accuracy                           0.86       166
   macro avg       0.87      0.81      0.82       166
weighted avg       0.87      0.86      0.85       166

Map predictions#

Before we apply the model to the image, we need to transform the 3D-image with the shape (n_bands, n_row, n_col) into a 2D array of shape (n_row * n_col, n_bands). To achieve that, we first need to move the n_bands dimension to the last axis with transpose. Then we can collapse the first two dimensions with reshape. Finally, we need to transform the predictions back to 3-D.

# Get size of dimensions
n_bands, n_rows, n_cols = lnd_img.shape

# Transpose the array from (n_bands, n_rows, n_cols) to (n_rows, n_cols, n_bands)
flat_img = np.transpose(lnd_img, (1, 2, 0))

# Collapse the row and column dimension to get a 2-d array
flat_img = flat_img.reshape((n_rows * n_cols, n_bands))
print('Reshaped from {o} to {n}'.format(o=lnd_img.shape, n=flat_img.shape))

Reshaped from (36, 3000, 3000) to (9000000, 36)

Rescale image to mean 0 and unit variance.

fittedScaler = StandardScaler().fit(flat_img)
scaled_img = fittedScaler.transform(flat_img)

Make a prediction for each pixel using the flat image array:

lc_flat = lc_model.predict(scaled_img)

Reshape our classified flat array back into the image shape:

lc_map = lc_flat.reshape(lnd_img.shape[1:])
lc_map.shape

(3000, 3000)

Map visualization#

The course module geopy has a small plot_map() function that also takes a list of labels and a dictionary of color codes as input. The color codes are of the format [red, green, blue, alpha], and the values range from 0 to 1.

lcColors = {1: [1.0, 0.0, 0.0, 1.0],  # Artificial
            2: [1.0, 1.0, 0.0, 1.0],  # Cropland
            3: [0.78, 0.78, 0.39, 1.0],  # Grassland
            4: [0.0, 0.78, 0.0, 1.0],  # Forest, broadleaf
            5: [0.0, 0.39, 0.39, 1.0],  # Forest, conifer
            6: [0.0, 0.0, 1.0, 1.0]}  # Water

labels = ['artificial land', 'cropland', 'grassland', 'forest broadleaved', 'forest coniferous', 'water']
geopy.plot_map(lc_map, lcColors, labels)

../../_images/75651ecd96c675a6949a04540c51712d3d1532547eeba62b7f6f4f6bde02ac04.png

Map export#

We could just save the map as a standard 8-bit (np.ubyte()) image. The function below also embeds a predefined color ramp into the tif file. The color values range from 0 to 255.

colors = dict((
    (1, (255, 0, 0)),  # Artificial
    (2, (255, 255, 0)),  # Cropland
    (3, (200, 200, 100)),  # Grassland
    (4, (0, 200, 0)),  # Forest broadleaf
    (5, (0, 100, 100)),  # Forest conifer
    (6, (0, 0, 255))  # Water
))

out_file = os.path.join(outpath, "landsat_landcover_2015.tif")
geopy.writeArray(lc_map.astype(np.ubyte()), out_file, lnd_file, colors=colors)