Learning goals

In the last session you were introduced to raster data, heard about some basic properties and data formats, and learned how objects appear in true- and false-colour representations. Our overarching task remains. Yet before we can jump straight into mapping and analyzing the urban greenness of Berlin, we must first …

  • … deepen our understanding of properties and metadata of remotely sensed imagery
  • … learn how to handle spatial raster data in R
Why code?
Why work with geospatial data, or any data really, in a programming environment and not only stick with Graphical User Interfaces (GUIs) like QGIS? For us it is particularily the flexibility of processing data and not being restricted to predetermined tools and functions. Also we often find ourselves having to apply a certain chain of processes repeatedly over different data. It is particularily the reusability of code that outcompetes working in a GUI in terms of efficiency. Last, getting acquainted with any programming language is generally a great and much in demand skill.
Help with coding
Once you enter the world of programming and you will come across syntax errors or tasks in which you need help with writing code. You are learning new a language after all. Good news is there are many excellent tutorials and community feedback out there which will help you tremendously in achieving your tasks. One great resource is the Stack Overflow community where people can ask code related questions and seek help from fellow (advanced) programmers. Make sure to Google your problem and chances are very high you come across a forum entry precisely matching your needs. Then, of course, there is ChatGPT. Last, do not unterestimate your own problem solving skills which will grow quickly in terms of coding with time spent doing so.

Raster data in R

Last week you learned that raster data stored in image formats are essentially a collection of pixels stored along rows and columns in matrices or tables. But how do we deal with such data in R? The terra package offers useful classes and functions which facilitate raster data handling in R. It allows us to access file characteristics before loading data into memory, manipulate coordinate reference systems and spatial extents. We can use it to perform raster algebra, to combine raster and vector datasets (e.g. ESRI shapefiles), or to convert raster files into matrices, which then become compatible with base-R functions to manipulate data, do statistics, develop models, and much more. The terra package supersedes the older raster package developed by the same team of authors. Because raster has been around for a longer time, dependencies with some packages may still exist. But development of raster will eventually be stopped at some point. terra’s classes and functions are well described in the package documentation and on the package’s webpage.

The most important classes within the package are:

  • SpatRaster to store raster data
  • SpatVector to store spatial vector data, i.e., points, lines, polygons.
  • SpatExtent to store information on spatial extents

There are a couple of things that the terra package does not provide or that are simply better done in a GUI like QGIS. For example, collecting training and validation data are preferably done in a GIS environment and processing large data volumes is preferably achieved in other languages than R.

Welcome to the terra world

With your R console opened in RStudio you can install and then load the terra package like any other package in R as follows:

# install the terra package
install.packages('terra')

# load the package to get access to the package's routines/functions in your environment
library(terra)

Once the library is loaded we can start reading geospatial data into R. Let’s start with last week’s Landsat-8 image covering Berlin. We need to provide the filepath to the image and then use terra’s rast()-function to read the file into a SpatRaster object.

# create a new string variable containing the path to your image file
path_landsat <- "data/Landsat-8/LC08_L1TP_193023_20190726_20200827_02_T1_int16.tif"
# read the file into raster object
img_landsat <- rast(path_landsat)
# print metadata to console
print(img_landsat)
## class       : SpatRaster 
## dimensions  : 1840, 2171, 6  (nrow, ncol, nlyr)
## resolution  : 30.00639, 30.00728  (x, y)
## extent      : 360793.1, 425937, 5794470, 5849683  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 33N (EPSG:32633) 
## source      : LC08_L1TP_193023_20190726_20200827_02_T1_int16.tif 
## names       :  BLU,  GRN,  RED,  NIR,   SW1,  SW2 
## min values  :  639,  371,  227,   60,    23,   16 
## max values  : 5603, 5984, 6728, 7454, 11152, 8946

You can see that the print(img_landsat) command prints a bunch of image properties to the console, including min-max statistics for each band.

Metadata
Metadata is data about data. It includes the already familiar image properties as well as conditions recorded during data acquisition (e.g. sun angle, time of day), or information added during pre-processing to reconstructr how the data came to be. Metadata are either stored as part of an integrated data format, i.e. stored in the same file, or as an additional file accompanying the image data. Example GeoTIFF: metadata stored within .tif image. Displaying metadata possible with specific metadata reader (e.g. QGIS Layer properties, R functions).

Next, we can visualize our image in the plot window of RStudio using the plot() function:

plot(img_landsat)

By default, this creates a single plot for each band without contrast enhancement and using a single-band pseudocolour representation. If we would like to create a multi-band RGB composite, we can use plotRGB to achieve this.

Exercise 1

Create a false-colour composite using linear contrast stretching with the following band arrangement: R: SW2, G: NIR, B: RED. Write help(plotRGB) or ?plotRGB into the console to learn how to use this function, i.e. the arguments of the function. Hint: you need 5 arguments of the function and the index, i.e. the position of each band along the z-dimension of the image. Below you find the desired visual result.

plotRGB(img_landsat, r=6, g=4, b=3, stretch = "lin", axes=TRUE)


Sentinel-2 data

Recall our overarching goal to map the land cover of Berlin to assess the urban greenness of the city. By considering different data sources, a colleague of yours came to the conclusion to use Sentinel-2 data for the recent land cover map. The reasoning is that Sentinel-2 compared to Landsat-8 offers up to 10 meter spatial resolution (compared to the 30 meter Landsat data), allowing for capturing a more heterogenous mosaic of objects such as in an urban environment.

Exercise 2

The Sentinel-2 data is already available to you FE1_data_S02-03 > Sentinel-2. Let’s start with a visual assessment in QGIS:

  • First, open the Landsat-8 image from last week. Second, open one of the Sentinel-2 .tif files of your choice (a window is likely to pop-up, asking which coordinate transformation to apply. Leave at default and click OK)
  • Choose an appropriate visualization-setting for both images to enhance contrast
  • Compare the two images: What differences can you identify between the two datasets? (Make use of the metadata if needed)
  • Download and open the binary mask delineating Berlin as our study area

In the following we will learn about 1) creating a multiband raster, 2) write/save data to disc, 3) choosing datatypes, 4) manipulating pixel values (raster algebra) and 5) reprojecting.


Raster stacking

The Sentinel-2 data is currently stored in single band images for each band. Our goal is now to create a multiband image - like the Landsat image - that contains all the bands in a single file. This process of merging bands into a multiband image is frequently called raster stacking.

There are a few prerequesites to successfully merge single bands into one image. First, the data must share the same spatial resolution because pixels must match each other. Second, the spatial extent of the images must be identical because also the overall dimensions must also match each other.

Let us take a look how to create a raster stack in R. We will see - as often with programming tasks - there are multiple ways of achieving the same result. Analogous to the code from above we could create a SpatRaster object by providing the filepath of each band to the rast() function:

# Create a vector using c() containing the filepaths;
# note that we explicitly consider the order, i.e. we start with blue and
# end with shortwave-infrared 2 (index = shorter to longer wavelengths)
l_bands <- c(
  'data/Sentinel-2/SEN2A_BOA_20190726_BLU_flt32.tif', 'data/Sentinel-2/SEN2A_BOA_20190726_GRN_flt32.tif',
  'data/Sentinel-2/SEN2A_BOA_20190726_RED_flt32.tif', 'data/Sentinel-2/SEN2A_BOA_20190726_NIR_flt32.tif',
  'data/Sentinel-2/SEN2A_BOA_20190726_SW1_flt32.tif', 'data/Sentinel-2/SEN2A_BOA_20190726_SW2_flt32.tif'
)
# provide vector of filenames to rast() function; check "?rast" for details
img_s2 <- rast(l_bands)
# print result to console
print(img_s2)
## class       : SpatRaster 
## dimensions  : 4977, 6279, 6  (nrow, ncol, nlyr)
## resolution  : 10, 10  (x, y)
## extent      : 4524946, 4587736, 3248660, 3298430  (xmin, xmax, ymin, ymax)
## coord. ref. : ETRS89-extended / LAEA Europe (EPSG:3035) 
## sources     : SEN2A_BOA_20190726_BLU_flt32.tif  
##               SEN2A_BOA_20190726_GRN_flt32.tif  
##               SEN2A_BOA_20190726_RED_flt32.tif  
##               ... and 3 more source(s)
## names       : SEN2A~flt32, SEN2A~flt32, SEN2A~flt32, SEN2A~flt32, SEN2A~flt32, SEN2A~flt32 
## min values  :          ? ,          ? ,          ? ,      0.0065,      0.0056,          ?  
## max values  :          ? ,          ? ,          ? ,      0.9975,      0.9874,          ?
We first define a vector named l_bands containing the relative paths from the current working directory to the individual files. Here we make sure the order of the bands matches the order we would like them to be in the final image, i.e., starting with shorter wavelengths to longer wavelengths or from blue to short-wave infrared. Once the vector is defined we can immediately read the single-band images into a multi-band SpatRaster by providing the vector l_bands to the rast() function. Using help(rast) specifies for its argument x: filename (character), missing, SpatRaster, SpatRasterDataset, SpatExtent, SpatVector, matrix, array, list of SpatRaster objects. For other types it will be attempted to create a SpatRaster via (‘as(x, “SpatRaster”)’


Raster writing

Now that we created a raster stack we would like to save it to disc, since it is currently only stored in memory. We can do this using the writeRaster function.

Exercise 3

Take a look at the writeRaster help page. There are two mandatory arguments which the function expects. Try writing your raster stack into your Sentinel-2 folder and choose a meaningful filename.

# provide the necessary arguments to save your image to disc
writeRaster()
Once your image is stored on disc, open it in QGIS. Select the same visualization settings as for the Landsat image and compare pixel values using the “Identify features” button at the top panel and clicking somewhere into the map view. If you right-click, select “Identify All” to inspect the values of both images.

According to the function’s help page we need at least two arguments: x and filename, i.e. the image we would like to save (x) and the associated filename(+path). We can use the same basename as the individual files but swap the band naming to STACK. In case the file already exists, and we would like to overwrite it, we can specify overwrite = TRUE.

# provide the necessary arguments to save your image to disc
writeRaster(x = img_s2, filename = "data/Sentinel-2/SEN2A_BOA_20190726_STACK_flt32.tif",
            overwrite = TRUE)

We can also access the image matrix or single values directly in R. Recall the image properties from last session:

Below you find some terra functions to access some of these properties and particularily how to access individual bands or pixels inside an image:

As you can see from the above figure, you can imagine the image as a three-dimensional array in which each pixel has a xyz-coordinate or index which we can access using square brackets [] on our raster object. For example, to retrieve the values from the red band at row = 1000 and column = 500 we would write:

# raster[[zdim / band]][ydim / row, xdim / column]
img_s2[[3]][1000, 500]
##   SEN2A_BOA_20190726_RED_flt32
## 1                       0.0369
img_landsat[[3]][1000, 500]
##   RED
## 1 541
Exercise 4
Why does indexing the Landsat and Sentinel image at indices [[3]][1000, 500] not point to the same place on earth?
The index points to the position of pixels/bands in xyz-direction. As such it depends on the number of pixels in x (columns) and y (rows) direction which again depends on the extent and resolution of the original image. The Landsat data comes with 30m spatial resolution, the Sentinel-2 data with 10m spatial resolution. Accordingly, even if the extent was identical, the number of pixels is not, thus accessing “the values from the red band at row = 1000 and column = 500” points to different points in space.

Note how the Landsat data is stored as whole numbers and Sentinel as floating point number. It is time to revisit the data types from last week.


Data types

Recall from just now that we quantized the radiometric resolution of a sensor in bits:

If the radiometric resolution is high, smaller and more subtle differences in reflected electromagnetic radiation can be differentiated by the sensor. The measured energy is then converted and stored as digital numbers, the pixel values in the imagery. The range of unique values that can be stored in each pixel is determined by the data type, thus the radiometric resolution greatly governs the needed data type to store the data. The unit in which the radiometric resolution and data type are measured is bits, i.e. short for binary digit.

The binary system
The binary system represents numbers through sequences of bits (binary digits). Each bit has a position and a state. The positions are numbered from 0 to the total number of bits, usually read from right to left. The state of an individual bit can either be 0 (FALSE) or 1 (TRUE). From an electric circuits perspective, electricity is either off (0) or on (1). Consider the meaning of 0 as “FALSE” or “NO”, 1 equals “TRUE” or “YES”.

This is the very fundamental logic on which all software is currently built upon. This is because we can represent all sort of concepts in binary, including - of course - numbers and letters. A number in binary is a number expressed in the base-2 numeral system (= 0 and 1, instead of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 in decimal). Just like in the decimal system, however, if we want to represent quantities beyond the amount of unique digits available (0-1 for binary, 0-9 for decimal), we ‘flip over’ to an additional position, re-starting the count on the previous position (in decimal this would be 1, 2, ..., 9, -> flip -> 10, 11, ..., 99 -> flip -> 100, 101, ..., 999 -> flip -> 1000, ...). Hence if we run out of symbols, we add a new position.

Let us image we start counting from 0 to 10 in binary:

  • 0 -> 0
  • 1 -> 1 - now we already run out of symbols, so for the next count we add a position
  • 2 -> 10
  • 3 -> 11 - again, we add a position for the next count
  • 4 -> 100
  • 5 -> 101
  • 6 -> 110
  • 7 -> 111 - again, we add a position for the next count
  • 8 -> 1000

Since a binary system operates in base 2, each bit position is assigned a unique value defined by \(2^{n}\), where n is the bit position starting at 0. So for bit position 0, we have a value of \(2^{0} = 1\), for bit position 1 we have a value of \(2^{1} = 2\), for bit position 2 we have a value of \(2^{2} = 4\), and so on. When extracting values from the entire bit sequence, we can simply add together all bit positions which have a state of 1 (TRUE). The visualization of the described looks as follows:

The binary number system. The bit position n (purple) and resulting value from \(2^{n}\) (blue) and bit state (orange) Source: Wikipedia

In the example above, five bit positions are given, hence the quantization is 5-bit. We can see that different combinations of states and positions enable us to represent any integer number, limited only by the number of bits in the sequence.

The more bits in a sequence we have, the higher the numbers we can generate. With 5 bits, we can represent the numbers 0–31 (i.e., 32 unique values), with 16 bits, we can represent the numbers 0 - 65,535 (i.e., 65,536 unique values).

Common data types of digital remote sensing imagery

Data type (name) Typical bit length Value range R datatype
byte / int8 1 byte = 8 bit -128 to 127 (signed); 0 to 255 (unsigned) 'INT1S'; 'INT1U'
int16 2 byte = 16 bit -32768 to 32767 (signed); 0 to 65535 (unsigned) 'INT2S'; 'INT2U'
float 4 byte = 32 bit -3,4e^38 to 3,4e^38, ‘single precision’; floating point number according to IEEE 754 'FLT4S'
double 8 byte = 64 bit -Inf to +Inf, ‘double precision’ 'FLT8S'

The data types byte and int16 store whole numbers while float and double store floating-point values, i.e., values that have potential decimal places. For instance, if you wanted to store the number 3.14 in integer format, it would be in the potential range of values for both types, but the decimal positions would be truncated and therefore only a value of 3 would be stored.

Now you may ask why we would bother about choosing the datatype at all, if there are obvious choices such as float / double that can capture a large range of unique values and also allow for decimal storage? Well, a greater bit length (8bit vs. 32bit) also means we need to store more information on our data storage device, possibly without needing to, because the information could be stored more efficiently with a smaller bit length.


The file size or memory needed to store an image can be calculated as follows:
file size = number-of-bands \(\times\) number-of-columns \(\times\) numbers-of-rows \(\times\) bit-length

Consider what the above means for the Sentinel-2 data. Our image has the following dimensions:

dim(img_s2)
## [1] 4977 6279    6

The pixel values are currently stored as 32bit floating point numbers. The uncompressed filesize in megabyte is thus approx. equal to:

# n_pixels * bit width
n_pixels <- prod(dim(img_s2))
bytes <- 4  # 32 bit integer
print( paste0( "Size (Mb) 32bit: ", round( n_pixels * bytes / 1024**2, digits=2 ) ) )
## [1] "Size (Mb) 32bit: 715.27"

If we can somehow change the data as 16bit integer, the file size reduces to:

# n_pixels * bit width
n_pixels <- prod(dim(img_s2))
bytes <- 2  # 16 bit integer
print( paste0( "Size (Mb) 16bit: ", round( n_pixels * bytes / 1024**2, digits=2 ) ) )
## [1] "Size (Mb) 16bit: 357.63"

With growing amount of images, it is inevitably useful to think about datatypes and use tricks to store data more efficiently. Often we convert data from float (% reflectance between 0-1) to integer by simply multiplying by a constant value of e.g. 10,000, where a value of, for instance, 0.353 (35.5%) becomes 3530, allowing for an int16 data type. This way we can represent the decimal reflectance numbers as integer and save a lot of memory on our drives and improving computing performance.


Raster algebra

The task to save disc space gives us a good reason to learn how to change pixel values of raster objects. Recall from the R Basics section that vector/array arithmetic works element-wise:

a_vector <- c(1, 2, 3, 4, 5)
print(a_vector * 10)
## [1] 10 20 30 40 50

You can add, subtract, multiply or divide a constant from every element in the array.

An entire array can be scaled by multiplying it with a constant

We can also use these operations on multiple arrays of the same size. For example multiplying two arrays together, the first element of the first array is multiplied by the first element of the second array, and so on.

You can multiply two images with each other in a pixel-wise manner

When we apply an operation on our SpatRaster, the terra package performs the matrix operations for us in the back, meaning operations on SpatRaster objects work element-wise, too.

Exercise 5
Change the pixel values of your Sentinel-2 raster stack to an integer compatible value range. Take a look at the column “R datatype” in the table above showing “common data types of digital remote sensing imagery”, and select the appropriate datatype that is needed to save the data. Either index or plot your result to check if it worked. Then, check ?writeRaster and and write your raster to disc with the appropriate datatype. Last, compare the two raster stacks in your folder in terms of their filesize.

We know from the website and the “R Basics” section that similar to vectors, the matrices containing the data in our SpatRaster can be modified using simple arithmetic. We need an integer compatible value range, thus move from our reflectance values stored between 0 and 1 to whole numbers. From the table above we can infer that we have two potential options here: 8bit integer (INT1S/INT1U) and 16bit integer (INT2S/INT2U). If we used INT1U we could transfer our values to a range from 0 to 100. However, this would disregard any sensitivity (radiometrically) beyond the second decimal place. Since this is unreasonable for data measured originally in 12bit (radiometric resolution of Sentinel-2), we opt for INT2U instead, thus transfer our values to a range from 0 to 10000. Whilst we are at it, we can also change the bandnames of the raster using names().

# we can define a new variable of overwrite the existing
img_s2 <- img_s2 * 10000

# rename bands (not in task description)
names(img_s2) <- c("BLU", "GRN", "RED", "NIR", "SW1", "SW2")

# plot bands to check value range
plot(img_s2)

From help(writeRaster) we can infer that there is a datatype argument allowing us to explicitly define the datatype when writing to disc.

writeRaster(
  x = img_s2,
  filename = "data/Sentinel-2/SEN2A_BOA_20190726_stack_int16.tif",
  datatype = "INT2U",
  overwrite = TRUE
)

Reprojecting and spatial subsetting

Our final toolset for today will deal with reprojections and spatial subsetting. Above you already had a look at berlin_10m_epsg32633.tif delineating Berlin as our study area.

When you compare the Coordinate Reference System (CRS) of the mask and Sentinel image, you notice they are different. We thus need to reproject the Sentinel-2 image to match the projection and grid of berlin_10m_epsg32633.tif. The following figure depicts a zoom-in comparing the 10x10m target grid of berlin_10m_epsg32633.tif shown in red, to the 10x10m grid of your Sentinel-2 image.

Mismatch in raster grids between two CRSs

We would like to transform the Sentinel-2 onto the red grid. But since the grids obviously do not match perfectly, how do we compute the values in the new grid? In technical terms this process is called resampling. We must choose a resampling method which defines the way the values in the output raster will be computed from the input raster. The three most commonly used techniques are Nearest Neighbour, Bilinear Interpolation and Cubic Convolution:

Three common resampling methods - Source: “Principles of Remote Sensing” © 2009 by ITC, Enschede

  • Nearest Neighbour: The value of each cell in an output raster is calculated using the value of the nearest cell in an input raster.
  • Bilinear Interpolation: The value of each cell in an output raster is calculated using the weighted average value of the nearest four cells in an input raster.
  • Cubic Convolution: The value of each cell in an output raster is calculated using the weighted average value of the nearest 16 cells in an input raster.

Assignment

In today’s homework assignment you will…

  • reproject the Sentinel-2 data to match the study area
  • inspect the atmospheric influence by visually comparing top-of-atmosphere (TOA) vs. bottom-of-atmosphere (BOA) reflectance
  • inspect the difference in the measured signal by comparing Landsat-8 with Sentinel-2 imagery from the same acquisition date

Reprojections

Reproject image
Reproject your 16bit integer Sentinel-2 image using berlin_10m_epsg32633.tif as reference image for projection, extent and resolution. To do this, make use of terra’s project() function. Seek the help page (?project) and identify the correct values for the arguments x and y. Create two separate results, once using 'near' and once using 'cubic' for the method argument. Print your result to console and check if the extent and dimensions match as expected. Last, write both results to disc.

Assess results in QGIS
Compare the results for both resampling methods in QGIS. Zoom in and out to search for differences/patterns. Considering the above theory on resampling methods, your visual assessment, and if necessary additional resources online: Which method is more appropriate? To exemplify your words, include at least one screenshot in your .Rmd file using the following syntax:

![Figure desription](relative/path/to/figure.png){width=100%}

The signal altered by the atmosphere: TOA vs. BOA

In the lecture you have learned about the influence of the atmosphere on the measured signal. You heard that we try to eliminate the atmosphere and calculate the surface reflectance as if there was no atmosphere. Data that is not atmospherically corrected is refered to as Top-Of-Atmosphere (TOA) reflectance. Atmospherically corrected data represents surface conditions and is thus called Bottom-Of-Atmsophere (BOA).

In our case, we have worked with two different products: The Landsat-8 data is TOA reflectance, and the Sentinel-2 data is BOA reflectance.

With both datasets opened in QGIS, use the the “Identify features” at the top-panel to right-click selected pixels and select “Identify All”. Change the view from “Table” to “Graph” to see the spectral profiles for both images on top of each other.

Now select different pixels and compare the spectral profiles of the two images to each other. Identify an exemplary pixel for 1) woody vegetation (e.g. forests), 2) built-up and 3) water, and provide a screenshot for each of the profiles. For each profile then describe the general pattern of reflectance in the different wavelengths and identify the most notable differences between TOA and BOA. Based on what you learned in the lecture, explain the most prominent difference between TOA and BOA.

Submission

  • Document your answers to the questions above with bullet points and the help of your screenshots in your .Rmd document knitted to PDF.

  • Upload your documentation as a PDF file on Moodle.

  • General submission notes: Submission deadline for the weekly assignment is always the following Monday at 10am. Please use the naming convention indicating session number and family name of all students in the respective team, e.g. ‘s01_surname1_surname2_surname3_surname4.pdf’. Each team member has to upload the assignment individually. Provide single file submissions, in case you have to submit multiple files, create a *.zip archive.


Copyright © 2023 Humboldt-Universität zu Berlin. Department of Geography.