Course preparation: R basics

Here, we aim to provide a brief overview on the basics of R, which is supposed to refresh your knowledge and help you to identify topics worth revisiting. You are encouraged to refresh your knowledge from the statistic module as well as learn from online resources we list at the bottom of this page.

Why R?

R is a programming language and open source software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. It was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. The name R originates the first names of the two authors and refers to the programming language S. The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.

Learning R has tons of advantages. It is a great starting point for those eager to learn programming. R offers increasingly specialized tools for data wrangling, statistical analyses, and visualization. The CRAN package repository currently features >13,000 packages serving a variety of purposes, e.g. data manipulation (tidyr, dplyr, caret), visualization (ggplot2, ggmap, rasterVis), and geodata handling (raster, rgdal, sp, sf). You will notice that a huge share of figures in scientific publications was produced using R. The R community is huge, and offers great support. R is extremely popular in science & industry, so a proficiency in R opens a wide array of job opportunities. Everything is free and open source.

Operators

R is an interpreted programming language which means users can execute commands directly in the console without having to compile sourcecode. You can use R as a simple calculator using arithmetic operators and mathematical functions which are then evaluated directly. For example:

# A simple calculation
2*(1+4)

# abs() returns absolute values
abs(-2)

# sqrt() returns the squareroot
sqrt(pi)  # Note: 'pi' is a predefined constant
## [1] 10
## [1] 2
## [1] 1.772454

To list the defined operators used by R, simply type ?Syntax into the console. This includes the arithmetic operators

• addition + and subtraction -
• multiplication * and division /
• exponentiation ^ or **

as well as logical operators

• less than < / greater than >
• less than or equal to <= / greater than equal to >=
• equal to ==
• not equal to !=
• or |
• and &

The concept of logical operators is to allow a program to make a decision based on (multiple) conditions. A condition can be evaluated to be either TRUE or FALSE (see ?Logic). For example:

2 > 5
## [1] FALSE

They are often used in combination with control statements such as the conditional if...else statement which are mentioned below. But first back to basic functions: R also comes with a whole set of mathematical functions including

• trigonometric functions sin(), cos() and tan()
• exponential function exp()
• statistical metrics like min(), max(), median()

to name just a very few of them. Functions in R provide information on how to use them. If you need help with a function, type in help(name_of_function) or simply ?name_of_function to display the correct usage and syntax of a function.

Objects

Everything in R we assign to a variable is an object. We can assign objects in R with the <- operator (i.e. an arrow), e.g. x <- 2. Commonly, other languages use = which also works in R, however it is definitely good practice to use the <- operator in R.

# Assign object 'x' a value of 2
x <- 2
# <- assigns the value, but does not print the result to console

# Assign  another variable 'y' ...
y <- 5

# Store the product of x and y in a new object 'z'
z <- x*y

# Print z
print(z)
## [1] 10

Assigned variables are stored in the environment or workspace of our session and can be listed by calling the ls() function.

Data Types and Data Structures

The basic data types in R comprise:

• Character (or String): sensor <- 'Landsat-8'
• Integer: n_bands <- 11L (the L tells R it is an integer, otherwise it is interpreted as numeric)
• Numeric (or Float): lambda_red <- 0.662
• Logical (or Boolean): sun_synchronous <- TRUE or sun_synchronous <- T
• Factors: agency <- factor('NASA',levels=c('NASA','ESA','INPE'))

Understanding data types is important for a variety of reasons. For instance, later in the course we will manipulate large rasters of remotely sensed reflectance data which may be stored as floating point numbers between 0 and 1. However, storing raster matrices as floats on disk requires more memory as compared to using an integer format. It can therefore be handy to transform data into integer prior to saving it to disk.

For now, let us just keep in mind that there are different data types and if we now want to store not just a single value into a variable but a whole set of data, we need to undertstand the different data structures available in R:

As you can see in the figure above, a variety of data structures exist in R, which either allow for storing different data types simultaneously (data frames, lists), or only one datatype (scalars, vectors, matrices). Note, that the colors representing the data types are only chosen for vizualisation purposes: It does not mean that matrices or vectors can only store values of type logical or numeric, but only data of the same type.

Vectors

A vector in R can be created using a variety of functions. The most basic function is the c()-function, where ‘c’ stands for concatenate.

# Create a vector of numbers
a <- c(2, 4, 8, 16, 32)

# Print vector a in console
print(a)

# cat-function: Similar to print, outputs the objects, concatenating the representations
cat('vector a:', a)  
## [1]  2  4  8 16 32
## vector a: 2 4 8 16 32

Lets see what happens when we mix numeric values with strings and combine them into a vector.

b <- c(2, 4, 8, 16, 32, 'Landsat', 'Sentinel-2')
print(typeof(b))
## [1] "character"

Our vector was automatically cast to character. This is because R will try and convert a smaller data type into a larger one to avoid data loss (here from numeric to character). The strings (Landsat and Sentinel-2) in vector b cannot be cast to numeric, however we can very well transform a numeric 2 into a string '2'. Be aware of this behaviour and use other data structures if needed.

Back to vector creation. Instead of having to type in sequences or replicates manually by hand, R provides the functions seq() and rep().

# Create a numeric sequence and store in object s
s <- seq(from=1, to=365, by=1)  # or simply s <- seq(1, 365, 1)
cat('Length of s:', length(s))

# Alternatively use (always with by=1):
s <- c(1:365)

# We may want to repeat a value n times...
r <- rep(1, 5)
print(r)

# ...or repeat a sequence
r <- rep(seq(2, 8, 2), 3)
print(r)

# ...or repeat each entry in a sequence
r <- rep(seq(2, 8, 2), each=3)
print(r)
## Length of s: 365[1] 1 1 1 1 1
##  [1] 2 4 6 8 2 4 6 8 2 4 6 8
##  [1] 2 2 2 4 4 4 6 6 6 8 8 8

Matrices and Arrays

Creating matrices and arrays is just as easy as creating vectors. Matrices are just 2-dimensional arrays with a specified number of rows and columns. All columns in a matrix must have the same data type (e.g., numeric, character) and must be of equal length.

# Creates empty matrix of size 5x2
m1 <- matrix(data=NA, nrow=5, ncol=2)
print(m1)

# Creates matrix of size 5x2 containing a numeric vector
m2<- matrix(data=c(1:10), nrow=5, ncol=2)
print(m2)

# Can you describe what byrow=T does?
m3 <- matrix(data=c(1:10), nrow=5, ncol=2, byrow = T)
print(m3)
##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA
## [3,]   NA   NA
## [4,]   NA   NA
## [5,]   NA   NA
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
## [4,]    7    8
## [5,]    9   10

Arrays are similar to matrices, but can have more than two dimensions, as specified in the dim argument.

# Create array of two 4x3 matrices
a <- array(data=c(1:12), dim=c(4,3,2))
print(a)
## , , 1
##
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
##
## , , 2
##
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

In R, basic mathematical operations may be used to manipulate arrays. This can be used to add, subtract, multiply or divide a constant from every element in the array.

# Create array of one 4x3 matrices
a <- array(data=c(1:9), dim=c(4,3,1))

a2 <- a * 10

We can also use these operations on multiple arrays of the same size. This array arithmetic works element-wise. For example multiplying two arrays together, the first element element of the first array is multiplied by the first element of the second array, and so on. Note: while not needed in this course, R is also capable of performing matrix algebra using these operators.

# Create two arrays each of one 4x3 matrices
a1 <- array(data=c(1:9), dim=c(4,3,1))
a2 <- array(data=c(1:9), dim=c(4,3,1))

a3 <- a1 * a2

Matrices are very important structures when working with remote sensing data, as this is how images are represented numerically. We will extensively make use of the raster package throughout the course where a lot of the handling is done for us, however sometimes we need to manipulate image data with more advanced techniques and it is therefore crucial to understand the underlying data structure.

Data Frames and Lists

Generally speaking, a data.frame is a list of vectors of equal length, which can have varying data types.

# Create data.frame with four columns of different data types
df <- data.frame(sensor = c('Landsat-5 TM', 'Landsat-7 ETM+', 'Landsat-8 OLI-TIRS'),
n_bands = c(7, 8, 11),
active = c(F, T, T),
launched = c(1984, 1999, 2013))
print(df)
##               sensor n_bands active launched
## 1       Landsat-5 TM       7  FALSE     1984
## 2     Landsat-7 ETM+       8   TRUE     1999
## 3 Landsat-8 OLI-TIRS      11   TRUE     2013

The top line of the data.frame is the header, describing the column names. Each entry is called a cell and may be indexed (accessed) individually. We will see how this works in a moment.

Lastly, let us have a look at lists. Lists are objects that can contain elements of different types – vectors, string scalars, matrices, functions or yet another list. Hence, a list is a generic vector containing other objects. The dimensions of the elements to be included must not be identical, which gives us additional flexibility to integrate heterogeneous datasets. A list can be created using the list() function:

# Store the matrix, array and vector in a list
l <- list(m3, a, df)
print(l)
## [[1]]
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
## [4,]    7    8
## [5,]    9   10
##
## [[2]]
## , , 1
##
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6    1
## [3,]    3    7    2
## [4,]    4    8    3
##
##
## [[3]]
##               sensor n_bands active launched
## 1       Landsat-5 TM       7  FALSE     1984
## 2     Landsat-7 ETM+       8   TRUE     1999
## 3 Landsat-8 OLI-TIRS      11   TRUE     2013

As the print results suggests, our three objects are now stored at locations [[1]], [[2]] and [[3]]. We may want to access these list elements and cells therein individually. This brings us to the next topic - indexing.

Indexing

So how do we access values in vectors, data.frames or the elements in the list we just created? Each element is assigned an index (i.e. a positive integer) which we can retrieve by addressing these indices. This is achieved using square brackets and a given index location(s).

b <- c(2, 4, 8, 16, 32, 64, 128, 256)

# Retrieve the 4th value of b
b[4]

# Retrieve value 1, 2, and 3
b[1:3]  
## [1] 16
## [1] 2 4 8

Use a vector to select certain positions:

# Retrieve value 1, 3, 5, and 7 of b
b[c(1, 3, 5, 7)] 
## [1]   2   8  32 128

Manipulate entries:

# Set the first value in b to 0
b[1] <- 0
print(b)
## [1]   0   4   8  16  32  64 128 256

For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.

# [[]] returns element 1 in its original data type
typeof(l[[1]])

# [] returns a list containing element 1
typeof(l[1])

# Create a new object containing element 1
object_l <- l[[1]]

# Create a new object containing cell [4,2] of element 1
cell_l <- l[[1]][4,2] 
## [1] "integer"
## [1] "list"

We may also use the $ operator to access specific elements (in lists) or columns (in data frames), given that names were assigned. For example: names(l) # Currently, our list does not have names... names(l) <- c('matrix', 'array', 'df') #... names() can be used to assign element names print(l$df)

print(df$n_bands) ## NULL ## sensor n_bands active launched ## 1 Landsat-5 TM 7 FALSE 1984 ## 2 Landsat-7 ETM+ 8 TRUE 1999 ## 3 Landsat-8 OLI-TIRS 11 TRUE 2013 ## [1] 7 8 11 If you are uncertain about the characteristics of an object, you may use additional functions to investigate its attributes. Good examples are: • str(): structure of an object • length(): number of elements in one-dimensional objects or lists • nrow(), ncol(): number of rows or columns in 2-dimensional objects (df or matrix) • class(), typeof(): class or type of an object • names(): retrieve column names of df, or element names of list Missing Values Often our data has missing entries. For illustration, think of a vector of temperature measurements where due to measurement error we miss an entry now and then. For completeness, those are not just excluded but may often be represented by a certain value, e.g. -9999. In R missing values can be set to the logical constant NA (Not Available). Let us consider an vector of hourly temperature measurements where -9999 in fact was used to represent erroneous measurements: t_hourly <- c(-1.2, -2.3, -2.4, -2.6, -1.8, -9999, 0.2, 1.4, 2.5, 4.7, -9999, 9.9, 12.1, 13.1, 13.0, 11.8, 9.8, 8.4, 7.5, 6.2, 5.0, 4.1, 4.2, 3.8) # Calculate daily mean temperature mean(t_hourly) ## [1] -828.775 These measurements exhibit two invalid temperature values of -9999. Accordingly, the calculated mean is not of value to us. We need to flag these entries as invalid or more generally not available NA: # Use indexing with logical condition t_hourly[t_hourly == -9999] <- NA # mean() will yield NA as soon as one NA value appears unless... mean(t_hourly) # ... we specify na.rm = True to ignore missing values: mean(t_hourly, na.rm = T) ## [1] NA ## [1] 4.881818 There are often various ways to achieve what we want, the best way largely depends on how we want to proceed with our data. Here are more examples to do the same: t_hourly <- c(-1.2, -2.3, -2.4, -2.6, -1.8, -9999, 0.2, 1.4, 2.5, 4.7, -9999, 9.9, 12.1, 13.1, 13.0, 11.8, 9.8, 8.4, 7.5, 6.2, 5.0, 4.1, 4.2, 3.8) # Use only the entries that are not -9999... mean(t_hourly[t_hourly != -9999]) # ...or use non-NAs after assigning NA to -9999; !is.na() flags all non-NA entries t_hourly[t_hourly == -9999] <- NA mean(t_hourly[!is.na(t_hourly)]) ## [1] 4.881818 ## [1] 4.881818 Flow Control Often we want code to be executed only when certain conditions are met (if/conditional-statements) or repeat a certain chain of commands in order (loops/repeating operations). Loops for loops can be used to iterate over items and repeatingly execute code blocks. For example, we can loop over each item in our hourly temperature vector and print the recorded value: # Iterate over each entry in v and print for (item in t_hourly){ print(item) } ## [1] -1.2 ## [1] -2.3 ## [1] -2.4 ## [1] -2.6 ## [1] -1.8 ## [1] NA ## [1] 0.2 ## [1] 1.4 ## [1] 2.5 ## [1] 4.7 ## [1] NA ## [1] 9.9 ## [1] 12.1 ## [1] 13.1 ## [1] 13 ## [1] 11.8 ## [1] 9.8 ## [1] 8.4 ## [1] 7.5 ## [1] 6.2 ## [1] 5 ## [1] 4.1 ## [1] 4.2 ## [1] 3.8 while repeatedly evaluates a condition and executes commands until the final condition is fulfilled: # Assign integer to v v <- -3 # Adds +1 to v and then prints until v > 2 while (v <= 2) { v <- v + 1 print(v) } ## [1] -2 ## [1] -1 ## [1] 0 ## [1] 1 ## [1] 2 ## [1] 3 Note that loops should be avoided in higher-level languages such as R or Python where possible because they tend to make the programs run slower than when using in-built vectorized functions. For example, calculating the sum over a vector may be calculated as follows: s <- seq(1, 10, 0.5) s_sum <- 0 for (i in s){ s_sum <- s_sum + i } print(s_sum) ## [1] 104.5 Much more elegant and faster (think of manipulating large datasets) is to use s_sum <- sum(s) print(s_sum) ## [1] 104.5 if-statements For conditional statements, the most commonly used approach are if-statements which are based on boolean logic. It is evaluated if a condition is TRUE or FALSE and the following code block will therefore be executed (TRUE) or not (FALSE). For example, let us check if there are temperatures in our hourly temperature vector which are below zero degrees celsius and record the evaluation (TRUE/FALSE) in a new vector: # create new empty vector in which we can 'fill' if the temperatures where below 0 freezing <- numeric(0) # now we loop over the items and check the recorded temperatures # we need to be aware of nodata values (NAs) and check for them as well for (t in t_hourly) { if (is.na(t)) { freezing <- c(freezing, NA) # append NA } else if (t < 0){ freezing <- c(freezing, 1) # append 1 (= it is below 0) } else { freezing <- c(freezing, 0) # append 0 (= it is not below 0) } } There is also a vectorized version ifelse() which allows us to the above in one line of code: freezing <- ifelse(t_hourly < 0 & !is.na(t_hourly), 1, 0) # ifelse(condition, TRUE, FALSE) print(freezing) ## [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Functions R allows for user-specified functions which comes in handy whenever a certain computation needs to be repeated or generalized. Generally, a function is a set of statements organized together to perform a specific task. Using functions greatly enhances code structure as it improves the clarity and readability of your script. Function arguments are provided in round () brackets. Between {} stands the function body which contains the collection of statements. return() is used to define the output of a function. A function may be as simple as a print statement without any objects to be returned. A function object is defined with the function() call: # Basic function printer <- function(){ print('Hello World!') } printer() ## [1] "Hello World!" Now we need to convert our degree celsius data to degrees fahrenheit for our non-metric american collegues. We cannot remember the conversion factors and simply want a little function in which we can plug-in our records and return the values: # Function to convert degree C to degree F convert_temp <- function(x, C_to_F = TRUE) { if (C_to_F) { x_converted <- x * (9/5) + 32 } else { x_converted <- (x - 32) * (5/9) } return(x_converted) } # 'convert_temp' is now an R object (in this case a function) which we can use like any other function: t_hourly_F <- convert_temp(t_hourly) print(t_hourly_F) ## [1] 29.84 27.86 27.68 27.32 28.76 NA 32.36 34.52 36.50 40.46 NA 49.82 ## [13] 53.78 55.58 55.40 53.24 49.64 47.12 45.50 43.16 41.00 39.38 39.56 38.84 Here, x is an obligatory argument that the user has to specify while C_to_F is optional as such that it has a default value (TRUE) which may or may not be changed (e.g. here be set to FALSE to convert from F to C instead). Tutorials We here covered only very basic aspects of R. At this stage, we did not cover advanced data wrangling using packages such as tidyr or dplyr, or look into advanced plotting with ggplot2. Please follow one or several tutorials to revisit and amend your knowledge: General tutorials: Graphics & visualization: Geodata processing: Coding style A few basic rules apply to coding in R. Here is a short summary of Hadley Wickham´s style guide: • Regularly save your progress. • Script names should be meaningful and end in ‘.R’. • Comment (#) your code & separate it into readable chunks. • Try to limit your code to 80 characters per line. • Variable and function names should be lowercase. • Variable names should be nouns and function names verbs. • Place spaces around operators (=, +, -, <-, etc.) and after commas. • Use <-, not =, for assignment. An example: ###################################################### # Creating random data and a correlated response # Author, 2020 # Load all required packages library(ggplot2) # Create random data x <- runif(50, 0, 2) # Build function to simulate response create.response <- function(x){x + rnorm(50, 0, 0.2)} # Apply function to random data y <- create.response(x) # Make a dataframe data <- data.frame('x' = x, 'y' = y) # Plot the simulated dataset ggplot(data, aes(x = x, y = y)) + geom_point() # Investigate correlation in the data cor(data$x, data\$y)
## [1] 0.9465169

Help!

If you get stuck while programming, there are plenty of things you can do:

• Seek the function´s help page (i.e. highlight the function and hit F1)
• Search your problem or error message
• Use the moodle course forum
• Check forums (e.g., StackOverflow)

Setting up a project in RStudio

We recommend to create an RStudio project for all the work you do for this module. This helps you keep your data and scripts (.R- or .Rmd-files) organised.

In order to set up a new project in Rstudio:

• Navigate to the tab “File” and select “New Project”
• Either select “New Directory” if you do not have a folder for this module yet, or select “Existing Directory” to point towards to such an existing folder. If (1): “New Directory” > “New Project”, select a name and a location to store the folder in, and finish by “Create Project”.
• Navigate to the path of the project folder on your machine
• Inside the folder, create the following sub-folders: data, scripts, models, results, and figures
• All scripts for each session are ideally saved into separate .R- or .Rmd-files and saved in your “scripts” folder
• You can use this blueprint .Rmd for your work

If you close and reopen RStudio, the last project should be reloaded automatically. If not the case, just navigate to the tab “File” and select “Open Project”.

Test for Session 3: terra-package

With your R console opened in RStudio, you can install the terra package like any other package in R as follows:

# install the terra package
install.packages('terra')

library(terra)

Once the package is loaded into your current environment (library(terra)), navigate to the Landsat-8 image you worked with before, copy the absolute path to the .tif-file and try to load the image into R. This should result in printing some image properties to your console like shown below.

# create variable which contains the filepath
file_landsat8 <- "your/path/to/the/landsat8/file.tif"

# use the filepath to read in the file into a "rast()"-object
landsat_8 <- rast(file_landsat8)
print(landsat_8)
## class       : SpatRaster
## dimensions  : 1840, 2171, 6  (nrow, ncol, nlyr)
## resolution  : 30.00639, 30.00728  (x, y)
## extent      : 360793.1, 425937, 5794470, 5849683  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 33N (EPSG:32633)
## source      : LC08_L1TP_193023_20190726_20200827_02_T1_int16.tif
## names       :  BLU,  GRN,  RED,  NIR,   SW1,  SW2
## min values  :  639,  371,  227,   60,    23,   16
## max values  : 5603, 5984, 6728, 7454, 11152, 8946`