- R as a calculator
- Variables (
`->`

) - Comments (
`#`

) - Math operators (
`sqrt(), exp(), log(), sin(), cos(), ...`

) - Data types (
`character, integer, double`

) - Logical operators (
`==, <, >, !=, &, |`

) - Missing values (
`NA, NaN, is.na()`

) - Getting help (
`?`

) - Vectors as a first R object type
- Vector arithmetic
- Numeric vector functions (
`c(), length(), max(), min(), sum(), ...`

) - Logical vectors
- Sequence vectors (
`n:m, seq(), rep()`

) - Accessing and manipulating vector values

- More R data structures
- Data import and export
- Simple data manipulation

Multiple data values can be stored in various data structures:

Homogeneous (of the same type):

vector

matrix

Heterogeneous (of mixed types):

data frame

list

Factors are a special type of vector that is used for coding categorical variables in statistical models. Recall statistical data types: continuous vs categorical (nominal vs ordinal). Factors contain a fixed number of values, also called **levels** (categories).

Categorical (factor) vectors are for example gender or tree species:

```
# character vector
treespecies_char <- c("SP", "PI", "FI", "FI", "PI")
treespecies_char
```

`## [1] "SP" "PI" "FI" "FI" "PI"`

```
# convert character vector to factor
treespecies <- factor(treespecies_char)
treespecies
```

```
## [1] SP PI FI FI PI
## Levels: FI PI SP
```

Levels are the category/class names/labels, and they are of data type `character`

.

`levels(treespecies)`

`## [1] "FI" "PI" "SP"`

`typeof(levels(treespecies))`

`## [1] "character"`

You can change the names of the levels (categories) as follows:

```
levels(treespecies) <- c("Fir", "Pine", "Spruce")
treespecies
```

```
## [1] Spruce Pine Fir Fir Pine
## Levels: Fir Pine Spruce
```

Check out the help for the `factor()`

function. The function also takes an argument `levels`

and an argument `labels`

. That means, you can also change the names of the `levels`

(categories) inside the `factor()`

function, i.e. when creating the factor variable.

```
treespecies2 <- factor(treespecies_char, levels=c("FI", "PI", "SP"), labels=c("Fir", "Pine", "Spruce"))
treespecies2
```

```
## [1] Spruce Pine Fir Fir Pine
## Levels: Fir Pine Spruce
```

Note, you cannot simply add values to a factor that are not specified in `levels`

:

`treespecies[1] <- "Oak"`

```
## Warning in `[<-.factor`(`*tmp*`, 1, value = "Oak"): invalid factor level, NA
## generated
```

`treespecies`

```
## [1] <NA> Pine Fir Fir Pine
## Levels: Fir Pine Spruce
```

Instead, you first need to add a level.

```
levels(treespecies2) <- c(levels(treespecies2), "Oak")
treespecies2
```

```
## [1] Spruce Pine Fir Fir Pine
## Levels: Fir Pine Spruce Oak
```

```
# now you can change the first tree species element to Oak
treespecies2[1] <- "Oak"
treespecies2
```

```
## [1] Oak Pine Fir Fir Pine
## Levels: Fir Pine Spruce Oak
```

A matrix is a two-dimensional **Array**. A matrix therefore has additional **attributes** specifying the 2 dimensions: **nrow** und **ncol**.

```
m <- matrix(1:9, nrow = 3, ncol = 3)
m
```

```
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
```

See: `dim(m)`

, `nrow(m)`

, `ncol(m)`

Note that by default the columns of the matrix will be filled first.

If you want to fill the matrix by row, you can specify this with the `byrow`

argument:

```
n <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
n
```

```
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
```

You can apply mathematical operators to matrices the same way as vectors (Attention: Recycling rule!):

`m * 2`

```
## [,1] [,2] [,3]
## [1,] 2 8 14
## [2,] 4 10 16
## [3,] 6 12 18
```

`m * n`

```
## [,1] [,2] [,3]
## [1,] 1 8 21
## [2,] 8 25 48
## [3,] 21 48 81
```

Like with vectors, you can access elements of a matrix with indices, except that we now deal with two dimensions `[i,j]`

or `[row, column]`

:

`m`

```
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
```

`m[1, ]`

`## [1] 1 4 7`

`m[ , 1]`

`## [1] 1 2 3`

`m[1, 1]`

`## [1] 1`

`m[1:2, 3]`

`## [1] 7 8`

`m[1:2, c(1,3)]`

```
## [,1] [,2]
## [1,] 1 7
## [2,] 2 8
```

When you exctract elements from a matrix, the result can belong to a different class!

`class(m)`

`## [1] "matrix" "array"`

`class(m[ , 3])`

`## [1] "integer"`

The functions `cbind()`

und `rbind()`

glue (bind) vectors and matrices together:

```
# bind together by column
cbind(m,n)
```

```
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 4 7 1 2 3
## [2,] 2 5 8 4 5 6
## [3,] 3 6 9 7 8 9
```

The function `cbind`

und `rbind`

glue (bind) vectors and matrices together:

```
# bind together by row
rbind(m,n)
```

```
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [4,] 1 2 3
## [5,] 4 5 6
## [6,] 7 8 9
```

Lists are objectes that can store values of different data type (and different objects):

`l <- list(c(1, 2, 3), m, "a")`

To access the elements of a list with indices you need to use double brackets `[[]]`

:

`l[[1]]`

`## [1] 1 2 3`

`l[[3]]`

`## [1] "a"`

** Why bother with lists?**

Nearly all functions (e.g., linear regression, generalised linear modelling, t-test, etc.) in R produce output that is stored in a list. The good news is that R contains various functions to extract required information (e.g., estimated values, p-values, etc.) and presents it in nice tables. However, sometimes it can be useful to extract information from the results directly OR you may want to use a list to return the output from your own functions.

A data frame is what you may call a data table. It is similar to a two-dimensional matrix but the columns can contain different data types.

```
df <- data.frame(TREEID = 1001:1003,
SPECIES = factor(c("Spruce", "Fir", "Pine")),
LIFE = c(TRUE, FALSE, TRUE),
HEIGHT = c(34, 21, 26)
)
df
```

```
## TREEID SPECIES LIFE HEIGHT
## 1 1001 Spruce TRUE 34
## 2 1002 Fir FALSE 21
## 3 1003 Pine TRUE 26
```

The `summary()`

function gives a quick overview. Helpful for spotting data entry errors and `NA`

’s:

`summary(df)`

```
## TREEID SPECIES LIFE HEIGHT
## Min. :1001 Fir :1 Mode :logical Min. :21.0
## 1st Qu.:1002 Pine :1 FALSE:1 1st Qu.:23.5
## Median :1002 Spruce:1 TRUE :2 Median :26.0
## Mean :1002 Mean :27.0
## 3rd Qu.:1002 3rd Qu.:30.0
## Max. :1003 Max. :34.0
```

You can index (access) columns using three main ways:

`df$TREEID`

`## [1] 1001 1002 1003`

`df[ , 1]`

`## [1] 1001 1002 1003`

`df[ , "TREEID"]`

`## [1] 1001 1002 1003`

Rows are indexed by row number:

`df[3, ]`

```
## TREEID SPECIES LIFE HEIGHT
## 3 1003 Pine TRUE 26
```

`df[1:2, "TREEID"]`

`## [1] 1001 1002`

`df[1, c("TREEID", "HEIGHT")]`

```
## TREEID HEIGHT
## 1 1001 34
```

IMPORTANT: Extracting a row does not change the class but, extracting a column does!

`class(df)`

`## [1] "data.frame"`

`class(df[ 1, ])`

`## [1] "data.frame"`

`class(df[ , "TREEID"])`

`## [1] "integer"`

R can read a variety of dataset formats:

- Text files (e.g. CSV, TXT)
- Statistical programs (e.g. Excel, SPSS table)
- DBF file (e.g. ArcGIS)
- Databases (e.g. PostgreSQL)
- local file system or on a remote server (e.g. ftp, http)
- https://cran.r-project.org/doc/manuals/r-release/R-data.html

Text (ASCII) files are a popular data storage and exchange format. They can be read on any OS platform and without specialty software (Notepad).

Comma-separated (csv) or tab-separated files are common formats to store table data (in systems with English locale).

BE AWARE: On German (and other) systems, the comma is already reserved for decimal places, so here the semi-colon or tab-separation is sometimes preferred. On English systems, the dot (`.`

) is used for decimal places, on other systems it may be used to group digits for readability, e.g. `1,000,000`

.

The `read.table()`

function can be used to read table data from text files. The function allows several arguments to accommodate the different data formats. See `?read.table()`

The following example specifies the following format options:

- The columns are separated by
`,`

- The decimal sign is
`.`

- and the first row contains the column names
`header = TRUE`

`tab <- read.table("data/airquality.txt", sep = ",", dec = ".", header = TRUE)`

`read.table()`

returns a data frame.

`class(tab)`

`## [1] "data.frame"`

`names(tab)`

`## [1] "ID" "Ozone" "Solar" "Wind" "Temp" "Month" "Day"`

`head(tab, 3)`

```
## ID Ozone Solar Wind Temp Month Day
## 1 1 41 190 7.4 67 5 1
## 2 2 36 118 8.0 72 5 2
## 3 3 12 149 12.6 74 5 3
```

`read.csv()`

is a short-cut of `read.table()`

text files with comma-separated columns.

Export data frames to semi-colon delimited file:

`write.table(tab, "data/airquality_output.txt", sep = ";", dec = ".", row.names = FALSE)`

You can also use `write.csv()`

to export data frames to comma-delimited text files:

`write.csv(tab, "data/airquality_output.csv")`

```
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
```

Recall from last session that `NA`

is used for missing values in R:

```
x <- c(1, 5, 3, 6, NA, 9, 21, 4)
x
```

`## [1] 1 5 3 6 NA 9 21 4`

..and that you must use `is.na()`

to determine if an element is or contains missing values.

`is.na(x)`

`## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE`

It is not uncommon to have missing values in datasets, e.g. in data frames and matrices:

```
df <- data.frame(var1 = c(1, 3, 12, NA, 5),
var2 = c(3, 4, 1, 8, 11))
df
```

```
## var1 var2
## 1 1 3
## 2 3 4
## 3 12 1
## 4 NA 8
## 5 5 11
```

Use `na.omit()`

to ignore rows in a data frame that contain `NA`

s:

`na.omit(df)`

```
## var1 var2
## 1 1 3
## 2 3 4
## 3 12 1
## 5 5 11
```

Also recall that arithmetic functions and operations applied to `NA`

s return `NA`

`NA * 3`

`## [1] NA`

Many arithmetic functions allow you to specify whether to ignore or include `NA`

s:

`sum(df$var1)`

`## [1] NA`

`sum(df$var1, na.rm=TRUE)`

`## [1] 21`

Combine two or more character variables witht `paste()`

:

`paste("Hello", "World", sep = "_")`

`## [1] "Hello_World"`

Extract a portion of a character variable `substring()`

:

`substring("Hello World", first = 3, last = 8)`

`## [1] "llo Wo"`

It is best to extract columns from a data frame using the column names:

`names(airquality)`

`## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"`

```
airquality_ozone <- airquality$Ozone
airquality_ozone <- airquality[, "Ozone"]
airquality_ozone_temp <- airquality[, c("Ozone", "Temp")]
```

Subset rows based using logical operations on variables (columns):

```
airquality_temp_gr_70 <- airquality[airquality$Temp > 70, ]
nrow(airquality_temp_gr_70)
```

`## [1] 120`

Subset rows based on row indices:

```
airquality_zeile_10_100 <- airquality[1:100, ]
nrow(airquality_zeile_10_100)
```

`## [1] 100`

You can combine logical operators to make more complex subsets of rows and columns:

`airquality_juni <- airquality[airquality$Month == 6, ]`

all measurements from 15. June:

`airquality_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, ]`

or return only Ozone values (column) from 15. June:

`ozone_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, "Ozone"]`

In previously covered logical operations, single values were compared to vectors or matrices (or single values), e.g. `c(1,2,3) < 2`

.

The `%in%`

operator can be applied to two vectors: `x %in% y`

. For each element in vector `x`

, `%in%`

evaluates if the element is contained in vector `y`

. The operator returns a logical vector of the same length as vector `x`

.

The following example returns all rows of month June and July.

`airquality_jun_jul <- airquality[airquality$Month %in% c(6, 7), ]`

Recall, the `$`

operator is used to access columns in a data frame by name. You can also use it to create a new column:

The following example creates a new column with the name “NewVariable” and fills it with Ozone values multiplied by 100.

`airquality$NewVariable <- airquality$Ozone * 100`

Or we create a log-transformed variable.

`airquality$logOzone <- log(airquality$Ozone)`

Sometimes, you may want to create classes from a continuous variable. For example, below we convert temperature data into a factor (categorical) variable with to levels (warm and cold).

- Manual recoding

First, add an empty character vector:

`airquality$TempClass <- rep("", nrow(airquality))`

Then, recode the character vector based on logical operations on `$Temp`

:

```
airquality[airquality$Temp >= 70 , "TempClass"] <- "warm"
airquality[airquality$Temp < 70 , "TempClass"] <- "cold"
airquality$TempClass <- factor(airquality$TempClass)
```

- Automatic recoding using cut()

The function `cut`

divides a numeric variable into intervals and codes them into factors (categorical data):

`airquality$TempClass2 <- cut(airquality$Temp, c(0, 69, 100), labels=c('cold', 'warm'))`

Note, there are more ways to do the same task!!

Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.