Data structures

Multiple data values can be stored in various data structures:

Homogeneous (of the same type):

  • vector

  • matrix

Heterogeneous (of mixed types):

  • data frame

  • list

Matrices

A matrix is a two-dimensional Array. A matrix therefore has additional attributes specifying the 2 dimensions: nrow and ncol.

m <- matrix(1:9, nrow = 3, ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

See: dim(m), nrow(m), ncol(m)

Note that by default the columns of the matrix will be filled first.

If you want to fill the matrix by row, you can specify this with the byrow argument:

n <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
n
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

You can apply mathematical operators to matrices the same way as vectors (Attention: Recycling rule!):

m * 2
##      [,1] [,2] [,3]
## [1,]    2    8   14
## [2,]    4   10   16
## [3,]    6   12   18
m * n
##      [,1] [,2] [,3]
## [1,]    1    8   21
## [2,]    8   25   48
## [3,]   21   48   81

Like with vectors, you can access elements of a matrix with indices, except that we now deal with two dimensions [i,j] or [row, column]:

m
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
m[1, ]
## [1] 1 4 7
m[ , 1]
## [1] 1 2 3
m[1, 1]
## [1] 1
m[1:2, 3]
## [1] 7 8
m[1:2, c(1,3)]
##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8

When you exctract elements from a matrix, the result can belong to a different class!

class(m)
## [1] "matrix" "array"
class(m[ , 3])
## [1] "integer"

The functions cbind() und rbind() glue (bind) vectors and matrices together:

# bind together by column
cbind(m,n)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7    1    2    3
## [2,]    2    5    8    4    5    6
## [3,]    3    6    9    7    8    9

The function cbind und rbind glue (bind) vectors and matrices together:

# bind together by row
rbind(m,n)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## [4,]    1    2    3
## [5,]    4    5    6
## [6,]    7    8    9

Lists

Lists are objects that can store values of different data type and different objects. Lists can also contain other lists or lists of lists of lists… You get the point. The list l below is created with three slots filled with a numeric vector, the matrix m, and a character string.

l <- list(myValues=c(1, 2, 3), m, "Landsat")
l
## $myValues
## [1] 1 2 3
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## [[3]]
## [1] "Landsat"

Slots can be accessed by name (if there is any) and by position (number). The example above creates only a name for the first slot (myValues). The second and third slot are not named. To access the elements of a list you need to use double brackets [[]]:

l[["myValues"]] # access first slot by name
## [1] 1 2 3
l[[1]] # access first slot by position
## [1] 1 2 3
l[[3]] # access third slot by position
## [1] "Landsat"

Named slots can also be accessed using the $ operator.

l$myValues
## [1] 1 2 3

You can also add new slots to an existing or empty list. The example below creates an empty list and adds a matrix to it.

newList <- list()
newList[["nMatrix"]] <- n
newList
## $nMatrix
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9


Why bother with lists?

Nearly all functions (e.g., linear regression, generalised linear modelling, t-test, etc.) in R produce output that is stored in a list. The good news is that R contains various functions to extract required information (e.g., estimated values, p-values, etc.) and presents it in nice tables. However, sometimes it can be useful to extract information from the results directly or you may want to use a list to return the output from your own functions.

Data frames

A data frame is table. It is similar to a two-dimensional matrix but the columns can contain different data types.

trees <- data.frame(treeid = 1001:1004, 
                    species = c("Beech", "Fir", "Pine", "Fir"), 
                    life = c(TRUE, FALSE, TRUE, TRUE),
                    height = c(34, 38, 16, 40),
                    diameter = c(40, 50, NA, 45),
                    age = c(110, NA, 40, 120)
                   )
trees
##   treeid species  life height diameter age
## 1   1001   Beech  TRUE     34       40 110
## 2   1002     Fir FALSE     38       50  NA
## 3   1003    Pine  TRUE     16       NA  40
## 4   1004     Fir  TRUE     40       45 120

To retrieve the dimensions of the data.frame you can use the functions dim(), nrow(), and ncol().

dim(trees)
## [1] 4 6
nrow(trees)
## [1] 4
ncol(trees)
## [1] 6

You can access the column names as follows:

names(trees)
## [1] "treeid"   "species"  "life"     "height"   "diameter" "age"

You can change all or individual column names:

names(trees)[3] <- "alive"
names(trees)
## [1] "treeid"   "species"  "alive"    "height"   "diameter" "age"

The summary() function prints a quick overview, which is helpful for spotting data entry errors and NA’s:

summary(trees)
##      treeid       species            alive             height    
##  Min.   :1001   Length:4           Mode :logical   Min.   :16.0  
##  1st Qu.:1002   Class :character   FALSE:1         1st Qu.:29.5  
##  Median :1002   Mode  :character   TRUE :3         Median :36.0  
##  Mean   :1002                                      Mean   :32.0  
##  3rd Qu.:1003                                      3rd Qu.:38.5  
##  Max.   :1004                                      Max.   :40.0  
##                                                                  
##     diameter         age     
##  Min.   :40.0   Min.   : 40  
##  1st Qu.:42.5   1st Qu.: 75  
##  Median :45.0   Median :110  
##  Mean   :45.0   Mean   : 90  
##  3rd Qu.:47.5   3rd Qu.:115  
##  Max.   :50.0   Max.   :120  
##  NA's   :1      NA's   :1

Indexing

You can index (access) columns using three methods. However, it is usually best to extract columns from a data frame using the column names. You do not necessarily know the position of the column or the position may change.

trees$treeid
## [1] 1001 1002 1003 1004
trees[ , 1]
## [1] 1001 1002 1003 1004
trees[ , "treeid"] # or trees[ , c("treeid", "alive")]
## [1] 1001 1002 1003 1004

When you extract a column, the result is a vector and not a data.frame.

Row indexing

You can also index (access) rows using numerical indices. Remember, the indexing format is data.frame[row_index, column_index]. For example, to access the 2nd and 3rd row you may type this:

trees[2:3, ]
##   treeid species alive height diameter age
## 2   1002     Fir FALSE     38       50  NA
## 3   1003    Pine  TRUE     16       NA  40

However, indexing rows using numerical indices is not very practical. Usually, we want to select rows (observations) that meet certain criteria defined in one or multiple columns. We can define these criteria using logical operations and indexing. For example, let’s select all trees greater than 30 m height. Tree height is stored in the height column. We can extract height from trees as a vector using trees$height. Consequently, the logical operations > 30 returns a logical vector, where the first element corresponds to the first tree of the data frame, the second element to the second tree, the third element to the third tree and so forth. You can see that only the 1st, 2nd, and 4th tree meet our height criterion.

trees$height > 30
## [1]  TRUE  TRUE FALSE  TRUE

To select all rows based on this criterion, we use the logical vector (or operation) as index vector.

trees[trees$height > 30, ]
##   treeid species alive height diameter age
## 1   1001   Beech  TRUE     34       40 110
## 2   1002     Fir FALSE     38       50  NA
## 4   1004     Fir  TRUE     40       45 120

You can combine multiple logical operations to make more complex queries. Let’s add the criterion that the trees also need to be alive.

trees[trees$height > 30 & trees$alive == TRUE, ]
##   treeid species alive height diameter age
## 1   1001   Beech  TRUE     34       40 110
## 4   1004     Fir  TRUE     40       45 120

Recall the %in% operator evaluates if an element is contained in another vector. This allows us to check each element against multiple choices. For example, we want to select all trees that match a certain list (vector) of tree species.

conifer <- trees[trees$species %in% c("Spruce", "Fir", "Pine"), ]
conifer
##   treeid species alive height diameter age
## 2   1002     Fir FALSE     38       50  NA
## 3   1003    Pine  TRUE     16       NA  40
## 4   1004     Fir  TRUE     40       45 120

Missing values

Recall from last session that math operations and logical operations with NA return NA. This is an issue, when we want to select against criteria that contain NA values.

trees$diameter > 40
## [1] FALSE  TRUE    NA  TRUE

Subsetting (extracting rows from) a data.frame with missing values using logical operations causes weird results. For example, column diameter contains a missing value. Subsetting the data.frame trees to all trees with a diameter greater than 40, returns the two trees plus an extra row filled with NA.

trees[trees$diameter > 40, ]
##    treeid species alive height diameter age
## 2    1002     Fir FALSE     38       50  NA
## NA     NA    <NA>    NA     NA       NA  NA
## 4    1004     Fir  TRUE     40       45 120

Understandably, the analyst must consider what to do with missing values prior to the analysis. Depending on the use case, different courses of action can be appropriate, e.g., filling in missing values or removing observations. A very crude but quick method is to remove all observations with missing values. The function na.omit() returns only observations with no missing values in any (!) of the columns. More advanced and meaningful methods that interpolate missing values are available in the zoo package.

trees_clean <- na.omit(trees)
trees_clean
##   treeid species alive height diameter age
## 1   1001   Beech  TRUE     34       40 110
## 4   1004     Fir  TRUE     40       45 120

Create new column

Recall, the $ operator is used to access columns in a data frame by name. You can also use it to create a new column. The following example creates a new column diamter_m that converts the diameter to meter.

trees$diameter_m <- trees$diameter / 100

We can also store a log-transformed version of a variable in a new column.

trees$logHeight <- log(trees$height)
trees
##   treeid species alive height diameter age diameter_m logHeight
## 1   1001   Beech  TRUE     34       40 110       0.40  3.526361
## 2   1002     Fir FALSE     38       50  NA       0.50  3.637586
## 3   1003    Pine  TRUE     16       NA  40         NA  2.772589
## 4   1004     Fir  TRUE     40       45 120       0.45  3.688879

Factors

Factors are a special type of vector that is used for coding categorical variables in statistical models. Recall statistical data types: continuous vs categorical (nominal vs ordinal). In our trees dataset, the column species is a candidate for a factor variable. Currently, the column is of type character.

trees$species
## [1] "Beech" "Fir"   "Pine"  "Fir"

The function factor() converts a character vector into a factor vector. Let’s store this factor vector in a new column called class.

trees$class <- factor(trees$species)
trees$class
## [1] Beech Fir   Pine  Fir  
## Levels: Beech Fir Pine

Factors contain a fixed number of categories, in R called levels. Our tree dataset has three levels: Beech, Fir, Pine. You can access the levels of a factor variable as follows:

levels(trees$class)
## [1] "Beech" "Fir"   "Pine"

Accessing the levels gives you an overview of the species, but you can also use the function to overwrite (change) the names of the levels:

levels(trees$class) <- c("Fagus", "Abies", "Pinus")
trees
##   treeid species alive height diameter age diameter_m logHeight class
## 1   1001   Beech  TRUE     34       40 110       0.40  3.526361 Fagus
## 2   1002     Fir FALSE     38       50  NA       0.50  3.637586 Abies
## 3   1003    Pine  TRUE     16       NA  40         NA  2.772589 Pinus
## 4   1004     Fir  TRUE     40       45 120       0.45  3.688879 Abies

Check out the help for the factor() function. The function also takes an argument levels and an argument labels. That means, you can change the names of the levels (categories) with one call of the factor() function.

trees$class <- factor(trees$species, levels=c("Beech", "Fir", "Pine"),
                              labels=c("Fagus", "Abies", "Pinus"))
trees
##   treeid species alive height diameter age diameter_m logHeight class
## 1   1001   Beech  TRUE     34       40 110       0.40  3.526361 Fagus
## 2   1002     Fir FALSE     38       50  NA       0.50  3.637586 Abies
## 3   1003    Pine  TRUE     16       NA  40         NA  2.772589 Pinus
## 4   1004     Fir  TRUE     40       45 120       0.45  3.688879 Abies

Note, you cannot simply add values to a factor that are not specified in levels. Below, I try to change the level of the first tree. Since the level does not exist in the factor variable, the entry gets deleted and replaced with NA. This is bad.

trees$class[1] <- "Oak"
## Warning in `[<-.factor`(`*tmp*`, 1, value = structure(c(NA, 2L, 3L, 2L), levels
## = c("Fagus", : invalid factor level, NA generated
trees
##   treeid species alive height diameter age diameter_m logHeight class
## 1   1001   Beech  TRUE     34       40 110       0.40  3.526361  <NA>
## 2   1002     Fir FALSE     38       50  NA       0.50  3.637586 Abies
## 3   1003    Pine  TRUE     16       NA  40         NA  2.772589 Pinus
## 4   1004     Fir  TRUE     40       45 120       0.45  3.688879 Abies

Instead, you first need to add a level.

levels(trees$class) <- c(levels(trees$class), "Oak")
trees$class
## [1] <NA>  Abies Pinus Abies
## Levels: Fagus Abies Pinus Oak

Then you can change the species class of the first tree to Oak.

trees$class[1] <- "Oak"
trees
##   treeid species alive height diameter age diameter_m logHeight class
## 1   1001   Beech  TRUE     34       40 110       0.40  3.526361   Oak
## 2   1002     Fir FALSE     38       50  NA       0.50  3.637586 Abies
## 3   1003    Pine  TRUE     16       NA  40         NA  2.772589 Pinus
## 4   1004     Fir  TRUE     40       45 120       0.45  3.688879 Abies

Factor from continuous variable

Sometimes, you may want to create classes from a continuous variable. For example, below we convert the tree heights into a factor (categorical) variable with two levels: small and tall. The function cut() divides a numeric variable into intervals and codes them into factors:

trees$heightClass <- cut(trees$height, c(0, 30, 100), labels=c('small', 'tall'))
trees
##   treeid species alive height diameter age diameter_m logHeight class
## 1   1001   Beech  TRUE     34       40 110       0.40  3.526361   Oak
## 2   1002     Fir FALSE     38       50  NA       0.50  3.637586 Abies
## 3   1003    Pine  TRUE     16       NA  40         NA  2.772589 Pinus
## 4   1004     Fir  TRUE     40       45 120       0.45  3.688879 Abies
##   heightClass
## 1        tall
## 2        tall
## 3       small
## 4        tall

Import/export

R can read a variety of dataset formats such as

  • Text files (e.g. CSV, TXT)
  • Statistical programs (e.g. Excel, SPSS table)
  • DBF file (e.g. ArcGIS)
  • Databases (e.g. PostgreSQL)
  • local file system or on a remote server (e.g. ftp, http)

Text (ASCII) files are a popular data storage and exchange format. They can be read on any OS platform without special software. The most common text files separate data columns by comma (csv), semi-colon, or tabs. On German (and other) systems, the comma is already reserved for decimal places, so here the semi-colon or tab-separation is sometimes preferred. On English systems, the dot (.) is used for decimal places, on other systems it may be used to group digits for readability, e.g. 1,000,000.

The read.table() function is the most generic function to read table data from various text files. The function allows several arguments to accommodate different data formats. See ?read.table(). Important formating options are:

  • sep=",": Columns are separated by ,
  • dec=".": Decimal sign is .
  • header=TRUE: The first row contains the column names
tab <- read.table("data/airquality.txt", sep = ",", dec = ".", header = TRUE)

read.table() returns a data frame.

class(tab)
## [1] "data.frame"

Use head() to print the first couple of rows of the data.frame or tail() to print the last rows.

head(tab)
##   ID Ozone Solar Wind Temp Month Day
## 1  1    41   190  7.4   67     5   1
## 2  2    36   118  8.0   72     5   2
## 3  3    12   149 12.6   74     5   3
## 4  4    18   313 11.5   62     5   4
## 5  5    NA    NA 14.3   56     5   5
## 6  6    28    NA 14.9   66     5   6

Exporting data.frames to text files is similarly easy with write.table():

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

This example writes the data frame tab to a semi-colon delimited file.

write.table(tab, "data/airquality_output.txt", sep = ";", dec = ".", row.names = FALSE)

You can also use write.csv() or write.csv2() to export data frames to comma-delimited or semi-colon delimited text files, respectively.

write.csv(tab, "data/airquality_output.csv")

Copyright © 2024 Humboldt-Universität zu Berlin. Department of Geography.