Content

  • More R data structures
  • Data import and export
  • Simple data manipulation

Data structures

Multiple data values can be stored in various data structures:

Homogeneous (of the same type):

  • vector

  • matrix

Heterogeneous (of mixed types):

  • data frame

  • list


Factors (vectors)

Factors are a special type of vector that is used for coding categorical variables in statistical models. Recall statistical data types: continuous vs categorical (nominal vs ordinal). Factors contain a fixed number of values, also called levels (categories).

Categorical (factor) vectors are for example gender or tree species:

# character vector
treespecies_char <- c("SP", "PI", "FI", "FI", "PI")
treespecies_char
## [1] "SP" "PI" "FI" "FI" "PI"
# convert character vector to factor
treespecies <- factor(treespecies_char)
treespecies
## [1] SP PI FI FI PI
## Levels: FI PI SP

Levels are the category/class names/labels, and they are of data type character.

levels(treespecies)
## [1] "FI" "PI" "SP"
typeof(levels(treespecies))
## [1] "character"

You can change the names of the levels (categories) as follows:

levels(treespecies) <- c("Fir", "Pine", "Spruce")
treespecies
## [1] Spruce Pine   Fir    Fir    Pine  
## Levels: Fir Pine Spruce

Check out the help for the factor() function. The function also takes an argument levels and an argument labels. That means, you can also change the names of the levels (categories) inside the factor() function, i.e. when creating the factor variable.

treespecies2 <- factor(treespecies_char, levels=c("FI", "PI", "SP"), labels=c("Fir", "Pine", "Spruce"))
treespecies2
## [1] Spruce Pine   Fir    Fir    Pine  
## Levels: Fir Pine Spruce

Note, you cannot simply add values to a factor that are not specified in levels:

treespecies[1] <- "Oak"
## Warning in `[<-.factor`(`*tmp*`, 1, value = "Oak"): invalid factor level, NA
## generated
treespecies
## [1] <NA> Pine Fir  Fir  Pine
## Levels: Fir Pine Spruce

Instead, you first need to add a level.

levels(treespecies2) <- c(levels(treespecies2), "Oak")
treespecies2
## [1] Spruce Pine   Fir    Fir    Pine  
## Levels: Fir Pine Spruce Oak
# now you can change the first tree species element to Oak
treespecies2[1] <- "Oak"
treespecies2
## [1] Oak  Pine Fir  Fir  Pine
## Levels: Fir Pine Spruce Oak

Matrices

A matrix is a two-dimensional Array. A matrix therefore has additional attributes specifying the 2 dimensions: nrow und ncol.

m <- matrix(1:9, nrow = 3, ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

See: dim(m), nrow(m), ncol(m)

Note that by default the columns of the matrix will be filled first.

If you want to fill the matrix by row, you can specify this with the byrow argument:

n <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
n
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

You can apply mathematical operators to matrices the same way as vectors (Attention: Recycling rule!):

m * 2
##      [,1] [,2] [,3]
## [1,]    2    8   14
## [2,]    4   10   16
## [3,]    6   12   18
m * n
##      [,1] [,2] [,3]
## [1,]    1    8   21
## [2,]    8   25   48
## [3,]   21   48   81

Like with vectors, you can access elements of a matrix with indices, except that we now deal with two dimensions [i,j] or [row, column]:

m
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
m[1, ]
## [1] 1 4 7
m[ , 1]
## [1] 1 2 3
m[1, 1]
## [1] 1
m[1:2, 3]
## [1] 7 8
m[1:2, c(1,3)]
##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8

When you exctract elements from a matrix, the result can belong to a different class!

class(m)
## [1] "matrix" "array"
class(m[ , 3])
## [1] "integer"

The functions cbind() und rbind() glue (bind) vectors and matrices together:

# bind together by column
cbind(m,n)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7    1    2    3
## [2,]    2    5    8    4    5    6
## [3,]    3    6    9    7    8    9

The function cbind und rbind glue (bind) vectors and matrices together:

# bind together by row
rbind(m,n)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## [4,]    1    2    3
## [5,]    4    5    6
## [6,]    7    8    9

Lists

Lists are objectes that can store values of different data type (and different objects):

l <- list(c(1, 2, 3), m, "a")

To access the elements of a list with indices you need to use double brackets [[]]:

l[[1]]
## [1] 1 2 3
l[[3]]
## [1] "a"


** Why bother with lists?**

Nearly all functions (e.g., linear regression, generalised linear modelling, t-test, etc.) in R produce output that is stored in a list. The good news is that R contains various functions to extract required information (e.g., estimated values, p-values, etc.) and presents it in nice tables. However, sometimes it can be useful to extract information from the results directly OR you may want to use a list to return the output from your own functions.


Data frames

A data frame is what you may call a data table. It is similar to a two-dimensional matrix but the columns can contain different data types.

df <- data.frame(TREEID = 1001:1003, 
                 SPECIES = factor(c("Spruce", "Fir", "Pine")), 
                 LIFE = c(TRUE, FALSE, TRUE),
                 HEIGHT = c(34, 21, 26)
                 )
df
##   TREEID SPECIES  LIFE HEIGHT
## 1   1001  Spruce  TRUE     34
## 2   1002     Fir FALSE     21
## 3   1003    Pine  TRUE     26

The summary() function gives a quick overview. Helpful for spotting data entry errors and NA’s:

summary(df)
##      TREEID       SPECIES     LIFE             HEIGHT    
##  Min.   :1001   Fir   :1   Mode :logical   Min.   :21.0  
##  1st Qu.:1002   Pine  :1   FALSE:1         1st Qu.:23.5  
##  Median :1002   Spruce:1   TRUE :2         Median :26.0  
##  Mean   :1002                              Mean   :27.0  
##  3rd Qu.:1002                              3rd Qu.:30.0  
##  Max.   :1003                              Max.   :34.0

You can index (access) columns using three main ways:

df$TREEID
## [1] 1001 1002 1003
df[ , 1]
## [1] 1001 1002 1003
df[ , "TREEID"]
## [1] 1001 1002 1003

Rows are indexed by row number:

df[3, ]
##   TREEID SPECIES LIFE HEIGHT
## 3   1003    Pine TRUE     26
df[1:2, "TREEID"]
## [1] 1001 1002
df[1, c("TREEID", "HEIGHT")]
##   TREEID HEIGHT
## 1   1001     34

IMPORTANT: Extracting a row does not change the class but, extracting a column does!

class(df)
## [1] "data.frame"
class(df[ 1, ])
## [1] "data.frame"
class(df[ , "TREEID"])
## [1] "integer"

Data import

R can read a variety of dataset formats:


Text files

Text (ASCII) files are a popular data storage and exchange format. They can be read on any OS platform and without specialty software (Notepad).

Comma-separated (csv) or tab-separated files are common formats to store table data (in systems with English locale).

BE AWARE: On German (and other) systems, the comma is already reserved for decimal places, so here the semi-colon or tab-separation is sometimes preferred. On English systems, the dot (.) is used for decimal places, on other systems it may be used to group digits for readability, e.g. 1,000,000.

The read.table() function can be used to read table data from text files. The function allows several arguments to accommodate the different data formats. See ?read.table()

The following example specifies the following format options:

  • The columns are separated by ,
  • The decimal sign is .
  • and the first row contains the column names header = TRUE
tab <- read.table("data/airquality.txt", sep = ",", dec = ".", header = TRUE)

read.table() returns a data frame.

class(tab)
## [1] "data.frame"
names(tab)
## [1] "ID"    "Ozone" "Solar" "Wind"  "Temp"  "Month" "Day"
head(tab, 3)
##   ID Ozone Solar Wind Temp Month Day
## 1  1    41   190  7.4   67     5   1
## 2  2    36   118  8.0   72     5   2
## 3  3    12   149 12.6   74     5   3

read.csv() is a short-cut of read.table() text files with comma-separated columns.


Data export

Export data frames to semi-colon delimited file:

write.table(tab, "data/airquality_output.txt", sep = ";", dec = ".", row.names = FALSE)

You can also use write.csv() to export data frames to comma-delimited text files:

write.csv(tab, "data/airquality_output.csv")
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

Missing values

Recall from last session that NA is used for missing values in R:

x <- c(1, 5, 3, 6, NA, 9, 21, 4)
x
## [1]  1  5  3  6 NA  9 21  4

..and that you must use is.na() to determine if an element is or contains missing values.

is.na(x)
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

It is not uncommon to have missing values in datasets, e.g. in data frames and matrices:

df <- data.frame(var1 = c(1, 3, 12, NA, 5), 
                 var2 = c(3, 4, 1, 8, 11))
df
##   var1 var2
## 1    1    3
## 2    3    4
## 3   12    1
## 4   NA    8
## 5    5   11

Use na.omit() to ignore rows in a data frame that contain NAs:

na.omit(df)
##   var1 var2
## 1    1    3
## 2    3    4
## 3   12    1
## 5    5   11

Also recall that arithmetic functions and operations applied to NAs return NA

NA * 3
## [1] NA

Many arithmetic functions allow you to specify whether to ignore or include NAs:

sum(df$var1)
## [1] NA
sum(df$var1, na.rm=TRUE)
## [1] 21

Text manipulation

Combine two or more character variables witht paste():

paste("Hello", "World", sep = "_")
## [1] "Hello_World"

Extract a portion of a character variable substring():

substring("Hello World", first = 3, last = 8)
## [1] "llo Wo"

Subset columns

It is best to extract columns from a data frame using the column names:

names(airquality)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"
airquality_ozone <- airquality$Ozone
airquality_ozone <- airquality[, "Ozone"]
airquality_ozone_temp <- airquality[, c("Ozone", "Temp")]

Subset rows

Subset rows based using logical operations on variables (columns):

airquality_temp_gr_70 <- airquality[airquality$Temp > 70, ]
nrow(airquality_temp_gr_70)
## [1] 120

Subset rows based on row indices:

airquality_zeile_10_100 <- airquality[1:100, ]
nrow(airquality_zeile_10_100)
## [1] 100

You can combine logical operators to make more complex subsets of rows and columns:

airquality_juni <- airquality[airquality$Month == 6, ]

all measurements from 15. June:

airquality_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, ]

or return only Ozone values (column) from 15. June:

ozone_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, "Ozone"]

In previously covered logical operations, single values were compared to vectors or matrices (or single values), e.g. c(1,2,3) < 2.

The %in% operator can be applied to two vectors: x %in% y. For each element in vector x, %in% evaluates if the element is contained in vector y. The operator returns a logical vector of the same length as vector x.

The following example returns all rows of month June and July.

airquality_jun_jul <- airquality[airquality$Month %in% c(6, 7), ]

Create a new variable

Recall, the $ operator is used to access columns in a data frame by name. You can also use it to create a new column:

The following example creates a new column with the name “NewVariable” and fills it with Ozone values multiplied by 100.

airquality$NewVariable <- airquality$Ozone * 100

Or we create a log-transformed variable.

airquality$logOzone <- log(airquality$Ozone)

Continuous variable to factor

Sometimes, you may want to create classes from a continuous variable. For example, below we convert temperature data into a factor (categorical) variable with to levels (warm and cold).

  1. Manual recoding

First, add an empty character vector:

airquality$TempClass <- rep("", nrow(airquality))

Then, recode the character vector based on logical operations on $Temp:

airquality[airquality$Temp >= 70 , "TempClass"] <- "warm"
airquality[airquality$Temp < 70 , "TempClass"] <- "cold"
airquality$TempClass <- factor(airquality$TempClass)
  1. Automatic recoding using cut()

The function cut divides a numeric variable into intervals and codes them into factors (categorical data):

airquality$TempClass2 <- cut(airquality$Temp, c(0, 69, 100), labels=c('cold', 'warm'))

Note, there are more ways to do the same task!!


Copyright © 2024 Humboldt-Universität zu Berlin. Department of Geography.