Multiple data values can be stored in various data structures:
Homogeneous (of the same type):
vector
matrix
Heterogeneous (of mixed types):
data frame
list
A matrix is a two-dimensional Array. A matrix therefore has additional attributes specifying the 2 dimensions: nrow and ncol.
m <- matrix(1:9, nrow = 3, ncol = 3)
m
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
See: dim(m)
, nrow(m)
,
ncol(m)
Note that by default the columns of the matrix will be filled first.
If you want to fill the matrix by row, you can specify this with the
byrow
argument:
n <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
n
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
You can apply mathematical operators to matrices the same way as vectors (Attention: Recycling rule!):
m * 2
## [,1] [,2] [,3]
## [1,] 2 8 14
## [2,] 4 10 16
## [3,] 6 12 18
m * n
## [,1] [,2] [,3]
## [1,] 1 8 21
## [2,] 8 25 48
## [3,] 21 48 81
Like with vectors, you can access elements of a matrix with indices,
except that we now deal with two dimensions [i,j]
or
[row, column]
:
m
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
m[1, ]
## [1] 1 4 7
m[ , 1]
## [1] 1 2 3
m[1, 1]
## [1] 1
m[1:2, 3]
## [1] 7 8
m[1:2, c(1,3)]
## [,1] [,2]
## [1,] 1 7
## [2,] 2 8
When you exctract elements from a matrix, the result can belong to a different class!
class(m)
## [1] "matrix" "array"
class(m[ , 3])
## [1] "integer"
The functions cbind()
und rbind()
glue
(bind) vectors and matrices together:
# bind together by column
cbind(m,n)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 4 7 1 2 3
## [2,] 2 5 8 4 5 6
## [3,] 3 6 9 7 8 9
The function cbind
und rbind
glue (bind)
vectors and matrices together:
# bind together by row
rbind(m,n)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [4,] 1 2 3
## [5,] 4 5 6
## [6,] 7 8 9
Lists are objects that can store values of different data type and
different objects. Lists can also contain other lists or lists of lists
of lists… You get the point. The list l
below is created
with three slots filled with a numeric vector, the matrix
m
, and a character string.
l <- list(myValues=c(1, 2, 3), m, "Landsat")
l
## $myValues
## [1] 1 2 3
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## [[3]]
## [1] "Landsat"
Slots can be accessed by name (if there is any) and by position
(number). The example above creates only a name for the first slot
(myValues
). The second and third slot are not named. To
access the elements of a list you need to use double brackets
[[]]
:
l[["myValues"]] # access first slot by name
## [1] 1 2 3
l[[1]] # access first slot by position
## [1] 1 2 3
l[[3]] # access third slot by position
## [1] "Landsat"
Named slots can also be accessed using the $
operator.
l$myValues
## [1] 1 2 3
You can also add new slots to an existing or empty list. The example below creates an empty list and adds a matrix to it.
newList <- list()
newList[["nMatrix"]] <- n
newList
## $nMatrix
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Why bother with lists?
Nearly all functions (e.g., linear regression, generalised linear modelling, t-test, etc.) in R produce output that is stored in a list. The good news is that R contains various functions to extract required information (e.g., estimated values, p-values, etc.) and presents it in nice tables. However, sometimes it can be useful to extract information from the results directly or you may want to use a list to return the output from your own functions.
A data frame is table. It is similar to a two-dimensional matrix but the columns can contain different data types.
trees <- data.frame(treeid = 1001:1004,
species = c("Beech", "Fir", "Pine", "Fir"),
life = c(TRUE, FALSE, TRUE, TRUE),
height = c(34, 38, 16, 40),
diameter = c(40, 50, NA, 45),
age = c(110, NA, 40, 120)
)
trees
## treeid species life height diameter age
## 1 1001 Beech TRUE 34 40 110
## 2 1002 Fir FALSE 38 50 NA
## 3 1003 Pine TRUE 16 NA 40
## 4 1004 Fir TRUE 40 45 120
To retrieve the dimensions of the data.frame you can use the
functions dim()
, nrow()
, and
ncol()
.
dim(trees)
## [1] 4 6
nrow(trees)
## [1] 4
ncol(trees)
## [1] 6
You can access the column names as follows:
names(trees)
## [1] "treeid" "species" "life" "height" "diameter" "age"
You can change all or individual column names:
names(trees)[3] <- "alive"
names(trees)
## [1] "treeid" "species" "alive" "height" "diameter" "age"
The summary()
function prints a quick overview, which is
helpful for spotting data entry errors and NA
’s:
summary(trees)
## treeid species alive height
## Min. :1001 Length:4 Mode :logical Min. :16.0
## 1st Qu.:1002 Class :character FALSE:1 1st Qu.:29.5
## Median :1002 Mode :character TRUE :3 Median :36.0
## Mean :1002 Mean :32.0
## 3rd Qu.:1003 3rd Qu.:38.5
## Max. :1004 Max. :40.0
##
## diameter age
## Min. :40.0 Min. : 40
## 1st Qu.:42.5 1st Qu.: 75
## Median :45.0 Median :110
## Mean :45.0 Mean : 90
## 3rd Qu.:47.5 3rd Qu.:115
## Max. :50.0 Max. :120
## NA's :1 NA's :1
You can index (access) columns using three methods. However, it is usually best to extract columns from a data frame using the column names. You do not necessarily know the position of the column or the position may change.
trees$treeid
## [1] 1001 1002 1003 1004
trees[ , 1]
## [1] 1001 1002 1003 1004
trees[ , "treeid"] # or trees[ , c("treeid", "alive")]
## [1] 1001 1002 1003 1004
When you extract a column, the result is a vector and not a data.frame.
You can also index (access) rows using numerical indices. Remember,
the indexing format is data.frame[row_index, column_index]
.
For example, to access the 2nd and 3rd row you may type this:
trees[2:3, ]
## treeid species alive height diameter age
## 2 1002 Fir FALSE 38 50 NA
## 3 1003 Pine TRUE 16 NA 40
However, indexing rows using numerical indices is not very practical.
Usually, we want to select rows (observations) that meet certain
criteria defined in one or multiple columns. We can define these
criteria using logical operations and indexing. For example, let’s
select all trees greater than 30 m height. Tree height is stored in the
height
column. We can extract height
from
trees
as a vector using trees$height
.
Consequently, the logical operations > 30
returns a
logical vector, where the first element corresponds to the first tree of
the data frame, the second element to the second tree, the third element
to the third tree and so forth. You can see that only the 1st, 2nd, and
4th tree meet our height criterion.
trees$height > 30
## [1] TRUE TRUE FALSE TRUE
To select all rows based on this criterion, we use the logical vector (or operation) as index vector.
trees[trees$height > 30, ]
## treeid species alive height diameter age
## 1 1001 Beech TRUE 34 40 110
## 2 1002 Fir FALSE 38 50 NA
## 4 1004 Fir TRUE 40 45 120
You can combine multiple logical operations to make more complex queries. Let’s add the criterion that the trees also need to be alive.
trees[trees$height > 30 & trees$alive == TRUE, ]
## treeid species alive height diameter age
## 1 1001 Beech TRUE 34 40 110
## 4 1004 Fir TRUE 40 45 120
Recall the %in%
operator evaluates if an element is
contained in another vector. This allows us to check each element
against multiple choices. For example, we want to select all trees that
match a certain list (vector) of tree species.
conifer <- trees[trees$species %in% c("Spruce", "Fir", "Pine"), ]
conifer
## treeid species alive height diameter age
## 2 1002 Fir FALSE 38 50 NA
## 3 1003 Pine TRUE 16 NA 40
## 4 1004 Fir TRUE 40 45 120
Recall from last session that math operations and logical operations
with NA
return NA
. This is an issue, when we
want to select against criteria that contain NA
values.
trees$diameter > 40
## [1] FALSE TRUE NA TRUE
Subsetting (extracting rows from) a data.frame with missing values
using logical operations causes weird results. For example, column
diameter
contains a missing value. Subsetting the
data.frame trees
to all trees with a diameter
greater than 40, returns the two trees plus an extra row filled with
NA
.
trees[trees$diameter > 40, ]
## treeid species alive height diameter age
## 2 1002 Fir FALSE 38 50 NA
## NA NA <NA> NA NA NA NA
## 4 1004 Fir TRUE 40 45 120
Understandably, the analyst must consider what to do with missing
values prior to the analysis. Depending on the use case, different
courses of action can be appropriate, e.g., filling in missing values or
removing observations. A very crude but quick method is to remove all
observations with missing values. The function na.omit()
returns only observations with no missing values in any (!) of the
columns. More advanced and meaningful methods that interpolate missing
values are available in the zoo
package.
trees_clean <- na.omit(trees)
trees_clean
## treeid species alive height diameter age
## 1 1001 Beech TRUE 34 40 110
## 4 1004 Fir TRUE 40 45 120
Recall, the $
operator is used to access columns in a
data frame by name. You can also use it to create a new column. The
following example creates a new column diamter_m
that
converts the diameter to meter.
trees$diameter_m <- trees$diameter / 100
We can also store a log-transformed version of a variable in a new column.
trees$logHeight <- log(trees$height)
trees
## treeid species alive height diameter age diameter_m logHeight
## 1 1001 Beech TRUE 34 40 110 0.40 3.526361
## 2 1002 Fir FALSE 38 50 NA 0.50 3.637586
## 3 1003 Pine TRUE 16 NA 40 NA 2.772589
## 4 1004 Fir TRUE 40 45 120 0.45 3.688879
Factors are a special type of vector that is used for coding
categorical variables in statistical models. Recall statistical data
types: continuous vs categorical (nominal vs ordinal). In our trees
dataset, the column species
is a candidate for a factor
variable. Currently, the column is of type character.
trees$species
## [1] "Beech" "Fir" "Pine" "Fir"
The function factor()
converts a character vector into a
factor vector. Let’s store this factor vector in a new column called
class
.
trees$class <- factor(trees$species)
trees$class
## [1] Beech Fir Pine Fir
## Levels: Beech Fir Pine
Factors contain a fixed number of categories, in R called levels. Our tree dataset has three levels: Beech, Fir, Pine. You can access the levels of a factor variable as follows:
levels(trees$class)
## [1] "Beech" "Fir" "Pine"
Accessing the levels gives you an overview of the species, but you can also use the function to overwrite (change) the names of the levels:
levels(trees$class) <- c("Fagus", "Abies", "Pinus")
trees
## treeid species alive height diameter age diameter_m logHeight class
## 1 1001 Beech TRUE 34 40 110 0.40 3.526361 Fagus
## 2 1002 Fir FALSE 38 50 NA 0.50 3.637586 Abies
## 3 1003 Pine TRUE 16 NA 40 NA 2.772589 Pinus
## 4 1004 Fir TRUE 40 45 120 0.45 3.688879 Abies
Check out the help for the factor()
function. The
function also takes an argument levels
and an argument
labels
. That means, you can change the names of the
levels
(categories) with one call of the
factor()
function.
trees$class <- factor(trees$species, levels=c("Beech", "Fir", "Pine"),
labels=c("Fagus", "Abies", "Pinus"))
trees
## treeid species alive height diameter age diameter_m logHeight class
## 1 1001 Beech TRUE 34 40 110 0.40 3.526361 Fagus
## 2 1002 Fir FALSE 38 50 NA 0.50 3.637586 Abies
## 3 1003 Pine TRUE 16 NA 40 NA 2.772589 Pinus
## 4 1004 Fir TRUE 40 45 120 0.45 3.688879 Abies
Note, you cannot simply add values to a factor that are not specified
in levels
. Below, I try to change the level of the first
tree. Since the level does not exist in the factor variable, the entry
gets deleted and replaced with NA
. This is bad.
trees$class[1] <- "Oak"
## Warning in `[<-.factor`(`*tmp*`, 1, value = structure(c(NA, 2L, 3L, 2L), levels
## = c("Fagus", : invalid factor level, NA generated
trees
## treeid species alive height diameter age diameter_m logHeight class
## 1 1001 Beech TRUE 34 40 110 0.40 3.526361 <NA>
## 2 1002 Fir FALSE 38 50 NA 0.50 3.637586 Abies
## 3 1003 Pine TRUE 16 NA 40 NA 2.772589 Pinus
## 4 1004 Fir TRUE 40 45 120 0.45 3.688879 Abies
Instead, you first need to add a level.
levels(trees$class) <- c(levels(trees$class), "Oak")
trees$class
## [1] <NA> Abies Pinus Abies
## Levels: Fagus Abies Pinus Oak
Then you can change the species class of the first tree to Oak.
trees$class[1] <- "Oak"
trees
## treeid species alive height diameter age diameter_m logHeight class
## 1 1001 Beech TRUE 34 40 110 0.40 3.526361 Oak
## 2 1002 Fir FALSE 38 50 NA 0.50 3.637586 Abies
## 3 1003 Pine TRUE 16 NA 40 NA 2.772589 Pinus
## 4 1004 Fir TRUE 40 45 120 0.45 3.688879 Abies
Sometimes, you may want to create classes from a continuous variable.
For example, below we convert the tree heights into a factor
(categorical) variable with two levels: small and tall. The function
cut()
divides a numeric variable into intervals and codes
them into factors:
trees$heightClass <- cut(trees$height, c(0, 30, 100), labels=c('small', 'tall'))
trees
## treeid species alive height diameter age diameter_m logHeight class
## 1 1001 Beech TRUE 34 40 110 0.40 3.526361 Oak
## 2 1002 Fir FALSE 38 50 NA 0.50 3.637586 Abies
## 3 1003 Pine TRUE 16 NA 40 NA 2.772589 Pinus
## 4 1004 Fir TRUE 40 45 120 0.45 3.688879 Abies
## heightClass
## 1 tall
## 2 tall
## 3 small
## 4 tall
R can read a variety of dataset formats such as
Text (ASCII) files are a popular data storage and exchange format.
They can be read on any OS platform without special software. The most
common text files separate data columns by comma (csv), semi-colon, or
tabs. On German (and other) systems, the comma is already reserved for
decimal places, so here the semi-colon or tab-separation is sometimes
preferred. On English systems, the dot (.
) is used for
decimal places, on other systems it may be used to group digits for
readability, e.g. 1,000,000
.
The read.table()
function is the most generic function
to read table data from various text files. The function allows several
arguments to accommodate different data formats. See
?read.table()
. Important formating options are:
sep=","
: Columns are separated by ,
dec="."
: Decimal sign is .
header=TRUE
: The first row contains the column
namestab <- read.table("data/airquality.txt", sep = ",", dec = ".", header = TRUE)
read.table()
returns a data frame.
class(tab)
## [1] "data.frame"
Use head()
to print the first couple of rows of the
data.frame or tail()
to print the last rows.
head(tab)
## ID Ozone Solar Wind Temp Month Day
## 1 1 41 190 7.4 67 5 1
## 2 2 36 118 8.0 72 5 2
## 3 3 12 149 12.6 74 5 3
## 4 4 18 313 11.5 62 5 4
## 5 5 NA NA 14.3 56 5 5
## 6 6 28 NA 14.9 66 5 6
Exporting data.frames to text files is similarly
easy with write.table()
:
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
This example writes the data frame tab
to a semi-colon
delimited file.
write.table(tab, "data/airquality_output.txt", sep = ";", dec = ".", row.names = FALSE)
You can also use write.csv()
or
write.csv2()
to export data frames to comma-delimited or
semi-colon delimited text files, respectively.
write.csv(tab, "data/airquality_output.csv")
Copyright © 2024 Humboldt-Universität zu Berlin. Department of Geography.