Content
- More R data structures
- Data import and export
- Simple data manipulation
Multiple data values can be stored in various data structures:
Homogeneous (of the same type):
vector
matrix
Heterogeneous (of mixed types):
data frame
list
Factors are a special type of vector that is used for coding categorical variables in statistical models. Recall statistical data types: continuous vs categorical (nominal vs ordinal). Factors contain a fixed number of values, also called levels (categories).
Categorical (factor) vectors are for example gender or tree species:
# character vector
treespecies_char <- c("SP", "PI", "FI", "FI", "PI")
treespecies_char
## [1] "SP" "PI" "FI" "FI" "PI"
# convert character vector to factor
treespecies <- factor(treespecies_char)
treespecies
## [1] SP PI FI FI PI
## Levels: FI PI SP
Levels are the category/class names/labels, and they are of data type
character
.
levels(treespecies)
## [1] "FI" "PI" "SP"
typeof(levels(treespecies))
## [1] "character"
You can change the names of the levels (categories) as follows:
levels(treespecies) <- c("Fir", "Pine", "Spruce")
treespecies
## [1] Spruce Pine Fir Fir Pine
## Levels: Fir Pine Spruce
Check out the help for the factor()
function. The
function also takes an argument levels
and an argument
labels
. That means, you can also change the names of the
levels
(categories) inside the factor()
function, i.e. when creating the factor variable.
treespecies2 <- factor(treespecies_char, levels=c("FI", "PI", "SP"), labels=c("Fir", "Pine", "Spruce"))
treespecies2
## [1] Spruce Pine Fir Fir Pine
## Levels: Fir Pine Spruce
Note, you cannot simply add values to a factor that are not specified
in levels
:
treespecies[1] <- "Oak"
## Warning in `[<-.factor`(`*tmp*`, 1, value = "Oak"): invalid factor level, NA
## generated
treespecies
## [1] <NA> Pine Fir Fir Pine
## Levels: Fir Pine Spruce
Instead, you first need to add a level.
levels(treespecies2) <- c(levels(treespecies2), "Oak")
treespecies2
## [1] Spruce Pine Fir Fir Pine
## Levels: Fir Pine Spruce Oak
# now you can change the first tree species element to Oak
treespecies2[1] <- "Oak"
treespecies2
## [1] Oak Pine Fir Fir Pine
## Levels: Fir Pine Spruce Oak
A matrix is a two-dimensional Array. A matrix therefore has additional attributes specifying the 2 dimensions: nrow und ncol.
m <- matrix(1:9, nrow = 3, ncol = 3)
m
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
See: dim(m)
, nrow(m)
,
ncol(m)
Note that by default the columns of the matrix will be filled first.
If you want to fill the matrix by row, you can specify this with the
byrow
argument:
n <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
n
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
You can apply mathematical operators to matrices the same way as vectors (Attention: Recycling rule!):
m * 2
## [,1] [,2] [,3]
## [1,] 2 8 14
## [2,] 4 10 16
## [3,] 6 12 18
m * n
## [,1] [,2] [,3]
## [1,] 1 8 21
## [2,] 8 25 48
## [3,] 21 48 81
Like with vectors, you can access elements of a matrix with indices,
except that we now deal with two dimensions [i,j]
or
[row, column]
:
m
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
m[1, ]
## [1] 1 4 7
m[ , 1]
## [1] 1 2 3
m[1, 1]
## [1] 1
m[1:2, 3]
## [1] 7 8
m[1:2, c(1,3)]
## [,1] [,2]
## [1,] 1 7
## [2,] 2 8
When you exctract elements from a matrix, the result can belong to a different class!
class(m)
## [1] "matrix" "array"
class(m[ , 3])
## [1] "integer"
The functions cbind()
und rbind()
glue
(bind) vectors and matrices together:
# bind together by column
cbind(m,n)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 4 7 1 2 3
## [2,] 2 5 8 4 5 6
## [3,] 3 6 9 7 8 9
The function cbind
und rbind
glue (bind)
vectors and matrices together:
# bind together by row
rbind(m,n)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [4,] 1 2 3
## [5,] 4 5 6
## [6,] 7 8 9
Lists are objectes that can store values of different data type (and different objects):
l <- list(c(1, 2, 3), m, "a")
To access the elements of a list with indices you need to use double
brackets [[]]
:
l[[1]]
## [1] 1 2 3
l[[3]]
## [1] "a"
** Why bother with lists?**
Nearly all functions (e.g., linear regression, generalised linear modelling, t-test, etc.) in R produce output that is stored in a list. The good news is that R contains various functions to extract required information (e.g., estimated values, p-values, etc.) and presents it in nice tables. However, sometimes it can be useful to extract information from the results directly OR you may want to use a list to return the output from your own functions.
A data frame is what you may call a data table. It is similar to a two-dimensional matrix but the columns can contain different data types.
df <- data.frame(TREEID = 1001:1003,
SPECIES = factor(c("Spruce", "Fir", "Pine")),
LIFE = c(TRUE, FALSE, TRUE),
HEIGHT = c(34, 21, 26)
)
df
## TREEID SPECIES LIFE HEIGHT
## 1 1001 Spruce TRUE 34
## 2 1002 Fir FALSE 21
## 3 1003 Pine TRUE 26
The summary()
function gives a quick overview. Helpful
for spotting data entry errors and NA
’s:
summary(df)
## TREEID SPECIES LIFE HEIGHT
## Min. :1001 Fir :1 Mode :logical Min. :21.0
## 1st Qu.:1002 Pine :1 FALSE:1 1st Qu.:23.5
## Median :1002 Spruce:1 TRUE :2 Median :26.0
## Mean :1002 Mean :27.0
## 3rd Qu.:1002 3rd Qu.:30.0
## Max. :1003 Max. :34.0
You can index (access) columns using three main ways:
df$TREEID
## [1] 1001 1002 1003
df[ , 1]
## [1] 1001 1002 1003
df[ , "TREEID"]
## [1] 1001 1002 1003
Rows are indexed by row number:
df[3, ]
## TREEID SPECIES LIFE HEIGHT
## 3 1003 Pine TRUE 26
df[1:2, "TREEID"]
## [1] 1001 1002
df[1, c("TREEID", "HEIGHT")]
## TREEID HEIGHT
## 1 1001 34
IMPORTANT: Extracting a row does not change the class but, extracting a column does!
class(df)
## [1] "data.frame"
class(df[ 1, ])
## [1] "data.frame"
class(df[ , "TREEID"])
## [1] "integer"
R can read a variety of dataset formats:
Text (ASCII) files are a popular data storage and exchange format. They can be read on any OS platform and without specialty software (Notepad).
Comma-separated (csv) or tab-separated files are common formats to store table data (in systems with English locale).
BE AWARE: On German (and other) systems, the comma is already
reserved for decimal places, so here the semi-colon or tab-separation is
sometimes preferred. On English systems, the dot (.
) is
used for decimal places, on other systems it may be used to group digits
for readability, e.g. 1,000,000
.
The read.table()
function can be used to read table data
from text files. The function allows several arguments to accommodate
the different data formats. See ?read.table()
The following example specifies the following format options:
,
.
header = TRUE
tab <- read.table("data/airquality.txt", sep = ",", dec = ".", header = TRUE)
read.table()
returns a data frame.
class(tab)
## [1] "data.frame"
names(tab)
## [1] "ID" "Ozone" "Solar" "Wind" "Temp" "Month" "Day"
head(tab, 3)
## ID Ozone Solar Wind Temp Month Day
## 1 1 41 190 7.4 67 5 1
## 2 2 36 118 8.0 72 5 2
## 3 3 12 149 12.6 74 5 3
read.csv()
is a short-cut of read.table()
text files with comma-separated columns.
Export data frames to semi-colon delimited file:
write.table(tab, "data/airquality_output.txt", sep = ";", dec = ".", row.names = FALSE)
You can also use write.csv()
to export data frames to
comma-delimited text files:
write.csv(tab, "data/airquality_output.csv")
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
Recall from last session that NA
is used for missing
values in R:
x <- c(1, 5, 3, 6, NA, 9, 21, 4)
x
## [1] 1 5 3 6 NA 9 21 4
..and that you must use is.na()
to determine if an
element is or contains missing values.
is.na(x)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
It is not uncommon to have missing values in datasets, e.g. in data frames and matrices:
df <- data.frame(var1 = c(1, 3, 12, NA, 5),
var2 = c(3, 4, 1, 8, 11))
df
## var1 var2
## 1 1 3
## 2 3 4
## 3 12 1
## 4 NA 8
## 5 5 11
Use na.omit()
to ignore rows in a data frame that
contain NA
s:
na.omit(df)
## var1 var2
## 1 1 3
## 2 3 4
## 3 12 1
## 5 5 11
Also recall that arithmetic functions and operations applied to
NA
s return NA
NA * 3
## [1] NA
Many arithmetic functions allow you to specify whether to ignore or
include NA
s:
sum(df$var1)
## [1] NA
sum(df$var1, na.rm=TRUE)
## [1] 21
Combine two or more character variables witht
paste()
:
paste("Hello", "World", sep = "_")
## [1] "Hello_World"
Extract a portion of a character variable
substring()
:
substring("Hello World", first = 3, last = 8)
## [1] "llo Wo"
It is best to extract columns from a data frame using the column names:
names(airquality)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
airquality_ozone <- airquality$Ozone
airquality_ozone <- airquality[, "Ozone"]
airquality_ozone_temp <- airquality[, c("Ozone", "Temp")]
Subset rows based using logical operations on variables (columns):
airquality_temp_gr_70 <- airquality[airquality$Temp > 70, ]
nrow(airquality_temp_gr_70)
## [1] 120
Subset rows based on row indices:
airquality_zeile_10_100 <- airquality[1:100, ]
nrow(airquality_zeile_10_100)
## [1] 100
You can combine logical operators to make more complex subsets of rows and columns:
airquality_juni <- airquality[airquality$Month == 6, ]
all measurements from 15. June:
airquality_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, ]
or return only Ozone values (column) from 15. June:
ozone_juni_15 <- airquality[airquality$Month == 6 & airquality$Day == 15, "Ozone"]
In previously covered logical operations, single values were compared
to vectors or matrices (or single values),
e.g. c(1,2,3) < 2
.
The %in%
operator can be applied to two vectors:
x %in% y
. For each element in vector x
,
%in%
evaluates if the element is contained in vector
y
. The operator returns a logical vector of the same length
as vector x
.
The following example returns all rows of month June and July.
airquality_jun_jul <- airquality[airquality$Month %in% c(6, 7), ]
Recall, the $
operator is used to access columns in a
data frame by name. You can also use it to create a new column:
The following example creates a new column with the name “NewVariable” and fills it with Ozone values multiplied by 100.
airquality$NewVariable <- airquality$Ozone * 100
Or we create a log-transformed variable.
airquality$logOzone <- log(airquality$Ozone)
Sometimes, you may want to create classes from a continuous variable. For example, below we convert temperature data into a factor (categorical) variable with to levels (warm and cold).
First, add an empty character vector:
airquality$TempClass <- rep("", nrow(airquality))
Then, recode the character vector based on logical operations on
$Temp
:
airquality[airquality$Temp >= 70 , "TempClass"] <- "warm"
airquality[airquality$Temp < 70 , "TempClass"] <- "cold"
airquality$TempClass <- factor(airquality$TempClass)
The function cut
divides a numeric variable into
intervals and codes them into factors (categorical data):
airquality$TempClass2 <- cut(airquality$Temp, c(0, 69, 100), labels=c('cold', 'warm'))
Note, there are more ways to do the same task!!
Copyright © 2024 Humboldt-Universität zu Berlin. Department of Geography.