- Importing raster files (
`library(raster); stack()`

) - Principal Component Analysis (
`prcomp(); princomp()`

)

- Assessing model predictions
- Cross-validation
- Linear discriminant analysis (Categorical predictions)
- Cross-validation example for regression

It is important to note that there are in fact two separate goals that we might have in mind when evaluating models:

Evaluating and selecting models based on the goodness of fit.

Evaluating models based on their prediction performance on new data.

```
x <- 5:14; y <- c(4,7,6,6,8,7,12,13,12,11); xn = seq(5, 15, by=0.1)
lmd1 <- lm(y ~ x)
plot(y~x, xlim=c(4,16), ylim=c(4,16), main='1-st order polynomial')
lines(xn, predict(lmd1, data.frame(x=xn)), col=1)
```

The **generalization** performance of a model relates to its **prediction** capability on **independent test data**. A predictive model is considered good when it is capable of predicting previously unseen samples with high accuracy.

It is important to distinguish between different prediction error concepts:

Training error is the average loss over the training sample.

Test error, also referred to as

**generalization error**, is the prediction error over an independent test sample.

Imagine you can repeat the entire model building process multiple times using different training data.

**Variance** is the variability of a model prediction for a given data point. The variance is how much the predictions for a given point vary between different realizations of the model. Models with high variance pay a lot of attention to training data and do not generalize on the data which they haven’t seen before (overfitting).

**Bias** is the difference between the expected (or average) prediction of our model and the correct value. Models with high bias pay little attention to the training data and oversimplify the model (underfitting).

To minimize the expected test error, we need to select a statistical model that simultaneously achieves low variance and low bias. Training error is not a good estimate of the test error, as seen in previous slide. Training error consistently decreases with model complexity, typically dropping to zero if we increase the model complexity enough. However, a model with zero training error is overfitted to the training data and will typically generalize poorly.

Sometimes, there are not enough data to split into training and test sample. Probably the simplest and most widely used method for estimating prediction error is cross-validation.

K-fold cross-validation uses part of the available data to fit the model, and a different part to test it. We split the data into K roughly equal-sized parts. Typical choices of K are between 5 and 10. When K = 5, the scenario looks like this:

The case K = N is known as leave-one-out cross-validation. In this case, for the *i*’th observation the fit is computed using all the data except the *i*’th.

Regression losses (quantitative response)

- Mean squared error (MSE)
- Mean absolute error (MAE)
- Root Mean Squared Error (RMSE)

Classification losses (qualitative response)

- Confusion matrix
- Overall accuracy
- Precision, Recall, etc.

Compare hypothesis testing:

Type I error: incorrectly reject a true null hypothesis (detecting an effect that is not present)

Type II error: fail to reject a false null hypothesis (failing to detect an effect that is present)

Overall accuracy = \(\frac{\sum{true\ positives} + \sum{true\ negatives}}{\sum{total\ samples}}\)

Omission error (of positives) = FNR; Commission error (of positives) = FDR

**Discriminative models:**

Estimate conditional models Pr[Y | X ]

Logistic regression

**Generative models:**

Estimate joint probability Pr[Y, X] = Pr[Y | X] Pr[X]

Estimates not only probability of labels but also the features

Once model is fit, can be used to generate data

LDA, QDA, Naive Bayes

**Positive:**

Can be used to generate data (Pr[X])

Offers more insights into data

**Negative:**

- Often works worse, particularly when assumptions are violated

For this exercise, we use the `taxon`

dataset from the R Book.

```
taxa <- read.table("data/taxonomy.txt", header=T, stringsAsFactors = T)
head(taxa)
```

```
## Taxon Petals Internode Sepal Bract Petiole Leaf Fruit
## 1 I 5.621498 29.48060 2.462107 18.20341 11.27910 1.128033 7.876151
## 2 I 4.994617 28.36025 2.429321 17.65205 11.04084 1.197617 7.025416
## 3 I 4.767505 27.25432 2.570497 19.40838 10.49072 1.003808 7.817479
## 4 I 6.299446 25.92424 2.066051 18.37915 11.80182 1.614052 7.672492
## 5 I 6.489375 25.21131 2.901583 17.31305 10.12159 1.813333 7.758443
## 6 I 5.785868 25.52433 2.655643 17.07216 10.55816 1.955524 7.880880
```

The dataset contains seven morphological variables (Petals, Internode, Sepal, Bract, Petiole, Leaf, Fruit) that might be useful for separating our categorical response variable (Taxon). There are four plant taxa (I, II, III, IV) with 30 individuals, respectively.

`table(taxa$Taxon)`

```
##
## I II III IV
## 30 30 30 30
```

The `MASS`

package provides functions to conduct discriminant function analysis. We will use `lda()`

to carry out a linear discriminant analysis. By now, you are familiar with the formula style: `response ~ explanatory variables`

(`.`

denotes all variables in the taxa dataset except `Taxon`

):

```
library(MASS)
lda_model <- lda(Taxon ~ ., data=taxa)
```

The `lda`

object (in this example `lda_model`

) contains the following components (and more):

**prior:** the prior probabilities used.

**means:** the group means.

**scaling:** a matrix of coefficients that transforms observations to discriminant functions.

**svd:** the singular values, which give the ratio of the between- and within-group standard deviations on the linear discriminant variables.

**N:** The number of observations used.

**counts:** The number of observations per class.

By default, equal prior probabilities are prescribed to each class:

`lda_model$prior`

```
## I II III IV
## 0.25 0.25 0.25 0.25
```

`lda_model$svd`

`## [1] 20.589120 9.098563 8.750050`

The singular values are analogous to the eigenvalues of the Principal Component Analysis, except that LDA does not maximize the variance of a component, instead it maximizes the separability (defined by the between and within-group standard deviation). Thus, the “proportion of trace” is the proportion of between-class variance that is explained by successive discriminant functions.

`lda_model$svd^2 / sum(lda_model$svd^2)`

`## [1] 0.7267985 0.1419332 0.1312682`

Hence, 72.7% of the between-class variance is explained by the first linear discriminant function (LD1).

As always, you get a printed overview of all essential model results by printing the model object:

`lda_model`

```
## Call:
## lda(Taxon ~ ., data = taxa)
##
## Prior probabilities of groups:
## I II III IV
## 0.25 0.25 0.25 0.25
##
## Group means:
## Petals Internode Sepal Bract Petiole Leaf Fruit
## I 5.476128 27.91886 2.537955 18.60268 10.864184 1.508029 7.574642
## II 7.035078 27.69834 2.490336 18.47557 8.541085 1.450260 7.418702
## III 6.849666 27.99308 2.446003 18.26330 9.866983 2.588555 7.482349
## IV 6.768464 27.78503 4.532560 18.42953 10.128838 1.645945 7.467917
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## Petals -0.01891137 0.034749952 0.559080267
## Internode 0.03374178 0.009670875 0.008808043
## Sepal 3.45605170 -0.500418135 0.401274694
## Bract 0.07557480 0.068774714 -0.024930728
## Petiole 0.25041949 -0.343892260 -1.249519047
## Leaf -1.13036429 -3.008335468 0.647932763
## Fruit 0.18285691 -0.208370808 -0.269924935
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7268 0.1419 0.1313
```

The number of discriminant functions that can be extracted depends on the number of groups and the number of variables – it is the lesser of the degrees of freedom for groups (number of groups minus one) and the number of variables (Tabachnick & Fidell 1996). The scaling coefficients show, LD1 is largely explained by the variable `Sepal`

. Plotting Sepal versus Leaf length confirms this:

```
plot(taxa$Sepal, taxa$Leaf, col=taxa$Taxon, type="n")
text(taxa$Sepal, taxa$Leaf, labels=as.character(taxa$Taxon), col=as.numeric(taxa$Taxon))
```

Applying the `predict()`

function to the LDA model yields the following results (see `?predict.lda`

):

`lda_model$class`

: The predicted class based on the maximum posterior probability (MAP)

`lda_model$posterior`

: posterior probabilities for the classes

`lda_model$x`

: the scores of test cases for all discriminant functions

```
lda_prediction <- predict(lda_model)
str(lda_prediction)
```

```
## List of 3
## $ class : Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ posterior: num [1:120, 1:4] 1 1 1 0.999 0.883 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:120] "1" "2" "3" "4" ...
## .. ..$ : chr [1:4] "I" "II" "III" "IV"
## $ x : num [1:120, 1:3] -0.624 -1.099 -0.285 -2.568 -0.419 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:120] "1" "2" "3" "4" ...
## .. ..$ : chr [1:3] "LD1" "LD2" "LD3"
```

We can glue together the posterior-probabilities and the predicted class vector (with `cbind()`

to show for each row, the predicted class represents the class with the maximum posterior probability

```
i <- sample(120) # random row
head(cbind(round(lda_prediction$posterior[i,], 4), class=lda_prediction$class[i]))
```

```
## I II III IV class
## 7 0.9452 0.0538 0.0010 0 1
## 43 0.0004 0.9993 0.0002 0 2
## 34 0.0002 0.9998 0.0000 0 2
## 73 0.0000 0.0000 1.0000 0 3
## 78 0.0000 0.0000 1.0000 0 3
## 79 0.0000 0.0008 0.9992 0 3
```

The LDA scores (the transformed values in LDA space) for all observations can be accessed via `$x`

:

`head(lda_prediction$x)`

```
## LD1 LD2 LD3
## 1 -0.6240211 1.6810963 -3.0303236
## 2 -1.0988293 1.6768377 -2.8176493
## 3 -0.2850537 2.3155720 -2.5934923
## 4 -2.5680633 0.2811099 -3.1292050
## 5 -0.4189485 -0.2502168 -0.4620391
## 6 -1.2922825 -0.7685348 -1.4317043
```

A nice way of displaying the results of a linear discriminant analysis (LDA) is to plot the LDA scores as histograms or scatterplots. A stacked histogram shows the scores of the discriminant functions separately for each group.

`ldahist(data = lda_prediction$x[,], g=taxa$Taxon)`

We can also produce a scatterplot showing how the groups separate along the LDA axes (below). LD1 separates taxum `IV`

from the other taxa. LD2 separates well `III`

from `I`

and `II`

. LD3 seems to separate taxum `I`

from the other groups.

`plot(lda_model, col=as.numeric(taxa$Taxon)) # assign color code based on factor code`

To check how well the model predicts our training data, we can construct a confusion matrix of the fitted vs observed Taxa using the `table`

function.

```
conf <- table(list(predicted=lda_prediction$class, observed=taxa$Taxon))
conf
```

```
## observed
## predicted I II III IV
## I 29 0 0 0
## II 0 30 0 0
## III 1 0 30 0
## IV 0 0 0 30
```

We can calculate the categorical accuracy measures using simple matrix arithmetic…

```
# Precision (Positive predicted value)
diag(conf) / rowSums(conf)
```

```
## I II III IV
## 1.0000000 1.0000000 0.9677419 1.0000000
```

```
# Sensitivity
diag(conf) / colSums(conf)
```

```
## I II III IV
## 0.9666667 1.0000000 1.0000000 1.0000000
```

Alternatively, the caret package includes functionality that makes this easier:

```
library(caret)
confusionMatrix(conf)
```

```
## Confusion Matrix and Statistics
##
## observed
## predicted I II III IV
## I 29 0 0 0
## II 0 30 0 0
## III 1 0 30 0
## IV 0 0 0 30
##
## Overall Statistics
##
## Accuracy : 0.9917
## 95% CI : (0.9544, 0.9998)
## No Information Rate : 0.25
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9889
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: I Class: II Class: III Class: IV
## Sensitivity 0.9667 1.00 1.0000 1.00
## Specificity 1.0000 1.00 0.9889 1.00
## Pos Pred Value 1.0000 1.00 0.9677 1.00
## Neg Pred Value 0.9890 1.00 1.0000 1.00
## Prevalence 0.2500 0.25 0.2500 0.25
## Detection Rate 0.2417 0.25 0.2500 0.25
## Detection Prevalence 0.2417 0.25 0.2583 0.25
## Balanced Accuracy 0.9833 1.00 0.9944 1.00
```

The previous error estimates show the `training error`

. The `lda()`

function also includes an easy way to estimate the `test error`

using leave-one-out cross-validation by specifying `CV=T`

.

```
lda_model2 <- lda(Taxon ~ ., data=taxa, CV=T)
conf2 <- table(list(predicted=lda_model2$class, observed=taxa$Taxon))
confusionMatrix(conf2)
```

```
## Confusion Matrix and Statistics
##
## observed
## predicted I II III IV
## I 29 1 0 0
## II 0 28 0 0
## III 1 1 30 0
## IV 0 0 0 30
##
## Overall Statistics
##
## Accuracy : 0.975
## 95% CI : (0.9287, 0.9948)
## No Information Rate : 0.25
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9667
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: I Class: II Class: III Class: IV
## Sensitivity 0.9667 0.9333 1.0000 1.00
## Specificity 0.9889 1.0000 0.9778 1.00
## Pos Pred Value 0.9667 1.0000 0.9375 1.00
## Neg Pred Value 0.9889 0.9783 1.0000 1.00
## Prevalence 0.2500 0.2500 0.2500 0.25
## Detection Rate 0.2417 0.2333 0.2500 0.25
## Detection Prevalence 0.2500 0.2333 0.2667 0.25
## Balanced Accuracy 0.9778 0.9667 0.9889 1.00
```

- Mean Squared Error (MSE)

\(MSE = \frac{1}{n} \sum_{i=1}^n(y_i - \hat{y_i})^2 = \frac{1}{n} SSE\)

- Root Mean Squared Error (RMSE)

\(RMSE = \sqrt(MSE)\)

- Mean Absolute Error (MAE)

\(\frac{1}{n} \sum_{i=1}^n|y_i - \hat{y_i}|\)

Recall the airquality dataset that we used in previous sessions. Our goal is now to assess, how well the following linear model predicts ozone concentrations based on air temperature and solar radiation.

```
library(ggplot2)
fit <- lm(log(Ozone) ~ I(Temp^2) + Solar.R, data = airquality)
summary(fit)
```

```
##
## Call:
## lm(formula = log(Ozone) ~ I(Temp^2) + Solar.R, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9178 -0.3366 0.0389 0.3031 1.4087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.150e-01 2.247e-01 2.292 0.0238 *
## I(Temp^2) 3.972e-04 3.644e-05 10.902 < 2e-16 ***
## Solar.R 2.496e-03 5.860e-04 4.259 4.41e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5366 on 108 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6229, Adjusted R-squared: 0.616
## F-statistic: 89.22 on 2 and 108 DF, p-value: < 2.2e-16
```

We’ll use the caret package to slice the dataset into 10 folds (sub-samples):

```
library(caret)
# remove NA's
airquality_no_na <- na.omit(airquality)
folds <- createFolds(airquality_no_na$Ozone, k = 10)
```

You’ll notice that we provide `createFolds()`

only with a vector and not the entire dataset. The key here is, the function returns a list of row indices instead of the actual data.

`class(folds)`

`## [1] "list"`

`folds[1:4]`

```
## $Fold01
## [1] 5 10 17 19 26 36 66 68 84 92
##
## $Fold02
## [1] 8 11 34 54 57 81 86 89 97 98
##
## $Fold03
## [1] 4 6 16 23 40 44 46 76 80 87 102
##
## $Fold04
## [1] 13 21 42 45 48 49 56 59 67 100 110
```

We can write a function that takes two attributes as input: 1) the entire data.frame (`data`

) and 2) the indices of a fold (`i`

). We call our function `crossValidateLM`

.

```
crossValidateLM <- function(data, i){
# all rows that are not in the fold are copied into a training dataset
train <- data[-i, ]
# all rows that are in the fold are used as test dataset
test <- data[i, ]
# fit a model using the training dataset
fit <- lm(log(Ozone) ~ I(Temp^2) + Solar.R, data = train)
# apply the model to the test dataset
pred_log <- predict(fit, newdata = test)
# back-transform log Ozone to Ozone
pred <- exp(pred_log)
# extract the observed Ozone concentration (reference)
ref <- test$Ozone
# return a data.frame with two columns containing the cross-validated
# predictions for the fold and the corrsponding reference observations
return(data.frame(Prediction = pred,
Reference = ref))
}
```

Let’s test our function on the first fold:

```
cv_first_fold <- crossValidateLM(airquality_no_na, folds[[1]])
head(cv_first_fold)
```

```
## Prediction Reference
## 7 19.672997 23
## 14 21.613168 14
## 21 7.967410 1
## 23 9.034628 4
## 40 81.396572 71
## 64 40.845456 32
```

```
ggplot(cv_first_fold, aes(x = Reference, y = Prediction)) +
geom_point() +
xlim(0, 175) +
ylim(0, 175) +
geom_abline(intercept = 0, slope = 1)
```

Looks good but we also need the predictions from the other folds. We can all gather them together by writing a loop from fold 1 to fold 10:

```
# create empty data.frame
cv_all_folds <- data.frame(Prediction = numeric(0), Reference = numeric(0))
for(i in folds){
# add the rows of the fold
cv_all_folds <- rbind(cv_all_folds, crossValidateLM(airquality_no_na, i))
}
ggplot(cv_all_folds, aes(x = Reference, y = Prediction)) +
geom_point() +
xlim(0, 175) +
ylim(0, 175) +
geom_abline(intercept = 0, slope = 1)
```

Now, as we have our cross-validated predictions and reference observations in one data.frame, we can calculate the root-mean-squared error (RMSE), mean squared error (MSE), and mean absolute error (MAE).

```
mse <- function(x, y){mean((x - y)^2)}
mse(x = cv_all_folds$Prediction, y = cv_all_folds$Reference)
```

`## [1] 513.3705`

```
rmse <- function(x, y){sqrt(mean((x - y)^2))}
rmse(x = cv_all_folds$Prediction, y = cv_all_folds$Reference)
```

`## [1] 22.65768`

```
mea <- function(x, y){mean(abs(x - y))}
mea(x = cv_all_folds$Prediction, y = cv_all_folds$Reference)
```

`## [1] 14.25212`

You probably realize, writing you own cross-validation function is not hard once you understand functions and loops in R. The `caret`

package (and numerous other packages) provide you with cross-validation functions for which you don’t have to write that many lines of code. This is convenient, as long as those functions work with the models that you are interested in. Here is an example using a linear regression model.

```
library(caret)
# define training control
train.control <- trainControl(method="LOOCV")
# train mean height model
loocv <- train(log(Ozone) ~ I(Temp^2) + Solar.R, data = airquality_no_na, method = "lm", trControl = train.control)
loocv
```

```
## Linear Regression
##
## 111 samples
## 2 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 110, 110, 110, 110, 110, 110, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.5461937 0.5987285 0.4187436
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
```

A note about how \(R^2\) is calculated by caret: it takes the approach of computing the correlation between the observed and predicted values and squaring the value. So, it is **NOT** the coefficient of determination of the model. Especially when the model is poor, this can lead to differences between this estimator and the more widely known estimate derived form linear regression models.

```
# define training control
lmd <- lm(log(Ozone) ~ I(Temp^2) + Solar.R, data = airquality_no_na)
# R2 estimated from OLS
summary(lmd)[["r.squared"]]
```

`## [1] 0.6229448`

```
# adjusted R2 estimated from OLS
summary(lmd)[["adj.r.squared"]]
```

`## [1] 0.6159623`

```
# R2 from caret
loocv$results$Rsquared
```

`## [1] 0.5987285`

Also, the RMSE is on the log-scale since the model predicts log(Ozone). We cannot simply square the RMSE for back-transformation. You need to calculate the RMSE on the back-transformed predictions yourself. See below, how to access the predictions.

```
pred <- loocv$pred
head(pred)
```

```
## pred obs rowIndex intercept
## 1 2.751203 3.713572 1 TRUE
## 2 2.857273 3.583519 2 TRUE
## 3 3.068884 2.484907 3 TRUE
## 4 2.818412 2.890372 4 TRUE
## 7 2.929038 3.135494 5 TRUE
## 8 2.110938 2.944439 6 TRUE
```

```
ggplot(pred, aes(x = exp(obs), y = exp(pred))) +
geom_point() +
geom_abline(intercept = 0, slope = 1)
```

Hastie, T., Tibshirani, R., Friedman, J. (2009): The Elements of Statistical Learning: Data Mining, Inference, and Prediction. (https://web.stanford.edu/~hastie/ElemStatLearn/)

Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.