- Installing and loading packages (
`install.packages(), library()`

) - Plotting with R base graphics (
`hist(), plot(), boxplot()`

) - Plotting with ggplot2 (
`ggplot()`

) - ggplot2 Geoms (
`geom_histogram(), geom_point(), geom_boxplot()`

) - ggplot2 Aesthetics (
`aes(x, y, col, shape, linetype, fill, size)`

) - ggplot2 Trellis graphs (
`facet_wrap()`

) - ggplot2 Plot title and axis labels (
`labs(x, y, title)`

) - Combining multiple plots (gridExtra package,
`grid.arange()`

) - Saving plots to graphic files (
`ggsave()`

or`png(), pdf(), tif()`

, etc.,`dev.off()`

) - Tidying data (tidyr package,
`pivot_wider(), pivot_longer(), separate()`

) - Writing complex statements using pipes (dplyr package,
`%>%`

) - Data wrangling with dplyr (
`mutate(), summarize(), group_by(), filter()`

)

- The lm() function and its extractor functions
- Plotting regression lines
- Prediction versus confidence bands
- Model diagnostics with R

```
library(ggplot2) # load ggplot2 package for visualization
library(tidyr) # data wrangling package: change from wide to long format
library(dplyr) # data manipulation: mutate
```

The anscombe dataset is shipped with base R, and it consists of 4 datasets. To make life easier later in the session, I convert the wide format (datasets are organized by columns) into a long format (datasets are organized by rows) using the `tidyr`

package.

```
data(anscombe) # load anscombe (installed with R)
head(anscombe, 2) # print first two rows
```

```
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
```

```
# create id
anscombe$id <- 1:11
# turn into long format
temp1 <- pivot_longer(anscombe, cols= -id, names_to="key", values_to="value")
# extract dataset id from variable names
temp2 <- extract(temp1, key, c("var", "dataset"), "(.)(.)")
# spread variables to separate x and y columns
anscombe.long <- pivot_wider(temp2, names_from=var, values_from=value)
head(anscombe.long, 2) # print first two rows
```

```
## # A tibble: 2 x 4
## id dataset x y
## <int> <chr> <dbl> <dbl>
## 1 1 1 10 8.04
## 2 1 2 10 9.14
```

For this exercise, we will use the first anscombe dataset.

```
# extract first anscombe dataset
ans1 <- anscombe.long[anscombe.long$dataset==1,]
ggplot(ans1, aes(x = x, y = y)) +
geom_point()
```

The linear regression model is given by:

\(y_i = \beta_{0} + \beta_{1}x_i + \epsilon_i\)

The parameters \(\beta_{0}\) and \(\beta_{1}\) can be estimated using the method of ordinary least squares.

OLS minimizes the residuals \(y_{i}-\hat{y}_i\) (difference between observed and fitted values, red lines). In fact, OLS minimizes the sum of squared residuals: \(SSE = \sum_{i=1}^{n}(y_{i} − \hat{y}_{i})^2\). Thus, SSE measures the unexplained variability by the regression. The diagonal grey line is the fitted line: \(\hat{y}_i\).

The sum of squares total (SSY) is based on the difference (blue line) between the observed variable \(y\) (dots) and its mean \(\bar{y}\) (black line): \(SSY = \sum_{i=1}^{n}(y_{i}-\bar{y})^2\). SSY is a measure of the total variability (variance) of the response variable \(y\). You can also think of it as the variance explained by the null model (single mean). Intercept of the Null model \(\beta_0\) = \(\bar{y}\) and the slope \(\beta_1\) = 0.

The sum of squares due to regression is the sum of the differences between the fitted values (grey line) and the mean response (black line): \(SSR = \sum_{i=1}^{n}(\hat{y}_{i} − \bar{y})^2\). Thus, it is a measure of how well your line fits the data. If SSE is zero then SSR = SSY, which means your regression model captures all the observed variablity and is perfect. \(SSR = SSY-SSE\)

The **F-test of the overall significance** is a specific form of the F-test. It compares a model with no predictors to the model that you specify. A regression model with no predictors is also known as null model or intercept-only model. All you need to do the F-test is in the Analysis of Variance table:

\(H_0\): The fit of the intercept-only model and your model are equal.

\(H_A\): The fit of the intercept-only model is significantly reduced compared to your model.

Source | Sum of Squares (SS) | Degrees of freedom (df) | Mean Squares (MS) |
---|---|---|---|

Regression | \(SSR = SSY-SSE\) | 1 | \(\frac{SSR}{1}\) |

Error | \(SSE\) | n-2 | \(\frac{SSE}{n-2}\) |

Total | \(SSY\) | n-1 |

\(F_s = \frac{MS_{regression}}{MS_{error}}\)

\(Pr(Z \ge F_s) = 1-F(F_s, 1, n-2)\)

`lm()`

The `lm()`

function implements simple linear regression in R. The argument to `lm()`

is a model formula in which the tilde symbol (~) should be read as “described by”.

```
lm.anscombe1 <- lm(y ~ x, data = ans1) # fits the model
lm.anscombe1 # print the lm object lm.abscombe1
```

```
##
## Call:
## lm(formula = y ~ x, data = ans1)
##
## Coefficients:
## (Intercept) x
## 3.0001 0.5001
```

The result of `lm()`

is a model object. This is a distinctive concept of R. Whereas other statistical systems focus on generating printed output that can be controlled by setting options, you get instead the result of a model fit encapsulated in an object from which the desired quantities can be obtained using **extractor functions**. Printing the created model object gives you the model formula and the coefficients (intercept and slope). More information about the model can be accessed with extractor functions. We’ll do that later.

We can plot the regression line from the estimated slope (\(\beta_{1}\)) and intercept (\(\beta_{0}\)) using `geom_abline()`

:

```
ggplot(ans1, aes(x = x, y = y)) +
geom_point() +
geom_abline(intercept = 3.0001, slope = 0.5001, col = "red")
```

Calling `anova()`

on an object returned by `lm()`

computes the ANOVA tables. The SSE is 13.763, the mean squared error is 1.529 (with 9 degrees of freedom), and the SSY is 27.51.

The F-value (\(F_S\)) is 17.99.

`anova(lm.anscombe1)`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 27.510 27.5100 17.99 0.00217 **
## Residuals 9 13.763 1.5292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Note, we can also do the F-test manually by looking up the \(F_s\) value in the F-distribution. Recall, the F-distribution has two parameters: the degrees of freedom of the regression (\(df_1 = 1\)), and the degrees of freedom of the error (\(df_2 = n-2\)).

```
# p-value of the F-test
1 - pf(17.99, 1, 11-2)
```

`## [1] 0.002169607`

As the p-value is very small, we safely reject the null hypothesis \(H_0 : \beta_1 = 0\), indicating a significant effect of \(x\) on \(y\). However, be aware that the p-value does not indicate the effect size! The effect size (change in \(y\) with one unit change in \(x\)) is given by the regression slope (\(\beta_1\)).

`lm()`

extractor functionsUsing the `summary()`

function we can get a more comprehensive summary of the model statistics, including the F-test, t-tests for individual parameters, coefficient of variation (\(R^2\)).

`summary(lm.anscombe1)`

```
##
## Call:
## lm(formula = y ~ x, data = ans1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
```

The function `fitted()`

returns fitted values: the y-values that you would expect for the given x-values according to the best-fitting straight line. The residuals shown by `residuals()`

is the difference between this and the observed y.

`fitted(lm.anscombe1)`

```
## 1 2 3 4 5 6 7 8
## 8.001000 7.000818 9.501273 7.500909 8.501091 10.001364 6.000636 5.000455
## 9 10 11
## 9.001182 6.500727 5.500545
```

`residuals(lm.anscombe1)`

```
## 1 2 3 4 5 6
## 0.03900000 -0.05081818 -1.92127273 1.30909091 -0.17109091 -0.04136364
## 7 8 9 10 11
## 1.23936364 -0.74045455 1.83881818 -1.68072727 0.17945455
```

The function `coefficients()`

returns the estimated model parameters; in this case intercept and slope.

`coefficients(lm.anscombe1)`

```
## (Intercept) x
## 3.0000909 0.5000909
```

You can extract both parameters using simple vector indexing:

```
intercept <- coefficients(lm.anscombe1)[1]
slope <- coefficients(lm.anscombe1)[2]
```

Predicted values can be extracted with the function `predict()`

. With no arguments, it just gives the fitted values:

`predict(lm.anscombe1) # Similar to fitted(lm.anscombe1)`

```
## 1 2 3 4 5 6 7 8
## 8.001000 7.000818 9.501273 7.500909 8.501091 10.001364 6.000636 5.000455
## 9 10 11
## 9.001182 6.500727 5.500545
```

You can also predict outcomes for new values:

```
new.ans1 <- data.frame(x = c(1, 4, 6, 19))
predict(lm.anscombe1, newdata = new.ans1)
```

```
## 1 2 3 4
## 3.500182 5.000455 6.000636 12.501818
```

Please note that the data frame must have the same column names as used in the formula within `lm()`

!

Setting the argument `interval = "confidence"`

you get the vector of predicted values with a lower and upper limit given **uncertainty in the parameters**.

`predict(lm.anscombe1, interval = 'confidence')`

```
## fit lwr upr
## 1 8.001000 7.116387 8.885613
## 2 7.000818 6.116205 7.885431
## 3 9.501273 8.141258 10.861287
## 4 7.500909 6.657464 8.344354
## 5 8.501091 7.503113 9.499069
## 6 10.001364 8.423422 11.579305
## 7 6.000636 4.838027 7.163245
## 8 5.000455 3.422513 6.578396
## 9 9.001182 7.838573 10.163791
## 10 6.500727 5.502750 7.498705
## 11 5.500545 4.140531 6.860560
```

Note that R by default returns the 95% confidence interval. You can change it by seting the argument `level`

to another confidence level (e.g., `level = 0.99`

).

```
prediction.anscombe1 <- lm.anscombe1 %>%
predict(., interval = 'confidence') %>%
as.data.frame() %>%
mutate(x = ans1$x)
p <- ggplot(ans1, aes(x = x, y = y)) +
geom_point() +
geom_line(data = prediction.anscombe1, aes(x = x, y = fit)) +
geom_line(data = prediction.anscombe1, aes(x = x, y = upr), linetype = "dashed" ) +
geom_line(data = prediction.anscombe1, aes(x = x, y = lwr), linetype = "dashed")
```

Interpret the confidence band as: When repeating the experiment, we would expect that the estimated regression lines lie within the confidence band 95% of the times.

If you add `interval = "prediction"`

, then you get the vector of predicted values with a lower and upper limit given **uncertainty in the parameters** plus **residual variance**.

```
prediction.anscombe1 <- lm.anscombe1 %>%
predict(., interval = 'prediction') %>%
as.data.frame() %>%
mutate(x = ans1$x)
```

`## Warning in predict.lm(., interval = "prediction"): predictions on current data refer to _future_ responses`

```
p <- ggplot(ans1, aes(x = x, y = y)) +
geom_point() +
geom_line(data = prediction.anscombe1, aes(x = x, y = fit)) +
geom_line(data = prediction.anscombe1, aes(x = x, y = upr), linetype = "dashed" ) +
geom_line(data = prediction.anscombe1, aes(x = x, y = lwr), linetype = "dashed")
```

Interpret the prediction band as: When predicting new values, we would expect that 95% of them lie within the prediction band.

The validity of a model depends on several assumptions regarding the distribution of residuals and the functional relationship between \(y\) and \(x\). Visual/numerical model diagnostics should cover the following aspects:

Analysis of the distribution of the predictor variable

Analysis of leverage points (points with high influence)

Analysis of the dispersion

Analysis of the residuals (independence, normal distribution, and homoscedasticity)

Analysis of the functional relationship between response and predictor

```
ggplot(anscombe.long, aes(x = x, y = y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, col = 'red') +
facet_wrap(~ dataset)
```

`## `geom_smooth()` using formula 'y ~ x'`

Extreme \(x\) values can dominate a regression.

Ideally, \(x\) would be uniformly distributed. We make no explicit assumptions on the distribution of \(x\), but it can matter.

This should be considered when designing experiments and sampling surveys.

Log- and squareroot-transformations can dampen the effect of extreme values (but be careful as the interpretation changes!).

```
ggplot(anscombe.long, aes(x = dataset, y = x)) +
geom_boxplot()
```

There is no practical justification to eliminate ‘outliers’ just because they are extreme values. Removing extreme values has to be well justified, e.g. measurement error.

Although, you should not simply delete outliers, you can run your analysis twice (with and without outliers) to test the robustness of the analysis.

Cook’s distance is a common measure to identify influencial points. Values greater than 1 or greater than 4/n (n = number of samples) are candidate influential points. You can extract Cook’s distance using the `cooks.distance()`

, though we suggest to rely rather on your intuition than on strict thresholds.

Dispersion quantifies, how much higher (or lower) the observed variance is relative to the expected variance of the model (overdispersion or underdispersion).

This is not relevant for the normal distribution case (OLS assumes error term is normally distributed), since the standard deviation (measure of variability) is independent of the mean, and hence, estimated separately.

But dispersion will play a role for other distributions modeled with Generalized Linear Models.

…to be continued.

The residuals should be independent of the fitted values. We can check this by plotting the residuals against the fitted values:

```
ans1 <- anscombe.long[anscombe.long$dataset==1,]
lm.ans1 <- lm(y ~ x, data=ans1)
ans1$residuals <- residuals(lm.ans1)
ans1$fitted = fitted(lm.ans1)
p1 <- ggplot(data = ans1, aes(x = fitted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, col = "red") +
labs(title = "Anscombe 1")
ans2 <- anscombe.long[anscombe.long$dataset==2,]
lm.ans2 <- lm(y ~ x, data=ans2)
ans2$residuals <- residuals(lm.ans2)
ans2$fitted = fitted(lm.ans2)
p2 <- ggplot(data = ans2, aes(x = fitted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, col = "red") +
labs(title = "Anscombe 2")
```

There should be no serial correlation between residuals. We can check this by plotting the residuals over observations:

```
p1 <- ggplot(data = ans1, aes(x=1:11, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, col = "red") +
labs(title = "Anscombe 1")
p2 <- ggplot(data = ans2, aes(x = 1:11, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, col = "red") +
labs(title = "Anscombe 2")
```

Please note that regression models might have temporally or spatially correlated residuals. In fact, both are very common with ecological data. You will learn how to deal with spatial correlation structure in the residuals later in the course.

The residuals should be normally distributed, which we can test by plotting the residuals via a histogram

```
p1 <- ggplot(data = ans1, aes(x = residuals)) +
geom_histogram() +
labs(title = "Anscombe 1")
p2 <- ggplot(data = ans2, aes(x = residuals)) +
geom_histogram() +
labs(title = "Anscombe 2")
```

With few data, however, histograms are difficult to assess! We hence often prefer QQ-Plots.

The Quantile-Quantile Plot (QQ-Plot) plots the quantiles of the standardized residuals set over the quantiles of a standard normal distribution:

```
ans1$resid_standard <- rstandard(lm.ans1) # standardized residuals
p1 <- ggplot(ans1, aes(sample = resid_standard)) +
stat_qq() +
geom_abline(intercept = 0, slope = 1)
ans2$resid_standard <- rstandard(lm.ans2) # standardized residuals
p2 <- ggplot(ans2, aes(sample = resid_standard)) +
stat_qq() +
geom_abline(intercept = 0, slope = 1)
```

If the two datasets come from the same distribution (the residuals are thus normally distributed), the points should lie roughly on a line through the origin with slope 1. Interpretation takes some experience and is not always easy, especially with small sample sizes. Yet, it is one of the best methods! Try https://xiongge.shinyapps.io/QQplots/ to see the effect of deviations from normality on the QQ-plot.

When fitting a linear regression model, we assume a linear relationship between \(y\) and \(x\). But this assumption may be incorrect. To test whether an assumed relationship matches the data or not, you may need to compare different models (including data transformations). You should also apply your mechanistic understanding of the system you are trying to analyse.

Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.