Recap last session

  • Installing and loading packages (install.packages(), library())
  • Plotting with R base graphics (hist(), plot(), boxplot())
  • Plotting with ggplot2 (ggplot())
  • ggplot2 Geoms (geom_histogram(), geom_point(), geom_boxplot())
  • ggplot2 Aesthetics (aes(x, y, col, shape, linetype, fill, size))
  • ggplot2 Trellis graphs (facet_wrap())
  • ggplot2 Plot title and axis labels (labs(x, y, title))
  • Combining multiple plots (gridExtra package, grid.arange())
  • Saving plots to graphic files (ggsave() or png(), pdf(), tif(), etc., dev.off())
  • Tidying data (tidyr package, pivot_wider(), pivot_longer(), separate())
  • Writing complex statements using pipes (dplyr package, %>%)
  • Data wrangling with dplyr (mutate(), summarize(), group_by(), filter())

Today’s session

  • The lm() function and its extractor functions
  • Plotting regression lines
  • Prediction versus confidence bands
  • Model diagnostics with R

Packages

library(ggplot2) # load ggplot2 package for visualization
library(tidyr) # data wrangling package: change from wide to long format
library(dplyr) # data manipulation: mutate

Anscombe dataset

The anscombe dataset is shipped with base R, and it consists of 4 datasets. To make life easier later in the session, I convert the wide format (datasets are organized by columns) into a long format (datasets are organized by rows) using the tidyr package.

data(anscombe) # load anscombe (installed with R)
head(anscombe, 2) # print first two rows
##   x1 x2 x3 x4   y1   y2   y3   y4
## 1 10 10 10  8 8.04 9.14 7.46 6.58
## 2  8  8  8  8 6.95 8.14 6.77 5.76
# create id
anscombe$id <- 1:11  

# turn into long format
temp1 <- pivot_longer(anscombe, cols= -id, names_to="key", values_to="value") 

# extract dataset id from variable names
temp2 <- extract(temp1, key, c("var", "dataset"), "(.)(.)") 

# spread variables to separate x and y columns
anscombe.long <- pivot_wider(temp2, names_from=var, values_from=value)

head(anscombe.long, 2) # print first two rows
## # A tibble: 2 x 4
##      id dataset     x     y
##   <int> <chr>   <dbl> <dbl>
## 1     1 1          10  8.04
## 2     1 2          10  9.14

For this exercise, we will use the first anscombe dataset.

# extract first anscombe dataset
ans1 <- anscombe.long[anscombe.long$dataset==1,] 
ggplot(ans1, aes(x = x, y = y)) +
       geom_point()

Ordinary Least Squares

The linear regression model is given by:

\(y_i = \beta_{0} + \beta_{1}x_i + \epsilon_i\)

The parameters \(\beta_{0}\) and \(\beta_{1}\) can be estimated using the method of ordinary least squares.

Residual sum of squares (SSE)

OLS minimizes the residuals \(y_{i}-\hat{y}_i\) (difference between observed and fitted values, red lines). In fact, OLS minimizes the sum of squared residuals: \(SSE = \sum_{i=1}^{n}(y_{i} − \hat{y}_{i})^2\). Thus, SSE measures the unexplained variability by the regression. The diagonal grey line is the fitted line: \(\hat{y}_i\).

Total sum of squares (SSY)

The sum of squares total (SSY) is based on the difference (blue line) between the observed variable \(y\) (dots) and its mean \(\bar{y}\) (black line): \(SSY = \sum_{i=1}^{n}(y_{i}-\bar{y})^2\). SSY is a measure of the total variability (variance) of the response variable \(y\). You can also think of it as the variance explained by the null model (single mean). Intercept of the Null model \(\beta_0\) = \(\bar{y}\) and the slope \(\beta_1\) = 0.

SSR

The sum of squares due to regression is the sum of the differences between the fitted values (grey line) and the mean response (black line): \(SSR = \sum_{i=1}^{n}(\hat{y}_{i} − \bar{y})^2\). Thus, it is a measure of how well your line fits the data. If SSE is zero then SSR = SSY, which means your regression model captures all the observed variablity and is perfect. \(SSR = SSY-SSE\)

F-test

The F-test of the overall significance is a specific form of the F-test. It compares a model with no predictors to the model that you specify. A regression model with no predictors is also known as null model or intercept-only model. All you need to do the F-test is in the Analysis of Variance table:

\(H_0\): The fit of the intercept-only model and your model are equal.

\(H_A\): The fit of the intercept-only model is significantly reduced compared to your model.

Source Sum of Squares (SS) Degrees of freedom (df) Mean Squares (MS)
Regression \(SSR = SSY-SSE\) 1 \(\frac{SSR}{1}\)
Error \(SSE\) n-2 \(\frac{SSE}{n-2}\)
Total \(SSY\) n-1

\(F_s = \frac{MS_{regression}}{MS_{error}}\)

\(Pr(Z \ge F_s) = 1-F(F_s, 1, n-2)\)

Model fitting with lm()

The lm() function implements simple linear regression in R. The argument to lm() is a model formula in which the tilde symbol (~) should be read as “described by”.

lm.anscombe1 <- lm(y ~ x, data = ans1) # fits the model
lm.anscombe1 # print the lm object lm.abscombe1
## 
## Call:
## lm(formula = y ~ x, data = ans1)
## 
## Coefficients:
## (Intercept)            x  
##      3.0001       0.5001

The result of lm() is a model object. This is a distinctive concept of R. Whereas other statistical systems focus on generating printed output that can be controlled by setting options, you get instead the result of a model fit encapsulated in an object from which the desired quantities can be obtained using extractor functions. Printing the created model object gives you the model formula and the coefficients (intercept and slope). More information about the model can be accessed with extractor functions. We’ll do that later.

Visualization

We can plot the regression line from the estimated slope (\(\beta_{1}\)) and intercept (\(\beta_{0}\)) using geom_abline():

ggplot(ans1, aes(x = x, y = y)) +
  geom_point() +
  geom_abline(intercept = 3.0001, slope = 0.5001, col = "red")

ANOVA table in R

Calling anova() on an object returned by lm() computes the ANOVA tables. The SSE is 13.763, the mean squared error is 1.529 (with 9 degrees of freedom), and the SSY is 27.51.

The F-value (\(F_S\)) is 17.99.

anova(lm.anscombe1)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## x          1 27.510 27.5100   17.99 0.00217 **
## Residuals  9 13.763  1.5292                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note, we can also do the F-test manually by looking up the \(F_s\) value in the F-distribution. Recall, the F-distribution has two parameters: the degrees of freedom of the regression (\(df_1 = 1\)), and the degrees of freedom of the error (\(df_2 = n-2\)).

# p-value of the F-test
1 - pf(17.99, 1, 11-2)
## [1] 0.002169607

As the p-value is very small, we safely reject the null hypothesis \(H_0 : \beta_1 = 0\), indicating a significant effect of \(x\) on \(y\). However, be aware that the p-value does not indicate the effect size! The effect size (change in \(y\) with one unit change in \(x\)) is given by the regression slope (\(\beta_1\)).

lm() extractor functions

Using the summary() function we can get a more comprehensive summary of the model statistics, including the F-test, t-tests for individual parameters, coefficient of variation (\(R^2\)).

summary(lm.anscombe1)
## 
## Call:
## lm(formula = y ~ x, data = ans1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x             0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

The function fitted() returns fitted values: the y-values that you would expect for the given x-values according to the best-fitting straight line. The residuals shown by residuals() is the difference between this and the observed y.

fitted(lm.anscombe1)
##         1         2         3         4         5         6         7         8 
##  8.001000  7.000818  9.501273  7.500909  8.501091 10.001364  6.000636  5.000455 
##         9        10        11 
##  9.001182  6.500727  5.500545
residuals(lm.anscombe1)
##           1           2           3           4           5           6 
##  0.03900000 -0.05081818 -1.92127273  1.30909091 -0.17109091 -0.04136364 
##           7           8           9          10          11 
##  1.23936364 -0.74045455  1.83881818 -1.68072727  0.17945455

The function coefficients() returns the estimated model parameters; in this case intercept and slope.

coefficients(lm.anscombe1)
## (Intercept)           x 
##   3.0000909   0.5000909

You can extract both parameters using simple vector indexing:

intercept <- coefficients(lm.anscombe1)[1]
slope <- coefficients(lm.anscombe1)[2]

Predicted values can be extracted with the function predict(). With no arguments, it just gives the fitted values:

predict(lm.anscombe1) # Similar to fitted(lm.anscombe1)
##         1         2         3         4         5         6         7         8 
##  8.001000  7.000818  9.501273  7.500909  8.501091 10.001364  6.000636  5.000455 
##         9        10        11 
##  9.001182  6.500727  5.500545

You can also predict outcomes for new values:

new.ans1 <- data.frame(x = c(1, 4, 6, 19))
predict(lm.anscombe1, newdata = new.ans1)
##         1         2         3         4 
##  3.500182  5.000455  6.000636 12.501818

Please note that the data frame must have the same column names as used in the formula within lm()!

Setting the argument interval = "confidence" you get the vector of predicted values with a lower and upper limit given uncertainty in the parameters.

predict(lm.anscombe1, interval = 'confidence')
##          fit      lwr       upr
## 1   8.001000 7.116387  8.885613
## 2   7.000818 6.116205  7.885431
## 3   9.501273 8.141258 10.861287
## 4   7.500909 6.657464  8.344354
## 5   8.501091 7.503113  9.499069
## 6  10.001364 8.423422 11.579305
## 7   6.000636 4.838027  7.163245
## 8   5.000455 3.422513  6.578396
## 9   9.001182 7.838573 10.163791
## 10  6.500727 5.502750  7.498705
## 11  5.500545 4.140531  6.860560

Note that R by default returns the 95% confidence interval. You can change it by seting the argument level to another confidence level (e.g., level = 0.99).

Confidence bands

prediction.anscombe1 <- lm.anscombe1 %>%
  predict(., interval = 'confidence') %>%
  as.data.frame() %>%
  mutate(x = ans1$x)

p <- ggplot(ans1, aes(x = x, y = y)) +
  geom_point() +
  geom_line(data = prediction.anscombe1, aes(x = x, y = fit)) +
  geom_line(data = prediction.anscombe1, aes(x = x, y = upr), linetype = "dashed" ) +
  geom_line(data = prediction.anscombe1, aes(x = x, y = lwr), linetype = "dashed")

Interpret the confidence band as: When repeating the experiment, we would expect that the estimated regression lines lie within the confidence band 95% of the times.

Prediction bands

If you add interval = "prediction", then you get the vector of predicted values with a lower and upper limit given uncertainty in the parameters plus residual variance.

prediction.anscombe1 <- lm.anscombe1 %>%
  predict(., interval = 'prediction') %>%
  as.data.frame() %>%
  mutate(x = ans1$x)
## Warning in predict.lm(., interval = "prediction"): predictions on current data refer to _future_ responses
p <- ggplot(ans1, aes(x = x, y = y)) +
  geom_point() +
  geom_line(data = prediction.anscombe1, aes(x = x, y = fit)) +
  geom_line(data = prediction.anscombe1, aes(x = x, y = upr), linetype = "dashed" ) +
  geom_line(data = prediction.anscombe1, aes(x = x, y = lwr), linetype = "dashed")

Interpret the prediction band as: When predicting new values, we would expect that 95% of them lie within the prediction band.

Model diagnostics

Five-step plan

The validity of a model depends on several assumptions regarding the distribution of residuals and the functional relationship between \(y\) and \(x\). Visual/numerical model diagnostics should cover the following aspects:

  1. Analysis of the distribution of the predictor variable

  2. Analysis of leverage points (points with high influence)

  3. Analysis of the dispersion

  4. Analysis of the residuals (independence, normal distribution, and homoscedasticity)

  5. Analysis of the functional relationship between response and predictor

Anscombe data

ggplot(anscombe.long, aes(x = x, y = y)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE, col = 'red') +      
    facet_wrap(~ dataset)
## `geom_smooth()` using formula 'y ~ x'

Distribution of predictor

  • Extreme \(x\) values can dominate a regression.

  • Ideally, \(x\) would be uniformly distributed. We make no explicit assumptions on the distribution of \(x\), but it can matter.

  • This should be considered when designing experiments and sampling surveys.

  • Log- and squareroot-transformations can dampen the effect of extreme values (but be careful as the interpretation changes!).

ggplot(anscombe.long, aes(x = dataset, y = x)) +
            geom_boxplot()

Leverage points

There is no practical justification to eliminate ‘outliers’ just because they are extreme values. Removing extreme values has to be well justified, e.g. measurement error.

Although, you should not simply delete outliers, you can run your analysis twice (with and without outliers) to test the robustness of the analysis.

Cook’s distance is a common measure to identify influencial points. Values greater than 1 or greater than 4/n (n = number of samples) are candidate influential points. You can extract Cook’s distance using the cooks.distance(), though we suggest to rely rather on your intuition than on strict thresholds.

Dispersion

  • Dispersion quantifies, how much higher (or lower) the observed variance is relative to the expected variance of the model (overdispersion or underdispersion).

  • This is not relevant for the normal distribution case (OLS assumes error term is normally distributed), since the standard deviation (measure of variability) is independent of the mean, and hence, estimated separately.

  • But dispersion will play a role for other distributions modeled with Generalized Linear Models.

  • …to be continued.

Response residuals

The residuals should be independent of the fitted values. We can check this by plotting the residuals against the fitted values:

ans1 <- anscombe.long[anscombe.long$dataset==1,]
lm.ans1 <- lm(y ~ x, data=ans1)
ans1$residuals <- residuals(lm.ans1)
ans1$fitted = fitted(lm.ans1)

p1 <- ggplot(data = ans1, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(title = "Anscombe 1")

ans2 <- anscombe.long[anscombe.long$dataset==2,]
lm.ans2 <- lm(y ~ x, data=ans2)
ans2$residuals <- residuals(lm.ans2)
ans2$fitted = fitted(lm.ans2)


p2 <- ggplot(data = ans2, aes(x = fitted, y = residuals)) +
      geom_point() +
      geom_hline(yintercept = 0, col = "red") +
      labs(title = "Anscombe 2")

There should be no serial correlation between residuals. We can check this by plotting the residuals over observations:

p1 <- ggplot(data = ans1, aes(x=1:11, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(title = "Anscombe 1")


p2 <- ggplot(data = ans2, aes(x = 1:11, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(title = "Anscombe 2")

Please note that regression models might have temporally or spatially correlated residuals. In fact, both are very common with ecological data. You will learn how to deal with spatial correlation structure in the residuals later in the course.

The residuals should be normally distributed, which we can test by plotting the residuals via a histogram

p1 <- ggplot(data = ans1, aes(x = residuals)) +
  geom_histogram() +
  labs(title = "Anscombe 1")

p2 <- ggplot(data = ans2, aes(x = residuals)) +
  geom_histogram() +
  labs(title = "Anscombe 2")

With few data, however, histograms are difficult to assess! We hence often prefer QQ-Plots.

The Quantile-Quantile Plot (QQ-Plot) plots the quantiles of the standardized residuals set over the quantiles of a standard normal distribution:

ans1$resid_standard <- rstandard(lm.ans1) # standardized residuals
p1 <- ggplot(ans1, aes(sample = resid_standard)) +
      stat_qq() +
      geom_abline(intercept = 0, slope = 1)

ans2$resid_standard <- rstandard(lm.ans2) # standardized residuals
p2 <- ggplot(ans2, aes(sample = resid_standard)) +
      stat_qq() +
      geom_abline(intercept = 0, slope = 1)

If the two datasets come from the same distribution (the residuals are thus normally distributed), the points should lie roughly on a line through the origin with slope 1. Interpretation takes some experience and is not always easy, especially with small sample sizes. Yet, it is one of the best methods! Try https://xiongge.shinyapps.io/QQplots/ to see the effect of deviations from normality on the QQ-plot.

Functional relationship

When fitting a linear regression model, we assume a linear relationship between \(y\) and \(x\). But this assumption may be incorrect. To test whether an assumed relationship matches the data or not, you may need to compare different models (including data transformations). You should also apply your mechanistic understanding of the system you are trying to analyse.


Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.