Generalization of linear models that allow to choose freely

- how the data is linked to the model (link-function) and
- which data generating distribution is assumed (data distribution).

We covered two models:

- Log link and Gaussian distribution (ozone~temp)
- Logit link and binomial distribution (mortality~concentration)

In R, GLMs are constructed using the `glm()`

function.

Generalized linear models can be fitted in `R`

using the `glm`

function, which is similar to the `lm`

function for fitting linear models. The arguments to a glm call are as follows

`glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE, method = ”glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)`

The family argument takes (the name of) a family function which specifies

- the link function
- the variance function

The exponential family functions available in R are

`binomial(link = "logit")`

`gaussian(link = "identity")`

`Gamma(link = "inverse")`

`inverse.gaussian(link = "1/mu2")`

`poisson(link = "log")`

GLMs to analyze:

- Binary response data with logit link and binomial distribution (known as logistic regression)
- Count data with log link and Poisson distribution

We today use a dataset where the response can only take two values. The data indicates whether trees were infested by a bark beetle or not, depending on the temperature anomaly (z-transformed temperatures with values > 0 indicating hotter than average):

```
dat <- read.csv("Data/infestation.csv")
head(dat)
```

```
## infested temp_anomaly
## 1 1 -1.0785729
## 2 1 1.2728951
## 3 1 1.8699789
## 4 1 0.6327722
## 5 0 0.3342014
## 6 1 0.8407184
```

```
ggplot(dat, aes(x = temp_anomaly, y = infested)) +
geom_point()
```

Similar problems include: dead/alive, presence/absence, conversion/conservation, etc.

We want to model the probability of infestation given a certain temperature anomaly. The logit link seems well suited as it is bounded between 0 and 1 and alows for a quick transition between both extremes:

The regression thus can be written as:

\(Pr(infestation = 1) = logit^{−1}(\beta_0 + \beta_1 * temp)) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 * temp)}}\)

In our case the data is binomial distributed. Hence, we can relate the probability of infestation modelled through the logit link to our data using a binomial distribution with one trial, which translates into the Bernoulli distribution (a special case of the binomial distribution with only two outcomes):

\(infestation \sim Bernoulli(logit^{−1}(\beta_0 + \beta_1 * temp)))\)

This model is called binomial logistic regression, there are also multinomial (three or more outcomes) and ordinal logistic regression (ordinal outcomes).

Fitting a logistic regression model to our data:

`infested.bin <- glm(infested ~ temp_anomaly, data = dat, family = binomial(link = "logit"))`

And plotting the regression line:

```
prediction <- data.frame(temp_anomaly = seq(-2, 2, length.out = 100))
prediction$fitted <- predict(object = infested.bin, newdata = prediction, type = "response")
ggplot() +
geom_point(data = dat, aes(x = temp_anomaly, y = infested)) +
geom_line(data = prediction, aes(x = temp_anomaly, y = fitted))
```

To calculate the confidence intervals we have to first calculate them on the link-scale and then backtransform (`plogis`

) them onto the predictor scale:

```
pred <- predict(object = infested.bin, newdata = prediction,
type = "link", se.fit = TRUE)
prediction$fitted_link <- pred$fit
prediction$lower_link <- prediction$fitted_link - 2 * pred$se.fit
prediction$upper_link <- prediction$fitted_link + 2 * pred$se.fit
prediction$link <- plogis(prediction$fitted_link)
prediction$lower <- plogis(prediction$lower_link)
prediction$upper <- plogis(prediction$upper_link)
```

And add them to the plot:

```
ggplot() +
geom_point(data = dat, aes(x = temp_anomaly, y = infested)) +
geom_ribbon(data = prediction, aes(x = temp_anomaly, ymin = lower, ymax = upper), alpha = 0.3) +
geom_line(data = prediction, aes(x = temp_anomaly, y = fitted))
```

Alternatively, we can use the `effect()`

function, which does the transformation automatically:

```
library(effects)
efct <- effect("temp_anomaly",
mod = infested.bin,
xlevels = list(temp_anomaly = seq(-2, 2, length.out = 100)))
efct <- as.data.frame(efct)
ggplot() +
geom_point(data = dat, aes(x = temp_anomaly, y = infested)) +
geom_ribbon(data = efct, aes(x = temp_anomaly, ymin = lower, ymax = upper), alpha = 0.3) +
geom_line(data = efct, aes(x = temp_anomaly, y = fit))
```

We can estimate the temperature anomaly value at which the infestation probability is 50% using the `dose.p()`

from the `MASS`

package.

```
library(MASS)
temp_anomaly_p50 <- dose.p(infested.bin, p=0.5)
temp_anomaly_p50
```

```
## Dose SE
## p = 0.5: -0.3744093 0.2372779
```

It is not straightforward to interpret coefficients in logistic regression, because the change in `y`

given a one unit change in `x`

is non-linear (though linear on the link scale!). However, there are - besides plotting - some ways of making the regression coefficients interpretable.

The **divide by 4 rule** by Gelman and Hill (2007) takes the fact that the the logistic curve is steepest at its center (which is zero given scaled predictors).

As a rough estimate, we can take logistic regression coefficients and divide them by 4 to get an upper bound of the predictive difference corresponding to a unit difference in x. This upper bound is a good approximation near the midpoint of the logistic curve, where probabilities are close to 0.5.

Gelmann and Hill (2007): Data Analysis Using Regression and Multilevel/Hierarchical Models (page 82).

In the our example, the regression coefficient was 2.18. Dividing 2.18 by 4 yields 0.54. We can thus say that with an increase in temperature anomaly by one standard deviation, there is an expected increase in infestation probability by no more than 54%.

`summary(infested.bin)`

```
##
## Call:
## glm(formula = infested ~ temp_anomaly, family = binomial(link = "logit"),
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8638 -0.6022 0.1966 0.4441 1.8595
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8154 0.5766 1.414 0.15734
## temp_anomaly 2.1778 0.7809 2.789 0.00529 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40.381 on 29 degrees of freedom
## Residual deviance: 23.488 on 28 degrees of freedom
## AIC: 27.488
##
## Number of Fisher Scoring iterations: 5
```

Another way to interpret logistic regression coefficients is in terms of **odds ratios**.

If a `success`

has the probability \(p\) and `failure`

the probability of \((1 − p)\), then \(p/(1 − p)\) is called the odds. Thus, the odds of `success`

are defined as the ratio of the probability of success over the probability of `failure`

.

An odds of 1 is equivalent to a probability of 0.5, that is equally likely outcomes (1:1). An odds of 0.5 is equivalent to a probability of 0.33, indicating there is a 1:2 chance of `success`

. An odds of 2 is equivalent to a probability of 0.66, indicating a 2:1 chance of `success`

.

The ratio of two odds is called an odds ratio: \(\frac{(p1/(1 − p1))}{(p2/(1 − p2))}\).

The odds ratio indicates the preference of outcome A over outcome B.

**Example:**

We observed a plant species in 6 out of 10 plots in landscape A and in 4 out of 10 plots in landscape B.

This gives us probabilities of 0.6 and 0.4. The odds ratio of observing the plant in landscape A over observing it in landscape B is (0.6/0.4) / (0.4/0.6) = 2.25.

Hence, there is a 2.25 times higher chance of observing the plant in landscape A compared to landscape B.

The logistic regression coefficients give the change in log odds (recall the link function ln(p/1-p)).

When exponentiated, logistic regression coefficients can be interpreted as odds ratios:

\(OR = \frac{odds(x + 1)}{odds(x)} = \frac{e^{\beta_0 + \beta_1 * (x + 1)}}{\beta_0 + \beta_1 * x} = e^{\beta_1}\)

For continuous predictors, the coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

For categorical predictors, the coefficient estimate describes the change in the log odds for each level compared to the base level.

Infestation example:

`exp(coef(infested.bin)[2])`

```
## temp_anomaly
## 8.826474
```

The odds ratio indicates that with a one SD change in temperature anomaly there is a 8.83 times higher chance of infestation over no infestation.

The following data set shows bird counts over a topographic relief gradient:

```
dat <- read.table("Data/bird_counts.txt", sep = ";", header = TRUE)
head(dat)
```

```
## relief Y
## 1 0.576 129
## 2 0.576 166
## 3 0.576 137
## 4 0.576 105
## 5 0.576 155
## 6 0.576 178
```

```
ggplot(dat, aes(x = relief, y = Y)) +
geom_point()
```

```
ggplot(dat, aes(x = relief, y = Y)) + scale_y_continuous(trans='log') +
geom_point()
```

The data is bounded to positive values, a log-link is thus recommended:

\(counts_{expected} = log^{-1}(\beta_0 + \beta_1 * relief)\)

As the data can only take natural numbers (0, 1, 2, …), we need a discrete data distrubution. For count models, the Poisson distribution is a good choice, yielding the final model of:

\(counts \sim Poisson(log^{-1}(\beta_0 + \beta_1 * relief))\)

`bird.poisson <- glm(Y ~ relief, data = dat, family = poisson(link = "log"))`

`summary(bird.poisson) `

```
##
## Call:
## glm(formula = Y ~ relief, family = poisson(link = "log"), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -11.1980 -2.1407 -0.0021 2.1986 10.7204
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.665888 0.021577 216.24 <2e-16 ***
## relief 0.339559 0.008382 40.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2921.1 on 60 degrees of freedom
## Residual deviance: 1299.6 on 59 degrees of freedom
## AIC: 1739.7
##
## Number of Fisher Scoring iterations: 4
```

Dispersion quantifies, how much higher (or lower) the observed variance is relative to the expected variance of the model.

For the Poisson distribution the expected value is equal to the variance, i.e. the variance increases 1:1 with the mean value. Hence,

**dispersion = 1**.In othe words, when we fit a Poisson model, we expect the variance to increase with the mean value by a factor of 1.

If the variance decreases at a lower rate than the data are

**underdispered**If the variance decreases at a higher rate than the data are

**overdispersed**.

With correct mean model we have consistent estimates of the parameters but overdispersion leads to:

to underestimation of the standard errors of the parameter estimates.

selection of overly complex models.

To get a rough estimate on overdispersion, divide the *residual deviance* by the *residual degree’s of freedom*. The ratio should be below 1. Values much larger (> 2) indicate overdispersion (or underdispersion if less than 0.6).

```
## Residual deviance: 1299.6 on 59 degrees of freedom
summary(bird.poisson)$deviance / summary(bird.poisson)$df.residual
```

`## [1] 22.02709`

Our model appears to be overdispered, indicating:

bad model fit (omitted terms/variables, incorrect relationship/link, outliers)

the variation of Y is greater than predicted by the model

Remedy:

Change distribution/link-function

Change optimization (quasi-likelihood)

For a quasi-poisson regression, the variance is assumed to be a linear function of the mean; for negative binomial regression, a quadratic function.

```
bird.quasip <- glm(Y ~ relief, data = dat, family = quasipoisson)
summary(bird.quasip)
```

```
##
## Call:
## glm(formula = Y ~ relief, family = quasipoisson, data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -11.1980 -2.1407 -0.0021 2.1986 10.7204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.66589 0.09953 46.878 < 2e-16 ***
## relief 0.33956 0.03866 8.782 2.64e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 21.27853)
##
## Null deviance: 2921.1 on 60 degrees of freedom
## Residual deviance: 1299.6 on 59 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
```

The coefficients of the Poisson model can be interpreted easily, as the coefficients are multiplicative when backtransformed to the response scale. That is, with a one unit increase in x, the count is multiplied by the exponent of the coefficient. For our example, we expect a 1.4 times (or 40%) increase in mean bird counts with a one unit increase in relief. You can use `confint()`

to compute confidence intervals for the parameters.

```
# Poisson
coeffs <- exp(coef(bird.poisson))
coeffs_ci <- exp(confint(bird.poisson))
cbind(coeffs, coeffs_ci)
```

```
## coeffs 2.5 % 97.5 %
## (Intercept) 106.259866 101.846790 110.836066
## relief 1.404328 1.381446 1.427589
```

```
# quasi-Poisson
coeffs <- exp(coef(bird.quasip))
coeffs_ci <- exp(confint(bird.quasip))
cbind(coeffs, coeffs_ci)
```

```
## coeffs 2.5 % 97.5 %
## (Intercept) 106.259866 87.187307 128.808562
## relief 1.404328 1.301813 1.514927
```

Extending the Linear Model with R, Julian J. Faraway (p. 59):

\(R^2_{pseudo} = 1-\frac{Residual\ deviance}{Null\ deviance}\)

```
## Null deviance: 2921.1 on 60 degrees of freedom
## Residual deviance: 1299.6 on 59 degrees of freedom
1 - 1299.6/2921.1
```

`## [1] 0.5550991`

```
efct <- effect("relief", mod = bird.quasip, xlevels = list(relief = seq(0, 4, length.out = 100)))
efct <- as.data.frame(efct)
p <- ggplot() +
geom_point(data = dat, aes(x = relief, y = Y)) +
geom_ribbon(data = efct, aes(x = relief, ymin = lower, ymax = upper), alpha = 0.3) +
geom_line(data = efct, aes(x = relief, y = fit))
print(p)
```

```
library(gridExtra)
plot1 <- ggplot(dat, aes(x = relief, y = Y)) +
geom_point() +
geom_smooth(method = "glm", method.args=list(family = "poisson"), col = "red")
plot2 <- ggplot(dat, aes(x = relief, y = Y)) +
geom_point() +
geom_smooth(method = "glm", method.args=list(family = "quasipoisson"), col = "red")
grid.arrange(plot1, plot2, ncol = 2)
```

```
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
```

Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.