Recap GLMs

Generalization of linear models that allow to choose freely

  1. how the data is linked to the model (link-function) and
  2. which data generating distribution is assumed (data distribution).

We covered two models:

  • Log link and Gaussian distribution (ozone~temp)
  • Logit link and binomial distribution (mortality~concentration)

In R, GLMs are constructed using the glm() function.

The glm Function

Generalized linear models can be fitted in R using the glm function, which is similar to the lm function for fitting linear models. The arguments to a glm call are as follows

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE, method = ”glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)

The family argument takes (the name of) a family function which specifies

  • the link function
  • the variance function

The exponential family functions available in R are

  • binomial(link = "logit")
  • gaussian(link = "identity")
  • Gamma(link = "inverse")
  • inverse.gaussian(link = "1/mu2")
  • poisson(link = "log")

Today’s session

GLMs to analyze:

  • Binary response data with logit link and binomial distribution (known as logistic regression)
  • Count data with log link and Poisson distribution

Logistic regression

We today use a dataset where the response can only take two values. The data indicates whether trees were infested by a bark beetle or not, depending on the temperature anomaly (z-transformed temperatures with values > 0 indicating hotter than average):

dat <- read.csv("Data/infestation.csv")
head(dat)
##   infested temp_anomaly
## 1        1   -1.0785729
## 2        1    1.2728951
## 3        1    1.8699789
## 4        1    0.6327722
## 5        0    0.3342014
## 6        1    0.8407184
ggplot(dat, aes(x = temp_anomaly, y = infested)) +
  geom_point()

Similar problems include: dead/alive, presence/absence, conversion/conservation, etc.

We want to model the probability of infestation given a certain temperature anomaly. The logit link seems well suited as it is bounded between 0 and 1 and alows for a quick transition between both extremes:

The regression thus can be written as:

\(Pr(infestation = 1) = logit^{−1}(\beta_0 + \beta_1 * temp)) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 * temp)}}\)

In our case the data is binomial distributed. Hence, we can relate the probability of infestation modelled through the logit link to our data using a binomial distribution with one trial, which translates into the Bernoulli distribution (a special case of the binomial distribution with only two outcomes):

\(infestation \sim Bernoulli(logit^{−1}(\beta_0 + \beta_1 * temp)))\)

This model is called binomial logistic regression, there are also multinomial (three or more outcomes) and ordinal logistic regression (ordinal outcomes).

Fitting a logistic regression model to our data:

infested.bin <- glm(infested ~ temp_anomaly, data = dat, family = binomial(link = "logit"))

And plotting the regression line:

prediction <- data.frame(temp_anomaly = seq(-2, 2, length.out = 100))
prediction$fitted <- predict(object = infested.bin, newdata = prediction, type = "response")

ggplot() +
  geom_point(data = dat, aes(x = temp_anomaly, y = infested)) +
  geom_line(data = prediction, aes(x = temp_anomaly, y = fitted))

Confidence intervals

To calculate the confidence intervals we have to first calculate them on the link-scale and then backtransform (plogis) them onto the predictor scale:

pred <- predict(object = infested.bin, newdata = prediction, 
                type = "link", se.fit = TRUE)

prediction$fitted_link <- pred$fit
prediction$lower_link <- prediction$fitted_link - 2 * pred$se.fit
prediction$upper_link <- prediction$fitted_link + 2 * pred$se.fit

prediction$link <- plogis(prediction$fitted_link)
prediction$lower <- plogis(prediction$lower_link)
prediction$upper <- plogis(prediction$upper_link)

And add them to the plot:

ggplot() +
  geom_point(data = dat, aes(x = temp_anomaly, y = infested)) +
  geom_ribbon(data = prediction, aes(x = temp_anomaly, ymin = lower, ymax = upper), alpha = 0.3) +
  geom_line(data = prediction, aes(x = temp_anomaly, y = fitted))

effects package

Alternatively, we can use the effect() function, which does the transformation automatically:

library(effects)

efct <- effect("temp_anomaly", 
               mod = infested.bin, 
               xlevels = list(temp_anomaly = seq(-2, 2, length.out = 100)))
efct <- as.data.frame(efct)

ggplot() +
  geom_point(data = dat, aes(x = temp_anomaly, y = infested)) +
  geom_ribbon(data = efct, aes(x = temp_anomaly, ymin = lower, ymax = upper), alpha = 0.3) +
  geom_line(data = efct, aes(x = temp_anomaly, y = fit))

dose.p()

We can estimate the temperature anomaly value at which the infestation probability is 50% using the dose.p() from the MASS package.

library(MASS)
temp_anomaly_p50 <- dose.p(infested.bin, p=0.5)
temp_anomaly_p50
##                Dose        SE
## p = 0.5: -0.3744093 0.2372779

Interpretation

It is not straightforward to interpret coefficients in logistic regression, because the change in y given a one unit change in x is non-linear (though linear on the link scale!). However, there are - besides plotting - some ways of making the regression coefficients interpretable.

The divide by 4 rule by Gelman and Hill (2007) takes the fact that the the logistic curve is steepest at its center (which is zero given scaled predictors).

As a rough estimate, we can take logistic regression coefficients and divide them by 4 to get an upper bound of the predictive difference corresponding to a unit difference in x. This upper bound is a good approximation near the midpoint of the logistic curve, where probabilities are close to 0.5.

Gelmann and Hill (2007): Data Analysis Using Regression and Multilevel/Hierarchical Models (page 82).

In the our example, the regression coefficient was 2.18. Dividing 2.18 by 4 yields 0.54. We can thus say that with an increase in temperature anomaly by one standard deviation, there is an expected increase in infestation probability by no more than 54%.

summary(infested.bin)
## 
## Call:
## glm(formula = infested ~ temp_anomaly, family = binomial(link = "logit"), 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8638  -0.6022   0.1966   0.4441   1.8595  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    0.8154     0.5766   1.414  0.15734   
## temp_anomaly   2.1778     0.7809   2.789  0.00529 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 40.381  on 29  degrees of freedom
## Residual deviance: 23.488  on 28  degrees of freedom
## AIC: 27.488
## 
## Number of Fisher Scoring iterations: 5

Odds ratios

Another way to interpret logistic regression coefficients is in terms of odds ratios.

If a success has the probability \(p\) and failure the probability of \((1 − p)\), then \(p/(1 − p)\) is called the odds. Thus, the odds of success are defined as the ratio of the probability of success over the probability of failure.

An odds of 1 is equivalent to a probability of 0.5, that is equally likely outcomes (1:1). An odds of 0.5 is equivalent to a probability of 0.33, indicating there is a 1:2 chance of success. An odds of 2 is equivalent to a probability of 0.66, indicating a 2:1 chance of success.

  • The ratio of two odds is called an odds ratio: \(\frac{(p1/(1 − p1))}{(p2/(1 − p2))}\).

  • The odds ratio indicates the preference of outcome A over outcome B.

Example:

  • We observed a plant species in 6 out of 10 plots in landscape A and in 4 out of 10 plots in landscape B.

  • This gives us probabilities of 0.6 and 0.4. The odds ratio of observing the plant in landscape A over observing it in landscape B is (0.6/0.4) / (0.4/0.6) = 2.25.

  • Hence, there is a 2.25 times higher chance of observing the plant in landscape A compared to landscape B.

  • The logistic regression coefficients give the change in log odds (recall the link function ln(p/1-p)).

  • When exponentiated, logistic regression coefficients can be interpreted as odds ratios:

\(OR = \frac{odds(x + 1)}{odds(x)} = \frac{e^{\beta_0 + \beta_1 * (x + 1)}}{\beta_0 + \beta_1 * x} = e^{\beta_1}\)

  • For continuous predictors, the coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

  • For categorical predictors, the coefficient estimate describes the change in the log odds for each level compared to the base level.

Investation example

Infestation example:

exp(coef(infested.bin)[2])
## temp_anomaly 
##     8.826474

The odds ratio indicates that with a one SD change in temperature anomaly there is a 8.83 times higher chance of infestation over no infestation.

Poisson regression

The following data set shows bird counts over a topographic relief gradient:

dat <- read.table("Data/bird_counts.txt", sep = ";", header = TRUE)
head(dat)
##   relief   Y
## 1  0.576 129
## 2  0.576 166
## 3  0.576 137
## 4  0.576 105
## 5  0.576 155
## 6  0.576 178
ggplot(dat, aes(x = relief, y = Y)) + 
  geom_point()

ggplot(dat, aes(x = relief, y = Y)) + scale_y_continuous(trans='log') +
  geom_point()

The data is bounded to positive values, a log-link is thus recommended:

\(counts_{expected} = log^{-1}(\beta_0 + \beta_1 * relief)\)

As the data can only take natural numbers (0, 1, 2, …), we need a discrete data distrubution. For count models, the Poisson distribution is a good choice, yielding the final model of:

\(counts \sim Poisson(log^{-1}(\beta_0 + \beta_1 * relief))\)

bird.poisson <- glm(Y ~ relief, data = dat, family = poisson(link = "log"))
summary(bird.poisson) 
## 
## Call:
## glm(formula = Y ~ relief, family = poisson(link = "log"), data = dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.1980   -2.1407   -0.0021    2.1986   10.7204  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 4.665888   0.021577  216.24   <2e-16 ***
## relief      0.339559   0.008382   40.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2921.1  on 60  degrees of freedom
## Residual deviance: 1299.6  on 59  degrees of freedom
## AIC: 1739.7
## 
## Number of Fisher Scoring iterations: 4

Dispersion

  • Dispersion quantifies, how much higher (or lower) the observed variance is relative to the expected variance of the model.

  • For the Poisson distribution the expected value is equal to the variance, i.e. the variance increases 1:1 with the mean value. Hence, dispersion = 1.

  • In othe words, when we fit a Poisson model, we expect the variance to increase with the mean value by a factor of 1.

  • If the variance decreases at a lower rate than the data are underdispered

  • If the variance decreases at a higher rate than the data are overdispersed.

Consequences of overdispersion

With correct mean model we have consistent estimates of the parameters but overdispersion leads to:

  • to underestimation of the standard errors of the parameter estimates.

  • selection of overly complex models.

Overdispersion

To get a rough estimate on overdispersion, divide the residual deviance by the residual degree’s of freedom. The ratio should be below 1. Values much larger (> 2) indicate overdispersion (or underdispersion if less than 0.6).

## Residual deviance: 1299.6  on 59  degrees of freedom
summary(bird.poisson)$deviance / summary(bird.poisson)$df.residual
## [1] 22.02709

Our model appears to be overdispered, indicating:

  1. bad model fit (omitted terms/variables, incorrect relationship/link, outliers)

  2. the variation of Y is greater than predicted by the model

Remedy:

  1. Change distribution/link-function

  2. Change optimization (quasi-likelihood)

quasi-Poisson model

For a quasi-poisson regression, the variance is assumed to be a linear function of the mean; for negative binomial regression, a quadratic function.

bird.quasip <- glm(Y ~ relief, data = dat, family = quasipoisson)
summary(bird.quasip) 
## 
## Call:
## glm(formula = Y ~ relief, family = quasipoisson, data = dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.1980   -2.1407   -0.0021    2.1986   10.7204  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.66589    0.09953  46.878  < 2e-16 ***
## relief       0.33956    0.03866   8.782 2.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 21.27853)
## 
##     Null deviance: 2921.1  on 60  degrees of freedom
## Residual deviance: 1299.6  on 59  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Parameter estimates

The coefficients of the Poisson model can be interpreted easily, as the coefficients are multiplicative when backtransformed to the response scale. That is, with a one unit increase in x, the count is multiplied by the exponent of the coefficient. For our example, we expect a 1.4 times (or 40%) increase in mean bird counts with a one unit increase in relief. You can use confint() to compute confidence intervals for the parameters.

# Poisson
coeffs <- exp(coef(bird.poisson))
coeffs_ci <- exp(confint(bird.poisson))
cbind(coeffs, coeffs_ci) 
##                 coeffs      2.5 %     97.5 %
## (Intercept) 106.259866 101.846790 110.836066
## relief        1.404328   1.381446   1.427589
# quasi-Poisson
coeffs <- exp(coef(bird.quasip))
coeffs_ci <- exp(confint(bird.quasip))
cbind(coeffs, coeffs_ci) 
##                 coeffs     2.5 %     97.5 %
## (Intercept) 106.259866 87.187307 128.808562
## relief        1.404328  1.301813   1.514927

\(R^2_{pseudo}\)

Extending the Linear Model with R, Julian J. Faraway (p. 59):

\(R^2_{pseudo} = 1-\frac{Residual\ deviance}{Null\ deviance}\)

##     Null deviance: 2921.1  on 60  degrees of freedom
## Residual deviance: 1299.6  on 59  degrees of freedom

1 - 1299.6/2921.1
## [1] 0.5550991

Fit model with effects package

efct <- effect("relief", mod = bird.quasip, xlevels = list(relief = seq(0, 4, length.out = 100)))
efct <- as.data.frame(efct)

p <- ggplot() +
  geom_point(data = dat, aes(x = relief, y = Y)) +
  geom_ribbon(data = efct, aes(x = relief, ymin = lower, ymax = upper), alpha = 0.3) +
  geom_line(data = efct, aes(x = relief, y = fit))
print(p)

Plot fitted values with ggplot

library(gridExtra)
plot1 <- ggplot(dat, aes(x = relief, y = Y)) + 
                geom_point() + 
                geom_smooth(method = "glm", method.args=list(family = "poisson"), col = "red") 

plot2 <- ggplot(dat, aes(x = relief, y = Y)) + 
                geom_point() + 
                geom_smooth(method = "glm", method.args=list(family = "quasipoisson"), col = "red")

grid.arrange(plot1, plot2, ncol = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'


Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.