# Recap GLMs

Generalization of linear models that allow to choose freely

2. which data generating distribution is assumed (data distribution).

We covered two models:

• Log link and Gaussian distribution (ozone~temp)
• Logit link and binomial distribution (mortality~concentration)

In R, GLMs are constructed using the glm() function.

# The glm Function

Generalized linear models can be fitted in R using the glm function, which is similar to the lm function for fitting linear models. The arguments to a glm call are as follows

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE, method = ”glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)

The family argument takes (the name of) a family function which specifies

• the variance function

The exponential family functions available in R are

• binomial(link = "logit")
• gaussian(link = "identity")
• Gamma(link = "inverse")
• inverse.gaussian(link = "1/mu2")
• poisson(link = "log")

# Today’s session

GLMs to analyze:

• Binary response data with logit link and binomial distribution (known as logistic regression)
• Count data with log link and Poisson distribution

# Logistic regression

We today use a dataset where the response can only take two values. The data indicates whether trees were infested by a bark beetle or not, depending on the temperature anomaly (z-transformed temperatures with values > 0 indicating hotter than average):

dat <- read.csv("Data/infestation.csv")
head(dat)
##   infested temp_anomaly
## 1        1   -1.0785729
## 2        1    1.2728951
## 3        1    1.8699789
## 4        1    0.6327722
## 5        0    0.3342014
## 6        1    0.8407184
ggplot(dat, aes(x = temp_anomaly, y = infested)) +
geom_point() Similar problems include: dead/alive, presence/absence, conversion/conservation, etc.

We want to model the probability of infestation given a certain temperature anomaly. The logit link seems well suited as it is bounded between 0 and 1 and alows for a quick transition between both extremes: The regression thus can be written as:

$$Pr(infestation = 1) = logit^{−1}(\beta_0 + \beta_1 * temp)) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 * temp)}}$$

In our case the data is binomial distributed. Hence, we can relate the probability of infestation modelled through the logit link to our data using a binomial distribution with one trial, which translates into the Bernoulli distribution (a special case of the binomial distribution with only two outcomes):

$$infestation \sim Bernoulli(logit^{−1}(\beta_0 + \beta_1 * temp)))$$

This model is called binomial logistic regression, there are also multinomial (three or more outcomes) and ordinal logistic regression (ordinal outcomes).

Fitting a logistic regression model to our data:

infested.bin <- glm(infested ~ temp_anomaly, data = dat, family = binomial(link = "logit"))

And plotting the regression line:

prediction <- data.frame(temp_anomaly = seq(-2, 2, length.out = 100))
prediction$upper_link <- prediction$fitted_link + 2 * pred$se.fit prediction$link <- plogis(prediction$fitted_link) prediction$lower <- plogis(prediction$lower_link) prediction$upper <- plogis(prediction$upper_link) And add them to the plot: ggplot() + geom_point(data = dat, aes(x = temp_anomaly, y = infested)) + geom_ribbon(data = prediction, aes(x = temp_anomaly, ymin = lower, ymax = upper), alpha = 0.3) + geom_line(data = prediction, aes(x = temp_anomaly, y = fitted)) ## effects package Alternatively, we can use the effect() function, which does the transformation automatically: library(effects) efct <- effect("temp_anomaly", mod = infested.bin, xlevels = list(temp_anomaly = seq(-2, 2, length.out = 100))) efct <- as.data.frame(efct) ggplot() + geom_point(data = dat, aes(x = temp_anomaly, y = infested)) + geom_ribbon(data = efct, aes(x = temp_anomaly, ymin = lower, ymax = upper), alpha = 0.3) + geom_line(data = efct, aes(x = temp_anomaly, y = fit)) ## dose.p() We can estimate the temperature anomaly value at which the infestation probability is 50% using the dose.p() from the MASS package. library(MASS) temp_anomaly_p50 <- dose.p(infested.bin, p=0.5) temp_anomaly_p50 ## Dose SE ## p = 0.5: -0.3744093 0.2372779 ## Interpretation It is not straightforward to interpret coefficients in logistic regression, because the change in y given a one unit change in x is non-linear (though linear on the link scale!). However, there are - besides plotting - some ways of making the regression coefficients interpretable. The divide by 4 rule by Gelman and Hill (2007) takes the fact that the the logistic curve is steepest at its center (which is zero given scaled predictors). As a rough estimate, we can take logistic regression coefficients and divide them by 4 to get an upper bound of the predictive difference corresponding to a unit difference in x. This upper bound is a good approximation near the midpoint of the logistic curve, where probabilities are close to 0.5. Gelmann and Hill (2007): Data Analysis Using Regression and Multilevel/Hierarchical Models (page 82). In the our example, the regression coefficient was 2.18. Dividing 2.18 by 4 yields 0.54. We can thus say that with an increase in temperature anomaly by one standard deviation, there is an expected increase in infestation probability by no more than 54%. summary(infested.bin) ## ## Call: ## glm(formula = infested ~ temp_anomaly, family = binomial(link = "logit"), ## data = dat) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.8638 -0.6022 0.1966 0.4441 1.8595 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.8154 0.5766 1.414 0.15734 ## temp_anomaly 2.1778 0.7809 2.789 0.00529 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 40.381 on 29 degrees of freedom ## Residual deviance: 23.488 on 28 degrees of freedom ## AIC: 27.488 ## ## Number of Fisher Scoring iterations: 5 ## Odds ratios Another way to interpret logistic regression coefficients is in terms of odds ratios. If a success has the probability $$p$$ and failure the probability of $$(1 − p)$$, then $$p/(1 − p)$$ is called the odds. Thus, the odds of success are defined as the ratio of the probability of success over the probability of failure. An odds of 1 is equivalent to a probability of 0.5, that is equally likely outcomes (1:1). An odds of 0.5 is equivalent to a probability of 0.33, indicating there is a 1:2 chance of success. An odds of 2 is equivalent to a probability of 0.66, indicating a 2:1 chance of success. • The ratio of two odds is called an odds ratio: $$\frac{(p1/(1 − p1))}{(p2/(1 − p2))}$$. • The odds ratio indicates the preference of outcome A over outcome B. Example: • We observed a plant species in 6 out of 10 plots in landscape A and in 4 out of 10 plots in landscape B. • This gives us probabilities of 0.6 and 0.4. The odds ratio of observing the plant in landscape A over observing it in landscape B is (0.6/0.4) / (0.4/0.6) = 2.25. • Hence, there is a 2.25 times higher chance of observing the plant in landscape A compared to landscape B. • The logistic regression coefficients give the change in log odds (recall the link function ln(p/1-p)). • When exponentiated, logistic regression coefficients can be interpreted as odds ratios: $$OR = \frac{odds(x + 1)}{odds(x)} = \frac{e^{\beta_0 + \beta_1 * (x + 1)}}{\beta_0 + \beta_1 * x} = e^{\beta_1}$$ • For continuous predictors, the coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. • For categorical predictors, the coefficient estimate describes the change in the log odds for each level compared to the base level. ## Investation example Infestation example: exp(coef(infested.bin)) ## temp_anomaly ## 8.826474 The odds ratio indicates that with a one SD change in temperature anomaly there is a 8.83 times higher chance of infestation over no infestation. # Poisson regression The following data set shows bird counts over a topographic relief gradient: dat <- read.table("Data/bird_counts.txt", sep = ";", header = TRUE) head(dat) ## relief Y ## 1 0.576 129 ## 2 0.576 166 ## 3 0.576 137 ## 4 0.576 105 ## 5 0.576 155 ## 6 0.576 178 ggplot(dat, aes(x = relief, y = Y)) + geom_point() ggplot(dat, aes(x = relief, y = Y)) + scale_y_continuous(trans='log') + geom_point() The data is bounded to positive values, a log-link is thus recommended: $$counts_{expected} = log^{-1}(\beta_0 + \beta_1 * relief)$$ As the data can only take natural numbers (0, 1, 2, …), we need a discrete data distrubution. For count models, the Poisson distribution is a good choice, yielding the final model of: $$counts \sim Poisson(log^{-1}(\beta_0 + \beta_1 * relief))$$ bird.poisson <- glm(Y ~ relief, data = dat, family = poisson(link = "log")) summary(bird.poisson)  ## ## Call: ## glm(formula = Y ~ relief, family = poisson(link = "log"), data = dat) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -11.1980 -2.1407 -0.0021 2.1986 10.7204 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 4.665888 0.021577 216.24 <2e-16 *** ## relief 0.339559 0.008382 40.51 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 2921.1 on 60 degrees of freedom ## Residual deviance: 1299.6 on 59 degrees of freedom ## AIC: 1739.7 ## ## Number of Fisher Scoring iterations: 4 ## Dispersion • Dispersion quantifies, how much higher (or lower) the observed variance is relative to the expected variance of the model. • For the Poisson distribution the expected value is equal to the variance, i.e. the variance increases 1:1 with the mean value. Hence, dispersion = 1. • In othe words, when we fit a Poisson model, we expect the variance to increase with the mean value by a factor of 1. • If the variance decreases at a lower rate than the data are underdispered • If the variance decreases at a higher rate than the data are overdispersed. ## Consequences of overdispersion With correct mean model we have consistent estimates of the parameters but overdispersion leads to: • to underestimation of the standard errors of the parameter estimates. • selection of overly complex models. ## Overdispersion To get a rough estimate on overdispersion, divide the residual deviance by the residual degree’s of freedom. The ratio should be below 1. Values much larger (> 2) indicate overdispersion (or underdispersion if less than 0.6). ## Residual deviance: 1299.6 on 59 degrees of freedom summary(bird.poisson)$deviance / summary(bird.poisson)\$df.residual
##  22.02709

Our model appears to be overdispered, indicating:

2. the variation of Y is greater than predicted by the model

Remedy:

2. Change optimization (quasi-likelihood)

# quasi-Poisson model

For a quasi-poisson regression, the variance is assumed to be a linear function of the mean; for negative binomial regression, a quadratic function.

bird.quasip <- glm(Y ~ relief, data = dat, family = quasipoisson)
summary(bird.quasip) 
##
## Call:
## glm(formula = Y ~ relief, family = quasipoisson, data = dat)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -11.1980   -2.1407   -0.0021    2.1986   10.7204
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4.66589    0.09953  46.878  < 2e-16 ***
## relief       0.33956    0.03866   8.782 2.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 21.27853)
##
##     Null deviance: 2921.1  on 60  degrees of freedom
## Residual deviance: 1299.6  on 59  degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4

# Parameter estimates

The coefficients of the Poisson model can be interpreted easily, as the coefficients are multiplicative when backtransformed to the response scale. That is, with a one unit increase in x, the count is multiplied by the exponent of the coefficient. For our example, we expect a 1.4 times (or 40%) increase in mean bird counts with a one unit increase in relief. You can use confint() to compute confidence intervals for the parameters.

# Poisson
coeffs <- exp(coef(bird.poisson))
coeffs_ci <- exp(confint(bird.poisson))
cbind(coeffs, coeffs_ci) 
##                 coeffs      2.5 %     97.5 %
## (Intercept) 106.259866 101.846790 110.836066
## relief        1.404328   1.381446   1.427589
# quasi-Poisson
coeffs <- exp(coef(bird.quasip))
coeffs_ci <- exp(confint(bird.quasip))
cbind(coeffs, coeffs_ci) 
##                 coeffs     2.5 %     97.5 %
## (Intercept) 106.259866 87.187307 128.808562
## relief        1.404328  1.301813   1.514927

# $$R^2_{pseudo}$$

Extending the Linear Model with R, Julian J. Faraway (p. 59):

$$R^2_{pseudo} = 1-\frac{Residual\ deviance}{Null\ deviance}$$

##     Null deviance: 2921.1  on 60  degrees of freedom
## Residual deviance: 1299.6  on 59  degrees of freedom

1 - 1299.6/2921.1
##  0.5550991

# Fit model with effects package

efct <- effect("relief", mod = bird.quasip, xlevels = list(relief = seq(0, 4, length.out = 100)))
efct <- as.data.frame(efct)

p <- ggplot() +
geom_point(data = dat, aes(x = relief, y = Y)) +
geom_ribbon(data = efct, aes(x = relief, ymin = lower, ymax = upper), alpha = 0.3) +
geom_line(data = efct, aes(x = relief, y = fit))
print(p) # Plot fitted values with ggplot

library(gridExtra)
plot1 <- ggplot(dat, aes(x = relief, y = Y)) +
geom_point() +
geom_smooth(method = "glm", method.args=list(family = "poisson"), col = "red")

plot2 <- ggplot(dat, aes(x = relief, y = Y)) +
geom_point() +
geom_smooth(method = "glm", method.args=list(family = "quasipoisson"), col = "red")

grid.arrange(plot1, plot2, ncol = 2)
## geom_smooth() using formula 'y ~ x'
## geom_smooth() using formula 'y ~ x'  