# Recap last session

• ‘Support Vector Machine’ (SVM) Classifier (library(e1071); svm())
• Train and Test data (sample(); subset())
• Feature preparation (scale())
• Assessing model performance (predict(); table(pred, test_y))
• Hyperparameter tuning (tune())

# Today’s session

• From LM to GLM
• The GLM in R
• Logistic regression
• Binomial model

# Generalized linear models

In the linear model, the mean response $$E[y_i]$$ is modeled by a linear function of explanatory variables $$x_{ik}$$ plus an error term $$\epsilon_i$$,

\begin{aligned}E[y_i] = \beta_{0} + \beta_{1} x_{i1} + ... + \beta_{k} x_{ik} + \epsilon_i\end{aligned}
where $$i$$ is the observation and $$k$$ is the number of explanatory variables.

We assume that the errors $$\epsilon_i$$ are independent and identically distributed such that

\begin{aligned} E(\epsilon_i) = 0 \\ Var(\epsilon_i) = \sigma^2 \end{aligned}

The random error term can also be written as

\begin{aligned} \epsilon_i \sim N(0, \sigma^2)\end{aligned}

This can also be written in terms of $$y_i$$, with a linear predictor as a mean response and variance $$\sigma^2$$.

\begin{aligned} y_i \sim N(E[y_i], \sigma^2)\end{aligned}

# The glm Function

Generalized linear models can be fitted in R using the glm function, which is similar to the lm function for fitting linear models. The arguments to a glm call are as follows

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE, method = ”glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)

## Family Argument

The family argument takes (the name of) a family function which specifies

• the variance function

The exponential family functions available in R are

• binomial(link = "logit")
• gaussian(link = "identity")
• Gamma(link = "inverse")
• inverse.gaussian(link = "1/mu2")
• poisson(link = "log")

## Extractor functions

The glm function returns an object of class c(“glm”, “lm”). There are several glm or lm methods available for accessing/displaying components of the glm object, including:

• residuals()
• fitted()
• predict()
• coef()
• deviance()
• formula()
• summary()

# From LM to GLM

Remember the airquality regression problem:

\begin{aligned}E[y_i] = \beta_{0} + \beta_{1} Temp + \beta_2 Wind + \beta_3 Temp:Wind + \epsilon_i\end{aligned}

We found that the residuals were not well distributed, because of ozone concentration being bounded to positive values. We transformed the response to overcome this problem:

\begin{aligned}E[log(y_i)] = \beta_{0} + \beta_{1} Temp + \beta_2 Wind + \beta_3 Temp:Wind + \epsilon_i\end{aligned} However, transforming the response can make the interpretation of the model challenging, as well as the back-transformation of standard errors and confidence intervals can be difficult.

The idea behind GLMs is that we transform our mean response via a so-called link-function $$g()$$. Compare with the equation above. This is different to transforming individual $$y_i$$’s.

\begin{aligned}g(E[y_i]) = \beta_{0} + \beta_{1} Temp + \beta_2 Wind + \beta_3 Temp:Wind + \epsilon_i\end{aligned} The transformed mean response is often written as $$\eta_i$$.

$$\eta_i = g(E[y_i])$$

The inverse-link function is used to backtransform the equation to the mean response:

$$E[y_i] = g^{-1}(\eta_i)$$

For a simple-linear regression the link-function is the so-called identity link, that is $$g(\eta_i) = E[y_i]$$.

Further, we also have to assume a data distribution, which for linear regression is Gaussian with zero mean and a variance of $$\sigma^2$$. The linear model can thus be written as:

$$y_i \sim N(E[y_i], \sigma^2)$$

Running a simple linear regression using glm() and comparing to lm() produces identical results:

fit.lm <- lm(Ozone ~ Temp * Wind, data = na.omit(airquality))

fit.glm <- glm(Ozone ~ Temp * Wind, data = na.omit(airquality), family = gaussian(link = "identity"))
coef(fit.lm)
## (Intercept)        Temp        Wind   Temp:Wind
## -239.891827    4.000528   13.597458   -0.217276
coef(fit.glm)
## (Intercept)        Temp        Wind   Temp:Wind
## -239.891827    4.000528   13.597458   -0.217276

Now we want to account for the non-linear relationship. Instead of transforming the response $$log(y_i)$$, we now define a log link-function linking the linear model to the mean response $$log(E[y_i])$$:

fit.glm <- glm(Ozone ~ Temp, data = na.omit(airquality), family = gaussian(link = "log"))

The link-function is $$g(E[y_i]) = log(E[y_i]) = \eta_i$$ and the inverse is $$g^{-1}(\eta_i) = e^{E[y_i]}$$. The data distribution is still assumed to be normal and the model can also be written as:

$$y_i \sim N(e^{E[y_i]}, \sigma^2)$$

Notice the difference to a linear model with log-transformed predictor.

$$y_i \sim e^{N(E[y_i], \sigma^2)}$$

The models are not identical. Most noticeable are the larger and increasing confidence bands of the simple linear model, which comes from transforming and back-transforming the individual observations including their variance. • Fitted values (and confidence intervals) are calculated in the original data space, not in a transformed data space
• We can select any link function fitting many data problems (bounded to positive, bounded to [0,1], …)
• We can specify the data distribution (think of counts, success/failure, rates, …)

# Logistic Regression

We make use of a new data set:

dat <- read.table("Data/logistic.txt", sep = "\t", header = TRUE)

head(dat)
##   logdose  n dead product
## 1     0.1 18    0       A
## 2     0.2 16    4       A
## 3     0.3 21    7       A
## 4     0.4 19    6       A
## 5     0.5 16    5       A
## 6     0.6 20   10       A
ggplot(dat, aes(x = logdose, y = dead / n, col = product)) +
geom_point() +
scale_color_brewer(palette = "Set1") # See colorbrewer.org for palettes We now aim to model the mortality-rate (bounded between 0 and 1) in response to the dose of a toxin. For doing so, we need to combine the dead and surviving cells using cbind():

fit1 <- glm(cbind(dead, n - dead) ~ logdose, data = dat, family = binomial(link = "logit"))
summary(fit1)
##
## Call:
##     data = dat)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -3.14262  -0.56909   0.06357   0.83762   2.29354
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.5980     0.1555  -10.27   <2e-16 ***
## logdose       2.3179     0.1699   13.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 311.241  on 50  degrees of freedom
## Residual deviance:  72.387  on 49  degrees of freedom
## AIC: 223.39
##
## Number of Fisher Scoring iterations: 4

As link function we choose a logit-link (bounded between 0 and 1) and we assume the data to be binomial distributed (k successes given n trials).

We can estimate the goodness of fit using the pseudo-$$R^2$$:

pseudo_r2 <- 1 - fit1$deviance / fit1$null.deviance
pseudo_r2
##  0.7674258

We can now fit the model and overlay the fitted line on the scatterplot:

dat$fitted1 <- predict(fit1, type = "response") # Alternatively type = "link" ggplot(dat, aes(x = logdose, y = dead/n)) + geom_point() + geom_line(data = dat, aes(x = logdose, y = fitted1)) Let us see if the mortality varies by toxin: fit2 <- glm(cbind(dead, n - dead) ~ logdose * product, data = dat, family = binomial(link = "logit")) summary(fit2) ## ## Call: ## glm(formula = cbind(dead, n - dead) ~ logdose * product, family = binomial(link = "logit"), ## data = dat) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.39800 -0.60734 0.09695 0.65920 1.69300 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.0410 0.3034 -6.728 1.72e-11 *** ## logdose 2.8768 0.3432 8.383 < 2e-16 *** ## productB 1.2920 0.3845 3.360 0.00078 *** ## productC -0.3079 0.4308 -0.715 0.47469 ## logdose:productB -1.6457 0.4220 -3.899 9.65e-05 *** ## logdose:productC 0.4318 0.4898 0.882 0.37793 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 311.241 on 50 degrees of freedom ## Residual deviance: 42.305 on 45 degrees of freedom ## AIC: 201.31 ## ## Number of Fisher Scoring iterations: 4 We can estimate the goodness of fit using the pseudo-$$R^2$$: pseudo_r2 <- 1 - fit2$deviance / fit2$null.deviance pseudo_r2 ##  0.8640758 We can fit the model for different product levels and overlay this on the scatterplot: library(effects) dat.effects <- effect("logdose:product", mod = fit2, xlevels = list(logdose = seq(0, 2, length.out = 100))) dat.effects <- as.data.frame(dat.effects) ggplot() + geom_point(data = dat, aes(x = logdose, y = dead/n, col = product)) + geom_ribbon(data = dat.effects, aes(x = logdose, ymin = lower, ymax = upper, fill = product), alpha = 0.2) + geom_line(data = dat.effects, aes(x = logdose, y = fit, col = product)) + scale_fill_brewer(palette = "Set1") + scale_color_brewer(palette = "Set1") # Residual analysis Several kinds of residuals can be defined for GLMs: • response: $$r_i = y_i - \hat{\mu_i}$$ • Pearson ($$r_i$$ divided by square root of the variance function) • deviance These definitions are all equivalent for Normal models. Deviance residuals are the default used in R, since they reflect the same criterion as used in the fitting. We can check the residuals using the residuals() function or rstandard() functions. Model fit1: formula = cbind(dead, n - dead) ~ logdose plot(fit1$fitted.values, rstandard(fit1, type = 'pearson'))
abline(h=0) Model fit2: formula = cbind(dead, n - dead) ~ logdose * product

plot(fit2\$fitted.values, rstandard(fit1, type = 'pearson'))
abline(h=0)  