- ‘Support Vector Machine’ (SVM) Classifier (
`library(e1071); svm()`

) - Train and Test data (
`sample(); subset()`

) - Feature preparation (
`scale()`

) - Assessing model performance (
`predict(); table(pred, test_y)`

) - Hyperparameter tuning (
`tune()`

)

- From LM to GLM
- The GLM in R
- Logistic regression
- Binomial model

In the linear model, the mean response \(E[y_i]\) is modeled by a linear function of explanatory variables \(x_{ik}\) plus an error term \(\epsilon_i\),

\[\begin{aligned}E[y_i] = \beta_{0} + \beta_{1} x_{i1} + ... + \beta_{k} x_{ik} + \epsilon_i\end{aligned}\]

where \(i\) is the observation and \(k\) is the number of explanatory variables.

We assume that the errors \(\epsilon_i\) are independent and identically distributed such that

\[\begin{aligned} E(\epsilon_i) = 0 \\ Var(\epsilon_i) = \sigma^2 \end{aligned}\]

The random error term can also be written as

\[\begin{aligned} \epsilon_i \sim N(0, \sigma^2)\end{aligned}\]

This can also be written in terms of \(y_i\), with a linear predictor as a mean response and variance \(\sigma^2\).

\[\begin{aligned} y_i \sim N(E[y_i], \sigma^2)\end{aligned}\]

Generalized linear models can be fitted in `R`

using the `glm`

function, which is similar to the `lm`

function for fitting linear models. The arguments to a glm call are as follows

`glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset,`

`control = glm.control(...), model = TRUE, method = ”glm.fit”, x = FALSE, y = TRUE, contrasts = NULL, ...)`

The family argument takes (the name of) a family function which specifies

- the link function
- the variance function

The exponential family functions available in R are

`binomial(link = "logit")`

`gaussian(link = "identity")`

`Gamma(link = "inverse")`

`inverse.gaussian(link = "1/mu2")`

`poisson(link = "log")`

The glm function returns an object of class c(“glm”, “lm”). There are several glm or lm methods available for accessing/displaying components of the glm object, including:

`residuals()`

`fitted()`

`predict()`

`coef()`

`deviance()`

`formula()`

`summary()`

Remember the airquality regression problem:

\[\begin{aligned}E[y_i] = \beta_{0} + \beta_{1} Temp + \beta_2 Wind + \beta_3 Temp:Wind + \epsilon_i\end{aligned}\]

We found that the residuals were not well distributed, because of ozone concentration being bounded to positive values. We transformed the response to overcome this problem:

\[\begin{aligned}E[log(y_i)] = \beta_{0} + \beta_{1} Temp + \beta_2 Wind + \beta_3 Temp:Wind + \epsilon_i\end{aligned}\] However, transforming the response can make the interpretation of the model challenging, as well as the back-transformation of standard errors and confidence intervals can be difficult.

The idea behind GLMs is that we transform our **mean response** via a so-called *link-function* \(g()\). Compare with the equation above. This is different to transforming individual \(y_i\)’s.

\[\begin{aligned}g(E[y_i]) = \beta_{0} + \beta_{1} Temp + \beta_2 Wind + \beta_3 Temp:Wind + \epsilon_i\end{aligned}\] The transformed mean response is often written as \(\eta_i\).

\(\eta_i = g(E[y_i])\)

The inverse-link function is used to backtransform the equation to the mean response:

\(E[y_i] = g^{-1}(\eta_i)\)

For a simple-linear regression the link-function is the so-called *identity* link, that is \(g(\eta_i) = E[y_i]\).

Further, we also have to assume a data distribution, which for linear regression is Gaussian with zero mean and a variance of \(\sigma^2\). The linear model can thus be written as:

\(y_i \sim N(E[y_i], \sigma^2)\)

Running a simple linear regression using `glm()`

and comparing to `lm()`

produces identical results:

```
fit.lm <- lm(Ozone ~ Temp * Wind, data = na.omit(airquality))
fit.glm <- glm(Ozone ~ Temp * Wind, data = na.omit(airquality), family = gaussian(link = "identity"))
```

`coef(fit.lm)`

```
## (Intercept) Temp Wind Temp:Wind
## -239.891827 4.000528 13.597458 -0.217276
```

`coef(fit.glm)`

```
## (Intercept) Temp Wind Temp:Wind
## -239.891827 4.000528 13.597458 -0.217276
```

Now we want to account for the non-linear relationship. Instead of transforming the response \(log(y_i)\), we now define a log link-function linking the linear model to the mean response \(log(E[y_i])\):

`fit.glm <- glm(Ozone ~ Temp, data = na.omit(airquality), family = gaussian(link = "log"))`

The link-function is \(g(E[y_i]) = log(E[y_i]) = \eta_i\) and the inverse is \(g^{-1}(\eta_i) = e^{E[y_i]}\). The data distribution is still assumed to be normal and the model can also be written as:

\(y_i \sim N(e^{E[y_i]}, \sigma^2)\)

Notice the difference to a linear model with log-transformed predictor.

\(y_i \sim e^{N(E[y_i], \sigma^2)}\)

The models are not identical. Most noticeable are the larger and increasing confidence bands of the simple linear model, which comes from transforming and back-transforming the individual observations including their variance.

- Fitted values (and confidence intervals) are calculated in the original data space, not in a transformed data space
- We can select any link function fitting many data problems (bounded to positive, bounded to [0,1], …)
- We can specify the data distribution (think of counts, success/failure, rates, …)

We make use of a new data set:

```
dat <- read.table("Data/logistic.txt", sep = "\t", header = TRUE)
head(dat)
```

```
## logdose n dead product
## 1 0.1 18 0 A
## 2 0.2 16 4 A
## 3 0.3 21 7 A
## 4 0.4 19 6 A
## 5 0.5 16 5 A
## 6 0.6 20 10 A
```

```
ggplot(dat, aes(x = logdose, y = dead / n, col = product)) +
geom_point() +
scale_color_brewer(palette = "Set1") # See colorbrewer.org for palettes
```

We now aim to model the mortality-rate (bounded between 0 and 1) in response to the dose of a toxin. For doing so, we need to combine the dead and surviving cells using `cbind()`

:

```
fit1 <- glm(cbind(dead, n - dead) ~ logdose, data = dat, family = binomial(link = "logit"))
summary(fit1)
```

```
##
## Call:
## glm(formula = cbind(dead, n - dead) ~ logdose, family = binomial(link = "logit"),
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.14262 -0.56909 0.06357 0.83762 2.29354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5980 0.1555 -10.27 <2e-16 ***
## logdose 2.3179 0.1699 13.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 311.241 on 50 degrees of freedom
## Residual deviance: 72.387 on 49 degrees of freedom
## AIC: 223.39
##
## Number of Fisher Scoring iterations: 4
```

As link function we choose a logit-link (bounded between 0 and 1) and we assume the data to be binomial distributed (k successes given n trials).

We can estimate the goodness of fit using the pseudo-\(R^2\):

```
pseudo_r2 <- 1 - fit1$deviance / fit1$null.deviance
pseudo_r2
```

`## [1] 0.7674258`

We can now fit the model and overlay the fitted line on the scatterplot:

```
dat$fitted1 <- predict(fit1, type = "response") # Alternatively type = "link"
ggplot(dat, aes(x = logdose, y = dead/n)) +
geom_point() +
geom_line(data = dat, aes(x = logdose, y = fitted1))
```

Let us see if the mortality varies by toxin:

```
fit2 <- glm(cbind(dead, n - dead) ~ logdose * product, data = dat, family = binomial(link = "logit"))
summary(fit2)
```

```
##
## Call:
## glm(formula = cbind(dead, n - dead) ~ logdose * product, family = binomial(link = "logit"),
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.39800 -0.60734 0.09695 0.65920 1.69300
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0410 0.3034 -6.728 1.72e-11 ***
## logdose 2.8768 0.3432 8.383 < 2e-16 ***
## productB 1.2920 0.3845 3.360 0.00078 ***
## productC -0.3079 0.4308 -0.715 0.47469
## logdose:productB -1.6457 0.4220 -3.899 9.65e-05 ***
## logdose:productC 0.4318 0.4898 0.882 0.37793
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 311.241 on 50 degrees of freedom
## Residual deviance: 42.305 on 45 degrees of freedom
## AIC: 201.31
##
## Number of Fisher Scoring iterations: 4
```

We can estimate the goodness of fit using the pseudo-\(R^2\):

```
pseudo_r2 <- 1 - fit2$deviance / fit2$null.deviance
pseudo_r2
```

`## [1] 0.8640758`

We can fit the model for different `product`

levels and overlay this on the scatterplot:

```
library(effects)
dat.effects <- effect("logdose:product", mod = fit2, xlevels = list(logdose = seq(0, 2, length.out = 100)))
dat.effects <- as.data.frame(dat.effects)
ggplot() +
geom_point(data = dat, aes(x = logdose, y = dead/n, col = product)) +
geom_ribbon(data = dat.effects, aes(x = logdose, ymin = lower, ymax = upper, fill = product), alpha = 0.2) +
geom_line(data = dat.effects, aes(x = logdose, y = fit, col = product)) +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1")
```

Several kinds of residuals can be defined for GLMs:

- response: \(r_i = y_i - \hat{\mu_i}\)
- Pearson (\(r_i\) divided by square root of the variance function)
- deviance

These definitions are all equivalent for Normal models. Deviance residuals are the default used in R, since they reflect the same criterion as used in the fitting.

We can check the residuals using the `residuals()`

function or `rstandard()`

functions.

Model `fit1`

: `formula = cbind(dead, n - dead) ~ logdose`

```
plot(fit1$fitted.values, rstandard(fit1, type = 'pearson'))
abline(h=0)
```

Model `fit2`

: `formula = cbind(dead, n - dead) ~ logdose * product`

```
plot(fit2$fitted.values, rstandard(fit1, type = 'pearson'))
abline(h=0)
```

Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.