Recap last session

  • Comparative bivariate analysis of means with t-tests (t.test())
  • Comparative bivariate analysis of variances with F-tests (var.test())
  • Testing for variance homogeneity with the Barlett’s test (barlett.test())
  • Fitting an ANOVA model (aov()) -> augmentation of lm()
  • Computing summary tables for aov() model fits (model.tables())
  • Comparing multiple means with “Tukey’s ‘Honest Significant Difference’” method (TukeyHSD())

Today’s session

  • Multiple regression in R
  • Understanding interactions
  • Model-selection

Packages

library(ggplot2)
library(MuMIn)
library(GGally)

Airquality dataset

Let’s import the airquality dataset that we used in previous sessions.

# load airquality dataset
airq <- read.table("data/airquality.txt", sep = ",", dec = ".", header = TRUE)

# remove observations with missing data
airq <- na.omit(airq) 

# log-transform Ozone
airq$logOzone <- log(airq$Ozone)

# convert the numeric Month column into a factor variable
airq$Month <- factor(airq$Month, levels = 5:9,
                                 labels = c("May", "Jun", "Jul", "Aug", "Sep"))

# print first 5 rows of the dataset on screen
head(airq)
##   ID Ozone Solar Wind Temp Month Day logOzone
## 1  1    41   190  7.4   67   May   1 3.713572
## 2  2    36   118  8.0   72   May   2 3.583519
## 3  3    12   149 12.6   74   May   3 2.484907
## 4  4    18   313 11.5   62   May   4 2.890372
## 7  7    23   299  8.6   65   May   7 3.135494
## 8  8    19    99 13.8   59   May   8 2.944439

Exploratory analysis

Before constructing potentially complex linear models, it is important to visually explore your data and ask questions like:

  1. What is the relationship between my response and predictor variables. Do I need to transform variables?

  2. What is the relationship among my predictor variables? Do I need all? Do I need to account for co-linearity?

You can evaluate the correlation between all variables by applying the cor() function to a data frame. With the methods argument, you can chose between “pearson” (default), “kendall”, and “spearman” correlation coefficient.

cor(airq[, 1:4])
##                ID      Ozone       Solar       Wind
## ID     1.00000000  0.1375667 -0.08457572 -0.1803616
## Ozone  0.13756674  1.0000000  0.34834169 -0.6124966
## Solar -0.08457572  0.3483417  1.00000000 -0.1271835
## Wind  -0.18036162 -0.6124966 -0.12718345  1.0000000

You can obtain pairwise scatterplots between all the variables in the data set using the function ggpairs() in the GGally package. Alternatively, use the pairs() function in base R.

# library(GGally)
ggpairs(data = airq, columns = 1:4)

Multiple regression formula

Specification of a multiple regression analysis is done by setting up a model formula with + between the explanatory variables. The + means, we are constructing a model were the predictors have an additive effect on the response variable.

lm1 <- lm(logOzone ~ Temp + Solar + Wind, data = airq)

We can add more complex terms, i.e. polynomials, “on-the-fly” using the I() function. The I() function “protects” the calculation inside the I() function call, such that operators like +, -, * and ^ are not misinterpreted as formula operators.

lm2 <- lm(logOzone ~ Temp + Solar + I(Solar^2) + Wind, data = airq)

You can use the update() function to add or remove predictors from a model:

# add a term from model lm2
lm3 <- update(lm2, . ~ . + I(Temp^2)) 
# remove a term from model lm3
lm2 <- update(lm3, . ~ . - I(Temp^2))

Dummy variables

A mathematical model can only work with numbers. So, how does R deal with categorical variables? Categorical predictor variables are internally recoded into multiple binary dummy variables, where each dummy variable switches a single factor level on (1) or off (0). For example, the variable Month in the airquality dataset has five factor levels: May, Jun, Jul, Aug, Sep. To illustrate the point, let’s recode the factor variable manually.

airq$dummyJun <- ifelse(airq$Month=="Jun", 1, 0)
airq$dummyJul <- ifelse(airq$Month=="Jul", 1, 0)
airq$dummyAug <- ifelse(airq$Month=="Aug", 1, 0)
airq$dummySep <- ifelse(airq$Month=="Sep", 1, 0)

head(airq[!duplicated(airq$Month), -1])
##     Ozone Solar Wind Temp Month Day logOzone dummyJun dummyJul dummyAug dummySep
## 1      41   190  7.4   67   May   1 3.713572        0        0        0        0
## 38     29   127  9.7   82   Jun   7 3.367296        1        0        0        0
## 62    135   269  4.1   84   Jul   1 4.905275        0        1        0        0
## 93     39    83  6.9   81   Aug   1 3.663562        0        0        1        0
## 124    96   167  6.9   91   Sep   1 4.564348        0        0        0        1

We do not need to create a dummy for month May since month May is automatically coded when all other dummy variables are 0. So, if a variable has n factors, you need n-1 dummy variables. That also means the more factor levels the more parameters your model needs to estimate. Let’s build a linear regression model between logOzone and temperature and month using the dummy variables.

lm4 <- lm(logOzone ~ Temp + dummyJun + dummyJul + dummyAug + dummySep, data = airq)
summary(lm4)
## 
## Call:
## lm(formula = logOzone ~ Temp + dummyJun + dummyJul + dummyAug + 
##     dummySep, data = airq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26925 -0.29321  0.03879  0.34206  1.50462 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.138938   0.546860  -3.911 0.000163 ***
## Temp         0.074715   0.008037   9.296 2.32e-15 ***
## dummyJun    -0.468759   0.243768  -1.923 0.057192 .  
## dummyJul    -0.244664   0.214692  -1.140 0.257045    
## dummyAug    -0.293629   0.217549  -1.350 0.180009    
## dummySep    -0.387591   0.179445  -2.160 0.033053 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5748 on 105 degrees of freedom
## Multiple R-squared:  0.5793, Adjusted R-squared:  0.5593 
## F-statistic: 28.92 on 5 and 105 DF,  p-value: < 2.2e-16

Note, there are different names for dummy variables, indicator variables is another common term.

Now, as I mentioned before, the dummy coding is actually done internally when you include a factor variable as predictor. You do not need to do the coding manually. So, before we discuss what the output means, let’s check out the ‘automated’ code version without manual dummy coding (recall, variable Month is a factor variable):

lm5 <- lm(logOzone ~ Temp + Month, data = airq)
summary(lm5)
## 
## Call:
## lm(formula = logOzone ~ Temp + Month, data = airq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26925 -0.29321  0.03879  0.34206  1.50462 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.138938   0.546860  -3.911 0.000163 ***
## Temp         0.074715   0.008037   9.296 2.32e-15 ***
## MonthJun    -0.468759   0.243768  -1.923 0.057192 .  
## MonthJul    -0.244664   0.214692  -1.140 0.257045    
## MonthAug    -0.293629   0.217549  -1.350 0.180009    
## MonthSep    -0.387591   0.179445  -2.160 0.033053 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5748 on 105 degrees of freedom
## Multiple R-squared:  0.5793, Adjusted R-squared:  0.5593 
## F-statistic: 28.92 on 5 and 105 DF,  p-value: < 2.2e-16

The model results are the same as using manually coded dummy variables. You see the factor Month was recoded into dummy variables, though they are now named MonthJun, MonthJul, MonthAug, MonthSep. The first level (May) is automatically selected as the base level (e.g. if all other dummy variables are 0).

Intercept dummy variables

With the previous model, we fitted a linear regression model using Temperature and Month as predictors. In other words, we let the linear relationship between Temperature and logOzone vary with Month. More precisely, we let only the intercept of the line vary with Month. To illustrate how we can estimate the intercepts associated with the different levels of Month, we need to go back to our linear model equation:

\(logOzone = \beta_0 + \beta_1Temp + \beta_2MonthJun + \beta_3MonthJul + \beta_4MonthAug + \beta_5Sep + \epsilon\)

Since the dummy variables are binary, i.e. either 0 or 1, the equation above reduces to this equation for month May:

\(logOzone = \beta_0 + \beta_1Temp + \beta_2*0 + \beta_3*0 + \beta_4*0 + \beta_5*0\)

\(logOzone = \beta_0 + \beta_1Temp\)

..or for Month July:

\(logOzone = \beta_0 + \beta_1Temp + \beta_2*0 + \beta_3*1 + \beta_4*0 + \beta_5*0\)

\(logOzone = \beta_0 + \beta_1Temp + \beta_3\)

Since \(\beta_0\) and \(\beta_3\) are constants, we can rearrange as follows:

\(logOzone = (\beta_0 + \beta_3) + \beta_1Temp\)

To estimate the different intercepts, we replace the \(\beta's\) with the corresponding parameter outputs from summary(lm5), e.g. \(\beta_0\) is (Intercept), \(\beta_2\) is MonthJun, \(\beta_3\) is MonthJul, etc. For May the intercept of the regression line is \(\beta_0\) = -2.138938 and for month July the intercept is \(\beta_0+\beta_3\) = -2.138938 + -0.244664 = -2.383602 (for month August \(\beta_0+\beta_4\) etc.).

coef(lm5) # access model coefficients (beta's)
## (Intercept)        Temp    MonthJun    MonthJul    MonthAug    MonthSep 
## -2.13893784  0.07471496 -0.46875932 -0.24466375 -0.29362919 -0.38759055
lm5_intercept_May <- coef(lm5)['(Intercept)']
lm5_intercept_May
## (Intercept) 
##   -2.138938
lm5_intercept_Jul <- coef(lm5)['(Intercept)'] + coef(lm5)['MonthJul']
lm5_intercept_Jul
## (Intercept) 
##   -2.383602
lm5_slope <- coef(lm5)['Temp']
lm5_slope
##       Temp 
## 0.07471496

You see, the only parameter that changes with Month in our model is the intercept and we can calculate the intercepts based on \(\beta_0\) and the \(\beta\)’s from our dummy variables. Let’s plot the regression lines for May and July based on the calculated intercepts.

ggplot(airq, aes(x = Temp, y = logOzone)) + 
  geom_point(aes(col = Month)) +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00")) +
  geom_abline(slope=lm5_slope, intercept=lm5_intercept_May, col="#E41A1C") +
  geom_abline(slope=lm5_slope, intercept=lm5_intercept_Jul, col="#4DAF4A")

Interaction terms

So far, we only included an additive effect of Month, i.e. we allowed only the intercept to vary with month. However, what if the effect is not additive? In other words, what if the effect of Month on Ozone is not constant? Perhaps, the effect of Month varies also with Temperature, in which case we need to also let the slope of the line vary with Month not just the intercept. To include such interactions between predictor variables, we can include interaction terms.

In model formulas in R, interaction terms are generated using the colon operator: var1:var2.

Usually, the interaction term is included in addition to the additive effect:

respone ~ var1+var2 + var1:var2

Rather than writing both terms, you can use the shorthand: var1*var2. Hence, the following formula describes the same model as the formula above.

respone ~ var1*var2

Airquality example

Let’s try out an interaction between Temperature and Month with our airquality dataset.

lm6 <- lm(logOzone ~ Temp * Month, data = airq)
summary(lm6)
## 
## Call:
## lm(formula = logOzone ~ Temp * Month, data = airq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.19906 -0.26980  0.00246  0.35058  1.51166 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.76430    1.19238  -2.318   0.0224 *  
## Temp           0.08413    0.01786   4.711 7.89e-06 ***
## MonthJun       2.42798    2.33970   1.038   0.3019    
## MonthJul      -3.95983    2.45826  -1.611   0.1103    
## MonthAug       0.11327    1.87091   0.061   0.9518    
## MonthSep       1.13404    1.54120   0.736   0.4635    
## Temp:MonthJun -0.03845    0.03123  -1.231   0.2211    
## Temp:MonthJul  0.04233    0.03121   1.357   0.1779    
## Temp:MonthAug -0.00680    0.02477  -0.275   0.7843    
## Temp:MonthSep -0.02107    0.02187  -0.963   0.3377    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.568 on 101 degrees of freedom
## Multiple R-squared:  0.6048, Adjusted R-squared:  0.5696 
## F-statistic: 17.18 on 9 and 101 DF,  p-value: < 2.2e-16

The model output shows the parameter estimates for the base Intercept, for Temperature, and for Month coded as dummy variables. The parameter names with the : sign are the interaction terms.

Estimating dummy effects

To estimate how the slopes vary with Month, we need to go back to our model formula and combine the parameters as we did when estimating the intercepts from additive dummy model. The parameters for the interaction terms are \(\beta_6\)-\(\beta_9\).

\(logOzone = \beta_0 + \beta_1*Temp + \beta_2*Jun + \beta_3*Jul + ... + \beta_5*Sep + \beta_6*Temp*Jun + \beta_7*Temp*Jul + ... + \beta_9*Temp*Sep\)

So, if we are interested in the relationship between Temperature and logOzone for May, the above equation reduces to:

\(logOzone = \beta_0 + \beta_1Temp\)

As May is our base factor level, all other \(\beta\)s get multiplied by zeros and therefore eliminated. The intercept of month May is \(\beta_0\) and the slope is \(\beta_1\).

If we are interested in the relationship between Temperature and Ozone for July (or some other month), the equation reduces as follows, where \(\beta_3\) estimates the additive term and \(\beta_6\) estimates the interaction term. Note, unlike in an R formula, where the \(*\) sign symbolizes the inclusion of both, additive and interaction terms, here \(*\) truly represents the multiplicative sign.

\(logOzone = \beta_0 + \beta_1Temp + \beta_3Jul + \beta_7*Temp*Jul\)

which we can rearrange as follows:

\(logOzone = (\beta_0 + \beta_3) + (\beta_1 + \beta_7)*Temp\)

where the intercept is \((\beta_0 + \beta_3)\) and the slope is \((\beta_1 + \beta_7)\).

Back to the example

The coefficients of our airquality model with interaction term are as follows:

coef(lm6) # access model coefficients (beta's)
##   (Intercept)          Temp      MonthJun      MonthJul      MonthAug      MonthSep Temp:MonthJun 
##  -2.764304030   0.084124863   2.427979206  -3.959831889   0.113272053   1.134041099  -0.038447329 
## Temp:MonthJul Temp:MonthAug Temp:MonthSep 
##   0.042334206  -0.006799668  -0.021065369

To estimate ozone concentration for month May, we grab the intercept and slope term:

\(logOzone_{May} = -2.76 + 0.08 * Temp\)

To estimate ozone concentration for month July, we have to add the parameters for the July intercept offset (-3.96) and the slope offset from the interaction term (0.04):

\(logOzone_{July} = (-2.76 - 3.96) + (0.08 + 0.04) * Temp\)

We can do this directly in R and plot the corresponding regression lines:

lm6_intercept_May <- coef(lm6)['(Intercept)']
lm6_intercept_Jul <- coef(lm6)['(Intercept)'] + coef(lm6)['MonthJul']
lm6_slope_May <- coef(lm6)['Temp']
lm6_slope_Jul <- coef(lm6)['Temp'] + coef(lm6)['Temp:MonthJul']

ggplot(airq, aes(x = Temp, y = logOzone)) +
  geom_point(aes(col = Month)) +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00")) +
  geom_abline(slope=lm6_slope_May, intercept=lm6_intercept_May, col="#E41A1C") +
  geom_abline(slope=lm6_slope_Jul, intercept=lm6_intercept_Jul, col="#4DAF4A")

effects package

As always, there already exists a package doing all the hard coding-work for us. In this case, we make use of the effect() function in the effects package:

library(effects)

efct <- effect(term = "Temp:Month", mod = lm6)
efct <- as.data.frame(efct) # convert list to data.frame
efct$Month <- factor(efct$Month, levels = c("May", "Jun", "Jul", "Aug", "Sep"))
head(efct)
##   Temp Month      fit        se    lower    upper
## 1   60   May 2.283188 0.1635384 1.958772 2.607604
## 2   70   May 3.124436 0.1320779 2.862429 3.386443
## 3   80   May 3.965685 0.2681741 3.433700 4.497670
## 4   90   May 4.806934 0.4360770 3.941874 5.671993
## 5  100   May 5.648182 0.6100672 4.437973 6.858392
## 6   60   Jun 2.404327 0.5038101 1.404903 3.403751

Note: confidence intervals are calculated automatically. You can change the level of the confidence intervals setting the argument confint. See ?effect.

Plotting the regression on the log-scale:

ggplot(airq) +
  geom_point(aes(x = Temp, y = logOzone, col = Month)) +
  geom_line(data = efct, aes(x = Temp, y = fit, col = Month)) +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00"))

And confidence intervals to the plot as ribbons using geom_ribbon():

ggplot(airq) +
  geom_point(aes(x = Temp, y = logOzone, col = Month)) +
  geom_ribbon(data = efct, aes(x = Temp, ymin = lower, ymax = upper, fill = Month), alpha = 0.2) +
  geom_line(data = efct, aes(x = Temp, y = fit, col = Month)) +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00")) +
  scale_fill_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00"))

Converting predictions back to the linear scale:

efct <- effect(term = "Temp:Month", mod = lm6) 
efct <- as.data.frame(efct)
efct$Month <- factor(efct$Month, levels = c("May", "Jun", "Jul", "Aug", "Sep"))

ggplot(airq, aes(x = Temp, y = Ozone, col = Month)) +
  geom_point() +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00")) +
  geom_line(data = efct, aes(y = exp(fit)))

Increasing the simulation interval:

efct <- effect(term = "Temp:Month", mod = lm6, 
               xlevels = list(Temp = seq(57, 97, length.out = 100))) # Changing the x-levels
efct <- as.data.frame(efct)
efct$Month <- factor(efct$Month, levels = c("May", "Jun", "Jul", "Aug", "Sep"))

ggplot(airq, aes(x = Temp, y = Ozone, col = Month)) +
  geom_point() +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00")) +
  geom_line(data = efct, aes(y = exp(fit)))

And finally with confidence intervals:

ggplot(airq) +
  geom_point(aes(x = Temp, y = Ozone, col = Month)) +
  geom_ribbon(data = efct, aes(x = Temp, ymin = exp(lower), ymax = exp(upper), fill = Month), alpha = 0.2) +
  geom_line(data = efct, aes(x = Temp, y = exp(fit), col = Month)) +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00")) +
  scale_fill_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00"))

Interactions continuous variables

We can also specify an interaction between two continuous variables:

lm7 <- lm(logOzone ~ Temp * Wind, data = airq)
summary(lm7)
## 
## Call:
## lm(formula = logOzone ~ Temp * Wind, data = airq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.31151 -0.28199  0.05219  0.34752  1.12276 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.380182   1.306868  -1.821   0.0714 .  
## Temp         0.081641   0.015953   5.118 1.37e-06 ***
## Wind         0.135132   0.115136   1.174   0.2431    
## Temp:Wind   -0.002509   0.001465  -1.713   0.0896 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5475 on 107 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.6002 
## F-statistic: 56.04 on 3 and 107 DF,  p-value: < 2.2e-16

(Intercept): logOzone concentration at 0 windspeed and 0 degrees temperature

Temp: Increasing temperature by 1 degree F at a windspeed of 0 increases logOzone concentration by 0.08.

Wind: Increasing windspeed by 1 mph is associated with an increase in logOzone concentration of 0.14, when temperature is 0.

Temp:Wind: The effect of temperature on logOzone concentration decreases by -0.003 for every unit increase in windspeed. For example, at zero windspeed, the effect (slope) of temperature on logOzone concentration is 0.08. At a windpeed of 1 mph, the effect decreases to 0.082 - 0.003 = 0.077.

Let’s start with the formula showing the additive terms for temperature (\(\beta_1Temp\)) and wind (\(\beta_2Wind\)) and the interaction term between temperature and wind (\(\beta_3tw\)):

\[logOzone = \beta_0 +\beta_1Temp +\beta_2Wind +\beta_3Temp*Wind + \epsilon\]

We want to find the regression line between temperature and ozone for different levels of wind speed. We can rearrange the equation as follows:

\[y = (\beta_0 +\beta_2Wind) + (\beta_1 +\beta_3Wind) * Temp + \epsilon\]

Now we can say that the intercept for temperature is \((\beta_0 + \beta_2Wind)\) and the slope is \((\beta_1 + \beta_3Wind)\).

We can graph the interaction using the effects package:

efct <- effect("Temp:Wind", mod = lm7, 
               xlevels = list(Temp = seq(57, 97, length.out = 100)))

efct <- as.data.frame(efct)

ggplot(airq) +
  geom_point(aes(x = Temp, y = Ozone, col = Wind)) +
  geom_line(data = efct, aes(x = Temp, y = exp(fit), col = Wind, group = Wind))

We can also change the values of the modulating variable within the xlevels argument:

efct <- effect("Temp:Wind", mod = lm7, 
               xlevels = list(Temp = seq(57, 97, length.out = 100),
                              Wind = c(0, 10, 20, 30)))

efct <- as.data.frame(efct)

ggplot(airq) +
  geom_point(aes(x = Temp, y = Ozone, col = Wind)) +
  geom_line(data = efct, aes(x = Temp, y = exp(fit), col = Wind, group = Wind))

For interpretation, it is often much better to scale the predictors to unit scale (via z-transformation):

airq[, 3:5] <- scale(airq[, 3:5]) # z-transforming Solar, Wind, Temp

lm8 <- lm(logOzone ~ Temp * Wind, data = airq)
summary(lm8)
## 
## Call:
## lm(formula = logOzone ~ Temp * Wind, data = airq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.31151 -0.28199  0.05219  0.34752  1.12276 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.37401    0.05744  58.739  < 2e-16 ***
## Temp         0.54037    0.06017   8.980 1.03e-14 ***
## Wind        -0.21364    0.06017  -3.551 0.000572 ***
## Temp:Wind   -0.08507    0.04967  -1.713 0.089649 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5475 on 107 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.6002 
## F-statistic: 56.04 on 3 and 107 DF,  p-value: < 2.2e-16

Given the interpretation shown in the previous slides, the coefficient for Temp now indicates the increase in logOzone with an increase in temperature by 1 standard deviation (9.5 F) at mean windspeed (9.9 mph).

Again plotting the regression model:

efct <- effect("Temp:Wind", mod = lm8, 
               xlevels = list(Temp = seq(-2, 2, length.out = 100)))

efct <- as.data.frame(efct)

ggplot(airq) +
  geom_point(aes(x = Temp, y = Ozone, col = Wind)) +
  geom_line(data = efct, aes(x = Temp, y = exp(fit), col = Wind, group = Wind))

Model selection

AIC

The Akaike Information Criterion (AIC) balances the goodness of fit and a penalty for model complexity. AIC is defined such that the smaller the value of AIC the better the model. Compare models based on the Akaike Information Criterion:

AIC(lm1, lm2, lm3)
##     df      AIC
## lm1  5 170.8251
## lm2  6 168.5865
## lm3  7 168.0743

What is the advantage of AIC over \(R^2\)?

AIC(lm1, lm2, lm3)
##     df      AIC
## lm1  5 170.8251
## lm2  6 168.5865
## lm3  7 168.0743
c(summary(lm1)$r.squared, summary(lm2)$r.squared, summary(lm3)$r.squared)
## [1] 0.6644239 0.6769963 0.6842246
c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared, summary(lm3)$adj.r.squared)
## [1] 0.6550153 0.6648075 0.6691876

Stepwise elimination

R has several functions for performing model searches, e.g. based on the Akaike information criterion. Such functions make the process of model selection efficient, particularly when there are lots of predictor variables. From Tobias you learned manual backward and forward model selection:

Backward elimination:

  • Start with a full model including potential transformations and polynomials
  • Work your way backwards by eliminating non-significant variables
  • Compare models using AIC and \(R_{adj}\)

Forward elimination:

  • Start with a single predictor
  • Add variables and assess their significance
  • Compare models using AIC and \(R_{adj}\)

One advantage of doing model reductions by hand is that you can impose some logical structure on the process, based on your understanding of the data and the underlying process.


Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.