- Multiple linear regression (
`lm(resp ~ pred_1 + ... + pred_n)`

) - Analyzing interactions (
`library(effect); effect(term = "pred_1:pred_2, mod = "model")`

) - Model selection using the ‘Akaike Information Criterion’ (AIC) (
`AIC(model_1, model_2, model_3)`

) - Exhaustive model comparison (
`library(MiMIn); dredge()`

)

- ‘Machine Learning’ in a nutshell
- ‘Support Vector Machine’ (SVM) as an example ML algorithm
- Train and Test data, feature preparation
- Assessing model performance
- Hyperparameter tuning

*“A type of artificial intelligence that enables computers to learn from large data sets.”* (Alpaydin, 2016)

*“Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.”* (https://en.wikipedia.org/wiki/Machine_learning)

Machine learning algorithms build a mathematical model based on sample data, known as “training data”.

The performance of “trained” models is evaluated using “test data”.

Test and training data are typically obtained by splitting an observational data set.

Aim: make predictions or decisions without being explicitly programmed to perform the task.

- ML tools are often based on statistical techniques, e.g. linear regression
- ML automatically creates a predictor from input variables (may be very many)
- ML approaches are typically lacking uncertainty/error quantification
- Bootstrap/leave-one-out approaches to quantify model performace indirectly

- ML tools require few (not none!)
*a priori*knowledge of data characteristics - Many ML tools do not provide means to check how specific result relates to input data (‘black box’)
- Adequete preparation and optimization of input data is crucial for many ML algroithms

ML algorithms are typically applied to large (many observations) and complex (many variables) data sets.

Problems tackled by ML algorithms:

- Classification
- Regression
- Pattern recognition
- Identification of drivers / feature importances

- Artificial Neural Networks
- Random Forests
- Support Vector Machines

Support Vector Machines (SVM) are a set of related supervised learning methods used for classification and regression.

Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

An SVM training algorithm is a

- non-probabilistic,
- binary,
- linear classifier,

although methods such as Platt scaling exist to use SVM in a probabilistic classification setting.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

In practice, multiple SVMs can be combined to analyze multi-class data (often implemented from scratch in higher level functions).

The key elements of the Support Vector Machine concept are outlined in Fig. 1.

- Identify the best fitting hyperplane between two groups (‘posititves’ vs. ‘negatives’, blue and green points in Fig. 1)
- Create two ‘support vectors’ (dashed lines in Fig. 1) running through closest data points
- Maximum perpendicular distance between hyperplane (red line in Fig. 1) and closest data points

- Use the hyperplane for predictive binary classification, i.e. new data point is either
- left of hyperplane -> class 1, or
- right of hyperplane -> class -1

We are given a training dataset of \(n\) points in the form \[(\vec{x_1},y_1),...,(\vec{x_n},y_n)\] where the \(y_i\) are either \(1\) or \(-1\), each indicating the class to which the point \(\vec{x_i}\) belongs. Each \(\vec{x_i}\) is a \(p\)-dimensional real vector.

We want to find the **“maximum-margin hyperplane”** that divides the group of points \(\vec{x_i}\) for which \(y_i=1\) from the group of points for which \(y_i=-1\), which is defined so that the distance between the hyperplane and the nearest point \(\vec{x_i}\) from either group is maximized.

Any hyperplane (red line in Fig. 1) can be represented by \[\langle \vec{w},\vec{x}\rangle-b=0 \] where \(\vec{w}\) is the vector of weights normal to the hyperplane and \(\vec{x}\) is a \(p\)-dimensional input vector. \(\langle \cdot,\cdot\rangle\) denote the dot product of two vectors.

Any \(\vec{x}\) that satisfies above equation lies on the hyperplane.

The perpendicular distance from the hyperplane to the origin is defined by \[\frac{b}{\|\vec{w}\|}\]

**High Dimensionality:**SVM is an effective tool in high-dimensional spaces, which is particularly applicable to document classification and sentiment analysis where the dimensionality can be extremely large.**Memory Efficiency:**Since only a subset of the training points are used in the actual decision process of assigning new members, just these points need to be stored in memory (and calculated upon) when making decisions.**Versatility:**Class separation is often highly non-linear. The ability to apply new kernels allows substantial flexibility for the decision boundaries, leading to greater classification performance.

**Kernel Parameter Selection:**SVMs are very sensitive to the choice of the kernel parameters. In situations where the number of features for each object exceeds the number of training data samples, SVMs can perform poorly.**Non-Probabilistic:**Since the classifier works by placing objects above and below a classifying hyperplane, there is no direct probabilistic interpretation for group membership. However, one potential metric to determine the “effectiveness” of the classification is how far from the decision boundary the new point is.

```
# Define random seed
set.seed(5555)
x = matrix(rnorm(40), 20, 2)
y = rep(c(-1, 1), c(10, 10))
x[y == 1,] = x[y == 1,] + 1
plot(x, col = y + 3, pch = 19)
```

```
library(e1071)
dat = data.frame(x, y = as.factor(y))
svmfit = svm(y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
print(svmfit)
```

```
##
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 5
```

`plot(svmfit, dat)`

The following is only for illustration of the decision boundaries and typically not included in actual SVM analysis projects.

```
# Extract linear coefficients of the SVM model
beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho
# Create a grid showing the domains of class -1 vs. class 1
make.grid = function(x, n = 75) {
grange = apply(x, 2, range)
x1 = seq(from = grange[1,1], to = grange[2,1], length = n)
x2 = seq(from = grange[1,2], to = grange[2,2], length = n)
expand.grid(X1 = x1, X2 = x2)
}
xgrid = make.grid(x)
ygrid = predict(svmfit, xgrid)
# Plot grid, data points and vectors
plot(xgrid, col = c("red", "blue")[as.numeric(ygrid)], pch = 20, cex = .2)
points(x, col = y + 3, pch = 19)
points(x[svmfit$index,], pch = 5, cex = 2)
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 - 1) / beta[2], -beta[1] / beta[2], lty = 2)
abline((beta0 + 1) / beta[2], -beta[1] / beta[2], lty = 2)
```

Consider a dataset containing two classes which form concentric rings in 2-dimensional data space \(\mathbb{R}^2\).

They are not linearly separable.

We may add artificial data dimensions that represent functions applyed to \(x_1\) and \(x_2\).

For example, using a quadratic function for transformation we get \([x_1, x_2] = [x_1, x_2, {x_1}^2 + {x_2}^2]\)

The resulting 3-dimensional data set is easily linearly separable by a hyperplane.

Thus, provided that we work in this \(\mathbb{R}^3\) space, we can train a linear SVM classifier that successfully finds a good decision boundary.

Kernels are similarity functions, i.e. generate a measure of similarity for two inputs.

In SVMs, Kernels allow to construct the hyperplanes for support and decision boundary vectors in an artificial higher-dimensional data space.

Owing to the mathematical procedure based on vector dot products, the results can be backtransformed to the original data space efficiently.

For the following, let \(\vec{x_i}, \vec{x_j} \in \mathbb{R}^N\) be rows from the dataset \(X\).

**Polynomial Kernel**: \((\gamma \cdot \langle \vec{x_i} , \vec{x_j} \rangle + r)^d\)**Radial Basis Function (RBF) Kernel**: \(exp(-\gamma \cdot \lvert \vec{x_i} - \vec{x_j} \rvert ^2)\), where \(\gamma > 0\)**Sigmoid Kernel**: \(tanh(\langle \vec{x_i}, \vec{x_j} \rangle + r)\)

Let’s look at a data set that is not linearly separable …

```
# Clean up a bit ...
rm(x, y)
# Load environment from RData file
load(file = "data/ESL.mixture.rda")
names(ESL.mixture)
```

`## [1] "x" "y" "xnew" "prob" "marginal" "px1" "px2" "means"`

```
attach(ESL.mixture)
p <- plot(x, col = y + 1)
```

```
# Convert the response variable to factor
dat = data.frame(y = factor(y), x)
# Train a SVM model. Note the kernel being set to 'radial'.
fit = svm(factor(y) ~ ., data = dat, scale = FALSE, kernel = "radial", cost = 5)
```

```
# Create grid background
xgrid = expand.grid(X1 = px1, X2 = px2)
ygrid = predict(fit, xgrid)
# Plot grid and data points
plot(xgrid, col = as.numeric(ygrid), pch = 20, cex = .2)
points(x, col = y + 1, pch = 19)
# Vizualize decision boundary
func = predict(fit, xgrid, decision.values = TRUE)
func = attributes(func)$decision
contour(px1, px2, matrix(func, 69, 99), level = 0, add = TRUE)
```

Measurements of sonar reflection characteristics for rock (‘R’) and metal (‘M’). Columns represent different angles at which the sonar chirps hit the objects.

```
sonar.data <- read.csv("data/sonar.all-data", stringsAsFactors = T)
head(sonar.data, 1)
```

```
## X0.0200 X0.0371 X0.0428 X0.0207 X0.0954 X0.0986 X0.1539 X0.1601 X0.3109 X0.2111 X0.1609 X0.1582
## 1 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918 0.6552
## X0.2238 X0.0645 X0.0660 X0.2273 X0.3100 X0.2999 X0.5078 X0.4797 X0.5783 X0.5071 X0.4328 X0.5550
## 1 0.6919 0.7797 0.7464 0.9444 1 0.8874 0.8024 0.7818 0.5212 0.4052 0.3957 0.3914
## X0.6711 X0.6415 X0.7104 X0.8080 X0.6791 X0.3857 X0.1307 X0.2604 X0.5121 X0.7547 X0.8537 X0.8507
## 1 0.325 0.32 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182
## X0.6692 X0.6097 X0.4943 X0.2744 X0.0510 X0.2834 X0.2825 X0.4256 X0.2641 X0.1386 X0.1051 X0.1343
## 1 0.3835 0.1057 0.184 0.197 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.053 0.0742
## X0.0383 X0.0324 X0.0232 X0.0027 X0.0065 X0.0159 X0.0072 X0.0167 X0.0180 X0.0084 X0.0090 X0.0032 R
## 1 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.014 0.0049 0.0052 0.0044 R
```

```
sonar.n <- nrow(sonar.data)
# Use 20 % of the total data as test data
test_portion <- 0.2
test_size <- round(sonar.n * test_portion)
test_n <- round(sonar.n * test_portion)
train_n <- round(sonar.n * (1 - test_portion))
test_samples <- sample(nrow(sonar.data), test_n)
test_df <- sonar.data[test_samples,]
train_df <- sonar.data[-test_samples,]
train_X <- subset(train_df, select = -R)
train_y <- train_df$R
test_X <- subset(test_df, select = -R)
test_y <- test_df$R
```

```
model <- svm(train_X, train_y)
print(model)
```

```
##
## Call:
## svm.default(x = train_X, y = train_y)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 131
```

`summary(model)`

```
##
## Call:
## svm.default(x = train_X, y = train_y)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 131
##
## ( 59 72 )
##
##
## Number of Classes: 2
##
## Levels:
## M R
```

```
pred <- predict(model, train_X)
pred_test <- predict(model, test_X)
# Confusion matrix table for train data
# (how well does the model perform describing the data it was trained on?)
table(pred, train_y)
```

```
## train_y
## pred M R
## M 89 1
## R 3 73
```

```
# Confusion matrix table for test data
# (how well does the model perform classifying unknown data?)
table(pred_test, test_y)
```

```
## test_y
## pred_test M R
## M 15 6
## R 4 16
```

```
n_correct <- length(pred_test[pred_test == test_y])
n_test <- length(pred_test)
paste('Correct [%]: ', n_correct / n_test * 100)
```

`## [1] "Correct [%]: 75.609756097561"`

Hyperparameters control the setup of Machine Learning algorithms. Typically, they control how much ressources the algorithm will invest to find an optimal solution in the learning process, and how much complexity should be allowed in the solutions.

The former type of Hyperparameters is defined by trade-offs between computing performance and model quality, the latter by the finding an optimal parameterization between model simplification and overfitting.

In the following we use the SVM hyperparameters ‘cost’, which trades off misclassification of training examples against simplicity of the decision surface, and ‘gamma’, controlling the influence radius of individual samples.

```
ranges_conf <- list(cost=10^(-1:3), gamma=c(.001,.002,.005,.01,.02,.05,.1,.2,.5,1,2))
svm_tune <- tune(svm, train.x=train_X, train.y=train_y, ranges=ranges_conf)
print(svm_tune)
```

```
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 100 0.01
##
## - best performance: 0.1323529
```

```
best_model <- svm_tune$best.model
print(best_model)
```

```
##
## Call:
## best.tune(method = svm, train.x = train_X, train.y = train_y, ranges = ranges_conf)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 100
##
## Number of Support Vectors: 101
```

```
pred_best <- predict(best_model, train_X)
# Confusion matrix table for train data using tuned model
table(pred_best, train_y)
```

```
## train_y
## pred_best M R
## M 92 0
## R 0 74
```

```
pred_best_test <- predict(best_model, test_X)
# Confusion matrix table for test data using tuned model
table(pred_best_test, test_y)
```

```
## test_y
## pred_best_test M R
## M 18 3
## R 1 19
```

```
n_correct <- length(pred_best_test[pred_best_test == test_y])
n_test <- length(pred_best_test)
paste('Correct [%]: ', n_correct / n_test * 100)
```

`## [1] "Correct [%]: 90.2439024390244"`

- Feature selection: which predictors are actually relevant
- Feature preparation: transformation, normalization
- Split in test and train subsets
- Model training and hyperparameter tuning
- Check model performance using test data set
- Iteratively repeat previous steps as necessary

Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.