Recap last session

  • Multiple linear regression (lm(resp ~ pred_1 + ... + pred_n))
  • Analyzing interactions (library(effect); effect(term = "pred_1:pred_2, mod = "model"))
  • Model selection using the ‘Akaike Information Criterion’ (AIC) (AIC(model_1, model_2, model_3))
  • Exhaustive model comparison (library(MiMIn); dredge())

Goals of today’s session:

  • ‘Machine Learning’ in a nutshell
  • ‘Support Vector Machine’ (SVM) as an example ML algorithm
  • Train and Test data, feature preparation
  • Assessing model performance
  • Hyperparameter tuning

Machine Learning

Definitions

“A type of artificial intelligence that enables computers to learn from large data sets.” (Alpaydin, 2016)

“Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.” (https://en.wikipedia.org/wiki/Machine_learning)

Key concepts

Machine learning algorithms build a mathematical model based on sample data, known as “training data”.

The performance of “trained” models is evaluated using “test data”.

Test and training data are typically obtained by splitting an observational data set.

Aim: make predictions or decisions without being explicitly programmed to perform the task.

Differentiation ML vs. Statistics

  • ML tools are often based on statistical techniques, e.g. linear regression
  • ML automatically creates a predictor from input variables (may be very many)
  • ML approaches are typically lacking uncertainty/error quantification
    • Bootstrap/leave-one-out approaches to quantify model performace indirectly
  • ML tools require few (not none!) a priori knowledge of data characteristics
  • Many ML tools do not provide means to check how specific result relates to input data (‘black box’)
  • Adequete preparation and optimization of input data is crucial for many ML algroithms

Domains of application

ML algorithms are typically applied to large (many observations) and complex (many variables) data sets.

Problems tackled by ML algorithms:

  • Classification
  • Regression
  • Pattern recognition
  • Identification of drivers / feature importances

Prominent ML approaches

  • Artificial Neural Networks
  • Random Forests
  • Support Vector Machines

Support Vector Machine

Concept

Support Vector Machines (SVM) are a set of related supervised learning methods used for classification and regression.

Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

An SVM training algorithm is a

  • non-probabilistic,
  • binary,
  • linear classifier,

although methods such as Platt scaling exist to use SVM in a probabilistic classification setting.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

In practice, multiple SVMs can be combined to analyze multi-class data (often implemented from scratch in higher level functions).

Approach

The key elements of the Support Vector Machine concept are outlined in Fig. 1.

  • Identify the best fitting hyperplane between two groups (‘posititves’ vs. ‘negatives’, blue and green points in Fig. 1)
    • Create two ‘support vectors’ (dashed lines in Fig. 1) running through closest data points
    • Maximum perpendicular distance between hyperplane (red line in Fig. 1) and closest data points
  • Use the hyperplane for predictive binary classification, i.e. new data point is either
    • left of hyperplane -> class 1, or
    • right of hyperplane -> class -1
Fig. 1. Key elements of the Support Vector Machine concept.

Fig. 1. Key elements of the Support Vector Machine concept.

Math 1: Training data

We are given a training dataset of \(n\) points in the form \[(\vec{x_1},y_1),...,(\vec{x_n},y_n)\] where the \(y_i\) are either \(1\) or \(-1\), each indicating the class to which the point \(\vec{x_i}\) belongs. Each \(\vec{x_i}\) is a \(p\)-dimensional real vector.

We want to find the “maximum-margin hyperplane” that divides the group of points \(\vec{x_i}\) for which \(y_i=1\) from the group of points for which \(y_i=-1\), which is defined so that the distance between the hyperplane and the nearest point \(\vec{x_i}\) from either group is maximized.

Math 2: Hyperplanes

Any hyperplane (red line in Fig. 1) can be represented by \[\langle \vec{w},\vec{x}\rangle-b=0 \] where \(\vec{w}\) is the vector of weights normal to the hyperplane and \(\vec{x}\) is a \(p\)-dimensional input vector. \(\langle \cdot,\cdot\rangle\) denote the dot product of two vectors.

Any \(\vec{x}\) that satisfies above equation lies on the hyperplane.

The perpendicular distance from the hyperplane to the origin is defined by \[\frac{b}{\|\vec{w}\|}\]

Advantages

  • High Dimensionality: SVM is an effective tool in high-dimensional spaces, which is particularly applicable to document classification and sentiment analysis where the dimensionality can be extremely large.

  • Memory Efficiency: Since only a subset of the training points are used in the actual decision process of assigning new members, just these points need to be stored in memory (and calculated upon) when making decisions.

  • Versatility: Class separation is often highly non-linear. The ability to apply new kernels allows substantial flexibility for the decision boundaries, leading to greater classification performance.

Disadvantages

  • Kernel Parameter Selection: SVMs are very sensitive to the choice of the kernel parameters. In situations where the number of features for each object exceeds the number of training data samples, SVMs can perform poorly.

  • Non-Probabilistic: Since the classifier works by placing objects above and below a classifying hyperplane, there is no direct probabilistic interpretation for group membership. However, one potential metric to determine the “effectiveness” of the classification is how far from the decision boundary the new point is.

Linearly separable data example

Create example data

# Define random seed 
set.seed(5555)

x = matrix(rnorm(40), 20, 2)
y = rep(c(-1, 1), c(10, 10))
x[y == 1,] = x[y == 1,] + 1

plot(x, col = y + 3, pch = 19)

Train a SVM model

library(e1071)
dat = data.frame(x, y = as.factor(y))
svmfit = svm(y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
print(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  5

SVM classification plot

plot(svmfit, dat)

Visualize support and decision boundary vectors

The following is only for illustration of the decision boundaries and typically not included in actual SVM analysis projects.

# Extract linear coefficients of the SVM model
beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho

# Create a grid showing the domains of class -1 vs. class 1 
make.grid = function(x, n = 75) {
  grange = apply(x, 2, range)
  x1 = seq(from = grange[1,1], to = grange[2,1], length = n)
  x2 = seq(from = grange[1,2], to = grange[2,2], length = n)
  expand.grid(X1 = x1, X2 = x2)
}

xgrid = make.grid(x)
ygrid = predict(svmfit, xgrid)

# Plot grid, data points and vectors
plot(xgrid, col = c("red", "blue")[as.numeric(ygrid)], pch = 20, cex = .2)
points(x, col = y + 3, pch = 19)
points(x[svmfit$index,], pch = 5, cex = 2)
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 - 1) / beta[2], -beta[1] / beta[2], lty = 2)
abline((beta0 + 1) / beta[2], -beta[1] / beta[2], lty = 2)

The ‘Kernel Trick’

Not linearly seperable example data

Consider a dataset containing two classes which form concentric rings in 2-dimensional data space \(\mathbb{R}^2\).

They are not linearly separable.

Transformation to higher dimension

We may add artificial data dimensions that represent functions applyed to \(x_1\) and \(x_2\).

For example, using a quadratic function for transformation we get \([x_1, x_2] = [x_1, x_2, {x_1}^2 + {x_2}^2]\)

The resulting 3-dimensional data set is easily linearly separable by a hyperplane.

Thus, provided that we work in this \(\mathbb{R}^3\) space, we can train a linear SVM classifier that successfully finds a good decision boundary.

Kernels

Kernels are similarity functions, i.e. generate a measure of similarity for two inputs.

In SVMs, Kernels allow to construct the hyperplanes for support and decision boundary vectors in an artificial higher-dimensional data space.

Owing to the mathematical procedure based on vector dot products, the results can be backtransformed to the original data space efficiently.

Common Kernel functions

For the following, let \(\vec{x_i}, \vec{x_j} \in \mathbb{R}^N\) be rows from the dataset \(X\).

  • Polynomial Kernel: \((\gamma \cdot \langle \vec{x_i} , \vec{x_j} \rangle + r)^d\)
  • Radial Basis Function (RBF) Kernel: \(exp(-\gamma \cdot \lvert \vec{x_i} - \vec{x_j} \rvert ^2)\), where \(\gamma > 0\)
  • Sigmoid Kernel: \(tanh(\langle \vec{x_i}, \vec{x_j} \rangle + r)\)

Non-linear example data

Let’s look at a data set that is not linearly separable …

# Clean up a bit ...
rm(x, y)

# Load environment from RData file
load(file = "data/ESL.mixture.rda")
names(ESL.mixture)
## [1] "x"        "y"        "xnew"     "prob"     "marginal" "px1"      "px2"      "means"
attach(ESL.mixture)

p <- plot(x, col = y + 1)

Train a non-linear SVM classifier

# Convert the response variable to factor
dat = data.frame(y = factor(y), x)

# Train a SVM model. Note the kernel being set to 'radial'.
fit = svm(factor(y) ~ ., data = dat, scale = FALSE, kernel = "radial", cost = 5)

Visualize support and decision boundary vectors

# Create grid background
xgrid = expand.grid(X1 = px1, X2 = px2)
ygrid = predict(fit, xgrid)

# Plot grid and data points
plot(xgrid, col = as.numeric(ygrid), pch = 20, cex = .2)
points(x, col = y + 1, pch = 19)

# Vizualize decision boundary
func = predict(fit, xgrid, decision.values = TRUE)
func = attributes(func)$decision
contour(px1, px2, matrix(func, 69, 99), level = 0, add = TRUE)

SVM with multidimensional data

Multidimensional example

Measurements of sonar reflection characteristics for rock (‘R’) and metal (‘M’). Columns represent different angles at which the sonar chirps hit the objects.

sonar.data <- read.csv("data/sonar.all-data", stringsAsFactors = T)
head(sonar.data, 1)
##   X0.0200 X0.0371 X0.0428 X0.0207 X0.0954 X0.0986 X0.1539 X0.1601 X0.3109 X0.2111 X0.1609 X0.1582
## 1  0.0453  0.0523  0.0843  0.0689  0.1183  0.2583  0.2156  0.3481  0.3337  0.2872  0.4918  0.6552
##   X0.2238 X0.0645 X0.0660 X0.2273 X0.3100 X0.2999 X0.5078 X0.4797 X0.5783 X0.5071 X0.4328 X0.5550
## 1  0.6919  0.7797  0.7464  0.9444       1  0.8874  0.8024  0.7818  0.5212  0.4052  0.3957  0.3914
##   X0.6711 X0.6415 X0.7104 X0.8080 X0.6791 X0.3857 X0.1307 X0.2604 X0.5121 X0.7547 X0.8537 X0.8507
## 1   0.325    0.32  0.3271  0.2767  0.4423  0.2028  0.3788  0.2947  0.1984  0.2341  0.1306  0.4182
##   X0.6692 X0.6097 X0.4943 X0.2744 X0.0510 X0.2834 X0.2825 X0.4256 X0.2641 X0.1386 X0.1051 X0.1343
## 1  0.3835  0.1057   0.184   0.197  0.1674  0.0583  0.1401  0.1628  0.0621  0.0203   0.053  0.0742
##   X0.0383 X0.0324 X0.0232 X0.0027 X0.0065 X0.0159 X0.0072 X0.0167 X0.0180 X0.0084 X0.0090 X0.0032 R
## 1  0.0409  0.0061  0.0125  0.0084  0.0089  0.0048  0.0094  0.0191   0.014  0.0049  0.0052  0.0044 R

Create test and train subsets

sonar.n <- nrow(sonar.data)

# Use 20 % of the total data as test data
test_portion <- 0.2
test_size <- round(sonar.n * test_portion)
test_n <- round(sonar.n * test_portion)
train_n <- round(sonar.n * (1 - test_portion))

test_samples <- sample(nrow(sonar.data), test_n)

test_df <- sonar.data[test_samples,]
train_df <- sonar.data[-test_samples,]

train_X <- subset(train_df, select = -R)
train_y <- train_df$R

test_X  <- subset(test_df, select = -R)
test_y  <- test_df$R

Setup initial SVM model

model <- svm(train_X, train_y)
print(model)
## 
## Call:
## svm.default(x = train_X, y = train_y)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  131
summary(model)
## 
## Call:
## svm.default(x = train_X, y = train_y)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  131
## 
##  ( 59 72 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  M R

Check initial SVM model’s performance

pred <- predict(model, train_X)
pred_test <- predict(model, test_X)

# Confusion matrix table for train data
# (how well does the model perform describing the data it was trained on?)
table(pred, train_y)
##     train_y
## pred  M  R
##    M 89  1
##    R  3 73
# Confusion matrix table for test data
# (how well does the model perform classifying unknown data?)
table(pred_test, test_y)
##          test_y
## pred_test  M  R
##         M 15  6
##         R  4 16
n_correct <- length(pred_test[pred_test == test_y])
n_test <- length(pred_test)

paste('Correct [%]: ', n_correct / n_test * 100)
## [1] "Correct [%]:  75.609756097561"

Hyperparameter Tuning

Hyperparameters control the setup of Machine Learning algorithms. Typically, they control how much ressources the algorithm will invest to find an optimal solution in the learning process, and how much complexity should be allowed in the solutions.

The former type of Hyperparameters is defined by trade-offs between computing performance and model quality, the latter by the finding an optimal parameterization between model simplification and overfitting.

In the following we use the SVM hyperparameters ‘cost’, which trades off misclassification of training examples against simplicity of the decision surface, and ‘gamma’, controlling the influence radius of individual samples.

ranges_conf <- list(cost=10^(-1:3), gamma=c(.001,.002,.005,.01,.02,.05,.1,.2,.5,1,2))

svm_tune <- tune(svm, train.x=train_X, train.y=train_y, ranges=ranges_conf)
print(svm_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##   100  0.01
## 
## - best performance: 0.1323529
best_model <- svm_tune$best.model
print(best_model)
## 
## Call:
## best.tune(method = svm, train.x = train_X, train.y = train_y, ranges = ranges_conf)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
## 
## Number of Support Vectors:  101

Check the tuned model

pred_best <- predict(best_model, train_X)
# Confusion matrix table for train data using tuned model
table(pred_best, train_y)
##          train_y
## pred_best  M  R
##         M 92  0
##         R  0 74
pred_best_test <- predict(best_model, test_X)
# Confusion matrix table for test data using tuned model
table(pred_best_test, test_y)
##               test_y
## pred_best_test  M  R
##              M 18  3
##              R  1 19
n_correct <- length(pred_best_test[pred_best_test == test_y])
n_test <- length(pred_best_test)

paste('Correct [%]: ', n_correct / n_test * 100)
## [1] "Correct [%]:  90.2439024390244"

Machine Learning analysis work flow

  • Feature selection: which predictors are actually relevant
  • Feature preparation: transformation, normalization
  • Split in test and train subsets
  • Model training and hyperparameter tuning
  • Check model performance using test data set
  • Iteratively repeat previous steps as necessary

Copyright © 2020 Humboldt-Universität zu Berlin. Department of Geography.