[view raw Rmd]

Introduction

Recently we started to integrate the spatial partitioning methods of sperrorest into mlr. mlr is a unified interface to conduct all kind of modeling (similar to caret), currently supporting more than 100 modeling packages. In comparison to sperrorest it shines by providing the option to easily tune hyperparameters. For now, only the most common spatial partitioning approach has been integrated It uses k-means clustering to create spatially disjoint training and test partitions (see Brenning (2012)). In this blog post a practical guidance on performing nested spatial cross-validation using mlr is given. SVM was chosen as the algorithm as it is widely know and always needs to be tuned to achieve good performance. Although the tuning effect of models with meaningful default hyperparameter values (e.g. Random Forest) is often marginal, one should always conduct a hyperparameter tuning as the specific effect is not known a priori.

Everything in mlr is build upon the following steps:

  • create a task (describes the data),
  • tune hyperparameters (optional),
  • model training,
  • prediction to new dataset.

When model performance is to be estimated, the procedure is conducted hundreds of times to reduce variance among different training + predict arrangements. It essentially mimics the situation of applying the fitted model on hundreds of different unknown test datasets. A different performance will be returned for each try. However, in reality usually no in-situ data exists for so many test datasets. That is the reason why the existing dataset is split into subsets for which validation information is available.

Creating a task

The “Task” in mlr essentially stores all information around your dataset: The type of the response variable, whether it is a supervised/unsupervised classification problem (if it is a classification one), whether the “Task” is spatial or non-spatial (i.e. whether it contains a spatial reference, i.e. coordinates) and much more information.

The example “Task” for spatial applications in mlr used the ecuador dataset. This dataset from Jannes Muenchow also comes with sperrorest. Since the integration is quite new, we need to use the development version of mlr from Github.

devtools::install_github("mlr-org/mlr")

library("mlr")

First a task needs to be created. Luckily, the built-in datasets of mlr are already in a task structure. Check sections Task in the mlr-tutorial if you need to build one from scratch. To make it a spatial task, you need to provide the coordinates of the dataset in argument coordinates of the task. They will later be used for the spatial partitioning of the dataset in the spatial cross-validation.

data("spatial.task")
spatial.task
#> Supervised task: ecuador
#> Type: classif
#> Target: slides
#> Observations: 751
#> Features:
#>    numerics     factors     ordered functionals 
#>          10           0           0           0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Has coordinates: TRUE
#> Classes: 2
#> FALSE  TRUE 
#>   251   500 
#> Positive class: TRUE

We are dealing with a supervised classification problem that has 751 observations, 10 numeric predictors, the response is “slides” and has a distribution of 500 positive and 251 negative detections.

Now that task has been created, we need to set up everything for the nested spatial cross-validation. We will repeat a 5-fold cross validation 100 times in the outer level (in which the performance is estimated) and again use a five fold cross validation for partitioning in the inner level (where the hyperparameter tuning is conducted).

Specification of the learner

The SVM implementation in the kernlab package is used in this example.

Create the learner:

learner_svm = makeLearner("classif.ksvm", predict.type = "prob")

The syntax is always the same in mlr: The prefix always specifies the task type, e.g. “classif” or “regr”. There are more options to choose from, see Task types.

As the response variable in this example is binary, we set predict.type = "prob" to tell the algorithm that we want to predict probabilities.

Setting the tuning method and its space

For the tuning tow things need to be done:

  1. Selection of the tuning method
  2. Setting of the tuning space limits

There are multiple options in mlr, see the section on tuning in the mlr-tutorial.

Random search is usually a good choice as it outperforms grid search in high-dimensional settings (i.e. if a lot hyperparameters have to be optimized) and has no disadvantages in low-dimensional cases (Bergstra and Bengio 2012).

tuning_method = makeTuneControlRandom(maxit = 50)

We set a budget of 50 which means that 50 different combinations of C and sigma are checked for each fold of the CV. Depending on your computational power you are of course free to use more. There is no rule of thumb because because it always depends on the size of the tuning space and the algorithm characteristics but I generally recommend not to use less than 50.

Next, the tuning limits of the hyperparameters C and sigma are defined:

ps = makeParamSet(makeNumericParam("C", lower = -5, upper = 15,
                                   trafo = function(x) 2 ^ x),
                  makeNumericParam("sigma", lower = -15, upper = 3,
                                   trafo = function(x) 2 ^ x))

Similar to the tuning budget, no scientific backed up limits exist in which a tuning should be conducted. Here the limits recommended by Hsu, Chang, and Lin (2003) are used. The problem of specifying tuning limits applies to all machine-learning models. Jakob Richter is working on a valuable project that establishes a database showing which limits were chosen by other users (Richter 2017). This can then serve as a reference point when searching tuning limits for an algorithm.

Setting the resampling method

Now that the tuning method and its tuning space have been defined, the resampling method for the CV that should be used needs to be set.

This needs to be done twice:

  1. For the performance estimation level (outer level)
  2. For the tuning level (inner level) in which the hyperparameters for the outer level are optimized

In both cases we want to use spatial cross-validation (SpCV). First, we set it for the tuning level.

inner = makeResampleDesc("SpCV", iters = 5)

This setting means the following: Five folds will be used (iters = 5) and, as we specified no repetition argument, one repetition will be used. Note that if “SpCV” is specified, always only one repetition will be used. If more repetitions should be used, one needs to use “SpRepCV” (see also below). The chosen setting means that every random setting of hyperparameters is applied once on each fold (5 in total). The performance of all folds is stored and the combination with the best mean value across all folds will be chosen as the “winner”.

It will be then used on the respective fold of the performance estimation level (that still needs to be specified). For this level, again five folds will be used but this time with the arrangement will be repeated 100 times to reduce variance introduced by partitioning. Note that this time “SpRepCV” needs to be chosen and the iters argument changes to folds.

outer = makeResampleDesc("SpRepCV", folds = 5, rep = 100)

Next, a wrapper function is needed that uses the specified learner and tells mlr that hyperparameter tuning should be performed. This function is then plugged into the actual resample() call of mlr. Luckily, mlr already comes with an integrated wrapper function:

wrapper_ksvm = makeTuneWrapper(learner_svm, resampling = inner, par.set = ps, 
                               control = tuning_method, show.info = TRUE,
                               measures = list(auc))

Executing everything in parallel

Now everything is ready to execute the nested cross-validation. To reduce runtime, we want to run it in parallel. mlr comes with an integrated parallel function in the parallelMap package. It lets not only choose the type of parallelism (“socket”, “multicore”, etc.) but also the level that should be parallelized (here the tuning level is selected, i.e. “mlr.tuneParams”). Choosing the parallelization level means whether the hyperparameter tuning should be parallelized (the chosen iterations) or the performance estimation in the outer level (number of total folds).

Two seeds need to be set here:

  1. A seed for the creation of the resampling partitions.set.seed() can be used here as it is created before the parallel processes.
  2. For the seeding of the tested hyperparameter settings, mc.set.seed = TRUE needs to be used as this step is executed in parallel.

Additionally, the extract = getTuneResult option returns the fitted models of every fold of the performance estimation so that a possible investigation can be conducted later on.

library(parallelMap)

parallelStart(mode = "multicore", level = "`mlr`.tuneParams", 
              cpus = 4, mc.set.seed = TRUE)

set.seed(12345)
resa_svm_spatial <- resample(wrapper_ksvm, spatial.task,
                             resampling = outer, extract = getTuneResult,
                             show.info = TRUE, measures = list(auc))

parallelStop()

# Resampling: repeated spatial cross-validation
# Measures:             auc
# [Tune] Started tuning learner classif.ksvm for parameter set:
#          Type len Def    Constr Req Tunable Trafo
# C     numeric   -   - -12 to 15   -    TRUE     Y
# sigma numeric   -   -  -15 to 6   -    TRUE     Y
# With control class: TuneControlRandom
# Imputation value: -0
# Mapping in parallel: mode = multicore; cpus = 4; elements = 50.
# [Tune] Result: C=617; sigma=0.000755 : auc.test.mean=0.6968159
# [Resample] iter 1:    0.5404151
# 
# [...]

The output of the first fold tells the following:

The best setting of C and sigma out of the 50 tested is:

[Tune] Result: C=617; sigma=0.000755 : auc.test.mean=0.6968159

Subsequently, these two values have been used to fit a model on the training set of the first fold at the performance estimation level.. This model then achieved an AUROC value of 0.5404151 on the respective test set. Only this value counts towards the performance of the model. The AUROC measure from the tuning level (0.6968159) does not contribute to the final performance estimate! It just means that the winning hyperparameter setting reached this value during tuning. Apparently, there can be major differences between the performance achieved during tuning compared with the performance estimation level.

This procedure is then applied to the remaining 499 folds (5 folds * 100 repetitions).

A personal note

I experienced that a lot of people are confused about the difference between cross-validation, its purpose and its relation to the actual prediction that is desired to be done on a new dataset:

Everything that has been shown here is only a performance estimate of 500 different training + prediction runs. The results can be visualized in a boxplot or similar or one can take the median or mean value to make a statement about the average performance of the model.

The actual part of training a model and predicting it to a new dataset (with the aim of creating a spatial prediction map) is a completely different step. There, one model is trained on the complete dataset. The hyperparameters again are estimated in a new tuning procedure and the winning setting is used for the training. There is no relation to the tuning part of the cross-validation.

References [references]

Bergstra, James, and Yoshua Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” J. Mach. Learn. Res. 13 (February). JMLR.org:281–305. http://dl.acm.org/citation.cfm?id=2188385.2188395.

Brenning, Alexander. 2012. “Spatial Cross-Validation and Bootstrap for the Assessment of Prediction Rules in Remote Sensing: The R Package Sperrorest.” In 2012 IEEE International Geoscience and Remote Sensing Symposium. IEEE. https://doi.org/10.1109/igarss.2012.6352393.

Hsu, Chih-wei, Chih-chung Chang, and Chih-Jen Lin. 2003. A Practical Guide to Support Vector Classification, Department of Computer Science National Taiwan University, Taipei 106, Taiwan.

Richter, Jakob. 2017. MlrHyperopt: Easy Hyperparameter Optimization with mlr and mlrMBO. http://doi.org/10.5281/zenodo.896269.