4.5 Non-Linear Graphs

The Graphs seen so far all have a linear structure. Some POs may have multiple input or output channels. These make it possible to create non-linear Graphs with alternative paths taken by the data.

Possible types are:

  • Branching: Splitting of a node into several paths, e.g. useful when comparing multiple feature-selection methods (pca, filters). Only one path will be executed.
  • Copying: Splitting of a node into several paths, all paths will be executed (sequentially). Parallel execution is not yet supported.
  • Stacking: Single graphs are stacked onto each other, i.e. the output of one Graph is the input for another. In machine learning this means that the prediction of one Graph is used as input for another Graph

4.5.1 Branching & Copying

The PipeOpBranch and PipeOpUnbranch POs make it possible to specify multiple alternative paths. Only one is actually executed, the others are ignored. The active path is determined by a hyperparameter. This concept makes it possible to tune alternative preprocessing paths (or learner models).

PipeOp(Un)Branch is initialized either with the number of branches, or with a character-vector indicating the names of the branches. If names are given, the “branch-choosing” hyperparameter becomes more readable. In the following, we set three options:

  1. Doing nothing (“nop”)
  2. Applying a PCA
  3. Scaling the data

It is important to “unbranch” again after “branching”, so that the outputs are merged into one result objects.

In the following we first create the branched graph and then show what happens if the “unbranching” is not applied:

Without “unbranching” one creates the following graph:

Now when “unbranching”, we obtain the following results:

The same can be achieved using a shorter notation:

## Graph with 5 PipeOps:
##        ID         State        sccssors       prdcssors
##    branch <<UNTRAINED>> no_op,pca,scale                
##     no_op <<UNTRAINED>>        unbranch          branch
##       pca <<UNTRAINED>>        unbranch          branch
##     scale <<UNTRAINED>>        unbranch          branch
##  unbranch <<UNTRAINED>>                 no_op,pca,scale

4.5.2 Model Ensembles

We can leverage the different operations presented to connect POs. This allows us to form powerful graphs.

Before we go into details, we split the task into train and test indices.

4.5.2.1 Bagging

We first examine Bagging introduced by (Breiman 1996). The basic idea is to create multiple predictors and then aggregate those to a single, more powerful predictor.

“… multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets” (Breiman 1996)

Bagging then aggregates a set of predictors by averaging (regression) or majority vote (classification). The idea behind bagging is, that a set of weak, but different predictors can be combined in order to arrive at a single, better predictor.

We can achieve this by downsampling our data before training a learner, repeating this e.g. 10 times and then performing a majority vote on the predictions.

First, we create a simple pipeline, that uses PipeOpSubsample before a PipeOpLearner is trained:

We can now copy this operation 10 times using greplicate .

Afterwards we need to aggregate the 10 pipelines to form a single model:

Now we can plot again to see what happens:

This pipeline can again be used in conjunction with GraphLearner in order for Bagging to be used like a Learner:

## <PredictionClassif> for 30 observations:
##     row_id     truth  response
##          6    setosa    setosa
##         21    setosa    setosa
##         22    setosa    setosa
## ---                           
##        136 virginica virginica
##        138 virginica virginica
##        147 virginica virginica

In conjunction with different Backends, this can be a very powerful tool. In cases when the data does not fully fit in memory, one can obtain a fraction of the data for each learner from a DataBackend and then aggregate predictions over all learners.

4.5.2.2 Stacking

Stacking (Wolpert 1992) is another technique that can improve model performance. The basic idea behind stacking is the use of predictions from one model as features for a subsequent model to possibly improve performance.

As an example we can train a decision tree and use the predictions from this model in conjunction with the original features in order to train an additional model on top.

To limit overfitting, we additionally do not predict on the original predictions of the learner. Instead, we predict on out-of-bag predictions. To do all this, we can use PipeOpLearnerCV .

PipeOpLearnerCV performs nested cross-validation on the training data, fitting a model in each fold. Each of the models is then used to predict on the out-of-fold data. As a result, we obtain predictions for every data point in our input data.

We first create a “level 0” learner, which is used to extract a lower level prediction. Additionally, we clone() the learner object to obtain a copy of the learner. Subsequently, one sets a custom id for the PipeOp .

We use PipeOpNOP in combination with gunion, in order to send the unchanged Task to the next level. There it is combined with the predictions from our decision tree learner.

Afterwards, we want to concatenate the predictions from PipeOpLearnerCV and the original Task using PipeOpFeatureUnion :

Now we can train another learner on top of the combined features:

In this vignette, we showed a very simple use-case for stacking. In many real-world applications, stacking is done for multiple levels and on multiple representations of the dataset. On a lower level, different preprocessing methods can be defined in conjunction with several learners. On a higher level, we can then combine those predictions in order to form a very powerful model.

4.5.2.3 Multilevel Stacking

In order to showcase the power of mlr3pipelines, we will show a more complicated stacking example.

In this case, we train a glmnet and 2 different rpart models (some transform its inputs using PipeOpPCA) on our task in the “level 0” and concatenate them with the original features (via gunion. The result is then passed on to “level 1”, where we copy the concatenated features 3 times and put this task into an rpart and a glmnet model. Additionally, we keep a version of the “level 0” output (via PipeOpNOP) and pass this on to “level 2”. In “level 2” we simply concatenate all “level 1” outputs and train a final decision tree.

In the following examples, use <lrn>$param_set$values$<param_name> = <param_value> to set hyperparameters for the different learner.

## Warning: Package 'glmnet' required but not installed for Learner
## 'classif.glmnet'

And we can again call .$train and .$predict

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer: 123–40.

Wolpert, David H. 1992. “Stacked Generalization.” Neural Networks 5 (2): 241–59. https://doi.org/https://doi.org/10.1016/S0893-6080(05)80023-1.