Understanding Bootstrap Aggregation (Bagging) in Machine Learning

1 Introduction

Bagging, short for Bootstrap Aggregation, is a powerful ensemble learning method that minimizes prediction variability by averaging across a set of bootstrap samples.

This technique proves highly effective in situations where models are inclined toward high variance or overfitting.

2 Objective

  • Variance Reduction Through Averaging: Bagging strategically averages predictions over bootstrap samples, effectively curbing variance.
  • Leveraging Statistical Connections: Beyond simple averaging, bagging uncovers a strong connection between the bootstrap mean and a posterior average, enhancing its nuanced and effective approach.

3 Examples

3.1 Regression problem

  • Training data \(\mathbf{Z}=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\}\)

  • Fitting the data and obtaining the prediction\(\hat{f}(x)\) at input \(x\)

  • For each bootstrap sample \(\mathbf{Z}^{* b}, b=1,2, \ldots, B\), we fit our model, giving prediction \(\hat{f}^{* b}(x)\).

    • \(\mathbf{Z}^{*}=\left\{\left(x_{1}^{*}, y_{1}^{*}\right),\left(x_{2}^{*}, y_{2}^{*}\right), \ldots,\left(x_{N}^{*}, y_{N}^{*}\right)\right\}\)
    • \(\left(x_i^*, y_i^*\right) \sim \hat{\mathcal{P}}\) where \(\hat{\mathcal{P}}\) is the empirical distribution putting equal probability \(1/N\) on each of the data points \((x_i, y_i)\).
  • The bagging estimate is \[ \hat{f}_{\mathrm{bag}}(x)=\frac{1}{B} \sum_{b=1}^{B} \hat{f}^{* b}(x) \]

  • \(\hat{f}_{\mathrm{bag}}(x)\) is a Monte Carlo estimate of the true bagging estimate \(\mathrm{E}_{\hat{\mathcal{P}}} \hat{f}^{*}(x)\)

  • The bagged estimate will differ from the original estimate \(\hat{f}(x)\) only when the latter is a nonlinear or adaptive function of the data.

    • For example, B-spline smoother is linear in the data if we fix the inputs; hence if we sample using the parametric bootstrap, it gives \[ \hat{f}_{\mathrm{bag}}(x) \rightarrow \hat{f}(x) \text { as } B \rightarrow \infty \]
  • For Regression tree, \(\hat{f}(x)\) represents the tree's prediction at input vector \(x\). The bagged estimate is the average prediction at \(x\) from these \(B\) tree

3.2 K-classification

Consider \(\hat{f}(x)\), an indicator-vector function, with value a single one and \(K-1\) zeroes. The classifier is \[ \hat{G}(x)=\arg \max _k \hat{f}(x) \] Then the bagged estimate \(\hat{f}_{\mathrm{bag}}(x)\) is a \(K\)-vector \(\left[p_1(x), p_2(x), \ldots\right.\left.p_K(x)\right]\) , with \(p_k(x)\) equal to the proportion of trees predicting class \(k\) at \(x\). The bagged classifier selects the class with the most "votes" from the \(B\) trees \[ \hat{G}_{\mathrm{bag}}(x)=\arg \max _k \hat{f}_{\mathrm{bag}}(x)=\arg \max _k \frac{1}{B} \sum_{b=1}^{B} \hat{f}^{* b}(x) \]

4 Properties

4.1 Bagging will often decrease mean-squared error

  • Consider the optimal bagging estimator, \(f_{\mathrm{ag}}(x) = \mathrm{E}_{\mathcal{P}} \hat{f}^*(x)\), where bootstrap samples are drawn from the actual population \(\mathcal{P}\) rather than the dataset.
  • \(\hat{f}^{*}(x)\) represents predictions derived from bootstrap samples drawn from \(\mathcal{P}\). \[ \begin{aligned}\mathrm{E}_{\mathcal{P}}\left[Y-\hat{f}^*(x)\right]^2 & =\mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)+f_{\mathrm{ag}}(x)-\hat{f}^*(x)\right]^2 \\& =\mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)\right]^2+\mathrm{E}_{\mathcal{P}}\left[\hat{f}^*(x)-f_{\mathrm{ag}}(x)\right]^2 \\& \geq \mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)\right]^2 .\end{aligned} \]
  • The additional error on the right-hand side arises from the variance of \(\hat{f}^*(x)\) around its mean \(f_{\mathrm{ag}}(x)\).
  • Consequently, true population aggregation never increases mean squared error.

4.2 Bagging's Effectiveness with High-Variance, Low-Bias Models

Bagging demonstrates exceptional efficacy, particularly with high-variance, low-bias procedures, such as trees.

  • Regression: Repeatedly fit the same regression tree to bootstrapped versions of the training data and average the results.
  • Classification: Form a committee of trees where each tree contributes a vote for the predicted class.

4.3 Bagging effect: Caution with Poor Classifiers

Bagging excels in enhancing a good classifier but can exacerbate the performance of a bad one.

4.3.1 Simple Example

Consider a scenario where \(Y=1\) for all \(x\).

If a classifier \(\hat{G}(x)\) predicts \(Y=1\) with a probability of 0.4 and \(Y=0\) with a probability of 0.6 for all \(x\), the misclassification error of \(\hat{G}(x)\) is 0.6.

However, the bagged classifier's error is 1.0.

Classes: \(k=0,1\)

\[ \begin{aligned} \hat{G}_{\mathrm{bag}}(x) & =\arg\max_{k}\frac{1}{B}\sum_{b=1}^{B}\hat{f}^{*b}(x)\\ & =\arg\max_{k}\left(\begin{array}{c}\frac{1}{B}\sum_{b=1}^{B}I\left(G_{b}^{*}(x)=0\right)\\\frac{1}{B}\sum_{b=1}^{B}I\left(G_{b}^{*}(x)=1\right)\end{array}\right)\\ & \rightarrow\arg\max_{k}\left(\begin{array}{c}0.6\\0.4\end{array}\right)\\ & =0 \end{aligned}\]

  • Misclassification error rate: \(e=0.6\)
  • \(S_1(x)=\sum_{b=1}^B I\left(G_b^*(x)=1\right)\sim \operatorname{Bin}(B, 1-e)\)
  • \(S_1(x) \rightarrow E(S_1(x))=B(1-e)=0.4 B < B/2\)
  • \(\operatorname{Pr}\left(S_1<\right.{B} / 2) \rightarrow 1\) as \(B\) gets large.

4.3.2 Why Bagged Trees Fall Short as a Consensus of Independent Weak Learners?

  • Bagged trees lack independence.
  • The "Wisdom of Crowds" principle suggests that the collective knowledge of a diverse and independent group surpasses that of any individual. This concept is effectively harnessed through voting.
  • Random forests address this limitation by diminishing the correlation among sampled trees.

4.3.3 Illustration of Combining Weak Learners

Consider a two-class example where the Bayes optimal decision at \(x\) is \(G(x)=1\).

  • Each weak learner \(G_b^*\) has an error-rate \(e_b=e<0.5\).
  • Let \(S_1(x)=\sum_{b=1}^B I\left(G_b^*(x)=1\right)\) represent the consensus vote for class 1.
  • Assuming independence among weak learners, \(S_1(x) \sim \operatorname{Bin}(B, 1-e)\).
  • As \(B\) grows, \(\operatorname{Pr}\left(S_1>\right.{B} / 2) \rightarrow 1\).
    • By the Law of Large Numbers,\[ S_1(x) \rightarrow E(S_1(x))=B(1-e) > B/2 \]
    • since \(e < 0.5\).

4.4 Bagging's Impact on Model Structure

  • Bagging can alter the simple structure of a model; for instance, a bagged tree ceases to resemble a tree.
  • More stable procedures like nearest neighbors are typically less affected by bagging.

5 Reference

https://www.stat.cmu.edu/~ryantibs/datamining/lectures/24-bag.pdf https://link.springer.com/book/10.1007/978-0-387-84858-7