Understanding Bootstrap Aggregation (Bagging) in Machine Learning

Posted on 2023-11-26 In Theory Views: Disqus:

1 Introduction

Bagging, short for Bootstrap Aggregation, is a powerful ensemble learning method that minimizes prediction variability by averaging across a set of bootstrap samples.

This technique proves highly effective in situations where models are inclined toward high variance or overfitting.

2 Objective

Variance Reduction Through Averaging: Bagging strategically averages predictions over bootstrap samples, effectively curbing variance.
Leveraging Statistical Connections: Beyond simple averaging, bagging uncovers a strong connection between the bootstrap mean and a posterior average, enhancing its nuanced and effective approach.

3 Examples

3.1 Regression problem

Training data \(\mathbf{Z}=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\}\)
Fitting the data and obtaining the prediction\(\hat{f}(x)\) at input \(x\)
For each bootstrap sample \(\mathbf{Z}^{* b}, b=1,2, \ldots, B\), we ﬁt our model, giving prediction \(\hat{f}^{* b}(x)\).
- \(\mathbf{Z}^{*}=\left\{\left(x_{1}^{*}, y_{1}^{*}\right),\left(x_{2}^{*}, y_{2}^{*}\right), \ldots,\left(x_{N}^{*}, y_{N}^{*}\right)\right\}\)
- \(\left(x_i^*, y_i^*\right) \sim \hat{\mathcal{P}}\) where \(\hat{\mathcal{P}}\) is the empirical distribution putting equal probability \(1/N\) on each of the data points \((x_i, y_i)\).
The bagging estimate is \[ \hat{f}_{\mathrm{bag}}(x)=\frac{1}{B} \sum_{b=1}^{B} \hat{f}^{* b}(x) \]
\(\hat{f}_{\mathrm{bag}}(x)\) is a Monte Carlo estimate of the true bagging estimate \(\mathrm{E}_{\hat{\mathcal{P}}} \hat{f}^{*}(x)\)
The bagged estimate will differ from the original estimate \(\hat{f}(x)\) only when the latter is a nonlinear or adaptive function of the data.
- For example, B-spline smoother is linear in the data if we ﬁx the inputs; hence if we sample using the parametric bootstrap, it gives \[ \hat{f}_{\mathrm{bag}}(x) \rightarrow \hat{f}(x) \text { as } B \rightarrow \infty \]
For Regression tree, \(\hat{f}(x)\) represents the tree's prediction at input vector \(x\). The bagged estimate is the average prediction at \(x\) from these \(B\) tree

3.2 K-classification

Consider \(\hat{f}(x)\), an indicator-vector function, with value a single one and \(K-1\) zeroes. The classifier is \[ \hat{G}(x)=\arg \max _k \hat{f}(x) \] Then the bagged estimate \(\hat{f}_{\mathrm{bag}}(x)\) is a \(K\)-vector \(\left[p_1(x), p_2(x), \ldots\right.\left.p_K(x)\right]\) , with \(p_k(x)\) equal to the proportion of trees predicting class \(k\) at \(x\). The bagged classifier selects the class with the most "votes" from the \(B\) trees \[ \hat{G}_{\mathrm{bag}}(x)=\arg \max _k \hat{f}_{\mathrm{bag}}(x)=\arg \max _k \frac{1}{B} \sum_{b=1}^{B} \hat{f}^{* b}(x) \]

4 Properties

4.1 Bagging will often decrease mean-squared error

Consider the optimal bagging estimator, \(f_{\mathrm{ag}}(x) = \mathrm{E}_{\mathcal{P}} \hat{f}^*(x)\), where bootstrap samples are drawn from the actual population \(\mathcal{P}\) rather than the dataset.
\(\hat{f}^{*}(x)\) represents predictions derived from bootstrap samples drawn from \(\mathcal{P}\). \[ \begin{aligned}\mathrm{E}_{\mathcal{P}}\left[Y-\hat{f}^*(x)\right]^2 & =\mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)+f_{\mathrm{ag}}(x)-\hat{f}^*(x)\right]^2 \\& =\mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)\right]^2+\mathrm{E}_{\mathcal{P}}\left[\hat{f}^*(x)-f_{\mathrm{ag}}(x)\right]^2 \\& \geq \mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)\right]^2 .\end{aligned} \]
The additional error on the right-hand side arises from the variance of \(\hat{f}^*(x)\) around its mean \(f_{\mathrm{ag}}(x)\).
Consequently, true population aggregation never increases mean squared error.

4.2 Bagging's Effectiveness with High-Variance, Low-Bias Models

Bagging demonstrates exceptional efficacy, particularly with high-variance, low-bias procedures, such as trees.

Regression: Repeatedly fit the same regression tree to bootstrapped versions of the training data and average the results.
Classification: Form a committee of trees where each tree contributes a vote for the predicted class.

4.3 Bagging effect: Caution with Poor Classifiers

Bagging excels in enhancing a good classifier but can exacerbate the performance of a bad one.

4.3.1 Simple Example

Consider a scenario where \(Y=1\) for all \(x\).

If a classifier \(\hat{G}(x)\) predicts \(Y=1\) with a probability of 0.4 and \(Y=0\) with a probability of 0.6 for all \(x\), the misclassification error of \(\hat{G}(x)\) is 0.6.

However, the bagged classifier's error is 1.0.

Classes: \(k=0,1\)

\[ \begin{aligned} \hat{G}_{\mathrm{bag}}(x) & =\arg\max_{k}\frac{1}{B}\sum_{b=1}^{B}\hat{f}^{*b}(x)\\ & =\arg\max_{k}\left(\begin{array}{c}\frac{1}{B}\sum_{b=1}^{B}I\left(G_{b}^{*}(x)=0\right)\\\frac{1}{B}\sum_{b=1}^{B}I\left(G_{b}^{*}(x)=1\right)\end{array}\right)\\ & \rightarrow\arg\max_{k}\left(\begin{array}{c}0.6\\0.4\end{array}\right)\\ & =0 \end{aligned}\]

Misclassification error rate: \(e=0.6\)
\(S_1(x)=\sum_{b=1}^B I\left(G_b^*(x)=1\right)\sim \operatorname{Bin}(B, 1-e)\)
\(S_1(x) \rightarrow E(S_1(x))=B(1-e)=0.4 B < B/2\)
\(\operatorname{Pr}\left(S_1<\right.{B} / 2) \rightarrow 1\) as \(B\) gets large.

4.3.2 Why Bagged Trees Fall Short as a Consensus of Independent Weak Learners?

Bagged trees lack independence.
The "Wisdom of Crowds" principle suggests that the collective knowledge of a diverse and independent group surpasses that of any individual. This concept is effectively harnessed through voting.
Random forests address this limitation by diminishing the correlation among sampled trees.

4.3.3 Illustration of Combining Weak Learners

Consider a two-class example where the Bayes optimal decision at \(x\) is \(G(x)=1\).

Each weak learner \(G_b^*\) has an error-rate \(e_b=e<0.5\).
Let \(S_1(x)=\sum_{b=1}^B I\left(G_b^*(x)=1\right)\) represent the consensus vote for class 1.
Assuming independence among weak learners, \(S_1(x) \sim \operatorname{Bin}(B, 1-e)\).
As \(B\) grows, \(\operatorname{Pr}\left(S_1>\right.{B} / 2) \rightarrow 1\).
- By the Law of Large Numbers,\[ S_1(x) \rightarrow E(S_1(x))=B(1-e) > B/2 \]
- since \(e < 0.5\).

4.4 Bagging's Impact on Model Structure

Bagging can alter the simple structure of a model; for instance, a bagged tree ceases to resemble a tree.
More stable procedures like nearest neighbors are typically less affected by bagging.

5 Reference

https://www.stat.cmu.edu/~ryantibs/datamining/lectures/24-bag.pdf https://link.springer.com/book/10.1007/978-0-387-84858-7