Understanding Bootstrap Aggregation (Bagging) in Machine Learning
1 Introduction
Bagging, short for Bootstrap Aggregation, is a powerful ensemble learning method that minimizes prediction variability by averaging across a set of bootstrap samples.
This technique proves highly effective in situations where models are inclined toward high variance or overfitting.
2 Objective
- Variance Reduction Through Averaging: Bagging strategically averages predictions over bootstrap samples, effectively curbing variance.
- Leveraging Statistical Connections: Beyond simple averaging, bagging uncovers a strong connection between the bootstrap mean and a posterior average, enhancing its nuanced and effective approach.
3 Examples
3.1 Regression problem
Training data \(\mathbf{Z}=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\}\)
Fitting the data and obtaining the prediction\(\hat{f}(x)\) at input \(x\)
For each bootstrap sample \(\mathbf{Z}^{* b}, b=1,2, \ldots, B\), we fit our model, giving prediction \(\hat{f}^{* b}(x)\).
- \(\mathbf{Z}^{*}=\left\{\left(x_{1}^{*}, y_{1}^{*}\right),\left(x_{2}^{*}, y_{2}^{*}\right), \ldots,\left(x_{N}^{*}, y_{N}^{*}\right)\right\}\)
- \(\left(x_i^*, y_i^*\right) \sim \hat{\mathcal{P}}\) where \(\hat{\mathcal{P}}\) is the empirical distribution putting equal probability \(1/N\) on each of the data points \((x_i, y_i)\).
The bagging estimate is \[ \hat{f}_{\mathrm{bag}}(x)=\frac{1}{B} \sum_{b=1}^{B} \hat{f}^{* b}(x) \]
\(\hat{f}_{\mathrm{bag}}(x)\) is a Monte Carlo estimate of the true bagging estimate \(\mathrm{E}_{\hat{\mathcal{P}}} \hat{f}^{*}(x)\)
The bagged estimate will differ from the original estimate \(\hat{f}(x)\) only when the latter is a nonlinear or adaptive function of the data.
- For example, B-spline smoother is linear in the data if we fix the inputs; hence if we sample using the parametric bootstrap, it gives \[ \hat{f}_{\mathrm{bag}}(x) \rightarrow \hat{f}(x) \text { as } B \rightarrow \infty \]
For Regression tree, \(\hat{f}(x)\) represents the tree's prediction at input vector \(x\). The bagged estimate is the average prediction at \(x\) from these \(B\) tree
3.2 K-classification
Consider \(\hat{f}(x)\), an indicator-vector function, with value a single one and \(K-1\) zeroes. The classifier is \[ \hat{G}(x)=\arg \max _k \hat{f}(x) \] Then the bagged estimate \(\hat{f}_{\mathrm{bag}}(x)\) is a \(K\)-vector \(\left[p_1(x), p_2(x), \ldots\right.\left.p_K(x)\right]\) , with \(p_k(x)\) equal to the proportion of trees predicting class \(k\) at \(x\). The bagged classifier selects the class with the most "votes" from the \(B\) trees \[ \hat{G}_{\mathrm{bag}}(x)=\arg \max _k \hat{f}_{\mathrm{bag}}(x)=\arg \max _k \frac{1}{B} \sum_{b=1}^{B} \hat{f}^{* b}(x) \]
4 Properties
4.1 Bagging will often decrease mean-squared error
- Consider the optimal bagging estimator, \(f_{\mathrm{ag}}(x) = \mathrm{E}_{\mathcal{P}} \hat{f}^*(x)\), where bootstrap samples are drawn from the actual population \(\mathcal{P}\) rather than the dataset.
- \(\hat{f}^{*}(x)\) represents predictions derived from bootstrap samples drawn from \(\mathcal{P}\). \[ \begin{aligned}\mathrm{E}_{\mathcal{P}}\left[Y-\hat{f}^*(x)\right]^2 & =\mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)+f_{\mathrm{ag}}(x)-\hat{f}^*(x)\right]^2 \\& =\mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)\right]^2+\mathrm{E}_{\mathcal{P}}\left[\hat{f}^*(x)-f_{\mathrm{ag}}(x)\right]^2 \\& \geq \mathrm{E}_{\mathcal{P}}\left[Y-f_{\mathrm{ag}}(x)\right]^2 .\end{aligned} \]
- The additional error on the right-hand side arises from the variance of \(\hat{f}^*(x)\) around its mean \(f_{\mathrm{ag}}(x)\).
- Consequently, true population aggregation never increases mean squared error.
4.2 Bagging's Effectiveness with High-Variance, Low-Bias Models
Bagging demonstrates exceptional efficacy, particularly with high-variance, low-bias procedures, such as trees.
- Regression: Repeatedly fit the same regression tree to bootstrapped versions of the training data and average the results.
- Classification: Form a committee of trees where each tree contributes a vote for the predicted class.
4.3 Bagging effect: Caution with Poor Classifiers
Bagging excels in enhancing a good classifier but can exacerbate the performance of a bad one.
4.3.1 Simple Example
Consider a scenario where \(Y=1\) for all \(x\).
If a classifier \(\hat{G}(x)\) predicts \(Y=1\) with a probability of 0.4 and \(Y=0\) with a probability of 0.6 for all \(x\), the misclassification error of \(\hat{G}(x)\) is 0.6.
However, the bagged classifier's error is 1.0.
Classes: \(k=0,1\)
\[ \begin{aligned} \hat{G}_{\mathrm{bag}}(x) & =\arg\max_{k}\frac{1}{B}\sum_{b=1}^{B}\hat{f}^{*b}(x)\\ & =\arg\max_{k}\left(\begin{array}{c}\frac{1}{B}\sum_{b=1}^{B}I\left(G_{b}^{*}(x)=0\right)\\\frac{1}{B}\sum_{b=1}^{B}I\left(G_{b}^{*}(x)=1\right)\end{array}\right)\\ & \rightarrow\arg\max_{k}\left(\begin{array}{c}0.6\\0.4\end{array}\right)\\ & =0 \end{aligned}\]
- Misclassification error rate: \(e=0.6\)
- \(S_1(x)=\sum_{b=1}^B I\left(G_b^*(x)=1\right)\sim \operatorname{Bin}(B, 1-e)\)
- \(S_1(x) \rightarrow E(S_1(x))=B(1-e)=0.4 B < B/2\)
- \(\operatorname{Pr}\left(S_1<\right.{B} / 2) \rightarrow 1\) as \(B\) gets large.
4.3.2 Why Bagged Trees Fall Short as a Consensus of Independent Weak Learners?
- Bagged trees lack independence.
- The "Wisdom of Crowds" principle suggests that the collective knowledge of a diverse and independent group surpasses that of any individual. This concept is effectively harnessed through voting.
- Random forests address this limitation by diminishing the correlation among sampled trees.
4.3.3 Illustration of Combining Weak Learners
Consider a two-class example where the Bayes optimal decision at \(x\) is \(G(x)=1\).
- Each weak learner \(G_b^*\) has an error-rate \(e_b=e<0.5\).
- Let \(S_1(x)=\sum_{b=1}^B I\left(G_b^*(x)=1\right)\) represent the consensus vote for class 1.
- Assuming independence among weak learners, \(S_1(x) \sim \operatorname{Bin}(B, 1-e)\).
- As \(B\) grows, \(\operatorname{Pr}\left(S_1>\right.{B} / 2)
\rightarrow 1\).
- By the Law of Large Numbers,\[ S_1(x) \rightarrow E(S_1(x))=B(1-e) > B/2 \]
- since \(e < 0.5\).
4.4 Bagging's Impact on Model Structure
- Bagging can alter the simple structure of a model; for instance, a bagged tree ceases to resemble a tree.
- More stable procedures like nearest neighbors are typically less affected by bagging.
5 Reference
https://www.stat.cmu.edu/~ryantibs/datamining/lectures/24-bag.pdf https://link.springer.com/book/10.1007/978-0-387-84858-7