Boosting 1 - AdaBoost.M1 Overview

Posted on 2024-09-22 In Theory Views: Disqus:

A series of blog posts on Boosting Methods

1 Introduction

Boosting is a powerful machine learning technique that transforms a collection of "weak" classifiers into a strong, unified model.

A "weak" classifier performs only slightly better than random guessing, but by combining their outputs, boosting creates a "committee" of classifiers that collectively produce highly accurate predictions.

At its core, boosting aims to improve the performance of weak classifiers, which are slightly better than random guessing, by iteratively focusing on the data points that are hardest to classify.

2 AdaBoost.M1.

AdaBoost.M1 is a prominent implementation of the boosting technique for binary classification.

It adjusts weights to emphasize misclassified examples, effectively combining weak classifiers into a robust model.

While initially designed for two-class problems, AdaBoost.M1 can also be extended to multi-class tasks.

2.1 Two-class Problem

Output variable \(Y \in\{-1,1\}\)
Predictor variables \(X\)
Classifier \(G(X)\) produces a prediction in \(\{-1,1\}\).

2.2 Weighted Error Rate

The typical error rate on the training sample is \[ \overline{\operatorname{err}}=\frac{1}{N} \sum_{i=1}^{N} I\left(y_{i} \neq G\left(x_{i}\right)\right) \]

Expected error rate on future predictions is \(\mathrm{E}_{X Y} I(Y \neq G(X))\).

In AdaBoost, we use the weighted error rate to assign different weights to samples, guiding the classifier to focus on harder-to-classify examples. \[ \operatorname{err}_{m}=\frac{\sum_{i=1}^{N} w_{i} I\left(y_{i} \neq G_{m}\left(x_{i}\right)\right)}{\sum_{i=1}^{N} w_{i}} \]

2.3 Weighted Majority Vote Classifier (Prediction)

\[G(x)=\operatorname{sign}\left( \sum_{m=1}^{M} \alpha_{m} G_{m}(x)\right)\]

where \(\alpha_1,\ldots,\alpha_M>0\) are weights from the boosting algorithm and they weight the contribution of each respective \(G_{m}(x)\).

2.4 Key Idea

Observations weights: misclassified observations \((y_i \neq G_m(x_i))\) have their weights \(w_i\) increased for the next step by \(\exp(\alpha_m)>1\) at each step \[ w_i \leftarrow w_i \cdot \exp\left[\alpha_m \cdot I(y_i \neq G_m(x_i))\right] \]
Classification weights: give higher influence to the more accurate classifiers \[\alpha_{m}=\log \left(\left(1-\operatorname{err}_{m}\right) / \operatorname{err}_{m}\right)\]

3 Algorithm AdaBoost.M1.

Initialize the observation weights \(w_{i}=1 / N, i=1,2, \ldots, N\)
For \(m=1\) to \(M\):
1. Fit a classifier \(G_{m}(x)\) to the training data using weights \(w_{i}\)
2. Compute weighted error rate
  
  \[ \operatorname{err}_{m}=\frac{\sum_{i=1}^{N} w_{i} I\left(y_{i} \neq G_{m}\left(x_{i}\right)\right)}{\sum_{i=1}^{N} w_{i}} \]
3. Compute classifier weight \(\alpha_{m}=\log \left(\left(1-\operatorname{err}_{m}\right) / \operatorname{err}_{m}\right)\).
  - larger error will have smaller \(\alpha_m\)
  - give higher influence to the more accurate classifiers
4. Set \(w_{i} \leftarrow w_{i} \cdot \exp \left[\alpha_{m} \cdot I\left(y_{i} \neq G_{m}\left(x_{i}\right)\right)\right], i=1,2, \ldots, N\)
  - For misclassified observation, increase the weights since \(\exp(\cdot) > 1\)
  - observations misclassified by \(G_m(x)\) have their weights scaled by a factor \(\exp \left(\alpha_m\right)\)
Output

\[ G(x)=\operatorname{sign}\left[\sum_{m=1}^{M} \alpha_{m} G_{m}(x)\right]. \]

4 Code

4.1 Code from Package

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1)
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant = 0, random_state=6)

# 2)
# X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

y = np.where(y == 0, -1, 1)  # Change labels from {0, 1} to {-1, 1}


# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


simple_tree = DecisionTreeClassifier(max_depth=1)
ada_boost = AdaBoostClassifier(simple_tree, n_estimators=50, algorithm='SAMME.R', random_state=42)

# Train models
ada_boost.fit(X_train, y_train)
simple_tree.fit(X_train, y_train)

# Test models
y_pred_ada = ada_boost.predict(X_test)
y_pred_tree = simple_tree.predict(X_test)

# Calculate and print the accuracy and error
accuracy_ada = accuracy_score(y_test, y_pred_ada)
error_rate_ada = 1 - accuracy_ada
accuracy_tree = accuracy_score(y_test, y_pred_tree)
error_rate_tree = 1 - accuracy_tree

print(f'AdaBoost Accuracy: {accuracy_ada:.2f}')
print(f'AdaBoost Error Rate: {error_rate_ada:.2f}')
print(f'Decision Tree Accuracy: {accuracy_tree:.2f}')
print(f'Decision Tree Error Rate: {error_rate_tree:.2f}')

When features dimension is low: n_features=2, n_informative=2, n_redundant = 0, random_state=6), the AdaBoost present a slightly better accuracy. The decision boundaries can correctly predict more samples.
1
2
3
4
AdaBoost Accuracy: 0.91
AdaBoost Error Rate: 0.09
Decision Tree Accuracy: 0.89
Decision Tree Error Rate: 0.11
When features dimension is high (especially with redundant features): n_features=20, n_informative=15, n_redundant=5, AdaBoost present a significant advantage.
1
2
3
4
AdaBoost Accuracy: 0.82
AdaBoost Error Rate: 0.18
Decision Tree Accuracy: 0.66
Decision Tree Error Rate: 0.34

4.2 Code from Scratch

class AdaBoost:
    def __init__(self, n_clf=50):
        self.n_clf = n_clf
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        # Initialize weights
        w = np.ones(n_samples) / n_samples
        self.clfs = []
        self.alpha = []

        for _ in range(self.n_clf+1):
            clf = DecisionTreeClassifier(max_depth=1)
            clf.fit(X, y, sample_weight=w)
            predictions = clf.predict(X)
            
            # Calculate the error rate and classifier weight
            error = w @ (predictions != y) / np.sum(w)
            print(error)

            # Reverse classifier with an error greater than 0.5 
            if error > 0.5:
                print("error > 0.5")
                error = 1 - error
                predictions = 1 - predictions  # Flip 0 to 1 and 1 to 0
                
            # Scale by 0.5 to prevent large updates that could destabilize learning or cause overfitting.
            alpha = 0.5 * np.log((1 - error) / error) 

            self.alpha.append(alpha)
            self.clfs.append(clf)
            
            # w *= np.exp(alpha * (predictions != y))
            w *= np.exp(-alpha * y * predictions)
            w /= np.sum(w)  # Normalize to keep the sum of weights = 1

    def predict(self, X):
        clf_preds = np.array([alpha * clf.predict(X) for clf, alpha in zip(self.clfs, self.alpha)])
        y_pred = np.sign(np.sum(clf_preds, axis=0))
        return y_pred



ada_boost = AdaBoost(n_clf = 50)
simple_tree = DecisionTreeClassifier(max_depth=1)

# train model
ada_boost.fit(X_train, y_train)
simple_tree.fit(X_train, y_train)

# test model
y_pred_ada = ada_boost.predict(X_test)
y_pred_tree = simple_tree.predict(X_test)

accuracy_ada = accuracy_score(y_test, y_pred_ada)
error_rate_ada = 1 - accuracy_ada
accuracy_tree = accuracy_score(y_test, y_pred_tree)
error_rate_tree = 1 - accuracy_tree

print(f'AdaBoost Accuracy: {accuracy_ada:.2f}')
print(f'AdaBoost Error Rate: {error_rate_ada:.2f}')
print(f'Decision Tree Accuracy: {accuracy_tree:.2f}')
print(f'Decision Tree Error Rate: {error_rate_tree:.2f}')

# AdaBoost Accuracy: 0.91
# AdaBoost Error Rate: 0.09
# Decision Tree Accuracy: 0.89
# Decision Tree Error Rate: 0.11

5 Reference

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. https://doi.org/10.1007/978-0-387-84858-7