Table Of content
- Simple Explanation of Adaboost
- Step by step understanding of how adaboost works.
- Number of weak learners required
- Bias and variance tradeoff in adaboost
- Parameter optimization in adaboost
- Feature selection in adaboost
SIMPLE EXPLANATION OF ADABOOST
Adaboost creates an ensemble of weak learners to create a strong learner.
Weak learners are models that achieve accuracy just above random chance on a classification problem.
The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.
Adboost is a sequential learner. Basically, in adaboost you run all of your data through a weak learner, and try to classify the data.
Then in the next iteration, you give more weightage to the incorrectly classified examples in the training data. So, your next weak learner does a better job in predicting those examples.
We iterate n times, each time applying base learner on the training set with updated weights.
The final prediction is the weighted sum of the n learners.
STEP BY STEP UNDERSTANDING OF HOW ADABOOST WORKS
Let’s start of with data set with d dimension, and it output two classes -1 and 1 .
- Initialize weight for the First weak Learner- When creating the first decision tree model, we treat all the training examples as equal and set the weight equal to 1/n , where n is the number of training examples
2) Error of the weak learner- After we train the weak learner we calculate the final classification error
This is calculated from the formula.
error = (correct – N) / N
Since this is the first weak learner that we are training lets call it e1
(2) Calculate the weight for the m_th weak classifier: The final prediction we will make from adaboost will be a weighted average of all the predictions from different weak learner.
The weightage each classifier gets in the final result is also based on the error.
If the accuracy of the classifier is higher than 50%, the weight is positive. Also, more the accuracy, higher weightage is given to the classifier.
(3) UPDATE THE WEIGHTS OF TRAINING EXAMPLES :-
As we discussed above, in adaboost is a sequential model, the next weak learner we will train will try to do a better job at the training examples the first weak learner got wrong.
In each iteration, update the weight for each data point as:
where Z_m is a normalization factor that ensures the sum of all instance weights is equal to 1.
For a misclassified case, the “exp” term in the numerator will be greater than 1 (y*f is always -1, theta_m is positive).
So, the particular misclassified case will be given higher weightage in the next iteration.
After M iteration we can get the final prediction by summing up the weighted prediction of each classifier.
4) Using the new weighted training data set we train a new weak learners
This is modified to use the weighting of the training instances:
error = sum(w(i) * terror(i)) / sum(w)
Which is the weighted sum of the misclassification rate, where w is the weight for training instance i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly classified.
5.Repeat step 2 and step 3 and step 4
We sum together all the weak learners using the following formula in adaboost
The final equation for classification can be represented as
NUMBER OF WEAK LEARNERS REQUIRED
The number of weak learners required in the ensemble depends on the error term of each weak learner.
If they error of each weak learner is closer to 50% then a larger number of weak learners is required.
BIAS AND VARIANCE TRADEOFF IN ADABOOST
Adaboost is a collection of lot of weak learners that have a high bias, by collecting a larget number of weak learners together adaboost tries to reduce the bias.
PARAMETER OPTIMIZATION IN ADABOOST (FROM Sklearn)
These are the important parameters that you need to take care when optimizing an ADABOOST algorithm.
Base_estimator : object, optional (default=DecisionTreeClassifier)
The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes.
There is no a-priori best answer. You need to grid search to determine the tree depth.
n_estimators : integer, optional (default=50)
The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
Train many, many weak learners. Then look at a test-error vs. number of estimators curve to find the optimal number.
learning_rate : float, optional (default=1.)
Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.
- learning_rate is the contribution of each model to the weights and defaults to 1. Reducing the learning rate will mean the weights will be increased or decreased to a small degree, forcing the model train slower (but sometimes resulting in better performance scores).
Smaller is better, but you will have to fit more weak learners the smaller the learning rate. During initial modeling and EDA, set the learning rate rather large (0.01 for example). Then when fitting your final model, set it very small (0.0001 for example), fit many, many weak learners, and run the model over night.
- Choose a relatively high learning rate. Generally the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems
- Determine the optimum number of trees for this learning rate. This should range around 40-70. Remember to choose a value on which your system can work fairly fast. This is because it will be used for testing various scenarios and determining the tree parameters.
- Tune tree-specific parameters for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
- Lower the learning rate and increase the estimators proportionally to get more robust models.
Feature importance Adaboost
There are multiple ways to determine relative feature importance but as far as I know your approach might already yield the best possible results in terms of insight!
AdaBoost’s feature importance is derived from the feature importance provided by its base classifier. Assuming you use a Decision Tree as a base classifier, then the AdaBoost feature importance is determined by the average feature importance provided by each Decision Tree.