I’m reading a book about the “boosting”: the machine learning idea that a better algorithm can be built by combining multiple instances of a simple algorithm, each instance trained on the same set of training data but with different weights assigned to each case.

The very basic concept itself goes along the same way as I’ve been doing manually for the Bayesian system: after a system is trained on a set of training cases, re-test it on the same set of data and find what cases got diagnosed wrong (the proper word for it, as it turns out, is “training error”). Then try to improve the recognition of these cases. But unlike me the boosting does it in an automated way: it doesn’t try to change the underlying algorithm, instead it simply produces the second training table, and the end result is produced by running the expert system with both tables then combining their results in a weighted fashion. Boosting can be done for hundreds of rounds, producing the systems that run the basic algorithm hundreds of times. For Bayesian algorithms, I think it might be possible to combine these many tables into one instead of running them separately, although I’m not quite sure about that.

There is also a limitation on the boosting: looks like it’s mostly sorted out for the cases where there are only two mutually exclusive hypotheses. Apparently it starts having problems even with multiple mutually exclusive hypotheses, let alone the overlapping ones.

Well, getting back to the book, it’s “Boosting”, written by the inventors of the concept in general and of the AdaBoost (Adaptive Boost) algorithm in particular, Shapire and Freund.

I’ve spent some time understanding the AdaBoost algorithm, and I feel that it can be expressed in a simpler form. The authors’ definition of the algorithm is like this (minus the mathematical characters that I don’t know how to write in HTML and wrote in words or similar Latin characters):

Given: (x_{1}, y_{1}), ..., (x_{m}, y_{m}) where x_{i}belongs to X, y_{i}belongs to {-1, +1}. Initialize: D_{1}(i) = 1/m for i = 1, ..., m. For t = 1, ..., T: * Train the basic algorithm using the distribution D_{t}. * Get weak hypothesis h_{t}: X -> {-1, +1}. * Aim: select h_{t}to minimalize the weighted error: e_{t}=Pr_{i~Dt}[h_{t}(x_{i}) != y_{i}]. * Choose Alpha_{t}= 1/2 * ln((1-e_{t})/e_{t}). * Update, for i = 1, ..., m: D_{t+1}(i) = D_{t}(i)/Z_{t}* { exp(-Alpha_{t}) if h_{t}(x_{i})=y_{i}; exp(Alpha_{t}) if h_{t}(x_{i}) != y_{i}} = D_{t}(i)*exp(-Alpha_{t}*y_{i}*h_{t}(x_{i})) / Z_{t}where Z_{t}is a normalization factor (chosen so that D_{t+1}will be a distribution). Output the final hypothesis: H(x) = sign(sum for t=1,...,T (Alpha_{t}*h_{t}(x)) ).

Some explanations are in order. The pairs (x_{i}, y_{i}) are the training cases. x_{i} represents the whole set of symptoms, y_{i} represents the outcome (in Bayesian terms, we could also say “hypotheses” but here the word hypothesis has a different meaning). Only two outcomes are possible, and unlike the Bayesian tradition of representing them as 0 and 1, here they are represented as -1 and +1. This representation allows the clever trick in the formula for D_{t+1}(i), replacing the conditional choice of the sign for Alpha_{t} with multiplication of (y_{i}*h_{t}(x_{i})). If these values don’t match, the result will be -1, if they do match then 1. The concept of hypothesis here is different from the Bayesian one, here the “hypothesis” means the whole basic algorithm with a computed training table. h_{t}(x_{i}) means the computation of the outcome by this trained algorithm by feeding the symptoms of a training case into it.

The epic expression **Pr**_{i~Dt}[h_{t}(x_{i}) != y_{i}] means the weighted probability of a training case outcome being computed incorrectly when it’s fed back into the basic algorithm with the current round’s training table (the training table is computed independently in each round based only on the set of training cases and on the distribution of their weights for this round). Basically, it’s the relation of the weight of incorrectly computed cases to the total weight of all cases. The underlying basic algorithm tries to make up its table to minimize this number to its best ability. But since it’s not perfect and might be far from perfect, the number of erroneous cases is unlikely to become 0.

There are T rounds of boosting. Each round trains the same basic algorithm on the same set of training cases but assigns different weights to the importance of these cases, according to D_{t}. The weights in D_{t} are adjusted for each round such as to sum up to 1. Z_{t} is thus the “normalization factor”: the sum of values in D_{t+1} before this adjustment. From the algorithmic standpoint, there is no good reason to do this normalization: weights are weights, who cares if they sum up to 1 or not, this algorithm doesn’t depend on it in any way. There is a reason why the authors use the Z_{t} as it is, because it turns up in the proofs about the properties of the algorithm. But here it’s easier to place it into the calculation of e_{t}.

Note also that Alpha_{t} is first computed as a logarithm, and then an exponent is taken on it. This logarithm and exponent cancel each other out except for the sign. This and the manipulation of the sign by (y_{i}*h_{t}(x_{i})) look mathematically cute but confuse the hell out of it. Well, they have a reason for using a logarithm because they use this Alpha_{t} to prove some properties, but like Z_{t} in this place it only complicates things.

Getting rid of all these complications, here is the simplified version:

Given: (x_{1}, y_{1}), ..., (x_{m}, y_{m}) where x_{i}belongs to X, y_{i}belongs to {-1, +1}. Initialize: D_{1}(i) = 1 for i = 1, ..., m. For t = 1, ..., T: * Train the basic algorithm using the weights D_{t}: * Get weak hypothesis h_{t}: X -> {-1, +1} where the basic algorithm aims to minimalize the weighted error e_{t}: Wgood_{t}= 0; Wbad_{t}= 0; for i = 1, ..., m { if h_{t}(x_{i}) = y_{i}{ Wgood_{t}+= D_{t}(i) } else { Wbad_{t}+= D_{t}(i) } } e_{t}= Wbad_{t}/ (Wgood_{t}+ Wbad_{t}) * Update, for i = 1, ..., m { if h_{t}(x_{i}) != y_{i}; { D_{t+1}(i) = D_{t}(i) * sqrt((1-e_{t})/e_{t}) } else { D_{t+1}(i) = D_{t}(i) * sqrt(e_{t}/(1-e_{t})) } } Output the final hypothesis: H(x) = sign(sum for t=1,...,T (ln(sqrt((1-e_{t})/e_{t}))*h_{t}(x)) ).

I think it becomes much easier to understand in this form. In particular, it becomes much easier to see an important property: each round adjusts the weights such as if the last table were used on them, it would produce the weighted error of exactly 50%, which would mean that it has no idea what is going on. The following round is then forced to come up with the new solution and scrape up a new way to do some differentiation. This is where the “Adaptive” in the algorithm’s name comes from. But they say that this is not how they’ve come up with this formula, they say that they’ve come up with this particular value of Alpha_{t} in order to minimize Z_{t}. I don’t understand yet, how exactly they’ve done it. I’ve tried to compute it and I get a different answer. I must be making an error somewhere, need to think more about it. And they try to minimize Z_{t} because it represents the upper bound on the training error (again, I can follow the explanation but I can’t tell on my own yet, exactly why, will need to read this section a couple more times).

I don’t understand yet the motivation for choosing the per-round multipliers in H(x) either. I can see some things about it. For example, the logarithm nicely handles the case of e_{t} > 0.5. If more than half the cases get misclassified by the basic algorithm, this situation is easy to improve: just flip the sign of the outcomes, and suddenly less than half of them will be misclassified. In this case the value under the logarithm will be less than 1, and the value of the logarithm will be negative, thus automatically flipping the sign! But I don’t know why the logarithm is the right choice the rest of the time.

We could also replace (1-e_{t})/e_{t} with Wgood_{t}/Wbad_{t}, and e_{t}/(1-e_{t}) with Wbad_{t}/Wgood_{t} but this might be detracting too much from the properties of e_{t}.