The C4.5 decision tree split criterion
In a previous post, we worked through an example of building a decision tree using the ID3 algorithm, a precursor to the ubiquitous C4.5 decision tree algorithm.
In this post, we are going to try to understand the split criterion of the C4.5 algorithm itself by doing some more hand calculations to ensure that we truly understand it, rather than just throwing some data into a black-boxed implemention of the algorithm.
The data
We'll use the same data that was used in the previous post on the ID3 algorithm so we can compare our results. The dataset is as follows:
Day | Outlook | Temperature | Humidity | Wind | PlayTennis |
---|---|---|---|---|---|
D1 | Sunny | Hot | High | Weak | No |
D2 | Sunny | Hot | High | Strong | No |
D3 | Overcast | Hot | High | Weak | Yes |
D4 | Rain | Mild | High | Weak | Yes |
D5 | Rain | Cool | Normal | Weak | Yes |
D6 | Rain | Cool | Normal | Strong | No |
D7 | Overcast | Cool | Normal | Strong | Yes |
D8 | Sunny | Mild | High | Weak | No |
D9 | Sunny | Cool | Normal | Weak | Yes |
D10 | Rain | Mild | Normal | Weak | Yes |
D11 | Sunny | Mild | Normal | Strong | Yes |
D12 | Overcast | Mild | High | Strong | Yes |
D13 | Overcast | Hot | Normal | Weak | Yes |
D14 | Rain | Mild | High | Strong | No |
Splitting criterion
ID3 used an information gain criterion in deciding how to build branches in the decision tree. In equation form:
\[ G(S,A) = H(S) - \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H(S_{\nu}) \]
where H is the entropy of a set, defined by:
\[ H(S) = \sum_i^n p_i \log_2 p_i \]
The information gain is the gain in information due to the hypothetical split on an attribute. We choose to split on the attribute that has the highest information gain.
In contrast, C4.5 uses a so-called gain rate criterion. This gain rate criterion is superior due to the fact that ID3's information gain criterion strongly favors features with many outcomes. For example, if we were to consider the Day column as a feature (we had only considered it to be a row label rather than a feature until now), it would have the highest information gain; but using it as a splitting criterion would result in a shallow, broad tree with 14 leaves: one branch for each instance. This is not a very useful decision tree as it likely would not generalize well on unseen data.
The gain ratio criterion ameliorates this problem by adjusting the gain by the number of outcomes to favor a feature with less outcomes over that with more.
The gain ratio is defined as follows:
\[ G_r(S,A) = \frac{G(S,A)}{Split(S,A)} \]
Where \(Split(S,A)\) is the split info defined as:
\[ Split(S,A) = \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H\left(\frac{|S_{\nu}|}{|S|}\right) \]
Computing splits using gain ratio
The gain on the full dataset is:
\[ H(S) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{9}{14} \log_2 \frac{9}{14} = 0.94 \]
Then we'll compute hypothetical splits on each of the features.
- Outlook
The information gain on the Outlook attribute is:
\[ G(S, Outlook) = H(S) - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Sunny} - \left[ \frac{4}{14} (-\frac{4}{4} \log_2 \frac{4}{4} - \frac{0}{4} \log_2 \frac{0}{4}) \right]_{Overcast} - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Rain} = 0.246 \]
And the split information is:
\[ Split(S, Outlook) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{12} \log_2 \frac{5}{14} = 1.577 \]
Therefore the gain ratio is \(\frac{0.246}{1.58}=0.156\) for the Outlook attribute.
- Temperature
\[ G(S, Temperature) = H(S) - \frac{4}{14} \left( -\frac{2}{4} \log_2 {2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right) - \frac{6}{14} \left( -\frac{4}{6} \log_2 \frac{4}{6} - \frac{2}{6} \log_2 \frac{2}{6} \right) - \frac{4}{14} \left( -\frac{3}{4} \log_2 \frac{3}{4} - \frac{1}{4} \log_2 \frac{1}{4} \right) = 0.571 \]
\[ Split(S, Temperature) = -frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} = 1.563 \]
And the gain ratio for the Temperature attribute is \(\frac{0.571}{1.563} = 0.365\).
- Humidity
\[ G(S, Humidity) = H(S) - \frac{7}{14} \left( -\frac{3}{7} \log_2 \frac{3}{7} - \frac{4}{7} \log_2 \frac{4}{7} \right) - \frac{7}{14} \left( -\frac{6}{7} \log_2 \frac{6}{7} - \frac{1}{7} \log_2 \frac{1}{7} \right) = 0.152 \]
\[ Split(S, Humidity) = -\frac{7}{14} \log_2 \frac{7}{14} - \frac{7}{14} \log_2 \frac{7}{14} = 1 \]
The gain ratio is \(\frac{0.152}{1} = 0.152\)
- Wind
\[ G(S, Wind) = H(S) - \frac{8}{14} \left( -\frac{6}{8} \log_2 \frac{6}{8} - \frac{2}{8} \log_2 \frac{2}{8} \right) - \frac{6}{14} \left( -\frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{2}{6} \right) = 0.048 \]
\[ Split(S, Wind) = \frac{8}{14} \log_2 \frac{8}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 0.985 \]
And the gain ratio is \(\frac{0.048}{0.985}=0.0487\).
Branching decision based on gain ratio
So we see that the maximum gain ratio between the potential attributes upon which we can make our split is on the Temperature feature. We see that this is different than in the case when we were using the information gain criterion (Outlook) and will therefore result in a different decision tree.