Inferate >> Data Science Blog: The C4.5 decision tree split criterion

blog

The C4.5 decision tree split criterion

In a previous post, we worked through an example of building a decision tree using the ID3 algorithm, a precursor to the ubiquitous C4.5 decision tree algorithm.

In this post, we are going to try to understand the split criterion of the C4.5 algorithm itself by doing some more hand calculations to ensure that we truly understand it, rather than just throwing some data into a black-boxed implemention of the algorithm.

The data

We'll use the same data that was used in the previous post on the ID3 algorithm so we can compare our results. The dataset is as follows:

Table 1: Some data about weather and whether we will play a sport. Say, tennis (adapted from T. Mitchel "Machine Learning" Chapter 3 pp. 59
Day	Outlook	Temperature	Humidity	Wind	PlayTennis
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

Splitting criterion

ID3 used an information gain criterion in deciding how to build branches in the decision tree. In equation form:

\[ G(S,A) = H(S) - \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H(S_{\nu}) \]

where H is the entropy of a set, defined by:

\[ H(S) = \sum_i^n p_i \log_2 p_i \]

The information gain is the gain in information due to the hypothetical split on an attribute. We choose to split on the attribute that has the highest information gain.

In contrast, C4.5 uses a so-called gain rate criterion. This gain rate criterion is superior due to the fact that ID3's information gain criterion strongly favors features with many outcomes. For example, if we were to consider the Day column as a feature (we had only considered it to be a row label rather than a feature until now), it would have the highest information gain; but using it as a splitting criterion would result in a shallow, broad tree with 14 leaves: one branch for each instance. This is not a very useful decision tree as it likely would not generalize well on unseen data.

The gain ratio criterion ameliorates this problem by adjusting the gain by the number of outcomes to favor a feature with less outcomes over that with more.

The gain ratio is defined as follows:

\[ G_r(S,A) = \frac{G(S,A)}{Split(S,A)} \]

Where \(Split(S,A)\) is the split info defined as:

\[ Split(S,A) = \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H\left(\frac{|S_{\nu}|}{|S|}\right) \]

Computing splits using gain ratio

The gain on the full dataset is:

\[ H(S) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{9}{14} \log_2 \frac{9}{14} = 0.94 \]

Then we'll compute hypothetical splits on each of the features.

Outlook

The information gain on the Outlook attribute is:

\[ G(S, Outlook) = H(S) - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Sunny} - \left[ \frac{4}{14} (-\frac{4}{4} \log_2 \frac{4}{4} - \frac{0}{4} \log_2 \frac{0}{4}) \right]_{Overcast} - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Rain} = 0.246 \]

And the split information is:

\[ Split(S, Outlook) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{12} \log_2 \frac{5}{14} = 1.577 \]

Therefore the gain ratio is \(\frac{0.246}{1.58}=0.156\) for the Outlook attribute.

Temperature

\[ G(S, Temperature) = H(S) - \frac{4}{14} \left( -\frac{2}{4} \log_2 {2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right) - \frac{6}{14} \left( -\frac{4}{6} \log_2 \frac{4}{6} - \frac{2}{6} \log_2 \frac{2}{6} \right) - \frac{4}{14} \left( -\frac{3}{4} \log_2 \frac{3}{4} - \frac{1}{4} \log_2 \frac{1}{4} \right) = 0.571 \]

\[ Split(S, Temperature) = -frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} = 1.563 \]

And the gain ratio for the Temperature attribute is \(\frac{0.571}{1.563} = 0.365\).

Humidity

\[ G(S, Humidity) = H(S) - \frac{7}{14} \left( -\frac{3}{7} \log_2 \frac{3}{7} - \frac{4}{7} \log_2 \frac{4}{7} \right) - \frac{7}{14} \left( -\frac{6}{7} \log_2 \frac{6}{7} - \frac{1}{7} \log_2 \frac{1}{7} \right) = 0.152 \]

\[ Split(S, Humidity) = -\frac{7}{14} \log_2 \frac{7}{14} - \frac{7}{14} \log_2 \frac{7}{14} = 1 \]

The gain ratio is \(\frac{0.152}{1} = 0.152\)

Wind

\[ G(S, Wind) = H(S) - \frac{8}{14} \left( -\frac{6}{8} \log_2 \frac{6}{8} - \frac{2}{8} \log_2 \frac{2}{8} \right) - \frac{6}{14} \left( -\frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{2}{6} \right) = 0.048 \]

\[ Split(S, Wind) = \frac{8}{14} \log_2 \frac{8}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 0.985 \]

And the gain ratio is \(\frac{0.048}{0.985}=0.0487\).

Branching decision based on gain ratio

So we see that the maximum gain ratio between the potential attributes upon which we can make our split is on the Temperature feature. We see that this is different than in the case when we were using the information gain criterion (Outlook) and will therefore result in a different decision tree.

2 comments:

AnonymousApril 27, 2021 at 11:14 AM
You've got a few errors in the calculations, as an example the 5/12 in the below split calculation should be 5/14. There appear to be a few of these typos, at least I think.

Split(S,Outlook)=−5/14log25/14−4/14log2414−5/12log25/14=1.577
zebediahzahndMarch 4, 2022 at 9:07 AM
casino and casino
The biggest name in 속초 출장안마 gambling 부산광역 출장마사지 is the casino. You can play your 과천 출장마사지 favorite 창원 출장안마 games like slots, poker, blackjack, and video poker. As a rule, the casino's 양산 출장마사지 name is

Tuesday, May 26, 2015

The C4.5 decision tree split criterion

The C4.5 decision tree split criterion

The data

Splitting criterion

Computing splits using gain ratio

Branching decision based on gain ratio

2 comments: