Tuesday, May 26, 2015

The C4.5 decision tree split criterion

blog

The C4.5 decision tree split criterion

In a previous post, we worked through an example of building a decision tree using the ID3 algorithm, a precursor to the ubiquitous C4.5 decision tree algorithm.

In this post, we are going to try to understand the split criterion of the C4.5 algorithm itself by doing some more hand calculations to ensure that we truly understand it, rather than just throwing some data into a black-boxed implemention of the algorithm.

The data

We'll use the same data that was used in the previous post on the ID3 algorithm so we can compare our results. The dataset is as follows:

Table 1: Some data about weather and whether we will play a sport. Say, tennis (adapted from T. Mitchel "Machine Learning" Chapter 3 pp. 59
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Splitting criterion

ID3 used an information gain criterion in deciding how to build branches in the decision tree. In equation form:

\[ G(S,A) = H(S) - \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H(S_{\nu}) \]

where H is the entropy of a set, defined by:

\[ H(S) = \sum_i^n p_i \log_2 p_i \]

The information gain is the gain in information due to the hypothetical split on an attribute. We choose to split on the attribute that has the highest information gain.

In contrast, C4.5 uses a so-called gain rate criterion. This gain rate criterion is superior due to the fact that ID3's information gain criterion strongly favors features with many outcomes. For example, if we were to consider the Day column as a feature (we had only considered it to be a row label rather than a feature until now), it would have the highest information gain; but using it as a splitting criterion would result in a shallow, broad tree with 14 leaves: one branch for each instance. This is not a very useful decision tree as it likely would not generalize well on unseen data.

The gain ratio criterion ameliorates this problem by adjusting the gain by the number of outcomes to favor a feature with less outcomes over that with more.

The gain ratio is defined as follows:

\[ G_r(S,A) = \frac{G(S,A)}{Split(S,A)} \]

Where \(Split(S,A)\) is the split info defined as:

\[ Split(S,A) = \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H\left(\frac{|S_{\nu}|}{|S|}\right) \]

Computing splits using gain ratio

The gain on the full dataset is:

\[ H(S) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{9}{14} \log_2 \frac{9}{14} = 0.94 \]

Then we'll compute hypothetical splits on each of the features.

  • Outlook

The information gain on the Outlook attribute is:

\[ G(S, Outlook) = H(S) - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Sunny} - \left[ \frac{4}{14} (-\frac{4}{4} \log_2 \frac{4}{4} - \frac{0}{4} \log_2 \frac{0}{4}) \right]_{Overcast} - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Rain} = 0.246 \]

And the split information is:

\[ Split(S, Outlook) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{12} \log_2 \frac{5}{14} = 1.577 \]

Therefore the gain ratio is \(\frac{0.246}{1.58}=0.156\) for the Outlook attribute.

  • Temperature

\[ G(S, Temperature) = H(S) - \frac{4}{14} \left( -\frac{2}{4} \log_2 {2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right) - \frac{6}{14} \left( -\frac{4}{6} \log_2 \frac{4}{6} - \frac{2}{6} \log_2 \frac{2}{6} \right) - \frac{4}{14} \left( -\frac{3}{4} \log_2 \frac{3}{4} - \frac{1}{4} \log_2 \frac{1}{4} \right) = 0.571 \]

\[ Split(S, Temperature) = -frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} = 1.563 \]

And the gain ratio for the Temperature attribute is \(\frac{0.571}{1.563} = 0.365\).

  • Humidity

\[ G(S, Humidity) = H(S) - \frac{7}{14} \left( -\frac{3}{7} \log_2 \frac{3}{7} - \frac{4}{7} \log_2 \frac{4}{7} \right) - \frac{7}{14} \left( -\frac{6}{7} \log_2 \frac{6}{7} - \frac{1}{7} \log_2 \frac{1}{7} \right) = 0.152 \]

\[ Split(S, Humidity) = -\frac{7}{14} \log_2 \frac{7}{14} - \frac{7}{14} \log_2 \frac{7}{14} = 1 \]

The gain ratio is \(\frac{0.152}{1} = 0.152\)

  • Wind

\[ G(S, Wind) = H(S) - \frac{8}{14} \left( -\frac{6}{8} \log_2 \frac{6}{8} - \frac{2}{8} \log_2 \frac{2}{8} \right) - \frac{6}{14} \left( -\frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{2}{6} \right) = 0.048 \]

\[ Split(S, Wind) = \frac{8}{14} \log_2 \frac{8}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 0.985 \]

And the gain ratio is \(\frac{0.048}{0.985}=0.0487\).

Branching decision based on gain ratio

So we see that the maximum gain ratio between the potential attributes upon which we can make our split is on the Temperature feature. We see that this is different than in the case when we were using the information gain criterion (Outlook) and will therefore result in a different decision tree.

2 comments:

  1. You've got a few errors in the calculations, as an example the 5/12 in the below split calculation should be 5/14. There appear to be a few of these typos, at least I think.

    Split(S,Outlook)=−5/14log25/14−4/14log2414−5/12log25/14=1.577

    ReplyDelete
  2. casino and casino
    The biggest name in 속초 출장안마 gambling 부산광역 출장마사지 is the casino. You can play your 과천 출장마사지 favorite 창원 출장안마 games like slots, poker, blackjack, and video poker. As a rule, the casino's 양산 출장마사지 name is

    ReplyDelete