Monday, May 25, 2015

Building an ID3 decision tree "by hand"

blog

Building an ID3 decision tree "by hand"

Here we will work out an example of the ID3 decision tree algorithm 'by hand'. This is the precursor to the popular C4.5 decision tree algorithm and is a good place to start to understand how to build a decision tree. We will use the example of trying to decide whether we should play tennis given certain weather conditions. This example is adapted from several similar examples used to illustrate decision trees such as in L. Mitchell's text "Machine Learning" and J.R. Quinlan's "C4.5: Programs for Machine Learning".

Defining the data

First we will define the data that we'll be working with. It can be seen in the following table.

Table 1: Some data about weather and whether we will play a sport. Say, tennis (adapted from T. Mitchel "Machine Learning" Chapter 3 pp. 59)
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

ID3 Information Gain

Then we will calculate information gain on the potential attributes. This is the metric that is used by ID3 to decide which feature on which we make a decision split. This decision split will make up the nodes of our decision tree, telling us which branch to traverse given a specific instance (data point) of the feature values.

Information Gain is calculated as the difference of the entropy of the superset of a set of attributes and the information of the child splits on the attribute, weighted by the proportion of the different values that the attribute can take. That's a mouthful. A math expression will do it more justice than words can…

Entropy is defined as:

\[ H(S) = \sum_i^n p_i \log_2 p_i \]

Here we are summing across the \(n\) discrete possible states of an attribute \(S\).

The information gain is defined as:

\[ G(S,A) = H(S) - \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H(S_{\nu}) \]

Thus, the information gain tells us the difference in entropy from the 'current', unsplit data and the entropy after we have partitioned on a specific feature and weighted by the number of examples in the set. We will then choose the partition that results in the maximum information gain (maximum reduction in entropy) to define a branching of our decision tree.

The first split

So let's go through the calculations for information gain in detail on all of the possible feature splits in the original dataset (and ignoring "Day" since it is more a row label than a feature).

We first need the entropy of the original dataset:

\[ H(S) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{9}{14} \log_2 \frac{9}{14} \]

Since we have 5 examples in which our outcome feature, "PlayTennis", is "No" and 9 where it is "Yes". The numeric value is:

round(-(5/14) * log(5/14, 2) - (9/14) * log(9/14, 2), 3)
0.94

Now we'll compute the information gain that would be achieved by splitting on the first feature, Outlook:

  • Outlook

\[ G(S, Outlook) = H(S) - \frac{|S_{Sunny}|}{|S|} H(S_{Sunny}) - \frac{|S_{Overcast}|}{|S|} H(S_{Overcast}) - \frac{|S_{Rain}|}{|S|} H(S_{Rain}) \] \[ G(S, Outlook) = H(S) - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Sunny} - \left[ \frac{4}{14} (-\frac{4}{4} \log_2 \frac{4}{4} - \frac{0}{4} \log_2 \frac{0}{4}) \right]_{Overcast} - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Rain} \]

For example, if we look at just the "Sunny" state that Outlook can take: for each of the values of "Sunny" (of which there are 5 out of 14 total possible states that Outlook can take), 3 of those values have a PlayTennis state of "Yes" and 2 have a PlayTennis state of "No". We always are interested in the information gain of a feature relative to the outcome or variable on which we are trying to make a prediction with our model.

Therefore, the information gain value for Outlook is:

round(0.94 - (5/14) * (-(3/5)*log(3/5,2)-(2/5)*log(2/5,2)) - (4/14) * (-(4/4)*log(4/4,2) - 0) - (5/14) * (-(3/5)*log(3/5,2) - (2/5)*log(2/5,2)), 3)
0.246

Similarly for the other features:

  • Temperature
round(0.94 - (4/14)*(-(2/4)*log(2/4,2)-(2/4)*log(2/4,2)) - (6/14)*(-(4/6)*log(4/6,2)-(2/6)*log(2/6,2)) - (4/14)*(-(3/4)*log(3/4,2) - (1/4)*log(1/4,2)),3)
0.029
  • Humidity
round(0.94 - (7/14)*(-(3/7)*log(3/7,2) - (4/7)*log(4/7,2)) - (7/14)*(-(6/7)*log(6/7,2) - (1/7)*log(1/7,2)), 3)
0.152
  • Wind
round(0.94 - (8/14)*(-(6/8)*log(6/8,2) - (2/8)*log(2/8,2)) - (6/14)*(-(3/6)*log(3/6,2) - (3/6)*log(3/6,2)), 3)
0.048

So the values for information gain are 0.246, 0.029, 0.152 and 0.048 for the features Outlook, Temperature, Humidity and Wind, respectively. We will choose the feature with maximum information gain on which to make out first decision tree split, which in our case is Outlook.

Our initial decision tree looks like this:

So we can say with certainty that if the outlook is overcast, we will play tennis since all values of outlook==overcast result in a "Yes" value of "PlayTennis". If the outlook is rain or sunny, we still don't have a definite answer as to whether we play tennis. Therefore, we continue splitting on other attributes for each of those nodes of our decision tree.

Another branch

At the risk of excess verbosity, we'll create one more branch of the decision tree explicitly to further illustrate the process.

We'll work with the "Outlook==Sunny" branch in which there are 2 "Yes" cases for "PlayTennis" and 3 "No" cases. The reduced set of data (those rows with Outlook==Sunny) we are working with now is therefore:

Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D11 Sunny Mild Normal Strong Yes

So now we have a total entropy of:

Which is:

round(-(2/5)*log(2/5,2) - (3/5)*log(3/5,2),3)
0.971

Computing the information gain on the features:

  • Temperature
0.971 - (2/5) * ( -0 - (2/2)*log(2/2,2) ) - (2/5) * ( -(1/2)*log(1/2,2) - (1/2)*log(1/2,2) ) - (1/5) * ( -(1/1)*log(1/1,2) - 0)
0.571
  • Humidity
0.971 - (3/5) * ( -0 - (3/3)*log(3/3,2) ) - (2/5) * ( -(2/2)*log(2/2,2) - 0 )
0.971
  • Wind
0.971 - (3/5) * ( -(1/3)*log(1/3,2) - (2/3)*log(2/3,2) ) - (2/5) * ( -(1/2)*log(1/2,2) - (1/2)*log(1/2,2) )
0.0200224995673063

Therefore we will split on Humidity since it has the highest information gain. The resulting update to our decision tree looks like this:

The final decision tree

Skipping over the calculations of the last split, the final decision tree is as follows:

3 comments:

  1. Be grateful you for spending time to speak about this, I think strongly about that and delight in reading read more about this topic. Whenever possible, just like you become expertise, do you mind updating your web site with a lot more details? It can be highly great for me. Two thumb up in this article! os path isdir

    ReplyDelete
  2. The Best Crypto Casinos: Top 10 Coin Casino Sites in 2021
    1. Red Dog Casino – Best For 인카지노 Crypto Games & Casinos · 2. InterTops – Best For 메리트 카지노 쿠폰 Bitcoin Gambling · 3. mBit Casino – Best for Live Dealer 제왕카지노 Games and Table Games · 4. Ignition –

    ReplyDelete