Inferate >> Data Science Blog: Building an ID3 decision tree "by hand"

blog

Building an ID3 decision tree "by hand"

Here we will work out an example of the ID3 decision tree algorithm 'by hand'. This is the precursor to the popular C4.5 decision tree algorithm and is a good place to start to understand how to build a decision tree. We will use the example of trying to decide whether we should play tennis given certain weather conditions. This example is adapted from several similar examples used to illustrate decision trees such as in L. Mitchell's text "Machine Learning" and J.R. Quinlan's "C4.5: Programs for Machine Learning".

Defining the data

First we will define the data that we'll be working with. It can be seen in the following table.

Table 1: Some data about weather and whether we will play a sport. Say, tennis (adapted from T. Mitchel "Machine Learning" Chapter 3 pp. 59)
Day	Outlook	Temperature	Humidity	Wind	PlayTennis
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

ID3 Information Gain

Then we will calculate information gain on the potential attributes. This is the metric that is used by ID3 to decide which feature on which we make a decision split. This decision split will make up the nodes of our decision tree, telling us which branch to traverse given a specific instance (data point) of the feature values.

Information Gain is calculated as the difference of the entropy of the superset of a set of attributes and the information of the child splits on the attribute, weighted by the proportion of the different values that the attribute can take. That's a mouthful. A math expression will do it more justice than words can…

Entropy is defined as:

$H(S) = \sum_i^n p_i \log_2 p_i$

Here we are summing across the $n$ discrete possible states of an attribute $S$ .

The information gain is defined as:

$G(S,A) = H(S) - \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H(S_{\nu})$

Thus, the information gain tells us the difference in entropy from the 'current', unsplit data and the entropy after we have partitioned on a specific feature and weighted by the number of examples in the set. We will then choose the partition that results in the maximum information gain (maximum reduction in entropy) to define a branching of our decision tree.

The first split

So let's go through the calculations for information gain in detail on all of the possible feature splits in the original dataset (and ignoring "Day" since it is more a row label than a feature).

We first need the entropy of the original dataset:

$H(S) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{9}{14} \log_2 \frac{9}{14}$

Since we have 5 examples in which our outcome feature, "PlayTennis", is "No" and 9 where it is "Yes". The numeric value is:

round(-(5/14) * log(5/14, 2) - (9/14) * log(9/14, 2), 3)

0.94

Now we'll compute the information gain that would be achieved by splitting on the first feature, Outlook:

Outlook

$G(S, Outlook) = H(S) - \frac{|S_{Sunny}|}{|S|} H(S_{Sunny}) - \frac{|S_{Overcast}|}{|S|} H(S_{Overcast}) - \frac{|S_{Rain}|}{|S|} H(S_{Rain})$ $G(S, Outlook) = H(S) - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Sunny} - \left[ \frac{4}{14} (-\frac{4}{4} \log_2 \frac{4}{4} - \frac{0}{4} \log_2 \frac{0}{4}) \right]_{Overcast} - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Rain}$

For example, if we look at just the "Sunny" state that Outlook can take: for each of the values of "Sunny" (of which there are 5 out of 14 total possible states that Outlook can take), 3 of those values have a PlayTennis state of "Yes" and 2 have a PlayTennis state of "No". We always are interested in the information gain of a feature relative to the outcome or variable on which we are trying to make a prediction with our model.

Therefore, the information gain value for Outlook is:

round(0.94 - (5/14) * (-(3/5)*log(3/5,2)-(2/5)*log(2/5,2)) - (4/14) * (-(4/4)*log(4/4,2) - 0) - (5/14) * (-(3/5)*log(3/5,2) - (2/5)*log(2/5,2)), 3)

0.246

Similarly for the other features:

Temperature

round(0.94 - (4/14)*(-(2/4)*log(2/4,2)-(2/4)*log(2/4,2)) - (6/14)*(-(4/6)*log(4/6,2)-(2/6)*log(2/6,2)) - (4/14)*(-(3/4)*log(3/4,2) - (1/4)*log(1/4,2)),3)

0.029

Humidity

round(0.94 - (7/14)*(-(3/7)*log(3/7,2) - (4/7)*log(4/7,2)) - (7/14)*(-(6/7)*log(6/7,2) - (1/7)*log(1/7,2)), 3)

0.152

Wind

round(0.94 - (8/14)*(-(6/8)*log(6/8,2) - (2/8)*log(2/8,2)) - (6/14)*(-(3/6)*log(3/6,2) - (3/6)*log(3/6,2)), 3)

0.048

So the values for information gain are 0.246, 0.029, 0.152 and 0.048 for the features Outlook, Temperature, Humidity and Wind, respectively. We will choose the feature with maximum information gain on which to make out first decision tree split, which in our case is Outlook.

Our initial decision tree looks like this:

So we can say with certainty that if the outlook is overcast, we will play tennis since all values of outlook==overcast result in a "Yes" value of "PlayTennis". If the outlook is rain or sunny, we still don't have a definite answer as to whether we play tennis. Therefore, we continue splitting on other attributes for each of those nodes of our decision tree.

Another branch

At the risk of excess verbosity, we'll create one more branch of the decision tree explicitly to further illustrate the process.

We'll work with the "Outlook==Sunny" branch in which there are 2 "Yes" cases for "PlayTennis" and 3 "No" cases. The reduced set of data (those rows with Outlook==Sunny) we are working with now is therefore:

Day	Outlook	Temperature	Humidity	Wind	PlayTennis
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes

So now we have a total entropy of:

Which is:

round(-(2/5)*log(2/5,2) - (3/5)*log(3/5,2),3)

0.971

Computing the information gain on the features:

Temperature

0.971 - (2/5) * ( -0 - (2/2)*log(2/2,2) ) - (2/5) * ( -(1/2)*log(1/2,2) - (1/2)*log(1/2,2) ) - (1/5) * ( -(1/1)*log(1/1,2) - 0)

0.571

Humidity

0.971 - (3/5) * ( -0 - (3/3)*log(3/3,2) ) - (2/5) * ( -(2/2)*log(2/2,2) - 0 )

0.971

Wind

0.971 - (3/5) * ( -(1/3)*log(1/3,2) - (2/3)*log(2/3,2) ) - (2/5) * ( -(1/2)*log(1/2,2) - (1/2)*log(1/2,2) )

0.0200224995673063

Therefore we will split on Humidity since it has the highest information gain. The resulting update to our decision tree looks like this:

The final decision tree

Skipping over the calculations of the last split, the final decision tree is as follows:

3 comments:

radhaAugust 21, 2018 at 11:07 PM
Nice post. Keep updating Artificial intelligence Online Trining
abdul quddosNovember 14, 2021 at 1:21 AM
Be grateful you for spending time to speak about this, I think strongly about that and delight in reading read more about this topic. Whenever possible, just like you become expertise, do you mind updating your web site with a lot more details? It can be highly great for me. Two thumb up in this article! os path isdir
AnonymousJanuary 28, 2022 at 9:24 PM
The Best Crypto Casinos: Top 10 Coin Casino Sites in 2021
1. Red Dog Casino – Best For 인카지노 Crypto Games & Casinos · 2. InterTops – Best For 메리트 카지노 쿠폰 Bitcoin Gambling · 3. mBit Casino – Best for Live Dealer 제왕카지노 Games and Table Games · 4. Ignition –

Monday, May 25, 2015

Building an ID3 decision tree "by hand"

Building an ID3 decision tree "by hand"

Defining the data

ID3 Information Gain

The first split

Another branch

The final decision tree

3 comments: