Tuesday, May 26, 2015

The C4.5 decision tree split criterion

blog

The C4.5 decision tree split criterion

In a previous post, we worked through an example of building a decision tree using the ID3 algorithm, a precursor to the ubiquitous C4.5 decision tree algorithm.

In this post, we are going to try to understand the split criterion of the C4.5 algorithm itself by doing some more hand calculations to ensure that we truly understand it, rather than just throwing some data into a black-boxed implemention of the algorithm.

The data

We'll use the same data that was used in the previous post on the ID3 algorithm so we can compare our results. The dataset is as follows:

Table 1: Some data about weather and whether we will play a sport. Say, tennis (adapted from T. Mitchel "Machine Learning" Chapter 3 pp. 59
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Splitting criterion

ID3 used an information gain criterion in deciding how to build branches in the decision tree. In equation form:

\[ G(S,A) = H(S) - \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H(S_{\nu}) \]

where H is the entropy of a set, defined by:

\[ H(S) = \sum_i^n p_i \log_2 p_i \]

The information gain is the gain in information due to the hypothetical split on an attribute. We choose to split on the attribute that has the highest information gain.

In contrast, C4.5 uses a so-called gain rate criterion. This gain rate criterion is superior due to the fact that ID3's information gain criterion strongly favors features with many outcomes. For example, if we were to consider the Day column as a feature (we had only considered it to be a row label rather than a feature until now), it would have the highest information gain; but using it as a splitting criterion would result in a shallow, broad tree with 14 leaves: one branch for each instance. This is not a very useful decision tree as it likely would not generalize well on unseen data.

The gain ratio criterion ameliorates this problem by adjusting the gain by the number of outcomes to favor a feature with less outcomes over that with more.

The gain ratio is defined as follows:

\[ G_r(S,A) = \frac{G(S,A)}{Split(S,A)} \]

Where \(Split(S,A)\) is the split info defined as:

\[ Split(S,A) = \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H\left(\frac{|S_{\nu}|}{|S|}\right) \]

Computing splits using gain ratio

The gain on the full dataset is:

\[ H(S) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{9}{14} \log_2 \frac{9}{14} = 0.94 \]

Then we'll compute hypothetical splits on each of the features.

  • Outlook

The information gain on the Outlook attribute is:

\[ G(S, Outlook) = H(S) - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Sunny} - \left[ \frac{4}{14} (-\frac{4}{4} \log_2 \frac{4}{4} - \frac{0}{4} \log_2 \frac{0}{4}) \right]_{Overcast} - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Rain} = 0.246 \]

And the split information is:

\[ Split(S, Outlook) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{12} \log_2 \frac{5}{14} = 1.577 \]

Therefore the gain ratio is \(\frac{0.246}{1.58}=0.156\) for the Outlook attribute.

  • Temperature

\[ G(S, Temperature) = H(S) - \frac{4}{14} \left( -\frac{2}{4} \log_2 {2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right) - \frac{6}{14} \left( -\frac{4}{6} \log_2 \frac{4}{6} - \frac{2}{6} \log_2 \frac{2}{6} \right) - \frac{4}{14} \left( -\frac{3}{4} \log_2 \frac{3}{4} - \frac{1}{4} \log_2 \frac{1}{4} \right) = 0.571 \]

\[ Split(S, Temperature) = -frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} = 1.563 \]

And the gain ratio for the Temperature attribute is \(\frac{0.571}{1.563} = 0.365\).

  • Humidity

\[ G(S, Humidity) = H(S) - \frac{7}{14} \left( -\frac{3}{7} \log_2 \frac{3}{7} - \frac{4}{7} \log_2 \frac{4}{7} \right) - \frac{7}{14} \left( -\frac{6}{7} \log_2 \frac{6}{7} - \frac{1}{7} \log_2 \frac{1}{7} \right) = 0.152 \]

\[ Split(S, Humidity) = -\frac{7}{14} \log_2 \frac{7}{14} - \frac{7}{14} \log_2 \frac{7}{14} = 1 \]

The gain ratio is \(\frac{0.152}{1} = 0.152\)

  • Wind

\[ G(S, Wind) = H(S) - \frac{8}{14} \left( -\frac{6}{8} \log_2 \frac{6}{8} - \frac{2}{8} \log_2 \frac{2}{8} \right) - \frac{6}{14} \left( -\frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{2}{6} \right) = 0.048 \]

\[ Split(S, Wind) = \frac{8}{14} \log_2 \frac{8}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 0.985 \]

And the gain ratio is \(\frac{0.048}{0.985}=0.0487\).

Branching decision based on gain ratio

So we see that the maximum gain ratio between the potential attributes upon which we can make our split is on the Temperature feature. We see that this is different than in the case when we were using the information gain criterion (Outlook) and will therefore result in a different decision tree.

Monday, May 25, 2015

Building an ID3 decision tree "by hand"

blog

Building an ID3 decision tree "by hand"

Here we will work out an example of the ID3 decision tree algorithm 'by hand'. This is the precursor to the popular C4.5 decision tree algorithm and is a good place to start to understand how to build a decision tree. We will use the example of trying to decide whether we should play tennis given certain weather conditions. This example is adapted from several similar examples used to illustrate decision trees such as in L. Mitchell's text "Machine Learning" and J.R. Quinlan's "C4.5: Programs for Machine Learning".

Defining the data

First we will define the data that we'll be working with. It can be seen in the following table.

Table 1: Some data about weather and whether we will play a sport. Say, tennis (adapted from T. Mitchel "Machine Learning" Chapter 3 pp. 59)
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

ID3 Information Gain

Then we will calculate information gain on the potential attributes. This is the metric that is used by ID3 to decide which feature on which we make a decision split. This decision split will make up the nodes of our decision tree, telling us which branch to traverse given a specific instance (data point) of the feature values.

Information Gain is calculated as the difference of the entropy of the superset of a set of attributes and the information of the child splits on the attribute, weighted by the proportion of the different values that the attribute can take. That's a mouthful. A math expression will do it more justice than words can…

Entropy is defined as:

\[ H(S) = \sum_i^n p_i \log_2 p_i \]

Here we are summing across the \(n\) discrete possible states of an attribute \(S\).

The information gain is defined as:

\[ G(S,A) = H(S) - \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H(S_{\nu}) \]

Thus, the information gain tells us the difference in entropy from the 'current', unsplit data and the entropy after we have partitioned on a specific feature and weighted by the number of examples in the set. We will then choose the partition that results in the maximum information gain (maximum reduction in entropy) to define a branching of our decision tree.

The first split

So let's go through the calculations for information gain in detail on all of the possible feature splits in the original dataset (and ignoring "Day" since it is more a row label than a feature).

We first need the entropy of the original dataset:

\[ H(S) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{9}{14} \log_2 \frac{9}{14} \]

Since we have 5 examples in which our outcome feature, "PlayTennis", is "No" and 9 where it is "Yes". The numeric value is:

round(-(5/14) * log(5/14, 2) - (9/14) * log(9/14, 2), 3)
0.94

Now we'll compute the information gain that would be achieved by splitting on the first feature, Outlook:

  • Outlook

\[ G(S, Outlook) = H(S) - \frac{|S_{Sunny}|}{|S|} H(S_{Sunny}) - \frac{|S_{Overcast}|}{|S|} H(S_{Overcast}) - \frac{|S_{Rain}|}{|S|} H(S_{Rain}) \] \[ G(S, Outlook) = H(S) - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Sunny} - \left[ \frac{4}{14} (-\frac{4}{4} \log_2 \frac{4}{4} - \frac{0}{4} \log_2 \frac{0}{4}) \right]_{Overcast} - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Rain} \]

For example, if we look at just the "Sunny" state that Outlook can take: for each of the values of "Sunny" (of which there are 5 out of 14 total possible states that Outlook can take), 3 of those values have a PlayTennis state of "Yes" and 2 have a PlayTennis state of "No". We always are interested in the information gain of a feature relative to the outcome or variable on which we are trying to make a prediction with our model.

Therefore, the information gain value for Outlook is:

round(0.94 - (5/14) * (-(3/5)*log(3/5,2)-(2/5)*log(2/5,2)) - (4/14) * (-(4/4)*log(4/4,2) - 0) - (5/14) * (-(3/5)*log(3/5,2) - (2/5)*log(2/5,2)), 3)
0.246

Similarly for the other features:

  • Temperature
round(0.94 - (4/14)*(-(2/4)*log(2/4,2)-(2/4)*log(2/4,2)) - (6/14)*(-(4/6)*log(4/6,2)-(2/6)*log(2/6,2)) - (4/14)*(-(3/4)*log(3/4,2) - (1/4)*log(1/4,2)),3)
0.029
  • Humidity
round(0.94 - (7/14)*(-(3/7)*log(3/7,2) - (4/7)*log(4/7,2)) - (7/14)*(-(6/7)*log(6/7,2) - (1/7)*log(1/7,2)), 3)
0.152
  • Wind
round(0.94 - (8/14)*(-(6/8)*log(6/8,2) - (2/8)*log(2/8,2)) - (6/14)*(-(3/6)*log(3/6,2) - (3/6)*log(3/6,2)), 3)
0.048

So the values for information gain are 0.246, 0.029, 0.152 and 0.048 for the features Outlook, Temperature, Humidity and Wind, respectively. We will choose the feature with maximum information gain on which to make out first decision tree split, which in our case is Outlook.

Our initial decision tree looks like this:

So we can say with certainty that if the outlook is overcast, we will play tennis since all values of outlook==overcast result in a "Yes" value of "PlayTennis". If the outlook is rain or sunny, we still don't have a definite answer as to whether we play tennis. Therefore, we continue splitting on other attributes for each of those nodes of our decision tree.

Another branch

At the risk of excess verbosity, we'll create one more branch of the decision tree explicitly to further illustrate the process.

We'll work with the "Outlook==Sunny" branch in which there are 2 "Yes" cases for "PlayTennis" and 3 "No" cases. The reduced set of data (those rows with Outlook==Sunny) we are working with now is therefore:

Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D11 Sunny Mild Normal Strong Yes

So now we have a total entropy of:

Which is:

round(-(2/5)*log(2/5,2) - (3/5)*log(3/5,2),3)
0.971

Computing the information gain on the features:

  • Temperature
0.971 - (2/5) * ( -0 - (2/2)*log(2/2,2) ) - (2/5) * ( -(1/2)*log(1/2,2) - (1/2)*log(1/2,2) ) - (1/5) * ( -(1/1)*log(1/1,2) - 0)
0.571
  • Humidity
0.971 - (3/5) * ( -0 - (3/3)*log(3/3,2) ) - (2/5) * ( -(2/2)*log(2/2,2) - 0 )
0.971
  • Wind
0.971 - (3/5) * ( -(1/3)*log(1/3,2) - (2/3)*log(2/3,2) ) - (2/5) * ( -(1/2)*log(1/2,2) - (1/2)*log(1/2,2) )
0.0200224995673063

Therefore we will split on Humidity since it has the highest information gain. The resulting update to our decision tree looks like this:

The final decision tree

Skipping over the calculations of the last split, the final decision tree is as follows:

Thursday, May 21, 2015

K-Fold Cross Validation with Decision Trees in R

blogsly

1 K-Fold Cross Validation with Decisions Trees in R   decision_trees machine_learning

1.1 Overview

We are going to go through an example of a k-fold cross validation experiment using a decision tree classifier in R.

K-fold cross validation is a method for ensuring a robust error estimate on a trained classification model.

When we train a predictive model, we want that model to not only be accurate on the data that we used to train the model but also generalize to other samples that the model has not yet been presented with. A common technique for ensuring this generalizability is to split data into training data and test data sets. The model is trained on the training data split and then tested on the test dataset to ensure that the model did not only learn to be accurate on the training dataset (overfit).

Similarly, in k-fold cross validation we split the data into k equally-partitioned subsamples. Then for each of the k partitions, we hold out the \(i^{th}\) partition and train our model on the other \(k-1\) partitions and test on the \(i^{th}\) partition. We then average the error over the testing results of all of our k rounds of training/testing.

1.2 Naive Training/Testing

To begin, we will show a naive implementation of a train/test process. In this example, we don't split between the training data and the testing data.

Here, we are training the model on the full dataset.

library(rpart)
data(iris)
rpart.model <- rpart(Species~., data=iris, method="class")
print(rpart.model)
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
  2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
  3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
    6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259) *
    7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *

And now we test on the same dataset. From this, we obtain a confusion matrix.

rcart.prediction <- predict(rpart.model, newdata=iris, type="class")
confusion.matrix <- table(iris$Species, rcart.prediction)
print(confusion.matrix)
          rcart.prediction
           setosa versicolor virginica
setosa         50          0         0
versicolor      0         49         1
virginica       0          5        45

The resulting error is as follows:

accuracy.percent <- 100*sum(diag(confusion.matrix))/sum(confusion.matrix)
print(paste("accuracy:",accuracy.percent,"%"))
[1] "accuracy: 96 %"

Pretty good. But our model could very well be overfit. If we were to obtain new measurements for each of these species our accuracy might not be very good because our model fits the data that we trained on well but does not generalize for new data.

1.3 Using k-fold cross-validation to train and test the model

So let's use k-fold cross-validation to obtain a more generalizable model.

library(plyr)
library(rpart)
set.seed(123)
form <- "Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width"
folds <- split(iris, cut(sample(1:nrow(iris)),10))
errs <- rep(NA, length(folds))

for (i in 1:length(folds)) {
 test <- ldply(folds[i], data.frame)
 train <- ldply(folds[-i], data.frame)
 tmp.model <- rpart(form , train, method = "class")
 tmp.predict <- predict(tmp.model, newdata = test, type = "class")
 conf.mat <- table(test$Species, tmp.predict)
 errs[i] <- 1-sum(diag(conf.mat))/sum(conf.mat)
}
print(sprintf("average error using k-fold cross-validation: %.3f percent", 100*mean(errs)))
[1] "average error using k-fold cross-validation: 7.333 percent"

So there we have it. K-fold cross-validation in action. We can see that the error increased when using k-fold cross-validation over simply training and then testing on the same data, which indicates that there may have been bias introduced by overfitting the model in the latter case.

1.4 k-fold cross-validation with C5.0

Let's do the same thing but with a different decision tree algorithm, C5.0. This is an update of J. Ross Quinlan's popular C4.5 algorithm.

library(C50)
library(plyr)
errs.c50 <- rep(NA, length(folds))
form <- "Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width"
folds <- split(iris, cut(sample(1:nrow(iris)),10))
for (i in 1:length(folds)) {
 test <- ldply(folds[i], data.frame)
 train <- ldply(folds[-i], data.frame)
 tmp.model <- C5.0(as.formula(form), train)
 tmp.predict <- predict(tmp.model, newdata=test)
 conf.mat <- table(test$Species, tmp.predict)
 errs.c50[i] <- 1 - sum(diag(conf.mat))/sum(conf.mat)
}

print(sprintf("average error using k-fold cross validation and C5.0 decision tree algorithm: %.3f percent", 100*mean(errs.c50)))
[1] "average error using k-fold cross validation and C5.0 decision tree algorithm: 6.000 percent"