C4.5 Decision Trees with Missing Data
Overview
We often encounter data with missing values; and various machine learning algorithms handle missing data with varying levels of grace. C4.5 is an algorithm that is advertised to be able to handle missing data since there is 'built-in' support for missing values.
In this post, we will walk through exactly how the C4.5 decision tree algorithm deals with missing values. We will do this by working out how the branches are created in detail.
Data
We'll work with the same data that we have used in previous posts on decision tree algorithms so that we are able to compare behaviors, but with the important difference that some of the data is missing (the missing values are marked with "?"). It is repeated below for ease of reference.
This data consists of weather conditions that we would like to use to decide whether we play tennis or choose not to play tennis on a given day. Thus "PlayTennis" is our outcome feature that we would like to build a classifier to predict. And we will continue to ignore the "Day" column since it is being used mainly as a row label rather than as an attribute that we would like to use in our model.
Original (non-missing) weather data
The original dataset (without missing values) is as follows:
Day | Outlook | Temperature | Humidity | Wind | PlayTennis |
---|---|---|---|---|---|
D1 | Sunny | Hot | High | Weak | No |
D2 | Sunny | Hot | High | Strong | No |
D3 | Overcast | Hot | High | Weak | Yes |
D4 | Rain | Mild | High | Weak | Yes |
D5 | Rain | Cool | Normal | Weak | Yes |
D6 | Rain | Cool | Normal | Strong | No |
D7 | Overcast | Cool | Normal | Strong | Yes |
D8 | Sunny | Mild | High | Weak | No |
D9 | Sunny | Cool | Normal | Weak | Yes |
D10 | Rain | Mild | Normal | Weak | Yes |
D11 | Sunny | Mild | Normal | Strong | Yes |
D12 | Overcast | Mild | High | Strong | Yes |
D13 | Overcast | Hot | Normal | Weak | Yes |
D14 | Rain | Mild | High | Strong | No |
Weather data with missing values
Here is the same data, but with missing values. Values were randomly removed, which is equivalent to the Missing Completely At Random missing data mechanism (missing data mechanisms will be the subject of a future post).
Day | Outlook | Temperature | Humidity | Wind | PlayTennis |
---|---|---|---|---|---|
D1 | Sunny | Hot | High | Weak | No |
D2 | ? | Hot | High | Strong | No |
D3 | ? | ? | High | ? | Yes |
D4 | Rain | Mild | High | Weak | Yes |
D5 | Rain | Cool | ? | Weak | Yes |
D6 | Rain | Cool | Normal | Strong | No |
D7 | Overcast | Cool | Normal | Strong | Yes |
D8 | ? | Mild | High | ? | No |
D9 | ? | Cool | Normal | Weak | Yes |
D10 | ? | ? | Normal | ? | Yes |
D11 | ? | Mild | Normal | ? | Yes |
D12 | Overcast | Mild | ? | Strong | Yes |
D13 | Overcast | Hot | ? | Weak | Yes |
D14 | Rain | Mild | High | Strong | No |
Altered gain and split info calculations
In C4.5, information gain and split info are used to calculate the gain ratio value (see other posts on C4.5 for definitions of gain ratio and split info). However, in the presences of missing values, J.R. Quinlan defines an altered version of these quantities to make the splitting criteria robust to missing data.
Information gain is simply weighted by the proportion of missing values on a particular attribute:
\[ G_m(S,A) = \frac{|S_{A}| - |M_{A}|}{|S_{A}|} G(S,A) \]
Where \(|M_{A}|\) is the number of missing values for attribute \(A\).
The new entropy value is calculated with only the non-missing values:
\[ H(S) = -\frac{4}{8} \log_2 \frac{4}{8} - \frac{4}{8} \log_2 \frac{4}{8} = 1 \]
So using the updated definition of information gain on the Outlook attribute would proceed as follows (using the data from the table with the missing values):
\[ G_m(S, Outlook) = \frac{8}{14} \left[ H(S) - \frac{1}{8}\left( -\frac{1}{1} \log_2 \frac{1}{1} - 0 \right)_{Sunny} - \frac{3}{8} \left( -\frac{3}{3} \log_2 \frac{3}{3} - 0 \right)_{Overcast} - \frac{4}{8} \left( -\frac{2}{4} \log_2 \frac{2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right)_{Rain} \right] = \frac{8}{14} ( 1 - \frac{1}{2} ) = 0.286 \]
Split info is calculated the same as before but with the missing values considered a separate state that an attribute can take.
\[ Split_m(S, Outlook) = -\frac{1}{14} \log_2 \frac{1}{14} - \frac{3}{14} \log_2 \frac{3}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 1.659 \]
Therefore the gain ratio for Outlook would be the ratio of the updated information gain to the updated split info: \(\frac{0.286}{1.659}=0.172\).
Doing similar calculations for the rest of the attributes yields the following values:
Attribute | Gain Ratio |
---|---|
Outlook | 0.172 |
Temperature | 0.028 |
Humidity | 0.081 |
Wind | 0.066 |
We pick the attribute with the highest gain ratio on which to define a new branc in the decision tree, which in this case is Outlook. This initial tree looks – at least structurally – the same as the one constructed without missing values:
But there is an ambiguity that we must consider when choosing which rows are assigned to which branches of the tree.
Partitioning the training set
Now that we have made the decision to split based on the Outlook attribute, we need to decide which of the outcome values each row is 'assigned to'. This is a clear choice for those rows without missing values. But we face a problem when it comes to deciding based on the rows that contain missing values for a particular attribute.
For example, when we look at the D2 row we see that we do not know what the Outlook vale is. So which branch do we follow?
The way that Quinlan dealt with this problem was to assign these ambiguous cases to all of the newly created descendant nodes with a weight that is proportional to the number of non-missing instances of each classification. If we take Outlook==Sunny as an example, we have one example for which we know Outlook is Sunny and its class is PlayTennis==No. However, we have \(6\) instances in which we do not know the Outlook, 4 of which are PlayTennis==Yes and 2 of which are PlayTennis==No. We simply assign a fractional weight value of the number of Outlook==Sunny instances to the number of other known instances for Outlook (\(\frac{1}{8}\)) to each of these rows with unknown Outlook.
Classifying an instance with missing values
When we come across an instance that we would like to classify that has missing values, we explore all possibilities that the missing value can take when we reach the decision split in our classification tree. Then once we reach the leaves of the tree for each of the possible paths, we choose the path with the highest probability value given the weighted instaces of PlayTennis==Yes and PlayTennis==No at each branch.
Can you give an example for the last section: "classifying an instance with missing values"? It's hard to understand
ReplyDeleteHow are you getting 1.659? I punched that same formula in Excel and I'm getting 1.778. Here is my formula (just copy and paste in Excel):
ReplyDelete= (-LOG(1/14,2) * 1/14 - LOG(3/14,2) * 3/14 - LOG(4/14,2) * 4/14 - LOG(6/14,2) * 6/14)
Um isn't your overall entropy with non missing values for outlook wrong? you have 5 yes and 3 no. so how come 4/8?
ReplyDeleteDo you know of any python implementation that correctly takes missing values into account? Is the Weka/Java's J4.8 the only correct implementation?
ReplyDelete