Friday, September 11, 2015

Shell-Based ETL for Algorithm Change Tests

Introduction

As promised, I will be adding a little ETL into the mix of my posts. What is ETL you ask? Well, it stands for "Extract Transform Load" and is shorthand for what some might call "data munging".

Throughout the course of data science/analytics work, we deal with a lot of data. And it is not always in the right format or the right place for us to distill insights from it. This is where ETL comes in: it is the process of 'extracting' data from whatever location and form it came, 'transforming it' and then 'loading' it into the form and place you need it to perform your analysis.

There are several software packages available that are devoted solely to the process of ETL. I'll get into some of those in the future. But using pretty minimal tools available at the shell (yes, I'm assuming you are using some version of Linux/Unix/OSX) can be pretty powerful. In this post, I'll show a short example of a couple of shell commands that are key to doing simple one-off data-munging tasks.

The scenario

I had been working on a custom data analytics tool that relies on an iterative optimization method. I was writing custom code to improve the way that the tool found solutions to the optimization problem. Thus, I could use the code to give me information about the 'before' and 'after': that is, 'before I added the new functionality' and 'after I added the new functionality'. I wanted to verify that the change resulted in a significant improvement in the analytics tool by adding the new functionality.

In the end, the functionality was trying to 'keep a value low' (sorry for being vague, but I cannot reveal exactly what I was doing due to confidentiality constraints). Since it was iterative, I knew that the tool's output included the iteration number (which we will call I) as well as the quantity that we were 'trying to keep low', which we'll call Y for brevity. If my custom update was a success, the value that we are trying to keep low would be significantly lower in the end of the process than if my added code were not there. The tool also output a bunch of irrelevant information that I wanted to filter out.

Using 'grep'

I would have to identify text in the output that would act as a signal for my munging method to extract only the pertinent lines and not all of the other mess. This is a job for the shell utility 'grep'. It simply either selects or filters out lines of text that match a regular expression. In my case, I noticed that the characters '>>>' were on each line that output both I (the iteration number value) and Y. Therefore a simple grep expression would suffice to eliminate all of the lines that do not contain '>>>', like so:

./run_analytics_tool | grep '>>>'

Note that I am running the analytics tool with './runanalyticstool' and then I am 'piping' the output into 'grep' to select only the lines that contain the string '>>>'. What this gives me is a line that looks something like this:

>>>I,X1,X2,X3,X4,Y
>>>1,18.837,18.279,1,1,1
>>>2,18.557,18.279,1.94,1,1.04
>>>3,18.232,14.239,2.8,2,1.04
>>>4,18.079,14.239,3.18,2,1.28
>>>5,16.312,12.705,3.48,3,2.12
>>>6,14.266,12.705,3.48,3,3.28
>>>7,13.368,10.339,3.44,4,4.28
>>>8,12.761,10.339,3.78,4,4.86
>>>9,12.24,10.339,4.42,4,4.72
>>>10,11.234,10.339,4.74,4,4
>>>11,10.732,8.851,5.3,5,3.96
>>>12,10.972,8.851,6.4,5,4.04
>>>13,10.46,8.851,5.9,5,4.94
>>>14,9.537,8.851,4.84,5,5.8
>>>15,9.378,8.851,5.46,5,5.76
>>>16,9.401,8.851,6.06,6,5.76
>>>17,9.412,8.851,6.86,6,5.66
>>>18,9.386,8.851,7.52,6,5.7
>>>19,9.444,8.851,9.06,6,5.66
>>>20,9.444,8.851,9.06,6,5.66

We don't care about the X1-X4 values but just the I and Y values. So we need some way to select only specific columns of comma-separated values. This is where 'awk' comes in.

Using 'awk'

Awk is a tool that allows us to select delimited columns and rows and operated on them. It is an extremely useful ETL tool.

In our case, we need to simply select certain columns (columns 1 and 6) from a stream of comma-delimited values. The way that we can do this is as follows:

./run_analytics_tool | grep '>>>' | awk -F "," '{OFS=","} {print $1,$6;}'

This gives us the following:

>>>I,Y
>>>1,1
>>>2,1.04
>>>3,1.04
>>>4,1.28
>>>5,2.12
>>>6,3.28
>>>7,4.28
>>>8,4.86
>>>9,4.72
>>>1,4
>>>1,3.96
>>>12,4.04
>>>13,4.94
>>>14,5.8
>>>15,5.76
>>>16,5.76
>>>17,5.66
>>>18,5.7
>>>19,5.66
>>>20,5.66

Here I am using the awk switch "-F" to say that we want to define the incoming delimiter as a comma rather than the default of a space. Similarly I am defining that we want a comma-delimiter as our "output file separator" (thus OFS) with the first section of the awk command in single quotes '{OFS=","}'. Then I am telling awk to print just the first and sixth values for each line in the stream ( '{print $1,$6;}' ). Note that again, I am 'piping' values from grep in a cascading manner to be fed into awk. Now there is a problem: our first column has a pesky string at the beginning; the one we used to our advantage for filtering with grep: '>>>'. We'd like to remove that. For this task, 'sed' is our tool…

Using 'sed'

Sed, the 'stream editor', is a replacement tool. It allows us to operate on a stream of text and transform it by matching with a regular expression and replacing what is matched with something else. We need to perform the simple task of matching '>>>' on each line and replacing it with nothing, which is itself a string: ''. We do this as follows:

./run_analytics_tool | grep '>>>' | awk -F "," '{OFS=","} {print $1,$6;}' | sed 's/>>>//'

Again, I am piping the text from the output of awk to the input of sed. What this sed expression says is to 'substitute' (thus 's') the value between the first two forward slashes (>>>) with the value in the second and third forward slashes (i.e. the empty string). What this gives us is we wanted:

I,Y
1,1
2,1.04
3,1.04
4,1.28
5,2.12
6,3.28
7,4.28
8,4.86
9,4.72
1,4
1,3.96
12,4.04
13,4.94
14,5.8
15,5.76
16,5.76
17,5.66
18,5.7
19,5.66
20,5.66

Now to see if there is a difference

What we ultimately wanted was a way to test whether the change made a difference in our analytics tool. As such, we would like to generate a set of independent 'before' and 'after' samples that we can test statistically. The way we do this is by wrapping all of what we've done until now into a loop and output the results to file. This is done in the following way:

for i in {1..100}
do
 run_analytics_tool | grep '^>>>' | awk -F "," '{OFS=","} {print $1,$6;}' | sed 's/>>>//' > test${i}_before.csv;
done

## make the change to the analytics tool codebase here
for i in {1..100}
do
 run_analytics_tool | grep '^>>>' | awk -F "," '{OFS=","} {print $1,$6;}' | sed 's/>>>//' > test${i}_after.csv;
done

This gives us 100 samples of the analytics tool run. We do this with the analytics tool before adding the new functionality and after, giving us a total of 200 samples with two treatments. Since this is an iterative tool, we cannot assume that the values found in the iterations themselves are independent, so we will take the last value of Y for each of our independent runs (each iteration of the for loop). Then we'll do a statistical test on those using R to see if the change ended up doing what we wanted to, which was to keep the value Y lower at the end of the iteration runs than if my change weren't implemented in the code.

To do this we will import the data into R from the files that we created:

library(plyr)
df.before <- ldply(list.files(pattern="test.*_before.csv"),
       function(filename) {
        dum = read.csv(filename)
        dum$filename = filename
        return(dum)
       })
df.after <-  ldply(list.files(pattern="test.*_after.csv"),
       function(filename) {
        dum = read.csv(filename)
        dum$filename = filename
        return(dum)
       })

## get it into a single data frame
## but preserve an indicator of which 'treatment' it came from
df.all <- rbind(cbind(df.before, treatment="before"),
    cbind(df.after,  treatment="after"))

Since we are only comparing the end iterations, let's just extract those and drop the rest:

df.all.last <- df.all[df.all$generation==max(df.all$generation),]

And we will assume normality for purposes of illustration and do a t-test to determine whether the change resulted in an improvement in the analytics tool:

t.test(ave.complexity~treatment, df.all.last)


 Welch Two Sample t-test

data:  ave.complexity by treatment
t = 22.301, df = 118.55, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 6.718873 8.028327
sample estimates:
mean in group before  mean in group after 
             10.6382               3.2646

Conclusion

In this post, we demonstrated basic usage of shell-based 'ETL tools': grep, awk and sed. We used the scenario in which we made a change to an algorithm within an analytics tool. This tool had a good deal of useful information in its output as well as a lot of junk that we didn't care about. We filtered out the junk and obtained information that was useful using munging methods. From this information we compared the analytics tool before the algorithm change and after and demonstrated that the change resulted in a significant improvement to the tool.

Wednesday, August 19, 2015

Data Tables in R

blog

Data Tables in R

Introduction

My posts have been pretty machine learning and data mining heavy until now. And although machine learning and data mining are arguably the most important aspect of what we have come to call 'data science', there are a couple other components that we cannot ignore: how we store data and how we 'munge' or manipulate data. I promise I will get to these aspects of data science in future posts. In a segue into these topics, today's post will be on a package that I've recently discovered that is a useful way to store data in R (well, at least in memory) and get useful aggregate information from it.

Usually we work with the 'data frame' class in R to store data that we are working with in memory. Data frames are one of the reasons that I choose to do my analytics in R versus another environment such as Matlab (although the folks at Mathworks - the makers of Matlab - have added a data frame-like type to their matrices in recent years but it doesn't match the power of the R version). Data frames are useful because - amongst other reasons - we can work with data in a natural way. Columns are named and typed into numeric, character or factor values. And we can store mixed data in data frames - not all columns are required to be the same type (as is the case for matrices).

However, I've recently discovered the 'data.table' library. I was looking for a quick way to operate on grouped data in a data frame and data tables seemed to fit the bill. I use plyr all the time for this type of operation; and plyr is very good at it (if you don't use it, I would recommend starting). But I was doing something very simple (finding min and max by a factor) and the data.table library came up in a search. And it looks like a package that I will probably start incorporating into my set of go-to libraries in R.

Oh, and I know there is a newer version of plyr (dplyr), but I haven't had time to make the switch. Someday…

Example Data

I'll generate some simple example data for demonstration purposes. The data will start its life as a data frame. It is just two columns: a numeric column of uniformly-distributed random values and a column of groups, randomly chosen:

df <- data.frame(x=runif(20),
     y=sample(c("A","B"), size=20, replace=TRUE))
print(df[order(df$y),])
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A
0.151608517859131 B
0.0283026716206223 B
0.606306979199871 B
0.0992447617463768 B
0.313594420906156 B
0.610857902327552 B
0.588633166160434 B
0.141868675360456 B
0.140498693101108 B
0.219244148349389 B
0.664096125168726 B

The plyr version

So what I would usually do in plyr to find, say, the min value for each group is something like this:

library(plyr)
df.agg <- ddply(df, c("y"), summarize, min.val.per.group=min(x))
A 0.0864250543527305
B 0.0283026716206223

As you can see, this gives us a new data frame with the minimum values for each group, A and B. A similar thing could be done for the max values.

Don't get me wrong, I love plyr and will continue to use it. But data.tables are a different approach to working with grouped frames of data that is intriguing.

The data.table version

In contrast to data frames, data tables maintain a "key" that can be made up of the columns of the table.

First we'll convert our original data frame to a data table. Then we set the key to be the "y" column (remember, the key can actually be a combination of columns if desired). Lastly, we use the tables() function to print out the tables that we have access to in memory (just our dt table right now) and some info on that table, such as its key.

library(data.table)
dt <- as.data.table(df)
setkey(dt, y)
tables()
dt 20 2 1 x,y y

Having a key gives us access to a simple and concise means by which we can group information on the table as well as filter it. If we only want the "A" group values on our table, we can filter it in this way:

dt["A",]
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A

Also note that the comma is optional:

dt["A"]
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A

And we can perform actions on the groups using the second index into our table. So if we want to take the min of the "A" group, we can do it like this:

dt["A", min(x)]
0.0864250543527305

But we can also get the minimum value for each key (or in our case since we defined the key as just the y column, which delineates our groups) succinctly by using the additional index into the data table with the by keyword:

dt[, min(x), by=y]
A 0.0864250543527305
B 0.0283026716206223

And notice that we don't have to address our columns explicitly as columns of the table (e.g. dt$x or dt$y), making things even more concise.

Conclusion

We only went through a brief and simple introduction of the data.table library in R. But hopefully that is enough to make you interested in pursuing this fairly simple but useful library further. I, for one, will be continuing to use it in my daily routine of analytics work thanks to its simplicity and ability to operate on groups of rows in such a succinct way.

Tuesday, August 18, 2015

Association Rule Mining and APriori

Association Rule Mining and APriori

Introduction

As you may already know, association rule mining is the process of discovering rules in data based on the frequent co-occurrence of items in the dataset. It is one of the more well-known techniques in data mining; primarily as the result of demonstrated successes in applications to discovering grocery items that individuals are likely to purchase together in a process usually referred to as market basket analysis.

The general idea is that we take a preferably large set of data and winnow it down to only those sets of items/features that co-occur frequently and in some cases imply one another. From these sets we form implication rules such as \(A \land B \land C \rightarrow D\).

In this post, I will discuss some of the metrics used in discovering rules and will briefly discuss one of the more common algorithms used in association rule mining: the APriori algorithm.

Example Setup

To introduce association rule mining, I will work from an example dataset of grocery items that were purchased by a hypothetical set of customers. The data itself can be seen in the following table.

transaction id apples oranges soda diapers candy bar bread milk butter beer
1   1   1         1
2 1     1     1    
3       1   1 1 1 1
4 1 1 1       1    
5 1 1 1 1 1   1    
6     1     1     1
7   1 1   1       1
8     1 1         1
9 1     1   1     1
10 1     1         1
11       1   1     1
12     1   1 1 1    
13 1 1   1   1 1   1
14     1 1     1    
15   1           1  

In association rule mining, the term for a row of data is a transaction. Here we have a set of 15 transactions in a grocery store. The items that were purchased are marked with a 1 and those that were not purchased by a particular customer are left blank.

We have a simple example here in order to introduce the concepts; however, we do not need to limit ourselves to a set of binary relations of {X = 1} or {X = empty} (where X can be any of the items, apples, oranges, soda, etc.). We could just as easily have allowed our X to take on numerical values if we were dealing with a different set of columns in our table. The one thing that is required is that our data is somehow able to be coaxed into a finite set of values. For example if our feature columns are real-valued, we would need to split them at boundaries so that we can run association rule mining algorithms on them.

In order to discover which of these items tend to be purchased together and which items may lead to another item being purchased we will introduce means for testing their frequency of co-occurrence in the following section.

Finding Frequent Sets

To find the items we often use the concepts of support, confidence and lift. These are the basic tests for which rules we can form

Support tells us how often a particular item is found in the transactions. It is simply the fraction of transactions or number of times that a single item or a set of items is found. There are two different ways that people measure support: number of times the set occurs and fraction of times. These two types of support measurement are known as the absolute and relative support, respectively.

Confidence uses the support measurement as a basis and gives us information about potential implication of the occurrence of items. What this means is that we can start to make inferences about whether the existence of a set of items implies that another item or set of items is likely to exist in the same type of transaction. Confidence is simply the support of a set of items divided by the support of a subset of those items. E.g. for items X, Y and Z, the confidence is calculated as \(\frac{supp(X,Y,Z)}{supp(X,Y)}\) (with supp() referring to the support value) for the confidence that Z is present when X and Y are also found in a transaction or the ratio of the support of X, Y and Z to the support of just X and Y. Parallels are often drawn between confidence and conditional probability since their definitions are very similar (i.e. the conditional probability of event Z given event Y is defined as $P(X|Y) = \frac{P(X,Y)}{P(Y)} or similarly the conditional probability of event Z given events X and Y is \(P(Z|X,Y)=\frac{P(X,Y,Z)}{P(X,Y)}\)). The difference lies in the fact that in calculating confidence, we are concerned solely with frequencies rather than requiring probabilities.

The last quantity that we will define before doing actual calculations is lift. Lift is defined as the ratio of the support of items X and Y and the 'independent' support of items X and Y (if you would forgive the reference to the terminology of probability theory). Thus, the lift is defined mathematically as:

$$lift(X \Rightarrow Y) = \frac{supp(X \cup Y)}{supp(X) \cdot supp(Y)}$$

The lift gives us an indication of how often items are found together versus alone or independent.

Example Calculations

Let's calculate (relative) support for each of the individual items first. There are 15 items total, which gives us the following support for each of the individual items (apples=6/15, oranges=6/15, soda=7/15, diapers=10/15, candy bar=3/15, bread=6/15, milk=7/15, butter=2/15 and beer=9/15). That was pretty easy to calculate all support values for our 9 items. Now let's consider the support for two-sets of items (items in groupings of 2). There are \({9 \choose 2}=36\) different ways of combining our items in groups of two. That's a lot to list - so we'll just do a couple of them. Let's look at soda and candy bars for starters. Out of the 15 transactions, we have 3 instances in which both soda and candy bars are bought together. If we look at apples and candy bars, we have 1 out of 15 examples in which both are found together. How about beer and diapers? 7 out of the 15 transactions have both beer and diapers are found together in transactions.

We have rough indications for how often some items are found together but cannot draw many conclusions about causality because even though soda and candy bars are only found together in 3 out of the 15 transactions, we cannot say much because it could be that the items do have a connection, but are not bought by a lot of customers. Indeed, if we look at the confidence calculations for soda given candy bars, we have that the confidence is \(\frac{supp(soda,candy bar)}{supp(candy bar)} = \frac{3}{3} = 1.0\). Confidence has a range of 0.0 to 1.0. Thus with a 1.0 confidence, we would (if the transaction set was bigger) say that we have a strong belief that people who have bought a candy bar are likely to also buy a soda. How about the frequency of people buying beer given that they have diapers in their shopping cart? The confidence is \(\frac{7/10}=0.7\), which is a pretty high value as well. Therefore, we might infer that there is a tendency for someone buying beer if they have a baby at home. In fact it is this relation that has been used to popularize association rule mining techniques. In an actual example of association rule mining, this beer and diaper relationship was discovered. If you think about it, it makes sense. A screaming baby at home might make a frustrated mother or father want a cold beer to take off the edge.

Let's look at the lift of the soda/candy bar and diaper/beer relationships. The lift of the soda/candy beer relationship is \(\frac{3/15}{(7/15)\cdot(3/15)}=2.14\) and the lift of the diaper/beer relationship is \(\frac{7/10}{((10/15)\cdot(9/15))}=1.75\). Generally, the higher the lift, the stronger is the relation between the items.

There are many other calculations that we can do to find relationships between sets: conviction, leverage, collective strength to name a few. We'll stop with these three basic calculations for this post.

You may have noticed that we have something of a combinatorial explosion as we want to calculate frequency quantities of increasing set size. For a 'group' size of 1, we only had to do \(k=9\) support calculations (one for each 'feature' or basket item). For groups of two, we had to calculate \({9 \choose 2}\) support values; and groups of 3 would require \({9 \choose 3}\) calculations. Imagine if we had more than 9 available grocery items. This is where calculations of combinations of items start to become expensive. As such, we need algorithms that do this efficiently.

A Priori

The A Priori algorithm is a way to efficiently calculate frequency relationships of large transaction sets with many possible items/features. It relies on a monotonicity assumption of itemsets called the 'downward closure lemma'. In simple terms, it relies on the assumption that if frequent sets of lower cardinality are not present, higher cardinality groups will not contain them. This narrows the number of potential combinations of items for larger group combinations. So for example, we can see that the itemset {butter, candy bar} has zero occurrences. Therefore, we do not need to calculate the frequency of a superset of this itemset such as {butter, candy bar, apple}. We already know that there are no such combinations. This allows us to reduce the combinations of large cardinality itemsets. However, in practice we do always limit the maximum size group of items that we would like to explore.

Conclusion

We went over a brief summary of some key concepts of association rule mining, one of the most popular data mining strategies. In future notes, we will go into more depth aout some of the other metrics used to find frequent itemsets in a transaction database, and explore the influence of missing data on itemset discovery.

Thursday, July 2, 2015

Comparing white box models for wine classification

Introduction

This will be a bit more of an 'applied' post. We'll talk about comparing so-called 'white box' models to the wine dataset from the well-known UCI machine learning repository (http://archive.ics.uci.edu/ml/).

A white box model is a machine learning model whose inner workings are visible to us after we have trained it. I.e. if we are building a classification model, we can see how a classification decision is being made – how the features are being used in making the determination for which class a particular sample should be placed. This is in contrast to a black box model, which is a model that arrives at a classification decision without any insight as to how it made that determination (an artificial neural network is a good example of a black box model).

Decision tree, decision list and rule-based algorithms are typical examples of white-box models. When we build a decision tree model, we can classify a new instance by starting at the root of the tree and trace down its branches following the conditions at each point that the tree branches out until we come to a leaf of the tree, which gives us the class of the instance. Similarly, with lists and rules we can find rules or lists whose conditions are satisfied by an instance that we would like to classify. Many decision tree algorithms, such as C4.5 (J48) can be transformed from a tree to a set of rules.

Being able to see how features are used in classifying a dataset gives us insight into what features are important in determining the class of that dataset. Hence these models give us an intuitive understanding of the dataset. As such, the UCI wine dataset is a nice application of these types of algorithms because we can try to see what qualities in wine are important in determining the output class, which is a score for the wine's quality. We will stick to just the red wine dataset in this example, but the same methods can be used on the white wine dataset or the combination of the two.

Tools

In this post, I will be breaking from the mold and using Weka (http://www.cs.waikato.ac.nz/ml/weka/) rather than my usual analytics tool, R. When I do a set of comparisons between various machine learning algorithms, I find that Weka is the easiest tool to quickly work through a large set of algorithms with basic parameter settings to compare their behavior. It is a very handy tool for doing things quickly and when you don't need to worry about automation of your analysis processes (of course I am strictly speaking to the UI tool, but Weka is actually a machine learning/data mining API in Java for those who are inclined to build software around its algorithms).

Analysis

We'll be approaching the wine dataset from the perspective of classifying the samples with respect to their quality. Thus, I have forced the quality feature to be nominal rather than numeric in the ARFF file (ARFF is a Weka-specific file format – for more information on ARFF see here http://www.cs.waikato.ac.nz/ml/weka/arff.html). The relevant line of the ARFF file defining the nominal quality feature is as follows:

@attribute quality {'3','4','5','6','7','8'}

And I placed single quotes around each of the quality data values in the data instances (for ways to do this efficiently, stay tuned for some posts on Extract-Transform-Load methods in the near future).

First, we'll look at the accuracy of various white-box classification algorithms. I've compiled a set of accuracies in the following table. These accuracies are the result of 10-fold cross validation.

name in WEKA classification accuracy *percent improvement over 'identity rule'*
ZeroR 42.5891 0.
ConjunctiveRule 56.2226 32.011712
JRip 56.2226 32.011712
OneR 54.6592 28.340820
PART 61.3508 44.052821
Ridor 52.4703 23.201242
BFTree 60.2877 41.556642
DecisionStump 55.3471 29.956022
FT 57.2858 34.508125
J48 61.4759 44.346558
J48graft 62.789 47.429741
LADTree 59.162 38.913478
LMT 60.4128 41.850380
NBTree 58.5366 37.445027
RandomTree 62.414 46.549234
REPTree 58.4115 37.151290
SimpleCART 60.1626 41.262905

…and those values in graphical form (using R for this part :)):

library(ggplot2)
p <- ggplot(df, aes(X.name.in.WEKA., X.percent.improvement.over..identity.rule.., 
     col=X.name.in.WEKA.)) 
p + geom_point(size=6, pch=9) +
 theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="none") +
  geom_abline(slope=0, intercept=0) +
   xlab("Algorithm") +
    ylab("Improvement over 'identity rule'")

Discussion

We can see that the highest accuracy classifier is the grafted J48 algorithm (note that J48 is the implementation of C4.5 in Weka). Note that we are comparing not the absolute accuracies of the algorithms in the plot but rather their accuracy with respect to what percent of an improvement they are over the 'identity rule' algorithm, which is simply to assign each instance to the most commonly occurring class. As such, it is a good baseline for comparison since it is about the simplest 'classification' algorithm that we could use without using more advanced techniques for classification. We take this 'identity rule' classifier as the minimum accuracy that we should be able to achieve.

The decision tree models tend to be quite large. The best classifier, J48 graft, has a size of 693 and has 347 leaves. So for the sake of brevity (and following the principle of Occam's Razor), we'll examine the makeup of a smaller model, JRip.

JRip

There are 6 levels of quality in the wine data: a minimum quality value of 3 and maximum of 8. The JRip classifier gives us a set of rules that take us from boolean comparisons of the features to numeric values to the class label. The first such rule gives us an indication of whether a wine will be of the highest quality (8): if the alcohol level is at least medium to high (sommeliers would call this medium, medium plus or high alcohol), the sulphates are greater than 0.68 and less than 0.74 and the chlorides are greater than 0.06. A similar antecedant can be seen in the second rule with different alcohol and sulphate levels but without the chlorides. These rules together give the impression that maybe a slightly higher amount of sulphates (between 0.82 and 0.86 rather than 0.69 to 0.74) reduce the need for the high level of chlorides in lower sulphate wines.

On the other side of the quality spectrum, there is a single rule for wines with a quality value of 4, which is pretty low (note that there are no rules that give us an indicator of wines that have the lowest quality value of 3). It states that if the volatile acidity is greater than or equal to 0.755 (and redundantly >= 1.02) and the fixed acidity is less than or equal to 7.5 then the quality value is 4.

Now to try to interpret these. If we had a sommelier at our disposal, we could at this stage present what we've found to them to see if they make intuitive sense with their domain knowledge. In the absence of a sommelier, we'll rely on some quick internet searches for interpretation.

Sulphates are sometimes added in wine to increase acidity (Modern Winemaking, P. Jakisch). The particular sulphate being measured in this dataset is Potassium Sulphate. Potassium Sulphate is fertilizer. If we look at a summary of the sulfates distribution in the data, the rule indicates that a medium to medium amount of sulfates occur in high-quality wines. Somewhat counterintuitive, but it could be that this level of fertilizer is ideal for growing the best grapes. The viticulture publication "Sulphate of Potash and Wine Grape" by the Tessenderlo fertilizer manufacturer (http://www.tessenderlo.com/) states in that "Potassium, delivered in the form of sulphate of potash (SOP) is always preferable, notably because of its beneficial role in the formation of sugars and organoleptic constituents, the contents of which will determine the quality of the wine." So we have at least a tenuous link between wine quality and sulfate in the wine. Again, a domain expert such as a vintner or sommelier would be the person to collaborate with to interpret these findings.

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Chlorides are components in salt, this dataset describes levels of sodium chloride or table salt. The presence of sodium chloride in wine can signify that a wine was from a vineyard near the sea coast or that the wine has added salt (via the International Organization of Viticulture). I could find no direct ties between salt and quality. Therefore we may have discovered an unknown relation… or a specious one!

(alcohol >= 12.5) and (sulphates >= 0.69) and (sulphates <= 0.74) and (chlorides >= 0.06) => quality=8 (7.0/2.0)
(alcohol >= 12.6) and (sulphates >= 0.82) and (sulphates <= 0.86) => quality=8 (7.0/3.0)
(volatile.acidity >= 0.755) and (volatile.acidity >= 1.02) and (fixed.acidity <= 7.5) => quality=4 (9.0/3.0)
(alcohol >= 10.5) and (volatile.acidity <= 0.37) and (sulphates >= 0.73) and (density >= 0.9976) => quality=7 (18.0/3.0)
(alcohol >= 10.5) and (sulphates >= 0.73) and (alcohol >= 11.7) => quality=7 (78.0/27.0)
(alcohol >= 10.5) and (volatile.acidity <= 0.37) and (density <= 0.99536) and (volatile.acidity >= 0.28) and (citric.acid >= 0.34) and (residual.sugar >= 2.1) => quality=7 (19.0/3.0)
(alcohol >= 11) and (total.sulfur.dioxide <= 15) and (alcohol >= 11.6) and (sulphates >= 0.59) => quality=7 (17.0/3.0)
(alcohol >= 10.5) and (volatile.acidity <= 0.37) and (pH <= 3.27) and (alcohol >= 11.1) and (alcohol <= 11.7) => quality=7 (21.0/7.0)
(alcohol >= 10.3) and (free.sulfur.dioxide >= 13) and (density <= 0.9962) and (citric.acid <= 0.07) and (density >= 0.99522) => quality=6 (34.0/3.0)
(alcohol >= 10.033333) and (alcohol >= 11.4) and (residual.sugar <= 2.4) => quality=6 (118.0/31.0)
(sulphates >= 0.58) and (alcohol >= 10.3) => quality=6 (327.0/144.0)
(sulphates >= 0.59) and (total.sulfur.dioxide <= 28) and (volatile.acidity <= 0.55) => quality=6 (88.0/27.0)
(volatile.acidity <= 0.545) and (alcohol >= 9.9) => quality=6 (113.0/51.0)
 => quality=5 (743.0/229.0)

We can proceed in a similar way for each of the white box classification algorithms. It is clear that either having some experience in the data domain or having a domain expert to guide you can be a great help in interpretting these models.

Conclusion

We have investigated the application of white box classification models to the UCI repository wine dataset. We determined the accuracy of several classifiers and investigated the structure of some of the simpler of the white box models produced. Maybe in the future we'll go into more depth in investigating the commonalities between these classifiers and see whether they predict using similar values and where they differ.

Tuesday, June 2, 2015

C4.5 Decision Trees with Missing Data

blog

C4.5 Decision Trees with Missing Data

Overview

We often encounter data with missing values; and various machine learning algorithms handle missing data with varying levels of grace. C4.5 is an algorithm that is advertised to be able to handle missing data since there is 'built-in' support for missing values.

In this post, we will walk through exactly how the C4.5 decision tree algorithm deals with missing values. We will do this by working out how the branches are created in detail.

Data

We'll work with the same data that we have used in previous posts on decision tree algorithms so that we are able to compare behaviors, but with the important difference that some of the data is missing (the missing values are marked with "?"). It is repeated below for ease of reference.

This data consists of weather conditions that we would like to use to decide whether we play tennis or choose not to play tennis on a given day. Thus "PlayTennis" is our outcome feature that we would like to build a classifier to predict. And we will continue to ignore the "Day" column since it is being used mainly as a row label rather than as an attribute that we would like to use in our model.

Original (non-missing) weather data

The original dataset (without missing values) is as follows:

Table 1: Weather data and an outcome variable, "PlayTennis", that we would like to predict. From T. Mitchel "Machine Learning" Chapter 3 pp. 59.
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Weather data with missing values

Here is the same data, but with missing values. Values were randomly removed, which is equivalent to the Missing Completely At Random missing data mechanism (missing data mechanisms will be the subject of a future post).

Table 2: Weather/tennis data with missing values.
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 ? Hot High Strong No
D3 ? ? High ? Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool ? Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 ? Mild High ? No
D9 ? Cool Normal Weak Yes
D10 ? ? Normal ? Yes
D11 ? Mild Normal ? Yes
D12 Overcast Mild ? Strong Yes
D13 Overcast Hot ? Weak Yes
D14 Rain Mild High Strong No

Altered gain and split info calculations

In C4.5, information gain and split info are used to calculate the gain ratio value (see other posts on C4.5 for definitions of gain ratio and split info). However, in the presences of missing values, J.R. Quinlan defines an altered version of these quantities to make the splitting criteria robust to missing data.

Information gain is simply weighted by the proportion of missing values on a particular attribute:

\[ G_m(S,A) = \frac{|S_{A}| - |M_{A}|}{|S_{A}|} G(S,A) \]

Where \(|M_{A}|\) is the number of missing values for attribute \(A\).

The new entropy value is calculated with only the non-missing values:

\[ H(S) = -\frac{4}{8} \log_2 \frac{4}{8} - \frac{4}{8} \log_2 \frac{4}{8} = 1 \]

So using the updated definition of information gain on the Outlook attribute would proceed as follows (using the data from the table with the missing values):

\[ G_m(S, Outlook) = \frac{8}{14} \left[ H(S) - \frac{1}{8}\left( -\frac{1}{1} \log_2 \frac{1}{1} - 0 \right)_{Sunny} - \frac{3}{8} \left( -\frac{3}{3} \log_2 \frac{3}{3} - 0 \right)_{Overcast} - \frac{4}{8} \left( -\frac{2}{4} \log_2 \frac{2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right)_{Rain} \right] = \frac{8}{14} ( 1 - \frac{1}{2} ) = 0.286 \]

Split info is calculated the same as before but with the missing values considered a separate state that an attribute can take.

\[ Split_m(S, Outlook) = -\frac{1}{14} \log_2 \frac{1}{14} - \frac{3}{14} \log_2 \frac{3}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 1.659 \]

Therefore the gain ratio for Outlook would be the ratio of the updated information gain to the updated split info: \(\frac{0.286}{1.659}=0.172\).

Doing similar calculations for the rest of the attributes yields the following values:

Attribute Gain Ratio
Outlook 0.172
Temperature 0.028
Humidity 0.081
Wind 0.066

We pick the attribute with the highest gain ratio on which to define a new branc in the decision tree, which in this case is Outlook. This initial tree looks – at least structurally – the same as the one constructed without missing values:

But there is an ambiguity that we must consider when choosing which rows are assigned to which branches of the tree.

Partitioning the training set

Now that we have made the decision to split based on the Outlook attribute, we need to decide which of the outcome values each row is 'assigned to'. This is a clear choice for those rows without missing values. But we face a problem when it comes to deciding based on the rows that contain missing values for a particular attribute.

For example, when we look at the D2 row we see that we do not know what the Outlook vale is. So which branch do we follow?

The way that Quinlan dealt with this problem was to assign these ambiguous cases to all of the newly created descendant nodes with a weight that is proportional to the number of non-missing instances of each classification. If we take Outlook==Sunny as an example, we have one example for which we know Outlook is Sunny and its class is PlayTennis==No. However, we have \(6\) instances in which we do not know the Outlook, 4 of which are PlayTennis==Yes and 2 of which are PlayTennis==No. We simply assign a fractional weight value of the number of Outlook==Sunny instances to the number of other known instances for Outlook (\(\frac{1}{8}\)) to each of these rows with unknown Outlook.

Classifying an instance with missing values

When we come across an instance that we would like to classify that has missing values, we explore all possibilities that the missing value can take when we reach the decision split in our classification tree. Then once we reach the leaves of the tree for each of the possible paths, we choose the path with the highest probability value given the weighted instaces of PlayTennis==Yes and PlayTennis==No at each branch.

Tuesday, May 26, 2015

The C4.5 decision tree split criterion

blog

The C4.5 decision tree split criterion

In a previous post, we worked through an example of building a decision tree using the ID3 algorithm, a precursor to the ubiquitous C4.5 decision tree algorithm.

In this post, we are going to try to understand the split criterion of the C4.5 algorithm itself by doing some more hand calculations to ensure that we truly understand it, rather than just throwing some data into a black-boxed implemention of the algorithm.

The data

We'll use the same data that was used in the previous post on the ID3 algorithm so we can compare our results. The dataset is as follows:

Table 1: Some data about weather and whether we will play a sport. Say, tennis (adapted from T. Mitchel "Machine Learning" Chapter 3 pp. 59
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Splitting criterion

ID3 used an information gain criterion in deciding how to build branches in the decision tree. In equation form:

\[ G(S,A) = H(S) - \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H(S_{\nu}) \]

where H is the entropy of a set, defined by:

\[ H(S) = \sum_i^n p_i \log_2 p_i \]

The information gain is the gain in information due to the hypothetical split on an attribute. We choose to split on the attribute that has the highest information gain.

In contrast, C4.5 uses a so-called gain rate criterion. This gain rate criterion is superior due to the fact that ID3's information gain criterion strongly favors features with many outcomes. For example, if we were to consider the Day column as a feature (we had only considered it to be a row label rather than a feature until now), it would have the highest information gain; but using it as a splitting criterion would result in a shallow, broad tree with 14 leaves: one branch for each instance. This is not a very useful decision tree as it likely would not generalize well on unseen data.

The gain ratio criterion ameliorates this problem by adjusting the gain by the number of outcomes to favor a feature with less outcomes over that with more.

The gain ratio is defined as follows:

\[ G_r(S,A) = \frac{G(S,A)}{Split(S,A)} \]

Where \(Split(S,A)\) is the split info defined as:

\[ Split(S,A) = \sum_{\nu \in A} \frac{|S_{\nu}|}{|S|} H\left(\frac{|S_{\nu}|}{|S|}\right) \]

Computing splits using gain ratio

The gain on the full dataset is:

\[ H(S) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{9}{14} \log_2 \frac{9}{14} = 0.94 \]

Then we'll compute hypothetical splits on each of the features.

  • Outlook

The information gain on the Outlook attribute is:

\[ G(S, Outlook) = H(S) - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Sunny} - \left[ \frac{4}{14} (-\frac{4}{4} \log_2 \frac{4}{4} - \frac{0}{4} \log_2 \frac{0}{4}) \right]_{Overcast} - \left[ \frac{5}{14} (-\frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}) \right]_{Rain} = 0.246 \]

And the split information is:

\[ Split(S, Outlook) = -\frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{12} \log_2 \frac{5}{14} = 1.577 \]

Therefore the gain ratio is \(\frac{0.246}{1.58}=0.156\) for the Outlook attribute.

  • Temperature

\[ G(S, Temperature) = H(S) - \frac{4}{14} \left( -\frac{2}{4} \log_2 {2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right) - \frac{6}{14} \left( -\frac{4}{6} \log_2 \frac{4}{6} - \frac{2}{6} \log_2 \frac{2}{6} \right) - \frac{4}{14} \left( -\frac{3}{4} \log_2 \frac{3}{4} - \frac{1}{4} \log_2 \frac{1}{4} \right) = 0.571 \]

\[ Split(S, Temperature) = -frac{4}{14} \log_2 \frac{4}{14} - \frac{5}{14} \log_2 \frac{5}{14} - \frac{4}{14} \log_2 \frac{4}{14} = 1.563 \]

And the gain ratio for the Temperature attribute is \(\frac{0.571}{1.563} = 0.365\).

  • Humidity

\[ G(S, Humidity) = H(S) - \frac{7}{14} \left( -\frac{3}{7} \log_2 \frac{3}{7} - \frac{4}{7} \log_2 \frac{4}{7} \right) - \frac{7}{14} \left( -\frac{6}{7} \log_2 \frac{6}{7} - \frac{1}{7} \log_2 \frac{1}{7} \right) = 0.152 \]

\[ Split(S, Humidity) = -\frac{7}{14} \log_2 \frac{7}{14} - \frac{7}{14} \log_2 \frac{7}{14} = 1 \]

The gain ratio is \(\frac{0.152}{1} = 0.152\)

  • Wind

\[ G(S, Wind) = H(S) - \frac{8}{14} \left( -\frac{6}{8} \log_2 \frac{6}{8} - \frac{2}{8} \log_2 \frac{2}{8} \right) - \frac{6}{14} \left( -\frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{2}{6} \right) = 0.048 \]

\[ Split(S, Wind) = \frac{8}{14} \log_2 \frac{8}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 0.985 \]

And the gain ratio is \(\frac{0.048}{0.985}=0.0487\).

Branching decision based on gain ratio

So we see that the maximum gain ratio between the potential attributes upon which we can make our split is on the Temperature feature. We see that this is different than in the case when we were using the information gain criterion (Outlook) and will therefore result in a different decision tree.