Wednesday, August 19, 2015

Data Tables in R

blog

Data Tables in R

Introduction

My posts have been pretty machine learning and data mining heavy until now. And although machine learning and data mining are arguably the most important aspect of what we have come to call 'data science', there are a couple other components that we cannot ignore: how we store data and how we 'munge' or manipulate data. I promise I will get to these aspects of data science in future posts. In a segue into these topics, today's post will be on a package that I've recently discovered that is a useful way to store data in R (well, at least in memory) and get useful aggregate information from it.

Usually we work with the 'data frame' class in R to store data that we are working with in memory. Data frames are one of the reasons that I choose to do my analytics in R versus another environment such as Matlab (although the folks at Mathworks - the makers of Matlab - have added a data frame-like type to their matrices in recent years but it doesn't match the power of the R version). Data frames are useful because - amongst other reasons - we can work with data in a natural way. Columns are named and typed into numeric, character or factor values. And we can store mixed data in data frames - not all columns are required to be the same type (as is the case for matrices).

However, I've recently discovered the 'data.table' library. I was looking for a quick way to operate on grouped data in a data frame and data tables seemed to fit the bill. I use plyr all the time for this type of operation; and plyr is very good at it (if you don't use it, I would recommend starting). But I was doing something very simple (finding min and max by a factor) and the data.table library came up in a search. And it looks like a package that I will probably start incorporating into my set of go-to libraries in R.

Oh, and I know there is a newer version of plyr (dplyr), but I haven't had time to make the switch. Someday…

Example Data

I'll generate some simple example data for demonstration purposes. The data will start its life as a data frame. It is just two columns: a numeric column of uniformly-distributed random values and a column of groups, randomly chosen:

df <- data.frame(x=runif(20),
     y=sample(c("A","B"), size=20, replace=TRUE))
print(df[order(df$y),])
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A
0.151608517859131 B
0.0283026716206223 B
0.606306979199871 B
0.0992447617463768 B
0.313594420906156 B
0.610857902327552 B
0.588633166160434 B
0.141868675360456 B
0.140498693101108 B
0.219244148349389 B
0.664096125168726 B

The plyr version

So what I would usually do in plyr to find, say, the min value for each group is something like this:

library(plyr)
df.agg <- ddply(df, c("y"), summarize, min.val.per.group=min(x))
A 0.0864250543527305
B 0.0283026716206223

As you can see, this gives us a new data frame with the minimum values for each group, A and B. A similar thing could be done for the max values.

Don't get me wrong, I love plyr and will continue to use it. But data.tables are a different approach to working with grouped frames of data that is intriguing.

The data.table version

In contrast to data frames, data tables maintain a "key" that can be made up of the columns of the table.

First we'll convert our original data frame to a data table. Then we set the key to be the "y" column (remember, the key can actually be a combination of columns if desired). Lastly, we use the tables() function to print out the tables that we have access to in memory (just our dt table right now) and some info on that table, such as its key.

library(data.table)
dt <- as.data.table(df)
setkey(dt, y)
tables()
dt 20 2 1 x,y y

Having a key gives us access to a simple and concise means by which we can group information on the table as well as filter it. If we only want the "A" group values on our table, we can filter it in this way:

dt["A",]
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A

Also note that the comma is optional:

dt["A"]
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A

And we can perform actions on the groups using the second index into our table. So if we want to take the min of the "A" group, we can do it like this:

dt["A", min(x)]
0.0864250543527305

But we can also get the minimum value for each key (or in our case since we defined the key as just the y column, which delineates our groups) succinctly by using the additional index into the data table with the by keyword:

dt[, min(x), by=y]
A 0.0864250543527305
B 0.0283026716206223

And notice that we don't have to address our columns explicitly as columns of the table (e.g. dt$x or dt$y), making things even more concise.

Conclusion

We only went through a brief and simple introduction of the data.table library in R. But hopefully that is enough to make you interested in pursuing this fairly simple but useful library further. I, for one, will be continuing to use it in my daily routine of analytics work thanks to its simplicity and ability to operate on groups of rows in such a succinct way.

Tuesday, August 18, 2015

Association Rule Mining and APriori

Association Rule Mining and APriori

Introduction

As you may already know, association rule mining is the process of discovering rules in data based on the frequent co-occurrence of items in the dataset. It is one of the more well-known techniques in data mining; primarily as the result of demonstrated successes in applications to discovering grocery items that individuals are likely to purchase together in a process usually referred to as market basket analysis.

The general idea is that we take a preferably large set of data and winnow it down to only those sets of items/features that co-occur frequently and in some cases imply one another. From these sets we form implication rules such as \(A \land B \land C \rightarrow D\).

In this post, I will discuss some of the metrics used in discovering rules and will briefly discuss one of the more common algorithms used in association rule mining: the APriori algorithm.

Example Setup

To introduce association rule mining, I will work from an example dataset of grocery items that were purchased by a hypothetical set of customers. The data itself can be seen in the following table.

transaction id apples oranges soda diapers candy bar bread milk butter beer
1   1   1         1
2 1     1     1    
3       1   1 1 1 1
4 1 1 1       1    
5 1 1 1 1 1   1    
6     1     1     1
7   1 1   1       1
8     1 1         1
9 1     1   1     1
10 1     1         1
11       1   1     1
12     1   1 1 1    
13 1 1   1   1 1   1
14     1 1     1    
15   1           1  

In association rule mining, the term for a row of data is a transaction. Here we have a set of 15 transactions in a grocery store. The items that were purchased are marked with a 1 and those that were not purchased by a particular customer are left blank.

We have a simple example here in order to introduce the concepts; however, we do not need to limit ourselves to a set of binary relations of {X = 1} or {X = empty} (where X can be any of the items, apples, oranges, soda, etc.). We could just as easily have allowed our X to take on numerical values if we were dealing with a different set of columns in our table. The one thing that is required is that our data is somehow able to be coaxed into a finite set of values. For example if our feature columns are real-valued, we would need to split them at boundaries so that we can run association rule mining algorithms on them.

In order to discover which of these items tend to be purchased together and which items may lead to another item being purchased we will introduce means for testing their frequency of co-occurrence in the following section.

Finding Frequent Sets

To find the items we often use the concepts of support, confidence and lift. These are the basic tests for which rules we can form

Support tells us how often a particular item is found in the transactions. It is simply the fraction of transactions or number of times that a single item or a set of items is found. There are two different ways that people measure support: number of times the set occurs and fraction of times. These two types of support measurement are known as the absolute and relative support, respectively.

Confidence uses the support measurement as a basis and gives us information about potential implication of the occurrence of items. What this means is that we can start to make inferences about whether the existence of a set of items implies that another item or set of items is likely to exist in the same type of transaction. Confidence is simply the support of a set of items divided by the support of a subset of those items. E.g. for items X, Y and Z, the confidence is calculated as \(\frac{supp(X,Y,Z)}{supp(X,Y)}\) (with supp() referring to the support value) for the confidence that Z is present when X and Y are also found in a transaction or the ratio of the support of X, Y and Z to the support of just X and Y. Parallels are often drawn between confidence and conditional probability since their definitions are very similar (i.e. the conditional probability of event Z given event Y is defined as $P(X|Y) = \frac{P(X,Y)}{P(Y)} or similarly the conditional probability of event Z given events X and Y is \(P(Z|X,Y)=\frac{P(X,Y,Z)}{P(X,Y)}\)). The difference lies in the fact that in calculating confidence, we are concerned solely with frequencies rather than requiring probabilities.

The last quantity that we will define before doing actual calculations is lift. Lift is defined as the ratio of the support of items X and Y and the 'independent' support of items X and Y (if you would forgive the reference to the terminology of probability theory). Thus, the lift is defined mathematically as:

$$lift(X \Rightarrow Y) = \frac{supp(X \cup Y)}{supp(X) \cdot supp(Y)}$$

The lift gives us an indication of how often items are found together versus alone or independent.

Example Calculations

Let's calculate (relative) support for each of the individual items first. There are 15 items total, which gives us the following support for each of the individual items (apples=6/15, oranges=6/15, soda=7/15, diapers=10/15, candy bar=3/15, bread=6/15, milk=7/15, butter=2/15 and beer=9/15). That was pretty easy to calculate all support values for our 9 items. Now let's consider the support for two-sets of items (items in groupings of 2). There are \({9 \choose 2}=36\) different ways of combining our items in groups of two. That's a lot to list - so we'll just do a couple of them. Let's look at soda and candy bars for starters. Out of the 15 transactions, we have 3 instances in which both soda and candy bars are bought together. If we look at apples and candy bars, we have 1 out of 15 examples in which both are found together. How about beer and diapers? 7 out of the 15 transactions have both beer and diapers are found together in transactions.

We have rough indications for how often some items are found together but cannot draw many conclusions about causality because even though soda and candy bars are only found together in 3 out of the 15 transactions, we cannot say much because it could be that the items do have a connection, but are not bought by a lot of customers. Indeed, if we look at the confidence calculations for soda given candy bars, we have that the confidence is \(\frac{supp(soda,candy bar)}{supp(candy bar)} = \frac{3}{3} = 1.0\). Confidence has a range of 0.0 to 1.0. Thus with a 1.0 confidence, we would (if the transaction set was bigger) say that we have a strong belief that people who have bought a candy bar are likely to also buy a soda. How about the frequency of people buying beer given that they have diapers in their shopping cart? The confidence is \(\frac{7/10}=0.7\), which is a pretty high value as well. Therefore, we might infer that there is a tendency for someone buying beer if they have a baby at home. In fact it is this relation that has been used to popularize association rule mining techniques. In an actual example of association rule mining, this beer and diaper relationship was discovered. If you think about it, it makes sense. A screaming baby at home might make a frustrated mother or father want a cold beer to take off the edge.

Let's look at the lift of the soda/candy bar and diaper/beer relationships. The lift of the soda/candy beer relationship is \(\frac{3/15}{(7/15)\cdot(3/15)}=2.14\) and the lift of the diaper/beer relationship is \(\frac{7/10}{((10/15)\cdot(9/15))}=1.75\). Generally, the higher the lift, the stronger is the relation between the items.

There are many other calculations that we can do to find relationships between sets: conviction, leverage, collective strength to name a few. We'll stop with these three basic calculations for this post.

You may have noticed that we have something of a combinatorial explosion as we want to calculate frequency quantities of increasing set size. For a 'group' size of 1, we only had to do \(k=9\) support calculations (one for each 'feature' or basket item). For groups of two, we had to calculate \({9 \choose 2}\) support values; and groups of 3 would require \({9 \choose 3}\) calculations. Imagine if we had more than 9 available grocery items. This is where calculations of combinations of items start to become expensive. As such, we need algorithms that do this efficiently.

A Priori

The A Priori algorithm is a way to efficiently calculate frequency relationships of large transaction sets with many possible items/features. It relies on a monotonicity assumption of itemsets called the 'downward closure lemma'. In simple terms, it relies on the assumption that if frequent sets of lower cardinality are not present, higher cardinality groups will not contain them. This narrows the number of potential combinations of items for larger group combinations. So for example, we can see that the itemset {butter, candy bar} has zero occurrences. Therefore, we do not need to calculate the frequency of a superset of this itemset such as {butter, candy bar, apple}. We already know that there are no such combinations. This allows us to reduce the combinations of large cardinality itemsets. However, in practice we do always limit the maximum size group of items that we would like to explore.

Conclusion

We went over a brief summary of some key concepts of association rule mining, one of the most popular data mining strategies. In future notes, we will go into more depth aout some of the other metrics used to find frequent itemsets in a transaction database, and explore the influence of missing data on itemset discovery.