Wednesday, August 19, 2015

Data Tables in R

blog

Data Tables in R

Introduction

My posts have been pretty machine learning and data mining heavy until now. And although machine learning and data mining are arguably the most important aspect of what we have come to call 'data science', there are a couple other components that we cannot ignore: how we store data and how we 'munge' or manipulate data. I promise I will get to these aspects of data science in future posts. In a segue into these topics, today's post will be on a package that I've recently discovered that is a useful way to store data in R (well, at least in memory) and get useful aggregate information from it.

Usually we work with the 'data frame' class in R to store data that we are working with in memory. Data frames are one of the reasons that I choose to do my analytics in R versus another environment such as Matlab (although the folks at Mathworks - the makers of Matlab - have added a data frame-like type to their matrices in recent years but it doesn't match the power of the R version). Data frames are useful because - amongst other reasons - we can work with data in a natural way. Columns are named and typed into numeric, character or factor values. And we can store mixed data in data frames - not all columns are required to be the same type (as is the case for matrices).

However, I've recently discovered the 'data.table' library. I was looking for a quick way to operate on grouped data in a data frame and data tables seemed to fit the bill. I use plyr all the time for this type of operation; and plyr is very good at it (if you don't use it, I would recommend starting). But I was doing something very simple (finding min and max by a factor) and the data.table library came up in a search. And it looks like a package that I will probably start incorporating into my set of go-to libraries in R.

Oh, and I know there is a newer version of plyr (dplyr), but I haven't had time to make the switch. Someday…

Example Data

I'll generate some simple example data for demonstration purposes. The data will start its life as a data frame. It is just two columns: a numeric column of uniformly-distributed random values and a column of groups, randomly chosen:

df <- data.frame(x=runif(20),
     y=sample(c("A","B"), size=20, replace=TRUE))
print(df[order(df$y),])
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A
0.151608517859131 B
0.0283026716206223 B
0.606306979199871 B
0.0992447617463768 B
0.313594420906156 B
0.610857902327552 B
0.588633166160434 B
0.141868675360456 B
0.140498693101108 B
0.219244148349389 B
0.664096125168726 B

The plyr version

So what I would usually do in plyr to find, say, the min value for each group is something like this:

library(plyr)
df.agg <- ddply(df, c("y"), summarize, min.val.per.group=min(x))
A 0.0864250543527305
B 0.0283026716206223

As you can see, this gives us a new data frame with the minimum values for each group, A and B. A similar thing could be done for the max values.

Don't get me wrong, I love plyr and will continue to use it. But data.tables are a different approach to working with grouped frames of data that is intriguing.

The data.table version

In contrast to data frames, data tables maintain a "key" that can be made up of the columns of the table.

First we'll convert our original data frame to a data table. Then we set the key to be the "y" column (remember, the key can actually be a combination of columns if desired). Lastly, we use the tables() function to print out the tables that we have access to in memory (just our dt table right now) and some info on that table, such as its key.

library(data.table)
dt <- as.data.table(df)
setkey(dt, y)
tables()
dt 20 2 1 x,y y

Having a key gives us access to a simple and concise means by which we can group information on the table as well as filter it. If we only want the "A" group values on our table, we can filter it in this way:

dt["A",]
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A

Also note that the comma is optional:

dt["A"]
0.970504403812811 A
0.64111499930732 A
0.915523204486817 A
0.510243171826005 A
0.0864250543527305 A
0.104028517380357 A
0.230543906101957 A
0.274059027433395 A
0.825932029169053 A

And we can perform actions on the groups using the second index into our table. So if we want to take the min of the "A" group, we can do it like this:

dt["A", min(x)]
0.0864250543527305

But we can also get the minimum value for each key (or in our case since we defined the key as just the y column, which delineates our groups) succinctly by using the additional index into the data table with the by keyword:

dt[, min(x), by=y]
A 0.0864250543527305
B 0.0283026716206223

And notice that we don't have to address our columns explicitly as columns of the table (e.g. dt$x or dt$y), making things even more concise.

Conclusion

We only went through a brief and simple introduction of the data.table library in R. But hopefully that is enough to make you interested in pursuing this fairly simple but useful library further. I, for one, will be continuing to use it in my daily routine of analytics work thanks to its simplicity and ability to operate on groups of rows in such a succinct way.

No comments:

Post a Comment