Introduction
As promised, I will be adding a little ETL into the mix of my posts. What is ETL you ask? Well, it stands for "Extract Transform Load" and is shorthand for what some might call "data munging".
Throughout the course of data science/analytics work, we deal with a lot of data. And it is not always in the right format or the right place for us to distill insights from it. This is where ETL comes in: it is the process of 'extracting' data from whatever location and form it came, 'transforming it' and then 'loading' it into the form and place you need it to perform your analysis.
There are several software packages available that are devoted solely to the process of ETL. I'll get into some of those in the future. But using pretty minimal tools available at the shell (yes, I'm assuming you are using some version of Linux/Unix/OSX) can be pretty powerful. In this post, I'll show a short example of a couple of shell commands that are key to doing simple one-off data-munging tasks.
The scenario
I had been working on a custom data analytics tool that relies on an iterative optimization method. I was writing custom code to improve the way that the tool found solutions to the optimization problem. Thus, I could use the code to give me information about the 'before' and 'after': that is, 'before I added the new functionality' and 'after I added the new functionality'. I wanted to verify that the change resulted in a significant improvement in the analytics tool by adding the new functionality.
In the end, the functionality was trying to 'keep a value low' (sorry for being vague, but I cannot reveal exactly what I was doing due to confidentiality constraints). Since it was iterative, I knew that the tool's output included the iteration number (which we will call I) as well as the quantity that we were 'trying to keep low', which we'll call Y for brevity. If my custom update was a success, the value that we are trying to keep low would be significantly lower in the end of the process than if my added code were not there. The tool also output a bunch of irrelevant information that I wanted to filter out.
Using 'grep'
I would have to identify text in the output that would act as a signal for my munging method to extract only the pertinent lines and not all of the other mess. This is a job for the shell utility 'grep'. It simply either selects or filters out lines of text that match a regular expression. In my case, I noticed that the characters '>>>' were on each line that output both I (the iteration number value) and Y. Therefore a simple grep expression would suffice to eliminate all of the lines that do not contain '>>>', like so:
./run_analytics_tool | grep '>>>'
Note that I am running the analytics tool with './runanalyticstool' and then I am 'piping' the output into 'grep' to select only the lines that contain the string '>>>'. What this gives me is a line that looks something like this:
>>>I,X1,X2,X3,X4,Y >>>1,18.837,18.279,1,1,1 >>>2,18.557,18.279,1.94,1,1.04 >>>3,18.232,14.239,2.8,2,1.04 >>>4,18.079,14.239,3.18,2,1.28 >>>5,16.312,12.705,3.48,3,2.12 >>>6,14.266,12.705,3.48,3,3.28 >>>7,13.368,10.339,3.44,4,4.28 >>>8,12.761,10.339,3.78,4,4.86 >>>9,12.24,10.339,4.42,4,4.72 >>>10,11.234,10.339,4.74,4,4 >>>11,10.732,8.851,5.3,5,3.96 >>>12,10.972,8.851,6.4,5,4.04 >>>13,10.46,8.851,5.9,5,4.94 >>>14,9.537,8.851,4.84,5,5.8 >>>15,9.378,8.851,5.46,5,5.76 >>>16,9.401,8.851,6.06,6,5.76 >>>17,9.412,8.851,6.86,6,5.66 >>>18,9.386,8.851,7.52,6,5.7 >>>19,9.444,8.851,9.06,6,5.66 >>>20,9.444,8.851,9.06,6,5.66
We don't care about the X1-X4 values but just the I and Y values. So we need some way to select only specific columns of comma-separated values. This is where 'awk' comes in.
Using 'awk'
Awk is a tool that allows us to select delimited columns and rows and operated on them. It is an extremely useful ETL tool.
In our case, we need to simply select certain columns (columns 1 and 6) from a stream of comma-delimited values. The way that we can do this is as follows:
./run_analytics_tool | grep '>>>' | awk -F "," '{OFS=","} {print $1,$6;}'
This gives us the following:
>>>I,Y >>>1,1 >>>2,1.04 >>>3,1.04 >>>4,1.28 >>>5,2.12 >>>6,3.28 >>>7,4.28 >>>8,4.86 >>>9,4.72 >>>1,4 >>>1,3.96 >>>12,4.04 >>>13,4.94 >>>14,5.8 >>>15,5.76 >>>16,5.76 >>>17,5.66 >>>18,5.7 >>>19,5.66 >>>20,5.66
Here I am using the awk switch "-F" to say that we want to define the incoming delimiter as a comma rather than the default of a space. Similarly I am defining that we want a comma-delimiter as our "output file separator" (thus OFS) with the first section of the awk command in single quotes '{OFS=","}'. Then I am telling awk to print just the first and sixth values for each line in the stream ( '{print $1,$6;}' ). Note that again, I am 'piping' values from grep in a cascading manner to be fed into awk. Now there is a problem: our first column has a pesky string at the beginning; the one we used to our advantage for filtering with grep: '>>>'. We'd like to remove that. For this task, 'sed' is our tool…
Using 'sed'
Sed, the 'stream editor', is a replacement tool. It allows us to operate on a stream of text and transform it by matching with a regular expression and replacing what is matched with something else. We need to perform the simple task of matching '>>>' on each line and replacing it with nothing, which is itself a string: ''. We do this as follows:
./run_analytics_tool | grep '>>>' | awk -F "," '{OFS=","} {print $1,$6;}' | sed 's/>>>//'
Again, I am piping the text from the output of awk to the input of sed. What this sed expression says is to 'substitute' (thus 's') the value between the first two forward slashes (>>>) with the value in the second and third forward slashes (i.e. the empty string). What this gives us is we wanted:
I,Y 1,1 2,1.04 3,1.04 4,1.28 5,2.12 6,3.28 7,4.28 8,4.86 9,4.72 1,4 1,3.96 12,4.04 13,4.94 14,5.8 15,5.76 16,5.76 17,5.66 18,5.7 19,5.66 20,5.66
Now to see if there is a difference
What we ultimately wanted was a way to test whether the change made a difference in our analytics tool. As such, we would like to generate a set of independent 'before' and 'after' samples that we can test statistically. The way we do this is by wrapping all of what we've done until now into a loop and output the results to file. This is done in the following way:
for i in {1..100} do run_analytics_tool | grep '^>>>' | awk -F "," '{OFS=","} {print $1,$6;}' | sed 's/>>>//' > test${i}_before.csv; done ## make the change to the analytics tool codebase here for i in {1..100} do run_analytics_tool | grep '^>>>' | awk -F "," '{OFS=","} {print $1,$6;}' | sed 's/>>>//' > test${i}_after.csv; done
This gives us 100 samples of the analytics tool run. We do this with the analytics tool before adding the new functionality and after, giving us a total of 200 samples with two treatments. Since this is an iterative tool, we cannot assume that the values found in the iterations themselves are independent, so we will take the last value of Y for each of our independent runs (each iteration of the for loop). Then we'll do a statistical test on those using R to see if the change ended up doing what we wanted to, which was to keep the value Y lower at the end of the iteration runs than if my change weren't implemented in the code.
To do this we will import the data into R from the files that we created:
library(plyr) df.before <- ldply(list.files(pattern="test.*_before.csv"), function(filename) { dum = read.csv(filename) dum$filename = filename return(dum) }) df.after <- ldply(list.files(pattern="test.*_after.csv"), function(filename) { dum = read.csv(filename) dum$filename = filename return(dum) }) ## get it into a single data frame ## but preserve an indicator of which 'treatment' it came from df.all <- rbind(cbind(df.before, treatment="before"), cbind(df.after, treatment="after"))
Since we are only comparing the end iterations, let's just extract those and drop the rest:
df.all.last <- df.all[df.all$generation==max(df.all$generation),]
And we will assume normality for purposes of illustration and do a t-test to determine whether the change resulted in an improvement in the analytics tool:
t.test(ave.complexity~treatment, df.all.last)
Welch Two Sample t-test data: ave.complexity by treatment t = 22.301, df = 118.55, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 6.718873 8.028327 sample estimates: mean in group before mean in group after 10.6382 3.2646
Conclusion
In this post, we demonstrated basic usage of shell-based 'ETL tools': grep, awk and sed. We used the scenario in which we made a change to an algorithm within an analytics tool. This tool had a good deal of useful information in its output as well as a lot of junk that we didn't care about. We filtered out the junk and obtained information that was useful using munging methods. From this information we compared the analytics tool before the algorithm change and after and demonstrated that the change resulted in a significant improvement to the tool.