Thursday, July 2, 2015

Comparing white box models for wine classification

Introduction

This will be a bit more of an 'applied' post. We'll talk about comparing so-called 'white box' models to the wine dataset from the well-known UCI machine learning repository (http://archive.ics.uci.edu/ml/).

A white box model is a machine learning model whose inner workings are visible to us after we have trained it. I.e. if we are building a classification model, we can see how a classification decision is being made – how the features are being used in making the determination for which class a particular sample should be placed. This is in contrast to a black box model, which is a model that arrives at a classification decision without any insight as to how it made that determination (an artificial neural network is a good example of a black box model).

Decision tree, decision list and rule-based algorithms are typical examples of white-box models. When we build a decision tree model, we can classify a new instance by starting at the root of the tree and trace down its branches following the conditions at each point that the tree branches out until we come to a leaf of the tree, which gives us the class of the instance. Similarly, with lists and rules we can find rules or lists whose conditions are satisfied by an instance that we would like to classify. Many decision tree algorithms, such as C4.5 (J48) can be transformed from a tree to a set of rules.

Being able to see how features are used in classifying a dataset gives us insight into what features are important in determining the class of that dataset. Hence these models give us an intuitive understanding of the dataset. As such, the UCI wine dataset is a nice application of these types of algorithms because we can try to see what qualities in wine are important in determining the output class, which is a score for the wine's quality. We will stick to just the red wine dataset in this example, but the same methods can be used on the white wine dataset or the combination of the two.

Tools

In this post, I will be breaking from the mold and using Weka (http://www.cs.waikato.ac.nz/ml/weka/) rather than my usual analytics tool, R. When I do a set of comparisons between various machine learning algorithms, I find that Weka is the easiest tool to quickly work through a large set of algorithms with basic parameter settings to compare their behavior. It is a very handy tool for doing things quickly and when you don't need to worry about automation of your analysis processes (of course I am strictly speaking to the UI tool, but Weka is actually a machine learning/data mining API in Java for those who are inclined to build software around its algorithms).

Analysis

We'll be approaching the wine dataset from the perspective of classifying the samples with respect to their quality. Thus, I have forced the quality feature to be nominal rather than numeric in the ARFF file (ARFF is a Weka-specific file format – for more information on ARFF see here http://www.cs.waikato.ac.nz/ml/weka/arff.html). The relevant line of the ARFF file defining the nominal quality feature is as follows:

@attribute quality {'3','4','5','6','7','8'}

And I placed single quotes around each of the quality data values in the data instances (for ways to do this efficiently, stay tuned for some posts on Extract-Transform-Load methods in the near future).

First, we'll look at the accuracy of various white-box classification algorithms. I've compiled a set of accuracies in the following table. These accuracies are the result of 10-fold cross validation.

name in WEKA classification accuracy *percent improvement over 'identity rule'*
ZeroR 42.5891 0.
ConjunctiveRule 56.2226 32.011712
JRip 56.2226 32.011712
OneR 54.6592 28.340820
PART 61.3508 44.052821
Ridor 52.4703 23.201242
BFTree 60.2877 41.556642
DecisionStump 55.3471 29.956022
FT 57.2858 34.508125
J48 61.4759 44.346558
J48graft 62.789 47.429741
LADTree 59.162 38.913478
LMT 60.4128 41.850380
NBTree 58.5366 37.445027
RandomTree 62.414 46.549234
REPTree 58.4115 37.151290
SimpleCART 60.1626 41.262905

…and those values in graphical form (using R for this part :)):

library(ggplot2)
p <- ggplot(df, aes(X.name.in.WEKA., X.percent.improvement.over..identity.rule.., 
     col=X.name.in.WEKA.)) 
p + geom_point(size=6, pch=9) +
 theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="none") +
  geom_abline(slope=0, intercept=0) +
   xlab("Algorithm") +
    ylab("Improvement over 'identity rule'")

Discussion

We can see that the highest accuracy classifier is the grafted J48 algorithm (note that J48 is the implementation of C4.5 in Weka). Note that we are comparing not the absolute accuracies of the algorithms in the plot but rather their accuracy with respect to what percent of an improvement they are over the 'identity rule' algorithm, which is simply to assign each instance to the most commonly occurring class. As such, it is a good baseline for comparison since it is about the simplest 'classification' algorithm that we could use without using more advanced techniques for classification. We take this 'identity rule' classifier as the minimum accuracy that we should be able to achieve.

The decision tree models tend to be quite large. The best classifier, J48 graft, has a size of 693 and has 347 leaves. So for the sake of brevity (and following the principle of Occam's Razor), we'll examine the makeup of a smaller model, JRip.

JRip

There are 6 levels of quality in the wine data: a minimum quality value of 3 and maximum of 8. The JRip classifier gives us a set of rules that take us from boolean comparisons of the features to numeric values to the class label. The first such rule gives us an indication of whether a wine will be of the highest quality (8): if the alcohol level is at least medium to high (sommeliers would call this medium, medium plus or high alcohol), the sulphates are greater than 0.68 and less than 0.74 and the chlorides are greater than 0.06. A similar antecedant can be seen in the second rule with different alcohol and sulphate levels but without the chlorides. These rules together give the impression that maybe a slightly higher amount of sulphates (between 0.82 and 0.86 rather than 0.69 to 0.74) reduce the need for the high level of chlorides in lower sulphate wines.

On the other side of the quality spectrum, there is a single rule for wines with a quality value of 4, which is pretty low (note that there are no rules that give us an indicator of wines that have the lowest quality value of 3). It states that if the volatile acidity is greater than or equal to 0.755 (and redundantly >= 1.02) and the fixed acidity is less than or equal to 7.5 then the quality value is 4.

Now to try to interpret these. If we had a sommelier at our disposal, we could at this stage present what we've found to them to see if they make intuitive sense with their domain knowledge. In the absence of a sommelier, we'll rely on some quick internet searches for interpretation.

Sulphates are sometimes added in wine to increase acidity (Modern Winemaking, P. Jakisch). The particular sulphate being measured in this dataset is Potassium Sulphate. Potassium Sulphate is fertilizer. If we look at a summary of the sulfates distribution in the data, the rule indicates that a medium to medium amount of sulfates occur in high-quality wines. Somewhat counterintuitive, but it could be that this level of fertilizer is ideal for growing the best grapes. The viticulture publication "Sulphate of Potash and Wine Grape" by the Tessenderlo fertilizer manufacturer (http://www.tessenderlo.com/) states in that "Potassium, delivered in the form of sulphate of potash (SOP) is always preferable, notably because of its beneficial role in the formation of sugars and organoleptic constituents, the contents of which will determine the quality of the wine." So we have at least a tenuous link between wine quality and sulfate in the wine. Again, a domain expert such as a vintner or sommelier would be the person to collaborate with to interpret these findings.

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Chlorides are components in salt, this dataset describes levels of sodium chloride or table salt. The presence of sodium chloride in wine can signify that a wine was from a vineyard near the sea coast or that the wine has added salt (via the International Organization of Viticulture). I could find no direct ties between salt and quality. Therefore we may have discovered an unknown relation… or a specious one!

(alcohol >= 12.5) and (sulphates >= 0.69) and (sulphates <= 0.74) and (chlorides >= 0.06) => quality=8 (7.0/2.0)
(alcohol >= 12.6) and (sulphates >= 0.82) and (sulphates <= 0.86) => quality=8 (7.0/3.0)
(volatile.acidity >= 0.755) and (volatile.acidity >= 1.02) and (fixed.acidity <= 7.5) => quality=4 (9.0/3.0)
(alcohol >= 10.5) and (volatile.acidity <= 0.37) and (sulphates >= 0.73) and (density >= 0.9976) => quality=7 (18.0/3.0)
(alcohol >= 10.5) and (sulphates >= 0.73) and (alcohol >= 11.7) => quality=7 (78.0/27.0)
(alcohol >= 10.5) and (volatile.acidity <= 0.37) and (density <= 0.99536) and (volatile.acidity >= 0.28) and (citric.acid >= 0.34) and (residual.sugar >= 2.1) => quality=7 (19.0/3.0)
(alcohol >= 11) and (total.sulfur.dioxide <= 15) and (alcohol >= 11.6) and (sulphates >= 0.59) => quality=7 (17.0/3.0)
(alcohol >= 10.5) and (volatile.acidity <= 0.37) and (pH <= 3.27) and (alcohol >= 11.1) and (alcohol <= 11.7) => quality=7 (21.0/7.0)
(alcohol >= 10.3) and (free.sulfur.dioxide >= 13) and (density <= 0.9962) and (citric.acid <= 0.07) and (density >= 0.99522) => quality=6 (34.0/3.0)
(alcohol >= 10.033333) and (alcohol >= 11.4) and (residual.sugar <= 2.4) => quality=6 (118.0/31.0)
(sulphates >= 0.58) and (alcohol >= 10.3) => quality=6 (327.0/144.0)
(sulphates >= 0.59) and (total.sulfur.dioxide <= 28) and (volatile.acidity <= 0.55) => quality=6 (88.0/27.0)
(volatile.acidity <= 0.545) and (alcohol >= 9.9) => quality=6 (113.0/51.0)
 => quality=5 (743.0/229.0)

We can proceed in a similar way for each of the white box classification algorithms. It is clear that either having some experience in the data domain or having a domain expert to guide you can be a great help in interpretting these models.

Conclusion

We have investigated the application of white box classification models to the UCI repository wine dataset. We determined the accuracy of several classifiers and investigated the structure of some of the simpler of the white box models produced. Maybe in the future we'll go into more depth in investigating the commonalities between these classifiers and see whether they predict using similar values and where they differ.