Thursday, May 21, 2015

K-Fold Cross Validation with Decision Trees in R

blogsly

1 K-Fold Cross Validation with Decisions Trees in R   decision_trees machine_learning

1.1 Overview

We are going to go through an example of a k-fold cross validation experiment using a decision tree classifier in R.

K-fold cross validation is a method for ensuring a robust error estimate on a trained classification model.

When we train a predictive model, we want that model to not only be accurate on the data that we used to train the model but also generalize to other samples that the model has not yet been presented with. A common technique for ensuring this generalizability is to split data into training data and test data sets. The model is trained on the training data split and then tested on the test dataset to ensure that the model did not only learn to be accurate on the training dataset (overfit).

Similarly, in k-fold cross validation we split the data into k equally-partitioned subsamples. Then for each of the k partitions, we hold out the \(i^{th}\) partition and train our model on the other \(k-1\) partitions and test on the \(i^{th}\) partition. We then average the error over the testing results of all of our k rounds of training/testing.

1.2 Naive Training/Testing

To begin, we will show a naive implementation of a train/test process. In this example, we don't split between the training data and the testing data.

Here, we are training the model on the full dataset.

library(rpart)
data(iris)
rpart.model <- rpart(Species~., data=iris, method="class")
print(rpart.model)
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
  2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
  3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
    6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259) *
    7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *

And now we test on the same dataset. From this, we obtain a confusion matrix.

rcart.prediction <- predict(rpart.model, newdata=iris, type="class")
confusion.matrix <- table(iris$Species, rcart.prediction)
print(confusion.matrix)
          rcart.prediction
           setosa versicolor virginica
setosa         50          0         0
versicolor      0         49         1
virginica       0          5        45

The resulting error is as follows:

accuracy.percent <- 100*sum(diag(confusion.matrix))/sum(confusion.matrix)
print(paste("accuracy:",accuracy.percent,"%"))
[1] "accuracy: 96 %"

Pretty good. But our model could very well be overfit. If we were to obtain new measurements for each of these species our accuracy might not be very good because our model fits the data that we trained on well but does not generalize for new data.

1.3 Using k-fold cross-validation to train and test the model

So let's use k-fold cross-validation to obtain a more generalizable model.

library(plyr)
library(rpart)
set.seed(123)
form <- "Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width"
folds <- split(iris, cut(sample(1:nrow(iris)),10))
errs <- rep(NA, length(folds))

for (i in 1:length(folds)) {
 test <- ldply(folds[i], data.frame)
 train <- ldply(folds[-i], data.frame)
 tmp.model <- rpart(form , train, method = "class")
 tmp.predict <- predict(tmp.model, newdata = test, type = "class")
 conf.mat <- table(test$Species, tmp.predict)
 errs[i] <- 1-sum(diag(conf.mat))/sum(conf.mat)
}
print(sprintf("average error using k-fold cross-validation: %.3f percent", 100*mean(errs)))
[1] "average error using k-fold cross-validation: 7.333 percent"

So there we have it. K-fold cross-validation in action. We can see that the error increased when using k-fold cross-validation over simply training and then testing on the same data, which indicates that there may have been bias introduced by overfitting the model in the latter case.

1.4 k-fold cross-validation with C5.0

Let's do the same thing but with a different decision tree algorithm, C5.0. This is an update of J. Ross Quinlan's popular C4.5 algorithm.

library(C50)
library(plyr)
errs.c50 <- rep(NA, length(folds))
form <- "Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width"
folds <- split(iris, cut(sample(1:nrow(iris)),10))
for (i in 1:length(folds)) {
 test <- ldply(folds[i], data.frame)
 train <- ldply(folds[-i], data.frame)
 tmp.model <- C5.0(as.formula(form), train)
 tmp.predict <- predict(tmp.model, newdata=test)
 conf.mat <- table(test$Species, tmp.predict)
 errs.c50[i] <- 1 - sum(diag(conf.mat))/sum(conf.mat)
}

print(sprintf("average error using k-fold cross validation and C5.0 decision tree algorithm: %.3f percent", 100*mean(errs.c50)))
[1] "average error using k-fold cross validation and C5.0 decision tree algorithm: 6.000 percent"

2 comments:

  1. Privileged to read this informative blog on Data Science. Commendable efforts to put on research the data. Please enlighten us with regular updates on Data Scinece. Friends if you're keen to know more about Data Science you can watch this amazing AI tutorial on the same.
    https://www.youtube.com/watch?v=sg5h_qcuw4k

    ReplyDelete