Comparing Classification Algorithms for Handwritten Digits Math, CS, Data
This is a simple exercise comparing several classification methods for identifying handwritten digits. The data are summarized on a prior post. Last time, I only considered classifying the digits “2” and “3”. This time I keep them all.
To compare methods, I rely a couple of utility functions which.response and crossval, which can be found elsewhere.
k-Nearest Neighbors
In order to cross-validate the results for a range of $k$ values, I will use a helper function.
Using the crossval function, along with knn.range, I try $k \in $ {$1, 2, 5, 10$} to look for a general trend.
I will now try the best $k$ on my training and my test data and record the error rates. The training error rate should be trivially zero when using $k=1$ nearest neighbors, as long as there are no exact ties belonging different classes. In this case, the probability of that is negligible.
Linear Discriminant Analysis
LDA assumes that the various classes have the same covariance structure. The data cloud within each group should have the same basic shape. We cannot see the 256-dimensional data cloud, but we can visualize any two-dimensional projection of it, which should have that same property. The first two principal components for a sample of the data are plotted below.
The assumption of equal covariance structures is not plausible. We will try LDA anyway for comparison to the other methods.
Quadratic Discriminant Analysis
QDA discards the equal covariance assumption of LDA, but it estimates so many parameters that it has a high variance. Another drawback is that QDA algorithms are susceptible errors, as seen below.
Regularized Discriminant Analysis
Technically, it’s redundant to do RDA in addition to LDA and QDA, because LDA and QDA are special cases of it. I could have just started with RDA, letting the $\lambda$ parameter vary over $[0, 1]$. In the klaR package’s rda function, $\lambda$ represents the amount of weight given to the pooled covariance matrix. Because I have already looked at $\lambda=1$ (LDA) and $\lambda=0$ (QDA), I will use cross-validation for range of $\lambda$ values between these extremes. My helper function rda.predict comes in handy.
Just like QDA, RDA will have problems with exactly multicolinear columns, so we will reuse the jittered data.
Now, I can use the best-performing $\lambda$ to try classifying the full training and test data.
Logistic Regression
Finally, logistic regression makes fewer assumptions on the data. Let’s see how it does by using the nnet package’s multinom function and my lr.predict helper function.
Comparing the Methods
For this problem, k-nearest neighbors was the most successful in predicting the groups of the test data. This may have to do with the fact that it makes the fewest assumptions. The other (parametric) methods are based on models that may be inappropriate for this problem.