This is a simple exercise that looks at using splines to reduce a a model to a simpler basis in order to better classify phoneme data. The data and a description are available on the Elements of Statistical Learning textbook’s website. It is further described in section 5.2.3 of the textbook.

The data come from sound clips of people speaking a variety of phonemes: “aa,” “ao,” “dcl,” “iy,” and “sh.” There are 256 features, corresponding to a digitized periodogram reading at equally spaced frequencies.

Only Two Phonemes

First, we will consider distinguishing between the two phonemes “aa” and “ao.”

Raw Data

As seen in the example plot above, each raw data vector has 256 highly correlated features. Our first attempt to fit a model is to allow for a different coefficient at each frequency. We will find that it overfits the training data and does a poor job of predicting the test data. Notice that I’m relying on which.response which can be found in an earlier post.

Using Splines

Now, we will try logistic regression again, this time using a range of spline bases.

Let’s glance at the composition of the best-performing spline basis functions.

Comparison

The regularized logistic fit outperformed the raw fit in predicting test data. The only exception was when using 4 degrees of freedom, which presumably did too much regularizing. The raw fit, however, performed better on the training data because it has so many parameters that it overfits.

All Five Phonemes

Next, I repeat the process, this time considering all five phonemes.

Raw Data

The following code relies on lr.predict which can be found elsewhere.

Using Splines

For simplicity, I will only use the spline basis that worked best from before.

Comparison

The regularized logistic fit again outperformed the raw fit in predicting test data. And again, the raw fit worked better on the training data because of overfitting.