This is a simple exercise that looks at using splines to reduce a a model to a simpler basis in order to better classify phoneme data. The data and a description are available on the Elements of Statistical Learning textbook’s website. It is further described in section 5.2.3 of the textbook.

The data come from sound clips of people speaking a variety of phonemes: “aa,” “ao,” “dcl,” “iy,” and “sh.” There are 256 features, corresponding to a digitized periodogram reading at equally spaced frequencies.

## Only Two Phonemes

First, we will consider distinguishing between the two phonemes “aa” and “ao.”

### Raw Data

As seen in the example plot above, each raw data vector has 256 highly correlated features. Our first attempt to fit a model is to allow for a different coefficient at each frequency. We will find that it overfits the training data and does a poor job of predicting the test data. Notice that I’m relying on which.response which can be found in an earlier post.

### Using Splines

Now, we will try logistic regression again, this time using a range of spline bases.

Let’s glance at the composition of the best-performing spline basis functions.

### Comparison

The regularized logistic fit outperformed the raw fit in predicting test data. The only exception was when using 4 degrees of freedom, which presumably did too much regularizing. The raw fit, however, performed better on the training data because it has so many parameters that it overfits.

## All Five Phonemes

Next, I repeat the process, this time considering all five phonemes.

### Raw Data

The following code relies on lr.predict which can be found elsewhere.

### Using Splines

For simplicity, I will only use the spline basis that worked best from before.

### Comparison

The regularized logistic fit again outperformed the raw fit in predicting test data. And again, the raw fit worked better on the training data because of overfitting.