Smoothing Data Math, CS, Data
Why Would Anyone Smooth Data?
You’ve used an imperfect instrument to measure $Y$ at a range of $x$ values. You believe that your instrument produces independent errors $\epsilon_i$, each with mean zero and variance $\sigma^2$. In other words, your model is
for some unknown function $g$.
Here is your data:
Now I want you to estimate $g(16)$ from your data. One obvious estimator is $Y_{31}$ (the red point on the plot) because it is the output of your instrument when $x$ was set to $16$. It is unbiased because the errors have mean zero. Its variance is $.04$, so that is the expected squared error of this estimate. If we had several observations at $x=16$, then we would probably average them to get a better estimate. In this case, we don’t have any more observations at $x=16$, but we do have some nearby observations. Could we let those neighboring $Y$ values inform our estimate of $g(16)$?
Answering this question requires thinking about the mechanism that is generating the data. Is the true relationship between $x$ and $Y$ likely to be “jumpy” or “smooth”? Most real-world phenomena behave smoothly, although there are some exceptions. You can look at your data and see if it seems like it might consist of random deviations from a smooth curve. (Note: I’m using the term “smooth” informally in this article.) That certainly seems to be the case with our current data.
Notice that this is not all that different from other fitting procedures. Often people assume that there is a linear relationship between $x$ and $Y$, and they will perform linear regression. Likewise, they might try a quadratic relationship or an exponential relationship. In smoothing, we are saying that we can’t identify any simple relationship, but the data does seem to follow some smooth curve.
My favorite function for smoothing is smooth.spline
. Let’s see what it estimates $g$ to be.
To me, it is very believable that the true relationship between $x$ and $Y$ follows something like this smooth curve and that the data points deviate from it due to random noise. If I had to estimate $g(16)$, I would not hesitate to use the value of our smooth curve over the $Y$ value that I observed at $x=16$. And in this case, I would be right to do so.
Let’s compare the performance of our raw $Y$ values with the performance of the smooth fit curve in estimating $g$. Assume we were trying to estimate $g$ at all of the $x$-values where we have taken measurements.
Using the smooth values as your estimates would cut your average squared error by more than a third, in this case.
You might be curious just how smooth you should make your fit curve. A good way to decide is by trying cross-validation for a range of possible smoothness values and seeing which level of smoothness best predicts the left-out data points. The smooth.spline
function does this by default and returns the best result.
Let me caution you not to smooth indiscriminately. As this example showed, smoothing makes a lot of sense in some situations, but you need to think carefully about your data before deciding to smooth it. You don’t want to accidentally erase an important pattern in your data by smoothing over it!
Estimating Derivatives
Another application of smoothing is estimating derivatives of $g$ from the data. I’ve posted an article about this before. Recently, a British physics student contacted me for help estimating second derivatives from his data. I did a little work for him, so I decided I may as well share it on this blog.
The low wavelength points look pretty wild; let’s zoom in on them.
It looks sort of like three separate strands of data, and indeed our dataset has three Transmittance values for each Wavelength.
Beautiful! Clearly we have three separate data series that have been merged. And the differences between the series at low wavelengths are much larger than the differences at the higher wavelengths. Let’s estimate derivatives separately for each of these series. Then, we will combine those derivative estimates at the end.
Notice that these smooth curves don’t follow the data very well at small wavelengths. Let’s make a separate smooth fit for that region.
Putting these two smooth fits together, we can compute second derivative estimates for the full wavelength range for each series of data.
Let’s plot the second derivative estimates for each data series.
Those plots aren’t very revealing because the low-wavelength second derivatives are so much greater than the second derivatives in the high-wavelength region. Let’s try that again, this time splitting into two regions. I’ll use smooth.spline
to compute an overall estimate of the second derivative and add it to these plots.
Now d2
contains some pretty good second derivative estimates. However, these estimates aren’t quite as refined as they could be. One could improve on this a little by subdividing the original data into more regions. I only used two regions, splitting at a wavelength of $660 \text{ cm}^{-1}$, but clearly the change in variability is quite gradual. Each smooth.spline
should be run on a region in which the variability is fairly consistent.
blog comments powered by Disqus