The Null Lineup Math, CS, Data
There’s a useful informal technique for hypothesis testing that I call “the null lineup.” To decide whether or not your data came from some particular null distribution, you generate a handful null datasets, create a plot for each, and place your data’s plot in a lineup with the null plots. If your data stands out strongly from the rest, then you have reason to believe that the null hypothesis is false. Otherwise, you don’t. Keep reading to see a couple of examples.
Looking at Quantile Plots
How do you decide whether or not your data could be normally distributed? Typically students of data analysis are told to create a normal quantile plot. Students are told that the points should be somewhat linear. To me, that instruction is so vague that it is practically worthless. Of course the points won’t be perfectly linear due to randomness. How linear should the points be? Well, that depends on the sample size; you should expect a larger sample size produce a more linear plot. Still, this is not specific enough to be useful to a data analyst. You need to decide if your normal quantile plot would be atypical for a normal sample of that size.
Here’s an informal trick to help you decide. Compare your normal quantile plot to those of a handful of other normal quantile plots for actual normal data that you randomly generate. The code below shows an example of this technique.
At this point, you do not know which of the plots corresponds to your data. Look at the lineup. Do any of the plots seem to stand out from the rest. Once you have made up your mind about this question, look at the answer
variable to see which plot is from your data. If that plot blends in with the rest, then it is plausible that your data came from a normal distribution. To be more precise, we should not reject the null hypothesis that our data is normal. (Technically, this isn’t the same this as asserting that our data is normal, but in practice, we typically do proceed as if the data were normal.)
Of course, in this case our data was generated by rnorm
, just like the rest of the lineup, so we should certainly expect it to blend.
Now, let’s see an example of data that is not generated from a normal distribution.
Pattern or Noise?
I recently had an exchange with a friend who is doing data analysis for a business. Her supervisor told her that a certian price in their industry has a 7 to 12 year period. She was tasked with fitting a model to this cyclic data, and she asked me for a little guidance.
Before proposing a model or estimating parameters, I wanted to make sure that there really is a cyclic pattern to the data. Prices tend to fluctuate in a manner similar to Geometric Brownian Motion (GBM). I used GBM (with a little Gaussian noise added) as a null “patternless” control to compare my friend’s data with.
None of these figures stand out to me as much more periodic-looking than the rest. Perhaps there are other theoretical reasons or other evidence that this industry should be periodic, but this particular dataset doesn’t convince me of it.
Final Thoughts
Here’s a problem you might run into when trying to use this technique. What if you’ve already looked at your data before creating the lineup? Well, then you will probably recognize it in the lineup, which will bias your judgment. I suggest showing the lineup to a peer who has not already seen the data and asking for his or her impressions.
It is very easy to see patterns in noise. This null lineup trick is a nice informal way to help you decide whether your “pattern” is real.
blog comments powered by Disqus