As my thesis partner and I gathered up the evidence we had collected, it began to dawn on us — as well as on our thesis advisers — that we didn’t have enough for ordinary, “normal” statistics. Our chief adviser, an Air Force colonel, and his captain assistant were on the faculty at the Air Force Institute of Technology (AFIT), where my partner and I were both seeking a master’s degree in logistics management.
We had traveled to the Warner Robins Air Logistics Center in Georgia to talk with a group of supply-chain managers and to administer a survey. We were trying to find out if they adapted their behavior based on what the Air Force expected of them. Our problem, we later came to understand, was a paucity of data. Not a problem, said our advisers. We could instead use non-parametric statistics; we just had to take care in how we framed our conclusions and to state clearly our level of confidence in the results.
Shopping for Stats
In the end, I think our thesis held up pretty well. Most of the conclusions we reached rang true and matched both common sense and the emerging consensus in logistics management based on Goldratt’s Theory of Constraints. But the work we did to prove our claims mathematically, with page after page of computer output, sometimes felt like voodoo. To be sure, we were careful not to put too much faith in them, not to “put too much weight on the saw,” but in some ways it seemed as though we were shopping for equations that proved our point.
I bring up this story from the previous century only to let you know that I am in no way a mathematician or a statistician. However, I still use statistics in my work. Oddly enough, when I left AFIT I simultaneously left the military (because of the “draw-down” of the early ’90s) and never worked in the logistics field again. I spent the next 24 years working in information technology. Still, my statistical background from AFIT has come in handy in things like data correlation, troubleshooting, reporting, data mining, etc.
We spent little, if any, time at AFIT learning about Bayes’ Theorem (BT). I think looking back on it, we might have done better in our thesis, chucking our esoteric non-parametric voodoo and replacing it with Bayesian statistics. I first had exposure to BT back around the turn of the century when I was spending a great deal of time both managing a mail server and maintaining an email interface program written in the most hideous dialect of C the world has ever produced.
As you probably know, one of the first email spam filters that actually worked relied on Bayesian inference. The biggest problem at the time was trying to find a balance between letting the bad stuff in versus keeping the good stuff out. A false positive can cause almost as much damage as an unblocked chunk of malware. “Didn’t you get my email?” asks the customer who took his business elsewhere.
Back then I didn’t give a damn about the philosophical difference between the frequentist statistics they taught us at AFIT and this up-and-coming rival. Nor did I care about their little feud. All that mattered at the time was that it worked. We still had false positives, but they were fewer. And it could learn, which was pretty impressive at the time. (We used SpamAssassin, by the way.)
The feud didn’t make a lot of sense to me at the time. It seemed more like a turf war than anything else. I mean, aren’t we all trying to discover the same thing? Isn’t probability the same, no matter which tools you use?
Well, actually no. It isn’t. The world is the same. The data are the same. But their perspectives and their outputs are fundamentally different.
What do we mean by “probability”?
I’m a visual sort of person, so I often need to see something explained graphically before I get past that phase where I’m only repeating what I’ve memorized and reach that point where I can feel it in my gut. For example, in the video below, Jake VanderPlas, a scientist (astronomer) at the University of Washington, imagines the following conversation.
Statistician: “95% of such confidence intervals in repeated experiments will contain the true value.”
Scientist: “So there’s a 95% chance that the value is in this interval?”
Statistician: “No: you see, parameters by definition can’t vary, so referring to chance in this context is meaningless. The 95% refers to the interval itself.”
Scientist: “Oh, so there’s a 95% chance that the value is in this interval?”
And so they go round and round. You may recall writing papers in which you had to be extremely precise about saying what your stats actually proved. But other than professional statisticians, who really has a clear understanding about “confidence intervals in repeated experiments“? The reason the scientists (and the rest of us) keep getting it wrong is that the Bayesian description of probability is more natural.
The big takeaway is this: Frequentist results are not the same as Bayesian results, and if you don’t understand that, you may fundamentally misinterpret what one or the other is saying. Frequentism is not wrong; it’s just measuring a different thing. In fact, the two approaches have different definitions of probability.
As VanderPlas explains:
You’ve got to remember this: in general, when someone gives you a frequentist confidence interval, it’s not 95% likely to contain the true value. This is a Bayesian interpretation of a frequentist construct, and it happens all the time. You have to be really, really careful about this.
So now we understand that the Bayesian and the frequentist are talking about two different things. But for me it really hits home when I see it explained graphically.
The Bayesian imagines one credible region, and multiple parameters either fall inside or outside that region. The frequentist sees one parameter, a sort of Platonic ideal, with multiple confidence intervals that may or may not overlap that parameter. I invite you to watch and enjoy this fun, easy-to-watch video, in which I think VanderPlas absolutely nails it.