“Hey, Briggs. I saw your take on the leaked New Zealand vaccine data. Interesting. But why didn’t you use [Insert My Favorite Statistics Model Here
]?” [Blog, Substack mirror.]
I’ll tell you why not. Because models, in the way you’re thinking of them, aren’t necessary.
In fact, all of you should stop using so many models! And you certainly shouldn’t trust models produced by others. Reagan had it backwards. It is not “Trust but verify”. It is “Verify then trust.”
My dear friends, you and I have, over the course of many years, examined hundreds upon hundreds of models, all of them bad, produced by the biggest names in science and in the best institutions. Shouldn’t this catalog of horrors have imbued in you by now a reflexive distrust of models, as they are usually found?
So no formal models in the sense you are thinking, unless absolutely necessary. When is that? Let me illustrate with something everybody understands: sports.
We always begin with a question of interest. Like, “Who won the Lions-Chargers game?”
Now, I ask you, how would you go about answering that? With a model!, say scientists. And what is the first step in modeling? Right: gathering data.
I didn’t know the answer, so I searched the standard woke search engine and they gave me this: “Detroit 41, Chargers 38.”
This is our data. Or, rather, part of it. We also need tacit premises, like the rules of football, the dates in question, and things like this, premises researchers scarcely ever write formally into their models. Which means they usually forget these premises are there, and when they go to employ their models they commit all manner of offences against thought.
But let that pass. Suppose we have the correct premises related to the question. We have the data, which is part of any formal model. The next step is to make math of it.
How about a parameterized bi-variate Poisson? If you recall, a Poisson is model that gives a probability to integer numbers, including zero, like scores in football are. A bi-variate allows two such numbers, for both our teams.
We could do this model as frequentists, fitting the parameters, and then staring at them for insight. If we’re lucky, we’ll be able to flash our wee Ps at our audience, and they will be in awe. Or we could be Bayesians, but then we need to think about “priors” (more models) on the parameters. It can be done. Software makes it a breeze.
You don’t like the Poisson? Then how about a time series cohort model? What we do is gather more data, on previous games and for other teams. Then we chart, for different team cohorts, the course of the season using an autoregressive integrated moving average model. ARIMA, as it’s called in the trade. Still have to make the frequentist-Bayesian choice. But whatever.
On the other hand, this is 2023! Why use these stuffy old parameterizations? We now have machine learning and artificial intelligence! The real stuff having been corrupted beyond measure.
How about a version of CART, then? We have scads of data on ticket prices, who bought them and so on, who sat where. We have tons more on the athletes statuses, their prior performance stats, and on and on, seemingly forever. But computers are big these days and handle all this with ease.
“Uh, okay, but we know the score. We can just look.”
EXACTLY!
This “just looking” still relies on the veracity of the data source, and our senses, and all that, as all models do. That can never be escaped.
Doc invents a new pill to cure the screaming willies. You either have it or you don’t: there are no gradations. He gave one batch of people his new drug. Four out of five were cured. He gave another batch a placebo. Two out of four were cured.
Which group did better?
If you’ve had formal statistics training your first instinct will be to model this. To “discover” what happened. But we already know! We don’t need to discover.
“But Briggs, the new pill might not be better. Other things could have caused the difference.”
Indeed. But who was claiming cause? The question was which group did better, which team won. We are not saying why one did better. We can never learn cause from such paltry data as this. I went out of my way to insist I was not claiming cause in the NZ data.
All we had to do in the NZ data was look. Nothing more was needed. Unless we wanted to make a stab at quantifying the small departure from uniformity in people who got only 1 or 2 shots then died (review the triangle plots if you can’t recall).
In sports, the old scores will do, unless we wanted to guess future (or unknown) scores. Then we need a model. With the drug, looking was fine, unless we wanted guess about future outcomes. Then we need a model.
If we want to know cause, of cures and non-cures, we have to do a monumental amounts of work, investigating biochemical pathways, genetics, patient characteristics, and on and on. Stupid simple statistical models cannot provide this information, though many, alas, believe they can. Hence so many bad models.
In “the game”, we do not know which team is better based on only the score. The score may have been the result of a bad call by official, say, or any number of other things. Scientists would say the result is “due to chance”, not understanding “chance” can’t cause anything. In any case, the score alone does not tell us why the score was what it was.
In the NZ data, all we had to do is count. Cause was out of the question. But cause can be had, and in the way I suggested. By having NZ release the individual health records of those who we suspect, but cannot prove, were vaccine injured.
It’s a good bet NZ does not release all their data. And it’s an excellent bet—you can make money with this one—that people, especially Experts, will believe they know the right answer anyway.
Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: $WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.
No I don't think you understand though. I have a preconceived outcome and you didn't find that, so I'd like you to keep looking until you do. Have you considered principle component analysis? Perhaps with some averaging of old data once you've used the most recent data to verify via relative error (because r^2 is shite) it can be used for extrapola-- oh sorry I appear to have entered the wrong room. Could you tell me where the tree ring proxy climate science room is please, I appear to be lost.
Fantastic post, William! Opting for simplicity over convoluted statistical models is the wise path. The shift towards "Verify then trust" resonates, steering away from the blind adherence of "Trust but verify." Your point about "chance can't cause anything" is a powerful reminder of the limitations in attributing causation to randomness.
Having abandoned models for the most part over the years, I've found that they often fall short in delivering accurate results, burdened by unmet assumptions. The straightforward approach of just looking at the data remains the most effective strategy. It's a bit disheartening that the scientific community remains beholding to the almighty p-values despite all their pitfalls. Here's to embracing simplicity and evidence over complex statistical entanglements!