Class 52: The Problem And Origin Of Parameters
Unobservable parameters abound in probability models. Why? Where do they come from? Are they needed? This is our first hardcore math lesson. You must read the written lesson today. WARNING for those reading the email version! The text below might appear to be gibberish. If so, it means the LaTeX did not render in the emails. I’m working on this. Meanwhile, please click on the headline and read the post on the site itself. Thank you.
Video
Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty
HOMEWORK: Given below; see end of lecture.
Lecture
This is an excerpt from Chapter 8 of Uncertainty.
There has been an inordinate and unfortunate fascination with unobservable parameters which are found inside most probability models. Parameters relate the X to Y, but are understood in an ad hoc fashion. Since models are often selected through custom or ignorance of alternatives (and recall we’re talking about actual and not ideal practice), the purposes of parameters are not well considered, to say the least. Most statistical practice, frequentist or Bayesian, revolves solely around parameters, which has led to the harmful misconception that the parameters are themselves the X, and the X causal. P-values, confidence intervals, and posterior distributions, hypothesis tests and other classic measures of model “fit” are abused with shocking frequency and with destructive force. Probability leakage is the least of these problems; mis-ascribed causality the worst. It’s time for it to stop. People want to know Pr(Y|X): tell them that and not about some mysterious parameters.
Parameters arise from considering measurement. All measurement is finite and discrete, regardless of the way the universe might happen to be (I use universe in the philosophical sense of all that exists). Measurement drives X, which in turn are probative of Y. Parameters are not necessary when all is finite and discrete, but they may be used for mathematical convenience. But their origin must first be understood. Parameters-as-approximations arise from taking a finite discrete measurement process (which all are processes) to the limit. The interpretation of parameters in this context then becomes natural. This area, as will soon be clear, is wide open for research. Below, I’ll show how parameters arise in a familiar set up; but how they come about in others is mostly an open question.
Where do parameters come from? Here is one example, which originates with Laplace, and which necessitates some mathematics, which, I remind us, are not our main purpose here. The parallels to Solomonoff’s approach (cited in Chapter 5) will be obvious to those familiar with algorithm information theory. Begin with the premise E that before us in an urn which contains N objects, objects which can take one of two states. From this language we infer N is finite, which is absolutely no restriction, because N can be very large indeed. Call them “success” and “failure”, or “1” and “0”, if you like. From this we deduce there can be anywhere from 0 to N successes. Given these premises—and no others—or rather this evidence E, we deduce the probability that there are M=i,i=0,…,N successes is 1/(N+1). No number of success (or failures) is more likely than any other.
Now suppose we reach in and grab a sample of size n. In this sample there will be n1 success and n0 failures, so that n1+n0=n. To say something about these observations, we want the probability of j successes in n draws, without replacement where the urn has N total successes. It will also be helpful to rewrite, or rather parameterize this by considering Nθ, where θ=i/N, which is the fraction of successes. Note that θ is observable. The probability is (with the obvious restrictions on j):
which is a hypergeometric. We are still interested in the fraction θ (out of all N) of successes. Since we saw n1 successes so far, θ must be at least as large as n1/N, but it might be larger. We can use Bayes’s theorem to write (again, with the obvious restrictions on j)
This is the posterior “parameter” distribution on θ, which turns out to be
where β() denotes the beta function.
Here is where parameters arise. Following Ross (p. 180) in showing how the hypergeometric is related to the binomial for large samples, let N→∞ in (our equation above). The result is
which is the standard beta distribution posterior on θ when the prior on θ is “flat”, i.e. equal to a beta distribution with parameters α=β=1.
We started with hard-and-fast observable propositions, and a finite number of successes and failures, and considered how expanding their number in a specific way towards infinity, and we end up with unobservable parameters. As Jack Aubrey would say, Ain’t you amazed? The key is that we don’t really need the infinite version of the model; the finite one worked just fine, albeit that it is harder to calculate for large N. But then there is no arguments over where the prior for θ came from. It arises naturally. This small demonstration is like de Finetti’s representation theorem (see below), only it also gives the prior instead of saying only that it exists.
What does the parameter θ mean? With a finite N—which will always be true of all real-world situations—it was the total fraction of successes (given the premises). This is sensible and measurable, at least in theory. Whether anybody ever measures all N mentioned in the premises is another matter. θ is discrete: it can take only the values 0/N,1/N,…,N/N, and no value inside this set is impossible; at least, not on the evidence we have assumed. At the limit, θ is continuous and can take any value in the unit interval. Which is to say, it can take none of them, not empirically, because as Keynes said, in the long run we shall all be dead: infinity can never be reached. The parameter is no longer the fraction of successes, only something like it. But what? The mind should boggle at imagining the ratio infinite successes in infinite chances; indeed, I cannot imagine it. I can only picture large finite approximations to it. This θ is not, as it is often called, “the probability of success.” We already deduced the probability of success given the premises. So what is it? An index with a complex definition involving limits, a definition so complex that its niceties are forgotten and people speak of it as if it were its finite cousin, that is, as it if were a probability.
Notice very carefully that the parameter-solution is an approximation. We don’t need it. Though calculating [the discrete-finite probability] may be cumbersome, we have the exact result. We don’t need to quarrel about the prior, impropriety, ignorance, non-informativity or anything else because everything has been deduced from the premises. This situation is also well behaved. Approaching the limit (in a certain specified way) produced a result which is familiar. The continuous-valued parameter ties nicely to a finite-sample result: it keeps the roughly same meaning. I have no idea whether this will be true for all stock distributions in our cookbook, but we have great reason to doubt it. In his book, Jaynes (Chapter 15) shows how the so-called marginalization paradox disappears when one very carefully tracks how one heads off to infinity. Buffon’s needle paradox is another well known example where the path matters.
Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: $WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.



I am amazed, as Jack Aubrey would say!
I am comfortable with the level of detail, thank you, William.
Even though it’s 50-odd years since my university level maths and stats I am following along…
Ah the wonderful beta function. Thank you Euler! Do you know Euler found this by dinking around trying to find an integral form for factorials? It's so useful for integration, probability and more. Very cool William!