Hung Jury: The Verdict On Uncertainty
This invited paper (which I forgot to post!) appeared in the festschrift for Hung T. Nguyen, previously at New Mexico State University and now at Chiang Mai University, to celebrate his entry into the World Hall of Fame of Statistics. Download an official copy. I leave out all the references below. Yesterday I accidentally scheduled two posts at once, and this was one of them. Apologies for the duplicate mailing.
A Bang Up Time
Seven months before Lee Harvey Oswald became famous for his encounter with President Kennedy, it is claimed he popped off his Mannlicher-Carcano rifle at the head of one Major General Edwin Walker, at Walker's residence. Oswald's political ties might have been the motive. According to Smithsonian Magazine, "Walker was a stark anti-communist voice and an increasingly strident critic of the Kennedy's, whose strong political stances had him pushed out of the army in 1961."
This incident was cited in an early work of Hung's, with Irwin Goodman, Uncertainty Models for Knowledge-Based Systems; A Unified Approach to the Measurement of Uncertainty. The example given in this book is just as relevant today as it was then to the understanding of uncertainty.
In deciding the culpability of Oswald in the assassination attempt upon General Walker, an expert ballistics analysis group indicated "could have come, and even perhaps a little stronger, to say that it probably came from this ... (gun]", while the FBI investigating team, as a matter of policy, avoiding the category of "probable" identification, refused to come to a conclusion [256]. Other corroborative evidence included a written note, also requiring an expert verification of authenticity, and verbal testimony of witnesses. Based upon this combination of evidence, the Warren Commission concluded that the suspect was guilty.
To conclude a suspect's guilt is to make a decision. Decisions are based on probabilities. And probabilities are calculated with respect to the evidence, and only the evidence, deemed probative to the decision. Picking which evidence is considered probative is itself often a matter of a decision, one perhaps external to the situation, as when a judge in a trial deems certain evidence admissible or inadmissible.
It should be clear that each of these steps is logically independent of each other, even if there are practical overlaps, as with a judge ruling a piece of relevant evidence inadmissible because of a technicality. It is also obvious that the standard classical methods used to form probabilities and make decisions are inadequate to this sequence. Yet these kinds of situations and decisions are extremely common and form the bulk of reasoning people use to go about their daily business. Everything from deciding whether to invest---in anything, from a stock to a new umbrella---to making inferences about people's behavior based on common interactions, to guessing which team will win to jurors deciding questions of guilt.
For instance, there is no way to shoehorn p-values, the classical way of simultaneously forming a probability and making a one-size-fits-all decision, into "economic" decisions of the kind found in assessing guilt or innocence. P-values first assess the probability of an event not of interest, then conflates that probability with an event which is of interest, then they make a decision designed to fit all situations, regardless of the consequences. This will be made clearer in the examples below. It does not make sense to use p-values when other measures designed to do the exact job asked of them are available and superior in every way. Hung has been one of the major forces pushing p-values into failed bin of history.
P-values rely on standard frequentist probability. It's becoming more obvious ordinary probability in frequency theory is inadequate for many, or even most, real-life decisions, especially economic decisions based on the outmoded idea of "rational actors". In order to use frequentist theory, an ``event" has to be embedded, or embeddable, in a unique infinite sequence. Probability in frequentist theory is defined as limits of subsequences in infinite sequences. No infinite sequences, no probability. In what sequence do we embed the General Walker shooting to form a probability of the proposition or event "Oswald took the shot"? All men who took shots at generals? All white men? All communist men? All men who took shots at officers of any rank? All men who took shots at other men of any kind? At women too? All those who used rifles and not guns? Bows and arrows, too? Only in America? Any country? Only at night? Only in Spring? Since a certain date?
To make frequentist probability work an infinite sequence is required; a merely long one won't do. In some physical cases, it might make sense to speak of "very long" sequences, but for many events important to people, it does not. Unique or finite events are ruled out by fiat in frequentist theory. And even when events are tacitly embedded in sequences, where little thought is given to the precise character of that sequence, frequentist probability can fail. The well known example of context effects produced by question order in surveys reveals commutativity estimates to fail.
Hung has been at the forefront of quantum probability as a replacement to ordinary frequentist probability, especially when applied to human events such as economic actions. This isn't the place to review quantum probability, but I do hope to show through two small examples the inadequacy of classical probability to certain human events. And no event is more human than a trial by jury. Forming probabilities of guilt or innocence in individual trials, and then making decisions whether to judge guilt or innocence, are acts entirely unfit to analysis by ordinary statistical methods. Especially in the face of constantly shifting evidence, unquantifiable complexities, and ambiguity of language, where "fuzzy" notions of terms are had by jury members, another area in which Hung has made fundamental contributions.
New Evidence
Consider an example, similar to the Oswald scenario, provided by Larry Laudan, a philosopher who writes on jury trials. He investigates the topic of the traditional Western instructions to jurors that the jurors must start with the belief in the defendant's innocence, and what this means to probability, and why ordinary probability is not up to the task of modeling these situations.
Judging a man guilty or innocent, or at least not guilty, is a decision, an act. It is not probability. Like all decisions it uses probability. The probability formed depends on the evidence assumed or believed by each juror first individually, and finally corporately. Probability is the deduction, not always quantified, from the set of assumed evidence of the proposition of interest. In this case the proposition is "He's guilty."
When jurors are empaneled they enter with minds full of chaos. Some might have already formed high probabilities of guilt of the defendant ("Just look at him!"); some will have formed low ("I like his eyes"). All will have different assumed background evidence, much of it loose and unformed. But it is still evidence probative to the question of Guilt. Yet most, we imagine, will accept the proposition given by a judge that "There's more evidence about guilt that you have not yet heard." Adding that to what's in the jurors' minds, perhaps after subtracting some wayward or irrelevant beliefs based on other judge's orders ("You are to ignore the venue"), and some jurors might form a low initial probability of Guilt.
Now no juror at this point is ever asked to form the decision from his probability to Guilty or Not Guilty. Each could, though. Some do. Many jurors and also citizens do when reading of trials in the news, for instance. There is nothing magical that turns the evidence at the final official decision into the "real probability". Decisions could be and are made at any time. It is only that the law states only one decision counts, the one directed by the judge at the trial's end.
What's going on in a juror's mind (I speak from experience) is nearly constantly shifting. One moment a juror believes or accepts this set of evidence, the next moment maybe something entirely different. Jurors are almost always ready to judge based on the probability they've formed at any instance. "He was near the school? He's Guilty!" Hidden is the step which moves from probability to decision; but it's still and must be there. Then they hear some new evidence and they alter the probability and the decision to "Not Guilty." The judge may tell jurors to ignore a piece of evidence, and maybe jurors can or maybe they can't. (Hence the frequent "tricks" used by attorneys to plant evidence in jurors' minds ruled inadmissible.) Some jurors see a certain mannerism in the defendant, or even the defendant's lawyer, and interpret it in a certain way, some didn't see. And so on.
At trial's end, jurors retire to their room with what they started with: minds full of augmented chaos---a directed chaos now. The direction is honed by the discussion jurors have with each other. They will try to agree on two things: a set of evidence, which necessarily leads to a deduction of a non-quantified probability of "Guilty". This won't be precisely identical for each juror, because the set of evidence considered can never be precisely identical, but the agreed-to evidence will be shared, and the probability is calculated with respect to that. Even if individuals jurors differ from the corporate assessment. After the probability is formed, then comes the decision based on the probability. Decisions are above probability. They account for thinking about being right and wrong, and what consequences flow from that. Each juror might come to a high probability of Guilty, but they might decide Not Guilty because they think the law is stupid, or "too harsh", or in other ways deplorable. The opposite may also happen.
That's the scheme. This still doesn't account for the judge's initial directive of "presuming innocence". Jurors hear "You must presume the defendant innocent." That can be taken as a judgement, i.e. a decision of innocence, or a command to clear the mind of evidence probative to the question of guilt. Or both. If it's a decision, it is nothing but a formality. Jurors don't get a vote at the beginning of a trial anyway, so hearing they would have to vote Not Guilty at the commencement of the trial, were they were allowed to vote, isn't much beyond legal theater. If it is a decision (by the judge), then conditional on that decision, every juror would and must also judge the probability of Guilt to be 0. Therefore, the judge's command is properly taken as guide for juror's to ignore all non-official evidence.
Again, if it's a command by the judge to clear the mind, or a command to at least implant the evidence "I don't know all the evidence, but know more is on its way", and to the extent each juror obeys this command, it is treated as a piece of evidence, and therefore forms part of each juror's total evidence, which itself implies a (non-quantified) probability for each juror.
This means the command is not a "Bayesian prior" per se. A "prior" is a probability, and probability is the deduction from a set of evidence. That the judge's command is used in forming a probability (of course very informally), does make it prior evidence, though. Prior to the trial itself. Thus, priors, which will certainly be formed in the minds of each juror, or formed with the set of evidence still allowed by the judge, or by evidence jurors find pleasing.
Probabilities are eventually changed, or "updated". But this does not necessarily mean in a Bayesian sense. Bayes is not necessary; Bayes theorem, that is. The theorem is only a helpful way to chop evidence into computable bits. What's always wanted in any and all situations is the probability represented by this schematic equation:
Pr(Y | All probative evidence),
where Y represents the proposition of interest; here Y = "Guilty". All Bayes does is help to partition the "All probative evidence" into smaller chunks so that numerical estimates can be reached. Numerical probabilities won't be had in jury trials, however. And certainly almost no juror will know how to use a complicated formula to form probabilities. Quantum probability, for instance, might be used by researchers after the fact, in modeling juror behavior, but what's going on inside the minds of jurors is anything but math.
The reader can well imagine what would happen if the criminal justice system adopted a set value, such as 0.95, above which Guilt must be decided. Some judges understanding the dire consequences which could result from this hyper-numeracy have banned the use of formal mathematical probability arguments, such as Bayes's theorem.
Laudan says the judge's initial command is "an instruction about [the jurors'] probative attitudes". I agree with that, in the sense just stated. But Laudan amplifies: "asking a juror to begin a trial believing that defendant did not commit a crime requires a doxastic act that is probably outside the jurors' control. It would involve asking jurors to strongly believe an empirical assertion for which they have no evidence whatsoever."
That jurors have "no evidence whatsoever" is false, and not even close to true. For instance, I like many jurors walked into my last trial with the thought, "The guy probably did it because he was arrested and is on trial." That is positive evidence for Guilty. I had lots of other thought-evidence, as did each other juror. Surely some jurors came in thinking Not Guilty for any number of other reasons, which is to say other evidence. The name of the crime itself, taken in its local context, is always taken as evidence by jurors. Each juror could commit, as I said, his "doxastic act" (his decision, which is not his probability), at any time. Only his decision doesn't count until the end.
Laudan further says:
asking jurors to believe that defendant did not commit the crime seems a rather strange and gratuitous request to make since at no point in the trial will jurors be asked to make a judgment whether defendant is materially innocent. The key decision they must make at the end of the trial does not require a determination of factual innocence. On the contrary, jurors must make a probative judgment: has it been proved beyond a reasonable doubt that defendant committed the crime? If they believe that the proof standard has been satisfied, they issue a verdict of guilty. If not, they acquit him. It is crucial to grasp that an acquittal entails nothing about whether defendant committed the crime, [sic]
We have already seen how each juror forms his probability and then decision based on the evidence; that's Laudan's "probative judgement". That evidence could very well start with the evidence provided by the judge's command; or, rather, the evidence left in each juror's mind after clearing away the debris as ordered by the judge. Thus Laudan's "at no point" also fails. Many jurors, through the fuzziness of language, take the vote of Not Guilty to mean exactly "He didn't do it!"---by which they mean they believe the defendant is innocent. Anybody who has served on a jury can verify this. Some jurors might say, of course, they're not sure, not convinced of the defendant's innocence, even though they vote that way. To insist that "an acquittal entails nothing about whether defendant committed the crime" is just false---except in a narrow, legal sense. It is a mistake to think every decision every person makes is based on extreme probabilities (i.e. 0 or 1).
Laudan says "Legal jurisprudence itself makes clear that the presumption of innocence must be glossed in probatory terms." That's true, and I agree the judge's statement is often taken as theater, part of the ritual of the trial. But it can, and in the manner I showed, be taken as evidence, too.
It seems Laudan is not a Bayesian (and neither am I):
Bayesians will of course be understandably appalled at the suggestion here that, as the jury comes to see and consider more and more evidence, they must continue assuming that defendant did not commit the crime until they make a quantum leap and suddenly decide that his guilt has been proven to a very high standard. This instruction makes sense if and only if we suppose that the court is not referring to belief in the likelihood of material innocence (which will presumably gradually decline with the accumulation of more and more inculpatory evidence) but rather to a belief that guilt has been proved.
As I see it, the presumption of innocence is nothing more than an instruction to jurors to avoid factoring into their calculations the fact that he is on trial because some people in the legal system believe him to be guilty. Such an instruction may be reasonable or not (after all, roughly 80% of those who go to trial are convicted and, given what we know about false conviction rates, that clearly means that the majority of defendants are guilty). But I'm quite prepared to have jurors urged to ignore what they know about conviction rates at trial and simply go into a trial acknowledging that, to date, they have seen no proof of defendant's culpability.
I can't say what Bayesians would be appalled by, though the ones I have known have strong stomachs. That Bayesians see an accumulation of evidence leading to a point seems to me to be exactly what Bayesians do think, though. How to think of the initial instruction (command), we have already seen.
I agree that the judge's command is used "to avoid factoring into their calculations the fact that he is on trial because some people in the legal system believe him to be guilty." That belief is evidence, though, which he just said jurors didn't have. Increasing the probability of Guilty because the defendant is on trial is what many jurors do. Even Laudan does that. That's why he quotes that "80%". The judge's command (sometimes) removes this evidence, sometimes not. In his favor, Laudan may be using evidence as synonymous with true statements of reality. I do not and instead call it the premises the jury believes true. After all, some lawyers and witnesses have been known to lie about evidence.
Laudan reasons in a frequentist fashion, but we have seen how that theory fails here. Jury trials are thus perfect at illuminating the weakness of frequentism as a theory or definition of probability people actually use in real-life decisions. Again, in frequentist theory, probabilities are defined by infinite sequences of positive (guilty) measurements embedded in infinite sequences of positive and negative (guilty and not guilty) measurements.
No real-life trial is part of an exact unique no-dispute no-possibility-of-other infinite sequence, just the Walker shooting was not. Something more complex is happening in the minds of jurors as they form probabilities then just tallying whether this or that piece of evidence adds to the tally of an infinite sequence.
Old Evidence
When jurors hear a piece of evidence, it is new evidence. However, they come stocked (in their minds) with what we can call old evidence. We have seen mixing the two is no difficulty. However, some say there is a definite problem of how to understand old evidence and how it fits into probability, specifically probability when using Bayes's theorem. We shall see here that there is no problem, and that probability always works.
Howson and Urbach is an influential book showing many errors of frequentism, though it introduced a few new ones due to emphasis on subjectivity; i.e. the theory that probability is always subjective. If probability were subjective, then probability would depend on how many scoops of ice cream the statistician had before modeling. There is also under the heading of subjectivity the so-called problem of old evidence.
The so-called problem is this, quoting from Howson:
The 'old evidence problem' is reckoned to be a problem for Bayesian analyses of confirmation in which evidence E confirms hypothesis H just in case \Pr(H|E) > \Pr(H)$. It is reckoned to be a problem because in such classic examples as the rate of advance of Mercury's perihelion (M) supposedly confirming general relativity (GR), the evidence had been known before the theory was proposed; thus, before GR was developed Pr(M) was and remained equal to 1, and Bayes's Theorem tells us that therefore Pr(GR|M) = Pr(GR). The failure is all the more embarrassing since M was not used by Einstein in constructing his theory...
The biggest error, found everywhere in uses of classical probability, is to only partially write down the evidence one has for a proposition, and then to allow that information "float", so that one falls prey to an equivocation fallacy. It is seen in this description of the so-called problem. How will become clear below.
A step in classical hypothesis testing is to choose a statistic, here following Kadane, d(X), the distribution of which is known when a certain hypothesis H nobody believes is true is true, i.e. when the "null" is true. The p-value is the probability of more extreme values of d(X) given this belief. The philosopher of statistics Deborah Mayo quotes Kadane as saying the probability statement: Pr(d(X) >= 1.96) = .025 "is a statement about d(X) before it is observed. After it is observed, the event d(X) > = 1.96 either happened or did not happen and hence has probability either one or zero (2011, p. 439)."
Mayo following Glymour then argues that if
the probability of the data x is 1, then Pr(x|H) also is 1, but then Pr(H|x) = Pr(H)Pr(x|H)/Pr(x) = Pr(H), so there is no boost in probability for a hypothesis or model arrived at after x. So does that mean known data doesn't supply evidence for H? (Known data are sometimes said to violate {\it temporal novelty}: data are temporally novel only if the hypothesis or claim of interest came first.) If it's got probability 1, this seems to be blocked. That's the old evidence problem. Subjective Bayesianism is faced with the old evidence problem if known evidence has probability 1, or so the argument goes.
There are number of difficulties with this reasoning. To write "Pr(d(X) > 1.96)" is strictly to make a mistake. The proposition "d(X) > 1.96" has no probability. Nothing has a probability. Just like all logical argument require premises, so do all probabilities. They are here missing, and they are later supplied in different ways, which is when equivocation occurs and the "problem" enters.
In other words, we need a right hand side. We might write
Pr(d(X) > 1.96 | H) [1]
where H is some compound, complex proposition that supplies information about the observable d(X), and what the (here anyway) ad hoc probability model for d(X) is. If this model allows quantification, we can calculate a value for [1]. Unless that model insists ``d(X) > 1.96" is impossible or certain, the probability will be non-extreme (i.e. not 0 or 1).
Suppose we actually observe some d(X_o) (o-for-observed). We can calculate
Pr(d(X) > d(X_o) | H) [2],
and unless d(X_o) is impossible or certain (given H), then again we'll calculate some non-extreme number. Equation [2] is almost identical with [1] but with a possibly different number than 1.96 for d(X_o). The following equation is not the same:
Pr( 1.96 >= 1.96 | H) [3],
which indeed has a probability of 1. Of course it does! "I observed what I observed" is a tautology where knowledge of H is irrelevant. The problem comes in where to put the actual observation, of the right or left hand side.
Take the standard evidence of a coin flip, the proposition C = "Two-sided object which when flipped must show one of h or t", then Pr(h | C) = 1/2. One would not say because one just observed a tail on an actual flip that, suddenly, Pr(h | C) = 0. Pr(h | C) = 1/2 because that 1/2 is deduced from C about h. Recall h is the proposition "A head will be observed".
However, and this is the key, Pr(I saw an h | I saw an h & C) = 1, and Pr(A new h | I saw an h& C) = 1/2. It is not different from 1/2 because C says nothing about how to add evidence of new flips. In other words, Pr(h | C)$ stays 1/2 forever, regardless what data is seen. There is nothing about data among the conditions. The same is true for any proposition, such as knowing about the theory of general relativity above, or in mathematical theorems. It may be true that at some later date new evidence for some proposition is learned, but this is no way changes the probability of the proposition given the old evidence, and only old evidence. The probability of proposition can indeed change given the old plus the new evidence, but this probability is in no way the same as the probability of the proposition given only the old evidence. Thus the so-called problem of old evidence is only a problem because of sloppy or careless notation. Probability was never in any danger.
Suppose, for ease, d() is "multiply by 1" and H says X follows a standard normal. Then
Pr(X > 1.96 | H) = 0.025, [4]
If an X of (say) 0.37 is observed, then what does [4] equal? The same. But this is not [4]:
Pr(0.37 > 1.96 | H) = 0, [5]
but because of the assumption H includes, as it always does, tacit and implicit knowledge of math and grammar.
Or we might try this:
Pr(X > 1.96 | I saw an old X = 0.37 & H) = 0.025, [6]
The answer is also the same because H like C says nothing about how to take old X and modify the model of X.
Now there are problems in this equation, too:
Pr(H|x) = Pr(H)Pr(x|H) / Pr(x) = Pr(H), [7]
There is no such thing as ``Pr(x)" nor does ``Pr(H)" exist, and we already seen it is false that ``Pr(x|H) = 1". This is because nothing has a probability. Probability does not exist. Probability, like logic, is a measure of a proposition of interest with respect to premises. If there are no premises, there is no logic and no probability. Thus we can never write, for any H, Pr(H). Better notation is:
Pr(H|xME) = Pr(x|HME)Pr(H|ME)/Pr(x|ME), [8]
where M is a proposition specifying information about the ad hoc parameterized probability model, H is usually a proposition saying something about one or more of the parameters of M, but it could also be a statement about the observable itself, and x is a proposition about some observable number. And E is a compound proposition that includes assumptions about all the obvious things.
There is no sense that Pr(x|HME) nor Pr(x|ME) equals 1 (unless we can deduce that via H or M) before or after any observation. To say so is to swap in an incorrect probability formulation, like in [5] above.
There is therefore no old evidence problem. There are many self-created problems, though, due to incorrect bookkeeping and faulty notation, which leads to equivocation fallacies. This solution to the so-called old evidence problem is thus yet another argument against hypothesis testing.
What we always want, is what we wanted above in [the first equation above]; i.e. Pr(Y | All probative evidence). And where Y is the relevant proposition of actual interest. Such as "Guilty" or "Buy now" and so on and so forth.
The Future
It is a very interesting time in probability and statistics. We are at a point similar to the 1980s when Bayesian statistics was being rediscovered, as it were. Yet we have roughly a century of methods developed for use in classical hypothesis. These methods are relied on by scientists, economists, governments, and regulatory agencies everywhere. They do not know of anything else. Hypothesis testing in particular is given far too much authority. The classical methods in use all contain fatal flaws, especially in the understanding of what hypothesis testing and probability are.
We therefore need a comprehensive new program to replace all these older, failing methods, with new ones which respect the way people actually act and make decisions. Work being led by our celebrant will, it is hoped, change the entire practice in the field within the next decade.
Subscribe or donate to support this site and its wholly independent host using credit card or PayPal click here