Have you ever wondered how scientists arrive at published conclusions about the causes of diseases or the efficacy of a vaccine? Given the current pandemic and ongoing global vaccination efforts, this question is increasingly being asked by ordinary people. This post is a short introduction to the science behind studies of diseases, particularly rare outbreaks like Covid-19. I will keep the technical aspects to a minimum and illustrate the key ideas with an example.
All empirical science, whether medicine or economics, rely on samples of a population. The population is too huge to study, so scientists make do with samples, subsets of a population. Further, no scientific study based on samples can truly be called scientific unless the sample data is analysed through the lens of rigorous statistic tools; otherwise, everything you do is pure guesswork.
Students of statistics are (or should be) familiar with the ubiquitous t-test invented by the British statistician, Ronald Fisher (1890-1962). The t-test is the most well-known of the so-called classical statistical test of inference, the basis by which scientists arrive the truth or falsity of a hypothesis. Importantly, t-tests and other like it rely on the notion that events are repeatable, so that with a sufficiently large enough sample, the data collected follow a nice bell-shaped like curve which enables concrete statements of probability to be made.
But – and this is the crucial thing – what if events are not repeatable, as with unique or “once-off” cases? Such events occur frequently in medicine, the current Coivd-19 pandemic being a good example. Until early last year, we never encountered something like Covid-19, and that is making scientific studies problematic.
Enter the work of Thomas Bayes (1701-1761), an English clergyman, philosopher and statistician who, in the 1740s, developed a rule of statistical inference that is named after him. Whereas classical probability theory describes long sequences of repeatable events (and thus nothing can be said about unique cases), Bayes’ rule offers a pragmatic alternative: take an educated guess, then update that guess in the light of new data. Overlooked by classical statisticians for more than three centuries, Bayes’ rule is now widely used to study unique cases in medicine such as the Middle East respiratory syndrome and now, Covid-19.
For the rest of this post, I will introduce this remarkable rule of inference, keeping the discussion as non-technical as possible (though it is clearly not possible to discuss statistics without necessary jargon).
Let’s start from the very beginning. The word ‘inference’ indicates that nothing is certain about the population being studied. The best you can do is to formulate a hunch about the population (we call this a hypothesis), collect a reasonably large sample in the hope that this sample “mirrors” in essential ways, the population characteristic (e.g., disease traits). Then you apply the machinery of statistical inference, with the associated mathematics, to judge whether your hypothesis is true or false.
If a hypothesis is not true, it must be false, which means there is not one, but two hypotheses in any inferential analysis: a base-case hypothesis known as the null hypothesis, and an alternative hypothesis, the opposite of the base-case. I will use the notations H0 for the null hypothesis and H1 for the alternative hypothesis. The goal of any statistical inference, including Bayesian, is to infer which of these two hypothesis is correct with high confidence.
Now, let’s go on Bayes’ rule. The key idea in Baye’s rule is the notion of statistical odds. The higher the odds of a hypothesis being true, the more confidence you have in your hypothesis. Words like “odds” and “confidence” pepper Bayesian statistics.
Without much ado, here is Bayes’ equation for updating an initial guess to arrive with new information:
The goal of Bayesian inference is to arrive at the posterior odds of the alternative hypothesis being more probable relative to the null hypothesis. The posterior odds is the first term of the above formula. To compute the posterior odds, you start with an educated guess of the odds of the two hypotheses. This is the first term on the right, after the equal sign. It is called the prior odds or prior uncertainty about the two hypotheses. The higher is the prior odds, the higher your subjective belief that H1 is more probable than H0. After the initial guess comes the updating of the prior odds. This is captured by the last term, which is known as the Bayes Factor (BF). A BF > 1 than one implies that the posterior odds are greater than the prior odds and hence, the data provides evidence in support of H1. A BF < 1 implies the opposite, namely that the data is against H1. Lastly, if BF = 1, the data provides no evidence either way.
In short, the above equation says that the change from prior to posterior odds is brought about by a predictive updating factor through the Bayes factor, with the data playing the crucial role of providing information for the update.
Here is a medical example to illustrate how to calculate posterior odds.
Marfan syndrome is a genetic disease of connective tissue that occurs in 1 of every 15000 people. The main ocular features of Marfan syndrome include bilateral ectopia lentis (lens dislocation), myopia and retinal detachment. About 70% of people with Marfan syndrome have a least one of these ocular features; only 7% of people without Marfan syndrome do. (We don’t guarantee the accuracy of these numbers, but they will work perfectly well for our example.) The question to be answered: If a person has at least one of these ocular features, what are the odds that they have Marfan syndrome?
Answer: Our hypotheses are: M (the person has Marfan syndrome), Mc (the person does not have Marfan syndrome). The subscript c here stands for the complement of, or the opposite. The data is: F (the person has at least one ocular feature). The prior probability of M and Mc is 1/15,000 and 14,999/15,000 respectively. Hence the prior odds is 1/14,999 = 0.000067. We also know the probability of F (the data) given M and Mc, which we write as P(F|M) and P(F|Mc) respectively. The numbers are 0.7 and 0.07 and therefore the Bayes Factor is 0.7/0.07 = 10.
That is all we need to calculate the posterior odds!
The answer is 0.000067 x 10 = 0.00067
This is a very small number, a good thing in this case. It is small even though the Bayes Factor is large because of the tiny prior odds. The low posterior odds must surely come as a relief to folks who report having lens dislocation. Hopefully, it is also a relief for people who suspected to being infected by other rare diseases such as SARs and Covid-19.
For a fascinating and non-technical introduction to the story of Bayes’ rule, see Susan Bertsch, The Theory that Would Not Die, Yale University Press, 2011.
Read a review here: https://www.ams.org/notices/201205/rtx120500657p.pdf