Kelechi Akwataghibe, PhD

The first framework I want to share to help us start thinking about Data Problems is the Bayesian Framework. Suppose we have a dataset (D) which we know depends on a parameter () in a known/unknown way, A Bayesian model can provide us a rigorous way of inferring this parameter. Instead of just giving a single estimate (like a “best guess”), it produces a full distribution that shows what the parameter value is likely to be, and how uncertain those estimates are.

This entire framework is built on a theorem from the 18th century. An English Presbyterian minister (and statistician) named Thomas Bayes formulated a rule that describes how our knowledge of a parameter () is updated when we are given new Data (D).

This simple-looking equation is the engine of the entire Bayesian framework. Let me illustrate it with a real-world example. Imagine you take a medical test for a rare disease (affects 1 in 10,000 people). The test is 99% accurate. You test positive. Do you have the disease? Your "best guess" (a single estimate) is "Yes" (the test is 99% accurate!). However, a Bayesian model combines this with your prior knowledge (the disease is very rare) and gives you a full distribution of possibilities, which shows the actual probability is only about 1%. Bayes' Rule teaches us how to combine these two pieces of information (the prior and the data) correctly. Now, let's look at the three main components of that equation. In this series, P will represent a Probability Mass Function (for discrete variables) or a Probability Density Function (for continuous variables).

P() is the Prior Distribution which encodes our a priori knowledge about the parameters before we see the data. In our example, the prior was our knowledge that the disease is very rare (). There is a skill to picking the right Prior distribution for your problem, which I will explore in the next post.

P() is the Likelihood which is our “data generating model”. It defines the process by which our parameters generate the data D. In our example, this was the test’s accuracy ( )

P() is the Posterior distribution which is the “holy grail” we are after. It is our updated knowledge about the parameters after seeing the data.It's the result of combining our prior with our likelihood. In our example, this was the true probability you had the disease given your positive test().

P() is the Evidence, sometimes referred to as the Marginal Likelihood. This term represents the total probability of observing the data D, averaged across all possible values of the parameters . It's calculated by an integral that averages the likelihood times the prior over all possible values of . This integral is the central computational challenge in Bayesian statistics. For almost any model of realistic complexity, it is a high-dimensional integral that is mathematically intractable to solve directly. When we do parameter inference, however, we usually set it aside because evaluates to a single number (called the normalization constant) that doesn’t depend on our parameter . More precisely, it doesn’t change the shape of the posterior distribution, but it ensures that integrates to one (by scaling it). This is why you will often see the rule written in its more practical, proportional form:

Which simply says that our “holy grail” – the posterior – is proportional to the product of the likelihood and the prior. The evidence has its place, for example, when we want to compare two different models, but to set up the Bayesian model, we only need to focus on the three quantities that are functions of .

Putting It All Together

Under the Bayesian framework, the scientist will first specify the likelihood and the prior. In principle, each one can be specified before the other, but specifying the likelihood is usually better, as often times, the structure of the prior can be chosen to better suit certain likelihoods. (This is an expert trick we’ll cover later) While the likelihood and prior can be usually written in simple form, the posterior is often complex, and unable to be written in closed form. Calculating the term is mathematically hard, so we can’t solve the equation directly. Because of this, we must use computational strategies to estimate the posterior distributions. In future posts, I will explore exactly how we do that.

See you in the next post!