
In the last post, we specified the full Bayesian model — the prior and the likelihood. We saw that when the prior is conjugate with the likelihood, the posterior falls into a known family and we can write it down directly. But we also noted that most real-world problems aren't this clean. The posterior is often a complex, high-dimensional object that we can't express in closed form.
So what do we do? We sample.
Why We Need Sampling
Recall from the first post that the main obstacle is the evidence term P(D) — that high-dimensional integral that averages the likelihood times the prior over all possible parameter values. For most models of realistic complexity, this integral is mathematically intractable. We can't solve it analytically, which means we can't write down the posterior directly.
But here's the key insight: we don't actually need the full closed-form posterior to do useful inference. What we need is a collection of representative samples from it. If we can generate enough samples that are distributed according to the posterior, we can use those samples to estimate anything we want — means, credible intervals, predictions, you name it. The samples become our representation of the posterior.
The question, then, is: how do you draw samples from a distribution you can't write down? This is the problem that Markov Chain Monte Carlo (MCMC) methods solve.
The Posterior Landscape
Before we talk about how to sample, it helps to think about where we're sampling from. The posterior is a distribution over the parameter space. You can think of this space as a landscape which entails a surface with peaks where certain parameter values are highly probable, valleys where they are not, and potentially ridges, plateaus, and multiple modes.
Our goal is to traverse this landscape in a way that we visit all the regions where the posterior has meaningful probability mass, and we visit them in proportion to how probable they are. A region with a tall peak should receive more visits than a region with a low plateau.
This idea has a formal name: ergodicity. A sampler is ergodic if, given enough time, it will visit every region of the parameter space in proportion to its posterior probability. This is the theoretical guarantee that makes MCMC work. Without it, our samples would be a biased snapshot of the landscape rather than a faithful representation of the full posterior.
What a Good Sampler Should Do
So what separates a good sampler from a bad one? Two things, primarily.
The first is efficient exploration. The sampler should move through the posterior landscape without getting stuck. It shouldn't waste time wandering through low-probability regions, and it shouldn't get trapped on a single peak when there are others to visit. It needs to be bold enough to propose big moves, but smart enough not to jump blindly into improbable territory.
The second is low autocorrelation. Each sample in the chain depends on the previous one (that's the "Markov Chain" in MCMC). But ideally, we want our samples to be as close to independent as possible. If consecutive samples are highly correlated (because the sampler is taking tiny steps), you might have 10,000 samples that contain the information content of only a few hundred. Low autocorrelation means each sample is genuinely adding new information about the posterior.
These two objectives are deeply related: a sampler that explores efficiently will naturally produce less correlated samples. But in practice, there are real trade-offs between them, which is where the choice of sampler becomes important.
Convergence — How Do We Know We've Explored Enough?
Every MCMC chain has to start somewhere — an initial guess for the parameter values. The samples generated early on will be influenced by that starting point, and we need the chain to "forget" where it started. This initial stretch of biased samples is called the burn-in period, and we typically discard it.
But how do we know the chain has actually converged, i.e. that it's faithfully representing the posterior rather than still wandering toward it?
The most common diagnostic is the R-hat statistic (sometimes called the Gelman-Rubin diagnostic). The idea is simple: run multiple chains from different starting points. If all the chains have converged to the same distribution, they should look statistically similar to each other. R-hat compares the variance within each chain to the variance between chains. If R-hat is close to 1 (typically below 1.01), that's evidence that the chains have converged. If it's substantially larger, the chains are still disagreeing, and you need more samples.
Trace plots are the visual companion to R-hat. A trace plot shows the sampled parameter values over time (one line per chain). In a well-converged run, the trace plot looks like "hairy caterpillars" — the chains overlap, mix freely, and oscillate around stable values. If you see chains that are drifting, stuck, or separated from each other, that's a red flag.
Sampler Trade-Offs
There is no universally "best" sampler. Each one makes different trade-offs, and the right choice depends on the structure of your posterior.
Speed vs. convergence quality. Some samplers are fast per iteration but require many iterations to converge. Others are slower per step but produce higher-quality samples that converge more quickly.
Generality vs. specialization. Some samplers work on virtually any posterior, but may be inefficient for specific shapes. Others are designed for particular posterior structures and perform brilliantly there, but fail elsewhere.
Simplicity vs. tuning. Some samplers are nearly plug-and-play. Others require careful tuning of hyperparameters (step sizes, proposal widths, etc.), and performance can be sensitive to these choices.
With that in mind, let's walk through the major families of MCMC samplers (Note that I provide a demo of samplers in this project).
The Samplers
Metropolis-Hastings — The Foundation
Metropolis-Hastings (MH) is the grandfather of MCMC methods. The idea is elegant in its simplicity. From your current position in the parameter space, you propose a new position by drawing from a proposal distribution (often a simple distribution like a Gaussian centered at your current location). Then you decide whether to accept or reject that proposal.
The acceptance decision is based on a ratio: how probable is the proposed position relative to the current one, under the posterior? If the proposal is more probable, you always accept it. If it's less probable, you accept it with some probability proportional to that ratio. This means the sampler can move "downhill" occasionally which is essential for not getting permanently stuck on a single peak.
The beauty of this scheme is that you only ever need to evaluate the ratio of posterior values. The intractable evidence term P(D) cancels out (since it appears in both the numerator and the denominator). This is why the proportional form of Bayes' Rule — posterior ∝ likelihood × prior — is all we need.
The Metropolis algorithm is a special case of MH where the proposal distribution is symmetric (the probability of proposing a move from A to B is the same as from B to A). This simplifies the acceptance ratio and was historically the first version proposed. Metropolis-Hastings generalizes it to allow asymmetric proposals.
The main challenge with MH is tuning the proposal distribution. If the proposed steps are too small, the sampler explores slowly and produces highly correlated samples. If the steps are too large, most proposals land in low-probability regions and get rejected. Finding the right step size is an art, and it becomes increasingly difficult in high-dimensional spaces.
Gibbs Sampling — One Parameter at a Time
Gibbs Sampling is a special case of Metropolis-Hastings, but it feels quite different in practice. Instead of proposing a joint move across all parameters simultaneously, it updates one parameter at a time, holding all the others fixed.
For each parameter, Gibbs samples from the full conditional distribution — the distribution of that parameter given the current values of all the other parameters and the data. If these full conditionals are known distributions (which happens when conjugacy is present; a callback to our last post), each update is a direct draw. No accept/reject step needed, and every proposal is accepted.
This makes Gibbs particularly attractive for models with conjugate structure, where the full conditionals have closed-form expressions. The sampler is simple, easy to implement, and requires no tuning.
The downside is that updating one parameter at a time can be very slow when parameters are highly correlated. The sampler ends up taking small, axis-aligned steps through a space that would be better navigated diagonally. In high-dimensional, correlated posteriors, Gibbs can struggle considerably.
Slice Sampling — Sampling by Slicing
Slice Sampling takes a different approach. Instead of proposing a point and deciding whether to accept it, it introduces an auxiliary variable that effectively "slices" the posterior at a certain height, and then samples uniformly within that slice.
The intuition is this: imagine you draw a horizontal line at a random height under the posterior curve. The set of parameter values where the posterior is above that line forms a "slice." You then pick a point uniformly at random from within that slice. This guarantees that you sample more often from high-probability regions (where the slices are wide) and less often from low-probability regions (where the slices are narrow).
The elegance of Slice Sampling is that it requires no tuning of a step size. It adapts automatically to the local scale of the distribution. This makes it a nice default choice when you want something more robust than basic Metropolis-Hastings but don't want to invest in the machinery required by more advanced methods.
Elliptical Slice Sampling — Slice Sampling for Gaussian Priors
Elliptical Slice Sampling is a specialized variant designed for models where the prior is Gaussian (or multivariate normal). This is a common setting as Gaussian priors show up frequently in spatial models, Gaussian processes, and many other contexts.
Where standard Slice Sampling constructs its slices without regard to the geometry of the prior, Elliptical Slice Sampling exploits the known elliptical geometry of the Gaussian prior. It proposes new points along an ellipse defined by the current sample and a draw from the prior, then uses the slice sampling logic to find an acceptable point on that ellipse.
The result is a sampler that is tuning-free (like standard Slice Sampling) and highly efficient for the specific class of models it targets. It's a clean example of the generality vs. specialization trade-off — it's narrower in scope than MH, but within its domain, it performs exceptionally well.
Hamiltonian Monte Carlo — Using the Gradient
Hamiltonian Monte Carlo (HMC) is where the samplers start getting sophisticated. HMC borrows an idea from physics: treat the parameter space as a physical system, where the negative log-posterior plays the role of a "potential energy surface," and introduce a fictitious "momentum" variable for each parameter.
The sampler then simulates the dynamics of a particle rolling across this energy surface. The particle naturally spends more time in low-energy regions (high posterior probability) and less time in high-energy regions (low posterior probability) which is exactly the behavior we want. And because the simulation follows the gradient of the posterior, the proposals are informed by the local shape of the distribution, not just random perturbations.
This is the key advantage of HMC: it makes large, directed moves through the parameter space while maintaining a high acceptance rate. The result is dramatically lower autocorrelation and much more efficient exploration, especially in high-dimensional problems where vanilla MH struggles.
The trade-off is cost. Each HMC step requires computing the gradient of the log-posterior, which is more expensive than the simple density evaluations used by MH. HMC also requires tuning two parameters: the step size (how far each simulation step goes) and the number of steps (how long the trajectory runs). Poor choices can lead to inefficient sampling or numerical instability.
NUTS — The No-U-Turn Sampler
The No-U-Turn Sampler (NUTS) is the practical, self-tuning extension of HMC. It solves the trajectory length problem automatically by running the simulation until the particle starts to "turn around" — that is, starts moving back toward where it came from. This eliminates the need to manually specify the number of steps.
NUTS also typically includes adaptive tuning of the step size during the warm-up phase. The result is a sampler that captures all of HMC's advantages — large, gradient-informed moves, low autocorrelation, efficient exploration — with minimal manual tuning. This is why NUTS has become the default sampler in modern probabilistic programming frameworks like Stan and PyMC.
It is slower per iteration than simpler methods, but the quality of the samples it produces — and the speed of convergence — more than compensates. For most practitioners working with complex models, NUTS is the go-to.
Choosing a Sampler
There is no one-size-fits-all answer. The right sampler depends on the structure of your posterior, the dimensionality of your parameter space, and the computational budget you have available. A model with conjugate structure might call for Gibbs. A model with a Gaussian prior might benefit from Elliptical Slice Sampling. A complex, high-dimensional model with no special structure will likely need HMC or NUTS.
The key is to understand the trade-offs each sampler makes, and to use convergence diagnostics — R-hat, trace plots, effective sample size — to verify that whatever you chose is actually working.
In the next post, we'll shift from parameter inference to a different kind of question: how do we compare entire models against each other?
See you in the next post!