Patterns on the complex floor: A fun little example of simulation-based experimentation in mathematics

Calling it a “fun little example” might sound disrespectful, but it’s not. Examples are not hard to come by, but “fun” and “little” are special. Just as it’s said that it can take a lot of work to write something concise, it can take a lot of understanding to demonstrate an important nontrivial point with an example that is fun and little.

As regular readers know, we’ve been pushing simulation-based experimentation (formerly called fake-data simulation) for a long time. I think simulation-based experimentation is usually the best way, by far, of understanding statistical procedures.

One thing I haven’t though about so much is how the same idea can work in math—even for problems that have no inherent probabilistic structure.

The general idea goes as follows. Suppose you have a mathematical conjecture. It might be true or it might be false, you don’t know, that’s why it’s a conjecture, not a theorem. You can try to prove it for some special cases and you can also look for counterexamples in various places. This discussion of cases and places suggests a meta-model in which there is some space of possible values. Consider a space Theta and a conjecture C(theta). The conjecture is a true theorem if C(theta) is true for all theta in Theta.

At this point you can do simulated-data experimentation by sampling theta from some distribution and evaluating C at each sampled value. If C is true for enough values, you can think about making probabilistic statements about your theorem, but even if this is not the case you can learn from the pattern of truth.

I was moved to think about all this after reading this great post by John Cook demonstrating the use of this idea. Cook’s post is great for two reasons:

1. His example is just on the border of triviality. As noted at the top of this post, “near-trivial” could sound bad, but I’m intending it to be a compliment. Trivial examples are fine too, but this one has enough complexity to be non-trivial while being simple enough that once you see the pattern, it’s kinda clear. It’s similar to our golf example (see section 10 here). Here’s a relevant discussion: Why we kept the trig in golf: Mathematical simplicity is not always the same as conceptual simplicity.

2. He doesn’t just demonstrate simulation-based experimentation, he also goes through its workflow. I illustrate through some excerpts from Cook’s post:

I [Cook] wrote some Python code to test this, and to my great surprise, the identity often holds. In my first experiment, it held something like 84% of the time for random inputs. I expected it would either rarely hold, or hold say half the time (e.g. in a half-plane).

My first thought was to plot 1,000 random points, green dots where the identity holds and red stars where it does not. This produced the following image.

I like how he presents results in the context of his expectations.

Cook continues:

Since I’m sampling uniformly throughout the square, there’s no reason to plot both where the identity holds and where it doesn’t. So for my next plot I just plotted where the identity fails.

The dots on those graphs are too large, but, hey, nobody’s perfect! For some reason lots of people like to make graphs with dots that are too large.

More from Cook:

That’s a little clearer. To make it clearer, I rerun the plot with 10,000 samples. (That’s the total number of random samples tested, about 16% of which are plotted.)

Then to improve the resolution even more, I increased the number of samples to 1,000,000.

It might not be too hard to work out analytically the regions where the identity holds and doesn’t hold. The blue regions above look sorta like parabolas, which makes sense because square roots are involved. And these parabola-like curves has ziggurat-like embellishments, which makes sense because floors are involved.

I like that—the connection between the graph and the theoretical model.

And there’s still more, including a final graph but I’ll let you read the whole thing for that. My point here is we learn not just from the example but also from the workflow, the steps that got us there.

P.S. I scheduled this post for this day because the example is a fun present all wrapped up, also because it seems that Cook is Christian so he might appreciate this gesture. Blogs are so cool. It’s Christmas every day, with strangers offering us amazing and unexpected gifts from all over the world. The internet is more than just people screaming at each other, it’s also people going to the trouble of preparing these delicious treats and sharing them with anyone who cares to stop by.

Lee Wilkinson

Lee Wilkinson is most famous for his book, The Grammar of Graphics, whose ideas were implemented in the widely used R package ggplot2 as well as Tableau, the popular commercial graphics program, and all sorts of other places. Arguably as important as the book itself is its title. The idea that graphics has a “grammar”—that’s a real breakthrough. It’s related to ideas in statistics (graphs as comparisons, graphs as model checks, graphs as exploration and being surprised, with surprise being implicitly defined relative to expectations and modeling) and to ideas in computer science (here I’m not so familiar with the literature, but I’m thinking of the idea of a graph being defined not by what it looks like but rather by the steps used to create it). The modern era of statistical graphics was begun by Tukey in the 1960s and 1970s and continued by Cleveland, Tufte, and others in the 1980s. I think of Wilkinson as the key figure in the next wave of work in this area: once the general messages (exploratory data analysis is important; graphs can be clear and beautiful) had been absorbed, there was space for new ways of thinking about the process of creating statistical graphics. Along these lines, I think it was valuable that Lee straddled the worlds of academia and commerce.

I did not know Lee well—I’m not sure exactly how many times we actually spoke, but it was less than ten times—but I considered him a friend. We corresponded by email from time to time. He was a sincere and thoughtful person, both soft-spoken and lively, if that is possible. He could be critical (see for example here) but only because he cared so much about statistics and its applications that it bothered him when people were being annoyingly stupid. I think it’s safe to say that Lee’s ideas will continue to influence data analysis for many decades to come.

Interpreting apparently sparse fitted models

Hannes Margraf writes:

I would like your opinion on an emerging practice in machine learning for materials science.

The idea is to find empirical relationships between a complex material property (say the critical temperature of a superconductor) and simple descriptors of the constituent elements (say atomic radii and electronegativities). Typically, a longish list of these ‘primary’ descriptors is collected. Subsequently a much larger number of ‘derived’ features are generated, by combining the primary ones and some non-linear functions. So primary descriptors A,B can be combined to yield exp(A)/B and many other combinations. Finally, lasso or similar techniques are used to find a compact linear regression model for the target property (a*exp(A)/B+b*sin(C) or whatnot).

The main application of this approach is to quite small datasets (e.g. 10-50 datapoints). I’m kind of unsure what to think of this. I would personally just use some type of regularized non-linear regression with the primary features here (e.g. GPR). Supposedly, the lasso approach is more interpretable though because you can see what features get selected (i.e. how the non-linearity is introduced). But it also feels very garden-of-forking-paths-like to me.

I know that you’ve talked positively about lasso before, so I wonder what your take on this is.

It’s hard for me to answer this with any sophistication, as it’s been a long long time since I’ve worked in materials science—my most recent publication in that area appeared 34 years ago—so I’ll stick to generalities. First, lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model. So I’d prefer to think of lasso, horseshoe, etc. as regularizers, not as devices for selecting or finding or discovering a sparse model. To put it another way, I don’t take the shrink-all-the-way-to-zero thing seriously. Rather, I interpret the fitted model as an approximation to a fit with more continuous partial pooling.

Design choices and user experience in Stan and TensorFlow Probability

(This post is by Charles, not Andrew)

I’ve been a Stan developer for a bit more than 5 years now (!!) and this summer, I interned at Google Research with the team behind TensorFlow Probability (TFP). I wanted to study a new probabilistic programming language and gain some perspective on how Stan does things. I’ll compare the two languages but rather than focus on performance, I want to talk about design and user experience. Which is fun because this aspect gets a bit lost in academic writing.

As an applied statistician, I love the Stan user experience because the focus is on modeling. Stan’s default inference is general-purpose and can be used on a very wide range of models. Other than tweaking a few tuning parameters, I usually rely on Stan’s default inference engine, which is a dynamic Hamiltonian Monte Carlo sampler. This means I can spend more time worrying about the model and whether or not it gives me the insights I seek. Being able to use “sampling” statements in the model block, i.e.

y ~ normal(theta, sigma)

is really nice because this is how I write models on paper. No graphs, no names chosen from a menu of options, no formula. For me, nothing beats writing down the generative process. Pedagogically this is a little bit confusing because any statement in Stan’s model block is not an actual sampling statement, since nothing gets sampled. Rather it’s a call to increment the log of the joint density or, in Stan’s jargon, the “target”. So maybe we should only teach the equivalent statement

target += normal_lpdf(y | theta, sigma)?

I’m not sure. Currently I teach both. I also like Stan’s coding blocks. Since a Bayesian model is specified by a joint distribution over observed variables and parameters, declaring variables in a data and parameter block further resonates with my statistical formulation of the model. There’s also another advantage: the blocks each have a different computation cost which is helpful to think about when writing computationally intensive models (see for example the discussion in this paper).

TFP acts more as a Python package than its own language. Within the same script, we’ll process the data, specify our model, run our inference, and do all the post-processing. This doesn’t have the pedagogical appeal of building blocks, nor its rigidity. Based on the tutorials I worked through, there are no pseudo-sampling statements, which means less clarity in terms of what model we’re writing down but also less confusion in terms of what computation takes place. Matt Hoffman tells me TFP has something akin to Stan’s sampling but it has less of a DSL (domain specific language) feel to it. When writing the model in TFP we can use the many functions available in Python with one super important caveat: the code needs to be differentiable, if we use gradient-based algorithms (which we probably do). With that in mind, TFP is designed to be on top of two automatic differentiation libraries: TensorFlow (surprise, surprise) and Jax. Stan has its own autodiff library so *broadly speaking* everything you write in Stan is safely differentiable. I was advised to use Jax which is pretty awesome — the documentation itself is a fascinating read. One compelling feature is that Jax supports many NumPy functions. And NumPy is great because of all the mathematical operations it offers and because it seamlessly parallelizes these operations across CPU cores. Jax takes NumPy a step further by parallelizing operations across accelerators, such as GPUs, and making a lot of these functions differentiable and therefore amiable to gradient-based inference. For those who are familiar with NumPy, this is great. For me there was a bit of a learning curve.

It’s worth pointing out that there is quite a bit of overlap between the functions Jax and Stan support but there are differences which reflect the motivating problems behind each project. Over the years, Stan’s support for ODEs has become very good and quite broad. Jax offers more specific ODE support, which works well for certain problems (e.g. Neural ODEs), less so in other cases. Note that there are pending pull requests extending Jax’s ODE support. But let’s focus on design here. Stan’s ODEs report error messages when they fail: for instance the numerical integrator reaches the maximum number of steps. Jax on the other hand reports… nothing. This is because apparently exceptions don’t exist on XLA (accelerated linear algebra, see this discussion on GitHub). Not having error messages caused me a lot of debugging grief. Overall I did find Stan did a better job issuing error and warning messages. Another example is divergent transitions: Stan automatically warns you when divergences occur. In TFP, you have to dig them out, which means you need to already know what a divergent transition is, and that may not be the case for many users.

Ok, this is where things get fun. TFP asks you to specify your inference algorithm. If you call tfp.mcmc you need to define the transition kernel of your Markov chain and usually, you should also write down an adaptive scheme. This takes a few lines of code, even if you use TFP’s built-in features. You then probably want to specify a “trace” function which records various properties of your sampler across iterations, such as whether a divergent transition occurs. This is more effort than relying on Stan’s very good defaults. But it also makes it easier to try out new transition kernels and moreover new inference algorithms! How easy is it to do this in Stan? It’s hard and this is not what Stan was designed for. That said, the underlying C++ code is open-source, well-written and well documented, and so it can be hacked. People do it. I’ve done it. And this leads to very interesting forks of Stan.

The question is: as an applied Statistician, do I resent not being able to specify my own inference algorithm in Stan? This short answer is occasionally. I’m biased because many of my scientific collaborations begin with people not being able to fit their model in Stan. Often we do eventually manage to do it by reparameterizing the model, baking more information into the priors, rewriting the ODE integrator, etc. Working out a new inference algorithm and efficiently implementing it is a bigger leap. My research does focus on the problems that frustrate our inference and for which we want to develop better methods. But this takes time. In the meantime, I believe there is a strong case to be made for an algorithm that has been stress-tested throughout the years, theoretically well-studied, and which empirically has worked for many problems.


On the other hand, as a researcher in the field of statistical algorithms, I do enjoy the modularity TFP provides and having the ability to easily try out new methods. TFP also has “experimental” features which are not fully supported but can nevertheless be tried out. I pitched the idea of including features explicitly labeled as “experimental” in Stan, as a middle-ground between making available some of the tech we’re developing and being upfront about what remains ongoing research. This gets into questions of what Stan’s mission is and how much we trust users to take heed of our warning messages. Long story short, for the time being, experimental features won’t be part of Stan’s official releases (except ADVI, which does issue a warning message).

Comparing TFP and Stan reminds me a little bit of the historical conflict between Microsoft and Apple, that is a modular approach versus end-to-end control designed to create a seamless user experience, at least on the algorithmic front. In the past I’ve approached this question by thinking about specific problems, the performance benefits of using specialized versus general techniques, and whether it makes sense to have a universal inference algorithm, or at least a universal default. But I think a big component of software development is design and user experience. How do “everyday scientists”, if you’re willing to accept this term, interact with the software? What do they want to focus on? What balance between guidance and freedom works best for a broad audience? Now when it comes to model specification, both Stan and TFP champion modularity and giving the users a large control over their model, which we could contrast to other packages out there. This too is a design choice and one we could ponder about.

Drawing Maps of Model Space with Modular Stan

This post is by Ryan Bernstein. I’m a PhD candidate at Columbia with Andrew and Jeannette Wing.

Right now, each probabilistic program we write represents a single model. As we tackle a modeling task, we try one model after another, and old programs pile up in our “models” folder or git history. What if instead of describing one model at a time, we focused on drawing a map of our model space? What if a program represented not a single probabilistic model, but a whole network of models and their rich interconnections, a region of model space that grew as we explored?

If we drew maps of model space as we experimented, it would be easier for others to follow our work. We could annotate regions of our map, like “These models rely on a big assumption,” or “Here, there be dragons convergence issues.” We could draw our path of exploration, from our first experiment to our final model and all of the false turns along the way. We could document, “This way was too slow”, or “This way gave strange prior predictions”, or “I didn’t have time to explore this way, maybe you should.” Issues like the Garden of Forking Paths and P-hacking make it clear that how and why a model was grown is necessary context for assessing its trustworthiness; model space maps would act as living documentation of that context.

Having a map would also help us navigate through model space as we explore. We could orient ourselves in model space by contrasting our current model to its immediate neighbors: we could say things like, in that direction, a posterior mean trends upwards, or calibration issues appear, or predictive accuracy improves.

Once we sketch out a region of model space, we could deploy robot servants to help us understand it. They could scout ahead for models that are accurate, simple, or robust. They could also check that our conclusions don’t depend too much on any one turn we made along our route, ala the Garden of Forking Paths. Maps enable automation, and automation can save us time and catch our errors.

So what kind of program can represent a growing network of models while remaining concise and readable?

I’ve developed a simple metaprogramming feature called ‘swappable modules’1 that can extend languages like Stan to do just that. Each ‘modular Stan program’ can compile down to a whole set of normal, ‘concrete’ Stan programs. For now we’ll treat model space as a discrete2 network of models and we’ll assume that the data is consistent across the network.

First I’ll introduce ‘modular Stan’ alongside a trivial example, and then I’ll give two more interesting case studies. You can play along at home with my prototype compiler and interactive visualizations at http://ryanbe.me/modular-stan.html. You can load each example with the buttons on the lower left, or you can write your own modular program. Expect plenty of bugs!

A way to program maps

Suppose we have some data x that we believe to be normally distributed. Let’s write a modular Stan program to represent a region of the model space of x.

Modular Stan programs start with a base, and then they are built up by adding one piece, or module, at a time. The base is part of every model we produce, so it should only contain things we’re sure won’t change: the shape of the data, the key inferred quantities, and a basic outline of the model. Here is the base for our `x` modular program:

data {
    int N;
    vector[N] x;
}
model {
    x ~ normal(Mean(), Stddev());
}

So far this is like a Stan program except that it has two “holes” in it, Mean and Stddev, that represent as-yet undefined behavior.

Next, we write a module to fill in each hole. A module can contain arbitrary code and declare new parameters like a miniature Stan program, and it can also return a value.

Here’s a module to fill in the Mean hole:

module "standard" Mean() {
    return 0;
}

This reads as: define a module called "standard" that fills the Mean hole with 0.

Here’s a module to fill Stddev:

module "standard" Stddev() {
    return 1;
}

Our modular Stan program up to this point is equivalent to a single concrete Stan program:

data {
    int N;
    vector[N] x;
}
model {
    x ~ normal(0, 1);
}

This isn’t very interesting, so let’s explore the model space a little bit more. We can explore by simply adding more modules.

Let’s try modeling the mean of x as a random variable. We add a module called "normal" to fill the Mean hole with a new parameter that has a prior:

module "normal" Mean() {
    parameters {
        real mu;
    }
    mu ~ normal(0, 1);
    return mu;
}

If we use "normal" to fill Mean and "standard" to fill Stddev, the concrete Stan program will be:

data {
    int N;
    vector[N] x;
}
parameters {
  real mu;
}
model {
    mu ~ normal(0, 1);
    x ~ normal(mu, 1);
}

Since we now have two ways to fill the holes, our modular Stan program is equivalent to a set of two concrete Stan programs.

Now suppose we also want to model the standard deviation, but we aren’t sure whether or not to use an informative prior (sorry for the contrived example). We can leave that decision as another hole:

module "lognormal" Stddev() {
    parameters {
        real<lower=0> sigma;
    }
    sigma ~ lognormal(0, StddevInformative());
    return sigma;
}
module "yes" StddevInformative() {
    return 1;
}
module "no" StddevInformative() {
    return 100;
}

StddevInformative is a hole nested inside the module "lognormal". In this way, modular programs are hierarchical as well as branching. We can view modular programs as trees3:

Our modular program now represents a set of six concrete models, one for each way the set of holes can be filled: two ways for Mean ("standard" and "normal") times three ways for Stddev ("standard", "lognormal" with StddevInformative: "yes", and "lognormal" with StddevInformative: "no".).

How do we build a network from this set of programs? There’s a natural definition: we draw an edge between two models if they differ by only one module4. Each neighbor is then one ‘decision’ away5. Here is the network of models for this trivial example:

A representation of the network of models for the “x” example. Edges are annotated with “hole” that differs between the two models. The prototype website has an interactive version of this visualization.

‘Golf’ Example: A travelogue through model space

I’ve rewritten Andrew’s golf case study as a modular program to show what it would look like to write a case study as a “travelogue” of development through model space.

In the case study, the modeling task is to understand a golfer’s chance of sinking a shot given their distance from the hole. Like before, we’ll start by writing an outline of the problem that will be constant for the whole model space:

data {
  int J;        // Number of distances
  vector[J] x;  // Distances
  int n[J];     // Number of shots at each distance
  int y[J];     // Number of successful shots at each distance
}
model {
  y ~ NSuccesses(n, PSuccess(x));
}

We start with two holes: NSuccesses is the distribution of the number of successful shots given the number of attempts n and the probability of success; PSuccess is the probability of success given the distance from the hole x.

A good way to count successes is the Binomial distribution, so let’s add a module called “binomial” to fill NSuccesses:

module "binomial" NSuccesses(y | n, p) {
  y ~ binomial(n, p);
}

Now we need a module for PSuccess to map the distance of a shot x to the probability of success. We start with a logit function:

module "logistic" PSuccess(x) {
  parameters {
    real a;
    real b;
  }
  return logit(a + b*x);
}

We’ve now defined logistic regression, the first model from the case study:

data {
  int J;        // Number of distances
  vector[J] x;  // Distances
  int n[J];     // Number of shots at each distance
  int y[J];     // Number of successful shots at each distance
}
parameters {
  real a;
  real b;
}
model {
  y ~ binomial(n, logit(a + b*x));
}

In this fashion, we can add a new module for each step of Andrew’s logic in the case study. See the full modular program here: http://ryanbe.me/modular-stan.html?example=golf.

Let’s use my prototype graphical interface to visualize the resulting modular program. Here are the parts of the interface:

1. Modular program editor with compile and load buttons. 2. Interactive module tree. 3. Interactive network of models. 4. Selected concrete Stan program, if any. 5. Label and documentation editor for the selected concrete program, if any.

We can select modules to narrow down the network of models to a single concrete model, as shown in the figure at the top of this post.

And finally we can see the case study as a “travelogue” that documents a path through the network of models:

‘Birthday’ Example: Exploring model space with “scouting robots”

To demonstrate automation on the network of models, I’ve rewritten the Birthday case study from as a modular program. I collaborated with Hyunji Moon to build this demo.

The goal of the birthday case study is to understand trends in the number of births over time. The authors use a Gaussian process time series model, and they experiment with including a few different types of trends (e.g. day-of-week, day-of-year, holidays, seasonal) and other modeling decisions (e.g. the weighting structure of days-of-week scores and the choice of hierarchical priors on day-of-year scores). When each trend inclusion or other modeling decision is represented as an independent choice of module, the modular program represents 120 models.

You can load the example with the full source code at http://ryanbe.me/modular-stan.html?example=birthday.

Here is the network of modules:

Suppose we want to find a model that does a reasonable job of fitting our time-series birthday data. That’s a lot of possibilities to explore manually! So let’s enlist a friendly robot to search the network for promising models.

We send off our robot with instructions for a greedy search:

“Start here.
Score the neighbors of every model you visit.
Move to the highest scoring model you’ve seen so far.
If you’re already at the highest scoring model, stop searching.”

This is like discrete gradient descent. We equip our robot with a scoring method: Expencted Log-Predictive Density (ELPD) is a reasonable choice that approximates goodness-of-fit.

Here is the data our robot collects:

The red annotations show the ELPD scores for the assessed models. The robot traveled along the arrow path from [START] to [GOAL].

In this case the robot ended its search at the case study’s final model. This optimization amounts to simple symbolic regression on probabilistic programs.

While greedy maximization of ELPD is naive and we shouldn’t blindly trust it to give us our final model, we can at least use it to find promising neighborhoods. I hope this serves as a proof of concept to open the door for other, smarter network-wide algorithms.

Future work

I hope that model space “maps” will give us new “navigation” tools for development, enable automatic search and validation, and help others to understand and reuse our work.

There’s a ton of opportunity for development in this area, such as:

  • What other algorithms can run on a network or neighborhood of models? What other scores would be useful to optimize?
  • How should we decorate network edges to support model navigation? What about for special cases like causal inference?
  • Can we validate modules themselves6?
  • Can we build libraries of modules?
  • Can we learn to synthesize modules?
  • Can we automatically ensemble or stack models in a network?
  • Can we include automated model transformations in the network in addition to module-swapping transformations?
  • If we’re fitting models in a network one after another, can we use their relatedness to sample more efficiently?
  • Are there extensions of the module-swapping system or other metaprogramming features that would be worth the added complexity7?

If you have any questions or ideas, please leave a comment or reach me directly at ryan[dot]bernstein[at]columbia.edu!

Footnotes:

1
A more technical programming languages term might be, ‘module system with non-deterministic functor application’.

2
We can treat the model space as discrete by unifying models that differ only by the values of continuous hyperparameters. Those hyperparameters can be optimized (or modeled) independently from the network of models approach.

3
It is possible for modular programs to not form trees when two modules contain the same hole, but all modular programs form DAGs.

4
Strictly speaking, to account for holes like StddevInformative that aren’t always in use, two models are neighbors if they differ by one subtree of modules.

5
Akin to a Hamming graph.

6
To paraphrase Bob’s 2021 ProbProg talk: modularity is the elephant in the room, but its big challenge is testing modules in isolation. The next best alternative may be to test within a modeling context by a sort of swap-one-out validation.

7
For example: a special type of “hole” that’s filled with a collection of modules and returns their values as an array. This would be useful for cases like including a subset of features in a regression or components of a Gaussian process.

It’s so hard to compare the efficiency of MCMC samplers

I thought I’d share a comment I made on Perry de Valpine‘s post on the NIMBLE blog,

Perry was posting about a paper that tried to compare efficiency of Stan, NIMBLE, and JAGS for some linear modeling problems. You can find the link in Perry’s post if you really care and then the paper links to their GitHub. Perry was worried they were misrepresenting NIMBLE and users would get the wrong idea. We decided to take the approach of established companies and simply ignore this kind of criticism. But this time I couldn’t resist as it’s not really about us.

Here’s my comment (lightly edited):

Comparing systems on an equal footing is a well-nigh impossible task, which is why we shy away from doing it in Stan. The polite thing to do with these kinds of comparisons is to send your code to the devs of each system for tuning. That’s what we did for our autodiff eval paper.

I don’t like this paper’s evaluation any more than you do! I’d like to see an evaluation with (a) arithmetic on an equal footing, (b) the kinds of priors we actually use in our applied work, and (c) something higher dimensional than p = 100 (as in p = 10,000 or even p = 100,000 like in the genomics regressions I’m working on now). Then the evaluation I care about is time to ESS = 100 as measured by our conservative cross-chain ESS evaluations that also allow for antithetical sampling (Stan can produce samples whose ESS is higher than the number of iterations; may estimators just truncate at the number of iterations because they don’t understand ESS and its relation to square error through the MCMC CLT). The problem with this kind of eval is that we want to represent actual practice but also minimize warmup to put systems in as favorable a light as possible. In simple GLMs like these, Stan usually only needs 100 or maybe 200 iterations of warmup compared to harder models. So if you use our default of 1000 warmup iterations and then run sampling until you hit ESS = 100, then you’ve wasted a lot of time in too much iteration. But in practice, you don’t know if you can get away with less warmup (though you can use something like iterative deepening algorithms to probe deeper, they’re not built in yet).

One way to get around this sensitivity to warmup is to evaluate ESS/second after warmup (or what you might call “burnin” if you’re still following BUGS’s terminology). But given that we rarely need more than ESS = 100 and want to run at least 4 parallel chains to debug convergence, that’s not many iterations and you start getting a lot of variance in the number of iterations it takes to get there. And things are even more sensitive to getting adaptation right. And also, I don’t think ESS/second after warmup is the metric practitioners care about unless they’re trying to evaluate tail statistics, at which point they should be seriously considering control variates rather than more sampling.

In other words, this is a really hard problem.

I then read Perry’s follow up and couldn’t help myself. I actually looked at their Stan code. Then I had a follow up comment.

I just read their source code. It’s not exactly Stan best practices for statistics or computation. For instance, in mixture_model.stan, there are redundant data computations per iteration (line 25), redundant distributions (also line 25), inefficient function calls (line 31), and a conjugate parameterization inducing extra work like sqrt (line 23). Then in AFT_non_informative.stan, they use very weakly informative priors (so it’s misnamed), there are missing constraints on constrained variables (lines 17, 32), redundant computation of subexpressions and of constants (lines 26, 27), missing algebraic reductions (also lines 26 and 27), redundant initialization and setting (lines 22/23 and 26/27), redundant computations (line 32).

The worst case for efficiency is in their coding of linear models where they use a loop rather than a matrix multiply (LinearModel_conjugate.stan, lines 30–32, which also violates every naming convention in the world by mixing underscore separators and camel case). This code also divides by 2 everywhere when it should be multiplying by 0.5 and it also has a bunch of problems like the other code of missing constraints (this one’s critical—`sigma` needs to be constrained to be greater than 0).

Then when we look at LinearModel_non_informative_hc.stan, things get even worse. It combines the problems of LinearModel_conjugate with two really bad ones for performance: not vectorizing the normal distribution and needlessly truncating the Cauchy distribution. These would add up to at least a factor of 2 and probably much more.

And of course, none of these exploited within-chain parallelization or GPU. And no use of sufficient stats in the conjugate cases like Linear_model_conjugate.stan.

Simulation-based inference and approximate Bayesian computation in ecology and population genetics

Paul Kedrosky asks:

Have you written anything on approximate Bayesian computation? It is seemingly all the rage in ecology and population genetics, and this recent paper uses it heavily to come to some heretical conclusions.

I passed this over to Lizzie who wrote:

Fun paper! Unfortunately, I too do not know much about this myself but I have two examples where I have heard of using it, both of which I think have some similar issues of exponentially increasing number of possible branches taken (e.g., Coalescent theory of pop gen and Lévy flight models). I don’t know the math of these models well enough to know what exactly makes the likelihood so expensive to calculate and I don’t really see authors explaining that on quick glance. My very gut reaction is that these are inherently non-identifiable models ….

Lizzie also pointed to figure 1 in this paper which I’ve reproduced above. And she asked, “What makes something ‘approximate’ Bayesian? And is it different than being ‘approximately’ Bayesian?” I replied that, except in some very simple models, all inferences, Bayesian or otherwise, are approximate, and all computations are approximate too, in the sense that you can’t really get exact draws from the posterior distribution or closed forms for posterior expectations or intervals.

And Lizzie connected us to Geoff Legault, who wrote:

The paper is also a mystery to me, but I do think ABC methods, or more broadly, simulation-based inference can be useful if done carefully and with full awareness of its many limitations.

In ecology, there are many “models” that are easily simulated, that in principle have likelihood/probability functions, but those likelihoods are not available or they cannot possibly be computed. For example, certain kinds of stochastic, spatially-explicit population models would require working with transition matrices that are too large for any computer that currently exists or will ever exist (probably). I gather some models in physics also such intractable likelihoods. In those cases, the idea is to use the simulator to either approximate the likelihood or simply to learn more about your model. Most applications of simulation-based inference that I’ve seen opt for the latter: parameter values are sampled from a prior distribution, data is simulated with them, distances between several summary statistics of simulated and real data are computed, and those (parameter) samples with distances below some tolerance level are retained. To me, the reliance on distances between summary statistics feels much more “approximate” than the usual approximations we accept when making most inferences.

In addition to the paper Lizzie linked, I highly recommend this recent review of simulation-based inference.

And John Wakeley adds:

Here’s ABC in a nutshell from the perspective of population genetics. Ideally you want the likelihood of the full data, but this is too complicated to compute analytically. It is even too complicated to compute using simulations because the data have too many dimensions and you might not even be quite sure what to measure. An example would be the whole genomes of a sample of individuals. Also, you are interested in particular model(s) with particular parameters which might not be enough to determine all aspects of the data. These issues are of course in common to all sorts of inference problems. In ABC, there are two or three ways in which approximations are made. The first, also in common with many inference methods, is dimension reduction. You choose some quantities to calculate from the full data, ones which capture what you think are important axes of variation, relevant to your model(s) and parameters. The second is to choose a tolerance level, which you’ll use to estimate the probability of the reduced data by a rejection method. Specifically, for a given set of parameter values, you compute the fraction of simulation replicates which produce values of these quantities close enough to your observed ones to be acceptable. Obviously this requires some finesse: too loose a tolerance means tossing out much to the information you’ve got, and too tight a tolerance can put you back in situation you were trying to avoid in the first place (i.e. if almost no simulation replicates will give exactly the observed values of your quantities). A wrinkle in population genetics, due to genomes being more-or-less digital, is that most of the commonly used quantities are discrete rather than continuous, so tolerances should be chosen accordingly but that’s not necessarily a simple matter depending on the particular quantities. The third point of approximation is the choice of the number of simulation replicates. I suppose this could also bring up subtleties, e.g. the unevenness of errors in estimating the likelihoods for different sets of parameters values.

So there you have it. This discussion makes me think that there could be some useful ways to extend our Bayesian workflow to simulation-based models.

Webinar: The Current State and Evolution of Stan

This post is by Eric.

Next Thursday, Rok Češnovar is stopping by to talk to us about recent developments in and around Stan. You can register here.

Abstract

We will present the current state of the Stan ecosystem, highlight some of the advances that improved the performance of Stan in the past few years, and discuss what is still to come in the not-so-distant future. You will hear about the core modules of Stan, how they all fit together, and how the various interfaces bring that core to life in different ways to make your Stan models run. If your models have been compiling and running faster recently and you were wondering why, we will present the improvements that were made in Stan over the last few years that sped up those pesky gradient evaluations. If you feel that your models are still running slowly, we will provide a few tips on how to identify the computational (non-modeling) reasons why that may be the case, and what are some of the things you can do to speed up computation.

About the speaker

Rok Češnovar is a Stan developer and a PhD student at the University of Ljubljana who is interested in making Bayesian inference faster using parallelism on GPUs and multi-core CPUs. He joined the Stan development team while working on adding GPU support with the team of prof. Štrumbelj at UL, but quickly became interested in almost all aspects of Stan. In addition to working on adding the GPU support, he has co-authored the cmdstanr R package, added profiling for Stan, and implemented the language side of reduce_sum and the new ODE interface. He has also been managing and coordinating Stan & CmdStan releases for the last two years.

The video is now available here.

Comparison of MCMC effective sample size estimators

Aki writes:

tl;dr

In the case of well mixing chains, the functions ess_basic and mcse_quantile in the posterior package are the most accurate methods. The ess_basic package computes sum of autocorrelations using the truncation rule of Geyer (1992), and mcse_quantile uses in addition an inverse approach for quantile Monte Carlo standard error. In the case of well separated modes, the posterior package also provides appropriately small effective sample size indicating that the chains are not mixing.

What should effective sample size estimates look like when we have chains that are not mixing? Here’s what Aki says:

If the modes are well separated so that between-chain variance is much bigger than within-chain variances, each chain is mostly indicating the loaction of the mode and not worth much more than one observation. The posterior package combines Geyer truncated autocorrelation estimates with multi-chain convergence diagnostic R-hat. If the chains are not mixing, then the draws within each chain are considered to be more correlated within the chain and the multi-chain ESS estimate gets smaller. If the modes are well separated, multi-chain ESS is close to the number of chains. Arguably even this is an overestimate as increasing the number of non-mixing chains does not provide reliable estimate of the relative masses of the masses.

This is a big deal. If you estimate effective sample size separately for each chain and then average the chains, then you’ll be much too optimistic when your chains aren’t mixing. That’s why we combine between and within-chain information as described in BDA but which is done more effectively in our recent paper, Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC.

Finally, here’s the background from Aki:

The MCMC effective sample size (ESS) and Monte Carlo standard error (MCSE) estimated for one chain includes estimation of the correlation between the iterations, for example, using autocorrelation time (or spectral density at frequency zero). As a finite number of MCMC draws are available the autocorrelation estimates for bigger lags are noisier. In our paper, we wrote about autocorrelation estimators:

In our experiments, Geyer’s (1992) truncation had superior stability compared to flat-top (Doss et al., 2014) and slug-sail (Vats and Knudson, 2018) lag window approaches.

This notebook includes additional experiments . . .

Again, here’s the link to Aki’s notebook.

Approximate leave-future-out cross validation for Bayesian time series models

Paul Bürkner, Jonah Gabry, and Aki Vehtari write:

One of the common goals of time series analysis is to use the observed series to inform predictions for future observations. In the absence of any actual new data to predict, cross-validation can be used to estimate a model’s future predictive accuracy, for instance, for the purpose of model comparison or selection. Exact cross-validation for Bayesian models is often computationally expensive, but approximate cross-validation methods have been developed, most notably methods for leave-one-out cross-validation (LOO-CV). If the actual prediction task is to predict the future given the past, LOO-CV provides an overly optimistic estimate because the information from future observations is available to influence predictions of the past. To properly account for the time series structure, we can use leave-future-out cross-validation (LFO-CV). Like exact LOO-CV, exact LFO-CV requires refitting the model many times to different subsets of the data. Using Pareto smoothed importance sampling, we propose a method for approximating exact LFO-CV that drastically reduces the computational costs while also providing informative diagnostics about the quality of the approximation.

A useful generalization, I’d say!

More should be possible, although I’m not quite sure how. One concern with leave-future-out is that you’re evaluating model fit for the early points in the series in a different way than how you’re evaluating later points in the series. Which makes me think that this sort of marginal evaluation could be done in other settings as well. Maybe one way to understand what’s possible would be to apply this mixed marginal/conditional predictive idea (something sliding between prior and posterior predictive checking) for the simpler problem of regression with independent data points. In some sense, leave-one-out or other forms of cross validation are already intermediate steps along this continuum, so maybe this is a good opportunity to explore these ideas further. From the other direction, we can consider problems such as spatial and network models in which there is no ordering, and cross-validation is related to the pseudo-likelihood idea of Besag (1974).

The Current State of Undergraduate Bayesian Education and Recommendations for the Future

This post is by Aki

Mine Dogucu and Jingchen Hu arxived in September a paper The Current State of Undergraduate Bayesian Education and Recommendations for the Future with an abstract

With the advances in tools and the rise of popularity, Bayesian statistics is becoming more important for undergraduates. In this study, we surveyed whether an undergraduate Bayesian course is offered or not in our sample of 152 high-ranking research universities and liberal arts colleges. For each identified Bayesian course, we examined how it fits into the institution’s undergraduate curricula, such as majors and prerequisites. Through a series of course syllabi analyses, we explored the topics covered and their popularity in these courses, the adopted teaching and learning tools, such as software. This paper presents our findings on the current practices of Bayesian education at the undergraduate level. Based on our findings, we provide recommendations for programs that may consider offering Bayesian education to their students.

It’s not mentioned in the abstract, but based on the references “U.S. News (2021a), ‘2021 Best national university rankings’” and “U.S. News (2021b), ‘National liberal arts colleges’” the current state is only about US research universities and liberal arts colleges.

Anyway, I highly recommend checking out their paper as in addition to the survey of the state, they make recommendations for the future:

  1. Expand the access to Bayesian courses
  2. Make Bayesian courses a part of the majors
  3. Balance statistics with computing
  4. Use a variety of assessments

It’s easy for me to agree with these, as I’ve been doing these for years: 1) I’m making my course available for everyone, 2) my course is optional in Data science BSc major (and some other majors), and required in Machine learning / AI MSc major at Aalto University (the course is quite advanced, and we’re missing an intermediate course at BSc level, so that’s why it’s only optional at BSc level), 4) I use a variety of assessments. Especially, I like the subpoints in the recommendation 3:

  1. Introduce simulation-based learning early in the course.
  2. Encourage students to write self-coded MCMC algorithms for relatively simple multiparameter models
  3. If the course puts equal emphasis on computing and modeling, consider adopting one of the popular probabilistic programming languages for Bayesian model estimation through MCMC (e.g., JAGS and Stan).
  4. If the course has a slightly stronger emphasis on modeling over computing, consider introducing one of the wrapper packages for Stan for its simpler posterior summary procedure (e.g., rstanarm and brms).

I have been including all these in my course (except JAGS), so I’m already in the future :D

I’m also happy to see that the most commonly required or recommended book is still 8 year old “Bayesian Data Analysis”.

Dogucu and Hu end their paper with

Last but not least, we invite current and future Bayesian educators to join the undergraduate Bayesian education network, an online community that fosters discussions of undergraduate Bayesian education.

My slides and paper submissions for Prob Prog 2021

Prob Prog 2021 just ended. Prob Prog is the big (250 registered attendees, and as many as 180 actually online at one point) probabilistic programming conference. It’s a very broadly scoped conference.

The online version this year went very smoothly. It ran a different schedule every day to accommodate different time zones. So I wound up missing the Thursday talks other than the posters because of the early start. There was a nice amount of space between sessions to hang out in the break rooms and chat.

Given that there’s no publication for this conference, I thought I’d share my slides here. The talks should go up on YouTube at some point.

Slides: What do we need from a PPL to support Bayesian workflow?

There was a lot of nice discussion around bits of workflow we don’t really discuss in the paper or book: how to manage file names for multiple models, how to share work among distributed teammates, how to put models into production and keep them updated for new data. In my talk, I brought up issues others have to deal with like privacy or intellectual property concerns.

My main focus was on modularity. After talking to a bunch of people after my talk, I still don’t think we have any reasonable methodology as a field to test out components of a probabilistic program that are between the level of a density we can unit test and a full model we can subject to our whole battery of workflow tests. How would we go about just testing a custom GP prior or spatio-temporal model component? There’s not even a way to represent such a module in Stan, which was the motivation for Maria Gorinova‘s work on SlicStan. Ryan Bernstein (a Stan developer and Gelman student) is also working on IDE-like tools that provide a new language for expressing a range of models.

Then Eli Bingham (of Pyro fame) dropped the big question: is there any hope we could use something like these PPLs to develop a scalable, global climate model? Turns out that we don’t even know how they vet the incredibly complicated components of these models. Just the soil carbon models are more complicated than most of the PK/PD models we fit and they’re one of the simplest parts of these models.

I submitted two abstracts this year and then they invited me to do a plenary session and I decided to focus on the first.

Paper submission 1: What do we need from a probabilistic programming language to support Bayesian workflow?

Paper submission 2: Lambdas, tuples, ragged arrays, and complex numbers in Stan

P.S. Andrew: have you considered just choosing another theme at random? It’s hard to imagine it’d be harder to read than this one.

New blog formatting

We needed to update the blog because the old theme was no longer being maintained by WordPress, and we were having security problems and issues with the comment screening. So we replaced it with this new theme which was having some issues. We patched it so now it’s functional, but I’ve been told it isn’t so great on phones.

I just wanted to let you know that we’re working on putting together something that looks a little better, less whitespace, things like that. It’s not as easy as you might think, in particular because we’ve got tons of content that you want to read, also we want to make the commenting experience as easy as possible under the constraints of security and not getting overwhelmed by spam. Thanks for your patience!

Webinar: Kernel Thinning and Stein Thinning

This post is by Eric.

Tomorrow, we will be hosting Lester Mackey from Microsoft Research. You can register here.

Abstract

This talk will introduce two new tools for summarizing a probability distribution more effectively than independent sampling or standard Markov chain Monte Carlo thinning:

  • Given an initial n point summary (for example, from independent sampling or a Markov chain), kernel thinning finds a subset of only square-root n points with comparable worst-case integration error across a reproducing kernel Hilbert space.
  • If the initial summary suffers from biases due to off-target sampling, tempering, or burn-in, Stein thinning simultaneously compresses the summary and improves the accuracy by correcting for these biases.

These tools are especially well-suited for tasks that incur substantial downstream computation costs per summary point like organ and tissue modeling in which each simulation consumes 1000s of CPU hours.

About the speaker

Lester Mackey is a Principal Researcher at Microsoft Research, where he develops machine learning methods, models, and theory for large-scale learning tasks driven by applications from healthcare, climate forecasting, and the social good.  Lester moved to Microsoft from Stanford University, where he was an assistant professor of Statistics and (by courtesy) of Computer Science.  He earned his Ph.D. in Computer Science and MA in Statistics from UC Berkeley and his BSE in Computer Science from Princeton University.  He co-organized the second place team in the Netflix Prize competition for collaborative filtering, won the Prize4Life ALS disease progression prediction challenge, won prizes for temperature and precipitation forecasting in the yearlong real-time Subseasonal Climate Forecast Rodeo, and received best paper and best student paper awards from the ACM Conference on Programming Language Design and Implementation and the International Conference on Machine Learning.

Cross validation and model assessment comparing models with different likelihoods? The answer’s in Aki’s cross validation faq!

Nick Fisch writes:

After reading your paper “Practical Bayesian model evaluation using leave-one-outcross-validation and WAIC”, I am curious as to whether the criteria WAIC or PSIS-LOO can be used to compare models that are fit using different likelihoods? I work in fisheries assessment, where we are frequently fitting highly parameterized nonlinear models to multiple data sources using MCMC (generally termed “integrated fisheries assessments”). If I build two models that solely differ in the likelihood specified for a specific data source (one Dirichlet, the other Multinomial), would WAIC or loo be able to distinguish these or must I use some other method to compare the models (such as goodness of fit, sensitivity, etc). I should note that the posterior distribution will be the unnormalized posterior distribution in these cases.

My response: for discrete data I think you’d just want to work with the log probability of the observed outcome (log p), and it would be fine if the families of models are different. I wasn’t sure what was the best solution with continuous variables, so I forwarded the question to Aki, who wrote:

This question is answered in my [Aki’s] cross validation FAQ:

12 Can cross-validation be used to compare different observation models / response distributions / likelihoods?

First to make the terms more clear, p(y∣θ) as a function of y is an observation model and p(y∣θ) as a function of θ is a likelihood. It is better to ask “Can cross-validation be used to compare different observation models?”

– You can compare models given different discrete observation models and it’s also allowed to have different transformations of y as long as the mapping is bijective (the probabilities will the stay the same).
– You can’t compare densities and probabilities directly. Thus you can’t compare model given continuous and discrete observation models, unless you compute probabilities in intervals from the continuous model (also known as discretising continuous model).
– You can compare models given different continuous observation models, but you have exactly the same y (loo functions in rstanarm and brms check that the hash of y is the same). If y is transformed, then the Jacobian of that transformation needs to be included. There is an example of this in mesquite case study.

It is better to use cross-validation than WAIC as the computational approximation in WAIC fails more easily and it’s more difficult to diagnose when it fails.

P.S. Nick Fisch is a Graduate Research Fellow in Fisheries and Aquatic Sciences at the University of Florida. How cool is that? I’m expecting to hear very soon from Nick Beef at the University of Nebraska.

ProbProg 2021 is Oct 20–22 and registration is open

ProbProb 2021, the probabilistic programming conference, is coming up in two weeks and registration is only US$25. It’s entirely online still this year and runs from 10.30 am Wednesday, 20 October to 2.30 pm Friday 22 October EDT (eastern daylight time, aka Boston time).

Here’s the schedule of talks and posters. On Friday at noon, I’ll be talking about Bayesian workflow and what probabilistic programming languages need in order to support it. I’ll survey a range of PPLs, which I conclude all lack key features required for seamless workflow support. Perhaps surprisingly, I conclude that BUGS is the best of a bad bunch for workflow when dealing with problems that can be expressed in its language and where Gibbs sampling is effective. Most other systems, including Stan, PyMC3, Pyro, and all the ML toolkits, require rewriting models for different uses like simulation-based calibration, posterior predictive checks, and cross-validation. Part III of the Stan User’s guide shows how to code these model variants in Stan. I’ll try to squeeze in some details of how we hope to extend Stan to better support workflow.

Hope to see you there!

posteriordb: a database of Bayesian posterior inference

Mans Magnusson, Aki Vehtari, Paul Buerkner, and others put together this corpus which we and others can use to evaluate Bayesian inference algorithms. They write:

What is posteriordb?

posteriordb is a set of posteriors, i.e. Bayesian statistical models and data sets, reference implementations in probabilistic programming languages, and reference posterior inferences in the form of posterior samples.

Why use posteriordb?

posteriordb is designed to test inference algorithms across a wide range of models and data sets. Applications include testing for accuracy, speed, and scalability. posteriordb can be used to test new algorithms being developed or deployed as part of continuous integration for ongoing regression testing algorithms in probabilistic programming frameworks.

posteriordb also makes it easy for students and instructors to access various pedagogical and real-world examples with precise model definitions, well-curated data sets, and reference posteriors.

posteriordb is framework agnostic and easily accessible from R and Python.

For more details regarding the use cases of posteriordb, see doc/use_cases.md. . . .

Using posteriordb

To simplify the use of posteriordb, there are convenience functions both in Python and in R. To use R, see the posteriordb-r repository, and to use Python, see the posteriordb-python repository.

This is important, and it’s growing:

Contributing

We are happy with any help in adding posteriors, data, and models to the database! See CONTRIBUTING.md for the details on how to contribute.

Great stuff.

Bayesian workflow for disease transmission modeling in Stan!

Léo Grinsztajn, Elizaveta Semenova, Charles Margossian, and Julien Riou write:

This tutorial shows how to build, fit, and criticize disease transmission models in Stan, and should be useful to researchers interested in modeling the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic and other infectious diseases in a Bayesian framework. Bayesian modeling provides a principled way to quantify uncertainty and incorporate both data and prior knowledge into the model estimates. Stan is an expressive probabilistic programming language that abstracts the inference and allows users to focus on the modeling. As a result, Stan code is readable and easily extensible, which makes the modeler’s work more transparent. Furthermore, Stan’s main inference engine, Hamiltonian Monte Carlo sampling, is amiable to diagnostics, which means the user can verify whether the obtained inference is reliable. In this tutorial, we demonstrate how to formulate, fit, and diagnose a compartmental transmission model in Stan, first with a simple susceptible-infected-recovered model, then with a more elaborate transmission model used during the SARS-CoV-2 pandemic. We also cover advanced topics which can further help practitioners fit sophisticated models; notably, how to use simulations to probe the model and priors, and computational techniques to scale-up models based on ordinary differential equations.

I love this workflow stuff!

Simulation-based calibration: Some challenges and directions for future research

Simulation-based calibration is the idea of checking Bayesian computation using the following steps, originally from Samantha Cook’s Ph.D. thesis, later appearing in this article by Cook et al., and then elaborated more recently by Talts et al.:

1. Take one draw of the vector of parameters theta_tilde from the prior distribution, p(theta).

2. Take one draw of the data vector y_tilde from the data model, p(y|theta_tilde).

3. Take a bunch of posterior draws of theta from p(theta|y_tilde). This is the part of the computation that typically needs to be checked. Call these draws theta^1,…,theta^L.

4. For each scalar component of theta or quantity of interest h(theta), compute the quantile of the true value h(theta_tilde) within the distribution of values h(theta^l). For a continuous parameter or summary, this quantile will take on one of the values 0/L, 1/L, …, 1.

If all the computations above are correct, then the result of step 4 should be uniformly distributed over the L+1 possible values. To do simulation-based computation, repeat the above 4 steps N times independently and then check that the distribution of the quantiles in step 4 is approximately uniform.

Connection to simulated-data experimentation

Simulation-based calibration is a special case of the more general activity of simulated-data experimentation. In simulation-based calibration, you’re comparing the assumed true value theta_tilde to the posterior simulations theta. More generally, we can do all sorts of experiments. For example, simulated-data experimentation is useful when designing a study: you make some assumptions, simulate some fake data, and then see how accurate the parameter estimates are. The goal here is not necessarily to check that the posterior inference is working well—indeed, there’s no requirement that the inference be Bayesian at all—but rather just to see what might happen under some scenarios. Here’s a discussion from earlier this year on why this can be so useful, and here’s a simple example.

Another application of simulated-data experimentation is to assess the bias in some existing estimation method. For example, suppose we’re concerned about selection bias in an experiment. We can simulate fake data under some assumed true parameter values and some assumed selection model, then do a simple uncorrected estimate and compare this inference with the assumed parameter values. This is similar to simulation-based calibration but not quite the same thing, because we’re not trying to assess whether the fit is calibrated; rather, we’re using the simulation to assess the bias of the estimate (given some assumptions).

A similar idea arises when assessing the bias of a computational algorithm. For example, variational inference can underestimate uncertainty of local parameters in a hierarchical model. One way to get a sense of this inferential bias is to simulate data from the assumed model, then fit using variational inference, check coverage of the resulting 50% intervals (for example), and loop this to see how often these intervals contain the true value. Or this could be done in other ways; the point is that this is similar to, but not the same as, straight simulation based calibration as outlined at the top of this post.

Perhaps most important is the use of simulated-data experimentation in modeling workflow. As Martin Modrák put it: simulation-based calibration was presented originally as validating an algorithm given that you trust your model, but in practice we are typically interested in validating a model implementation, given that you trust your sampling algorithm. Or maybe we want to be checking both the model and the computation.

Problems with simulation-based calibration

OK, now back to the specific idea of using simulated data to check calibration of parameter inferences. There are some problems with the method of Cook et al. (2006) and Talts et al. (2021):

– The method can be computationally expensive. If you loop the above steps 1-4 many times, then you’ll need to do posterior inference—step 3—lots of times. You might not want to occupy your cluster with N=1000 parallel simulations.

– The method is a hypothesis test: it’s checking whether the simulations are exactly correct (to within the accuracy of the test). But our simulations are almost never exact. HMC/Nuts is great, but it’s still only approximate. So there’s an awkwardness to the setup in that we’re testing a hypothesis we don’t expect to be true, and we’re assuming this will work because of some slop in the uncertainty based on a finite number of replications.

– Sometimes we want to test an approximate method that’s not even simulation consistent, something like ADVI or Pathfinder. Simulation-based calibration still seems like a good general idea, but it won’t quite work as stated because we’re not expecting perfect coverage.

– In practice, we don’t necessarily care about the calibration of a procedure everywhere. In a particular application, we typically care about calibration in the neighborhood of the data. If the model being fit has a weak prior distribution, simulation-based calibration drawing from the prior (as in steps 1-4 above) might miss the areas of parameter and data space that we really care about.

– Because of the above concerns, we commonly perform simulation-based calibration using a simpler approach, using a single fixed value theta_tilde set to a reasonable value, where “reasonable” is defined based on our understanding of how the model will be used. This has the virtue of testing the computation where we care about it, but now that we’re no longer averaging over the prior, we can’t make any general statements about calibration, even if the posterior simulations are perfect.

Possible solutions

So, what to do? I’m not sure. But I have a few thoughts.

First, in practice we can often learn a lot from just a single simulated dataset, that is, N=1. Modeling or computational problems are often so severe that they show up from just one simulation, for example we’ll get parameter estimates blowing up or getting stuck somewhere far away from where they should be. In my workflow, I’m often doing this sort of informal experimentation where I set the parameter vector to some reasonable value, simulate fake data, fit the model, and check that the parameter inferences are in the ballpark. It would be good to semi-automate this process so that it can be easy to do in the cmdstanR and cmdstanPy environments, as well as in rstanarm and brms.

Second, we’re often fitting hierarchical models, and then we can follow a hybrid approach. For a hierarchical model with hyperparameters phi and local parameters alpha, we can set phi_tilde to a reasonable value and draw alpha_tilde from its distribution, p(alpha|phi_tilde). If the number of groups is large, there can be enough internal replication in alpha that much can be learned about coverage from just a single simulated dataset—although we have to recognize that this can all depend strongly on phi, so it might make sense to repeat this for a few values of phi.

Third, when checking the computation of a model fit to a particular dataset, I have the following idea. Start by fitting the model, which gives you posterior simulations of theta. Then, for each of several random draws of theta from the posterior, simulate a new dataset y_rep, re-fit the model to this y_rep, and check the coverage of these inferences with respect to the posterior draw of theta. This has the form of simulation-based calibration except that we’re drawing theta from the posterior, not the prior. This makes a lot of sense to me—after all, it’s the posterior distribution that I care about—but we’re no longer simply drawing from the joint distribution, so we can’t expect the same coverage. On the other hand, if we had really crappy coverage, that would be a problem, right? I’m thinking we should try to understand this in the context of some simple examples.

Fourth, I’d like any version of simulation-based calibration to be set up as a measure of miscalibration rather than as a hypothesis test. My model here is R-hat, which is a potential scale reduction factor and which is never expected to be exactly 1.00000.

Fifth, this last idea connects to the use of simulation-based calibrations for explicitly approximate computations such as ADVI or Pathfinder, where the goal should not be to check if the method are calibrated but rather to measure the extent of the miscalibration. One measure that we could try is ((computed posterior mean) – (true parameter value)) / (computed posterior sd). Or maybe that’s not quite right, I’m not sure. The point is that we want a measure of how bad is the fit, not a yes/no hypothesis test.

To summarize, I see three clear research directions:

(a) Adjusting for the miscalibration that will arise when we’re no longer averaging over the prior (because the number of replication draws N is not large or because we want to focus on the posterior or some other area of interest of parameter space).

(b) Coming up with an interpretable measure of miscalibration rather than framing as a test of the hypothesis of perfect calibration.

(c) Incorporating this into workflow so that it’s convenient and not computationally expensive.