
This post is by Ryan Bernstein. I’m a PhD candidate at Columbia with Andrew and Jeannette Wing.
Right now, each probabilistic program we write represents a single model. As we tackle a modeling task, we try one model after another, and old programs pile up in our “models” folder or git history. What if instead of describing one model at a time, we focused on drawing a map of our model space? What if a program represented not a single probabilistic model, but a whole network of models and their rich interconnections, a region of model space that grew as we explored?
If we drew maps of model space as we experimented, it would be easier for others to follow our work. We could annotate regions of our map, like “These models rely on a big assumption,” or “Here, there be dragons convergence issues.” We could draw our path of exploration, from our first experiment to our final model and all of the false turns along the way. We could document, “This way was too slow”, or “This way gave strange prior predictions”, or “I didn’t have time to explore this way, maybe you should.” Issues like the Garden of Forking Paths and P-hacking make it clear that how and why a model was grown is necessary context for assessing its trustworthiness; model space maps would act as living documentation of that context.
Having a map would also help us navigate through model space as we explore. We could orient ourselves in model space by contrasting our current model to its immediate neighbors: we could say things like, in that direction, a posterior mean trends upwards, or calibration issues appear, or predictive accuracy improves.
Once we sketch out a region of model space, we could deploy robot servants to help us understand it. They could scout ahead for models that are accurate, simple, or robust. They could also check that our conclusions don’t depend too much on any one turn we made along our route, ala the Garden of Forking Paths. Maps enable automation, and automation can save us time and catch our errors.
So what kind of program can represent a growing network of models while remaining concise and readable?
I’ve developed a simple metaprogramming feature called ‘swappable modules’ that can extend languages like Stan to do just that. Each ‘modular Stan program’ can compile down to a whole set of normal, ‘concrete’ Stan programs. For now we’ll treat model space as a discrete network of models and we’ll assume that the data is consistent across the network.
First I’ll introduce ‘modular Stan’ alongside a trivial example, and then I’ll give two more interesting case studies. You can play along at home with my prototype compiler and interactive visualizations at http://ryanbe.me/modular-stan.html. You can load each example with the buttons on the lower left, or you can write your own modular program. Expect plenty of bugs!
A way to program maps
Suppose we have some data x that we believe to be normally distributed. Let’s write a modular Stan program to represent a region of the model space of x.
Modular Stan programs start with a base, and then they are built up by adding one piece, or module, at a time. The base is part of every model we produce, so it should only contain things we’re sure won’t change: the shape of the data, the key inferred quantities, and a basic outline of the model. Here is the base for our `x` modular program:
data {
int N;
vector[N] x;
}
model {
x ~ normal(Mean(), Stddev());
}
So far this is like a Stan program except that it has two “holes” in it, Mean and Stddev, that represent as-yet undefined behavior.
Next, we write a module to fill in each hole. A module can contain arbitrary code and declare new parameters like a miniature Stan program, and it can also return a value.
Here’s a module to fill in the Mean hole:
module "standard" Mean() {
return 0;
}
This reads as: define a module called "standard" that fills the Mean hole with 0.
Here’s a module to fill Stddev:
module "standard" Stddev() {
return 1;
}
Our modular Stan program up to this point is equivalent to a single concrete Stan program:
data {
int N;
vector[N] x;
}
model {
x ~ normal(0, 1);
}
This isn’t very interesting, so let’s explore the model space a little bit more. We can explore by simply adding more modules.
Let’s try modeling the mean of x as a random variable. We add a module called "normal" to fill the Mean hole with a new parameter that has a prior:
module "normal" Mean() {
parameters {
real mu;
}
mu ~ normal(0, 1);
return mu;
}
If we use "normal" to fill Mean and "standard" to fill Stddev, the concrete Stan program will be:
data {
int N;
vector[N] x;
}
parameters {
real mu;
}
model {
mu ~ normal(0, 1);
x ~ normal(mu, 1);
}
Since we now have two ways to fill the holes, our modular Stan program is equivalent to a set of two concrete Stan programs.
Now suppose we also want to model the standard deviation, but we aren’t sure whether or not to use an informative prior (sorry for the contrived example). We can leave that decision as another hole:
module "lognormal" Stddev() {
parameters {
real<lower=0> sigma;
}
sigma ~ lognormal(0, StddevInformative());
return sigma;
}
module "yes" StddevInformative() {
return 1;
}
module "no" StddevInformative() {
return 100;
}
StddevInformative is a hole nested inside the module "lognormal". In this way, modular programs are hierarchical as well as branching. We can view modular programs as trees:

Our modular program now represents a set of six concrete models, one for each way the set of holes can be filled: two ways for Mean ("standard" and "normal") times three ways for Stddev ("standard", "lognormal" with StddevInformative: "yes", and "lognormal" with StddevInformative: "no".).
How do we build a network from this set of programs? There’s a natural definition: we draw an edge between two models if they differ by only one module. Each neighbor is then one ‘decision’ away. Here is the network of models for this trivial example:

A representation of the network of models for the “x” example. Edges are annotated with “hole” that differs between the two models. The prototype website has an interactive version of this visualization.
‘Golf’ Example: A travelogue through model space
I’ve rewritten Andrew’s golf case study as a modular program to show what it would look like to write a case study as a “travelogue” of development through model space.
In the case study, the modeling task is to understand a golfer’s chance of sinking a shot given their distance from the hole. Like before, we’ll start by writing an outline of the problem that will be constant for the whole model space:
data {
int J; // Number of distances
vector[J] x; // Distances
int n[J]; // Number of shots at each distance
int y[J]; // Number of successful shots at each distance
}
model {
y ~ NSuccesses(n, PSuccess(x));
}
We start with two holes: NSuccesses is the distribution of the number of successful shots given the number of attempts n and the probability of success; PSuccess is the probability of success given the distance from the hole x.
A good way to count successes is the Binomial distribution, so let’s add a module called “binomial” to fill NSuccesses:
module "binomial" NSuccesses(y | n, p) {
y ~ binomial(n, p);
}
Now we need a module for PSuccess to map the distance of a shot x to the probability of success. We start with a logit function:
module "logistic" PSuccess(x) {
parameters {
real a;
real b;
}
return logit(a + b*x);
}
We’ve now defined logistic regression, the first model from the case study:
data {
int J; // Number of distances
vector[J] x; // Distances
int n[J]; // Number of shots at each distance
int y[J]; // Number of successful shots at each distance
}
parameters {
real a;
real b;
}
model {
y ~ binomial(n, logit(a + b*x));
}
In this fashion, we can add a new module for each step of Andrew’s logic in the case study. See the full modular program here: http://ryanbe.me/modular-stan.html?example=golf.
Let’s use my prototype graphical interface to visualize the resulting modular program. Here are the parts of the interface:

1. Modular program editor with compile and load buttons. 2. Interactive module tree. 3. Interactive network of models. 4. Selected concrete Stan program, if any. 5. Label and documentation editor for the selected concrete program, if any.
We can select modules to narrow down the network of models to a single concrete model, as shown in the figure at the top of this post.
And finally we can see the case study as a “travelogue” that documents a path through the network of models:

‘Birthday’ Example: Exploring model space with “scouting robots”
To demonstrate automation on the network of models, I’ve rewritten the Birthday case study from as a modular program. I collaborated with Hyunji Moon to build this demo.
The goal of the birthday case study is to understand trends in the number of births over time. The authors use a Gaussian process time series model, and they experiment with including a few different types of trends (e.g. day-of-week, day-of-year, holidays, seasonal) and other modeling decisions (e.g. the weighting structure of days-of-week scores and the choice of hierarchical priors on day-of-year scores). When each trend inclusion or other modeling decision is represented as an independent choice of module, the modular program represents 120 models.
You can load the example with the full source code at http://ryanbe.me/modular-stan.html?example=birthday.
Here is the network of modules:

Suppose we want to find a model that does a reasonable job of fitting our time-series birthday data. That’s a lot of possibilities to explore manually! So let’s enlist a friendly robot to search the network for promising models.
We send off our robot with instructions for a greedy search:
“Start here.
Score the neighbors of every model you visit.
Move to the highest scoring model you’ve seen so far.
If you’re already at the highest scoring model, stop searching.”
This is like discrete gradient descent. We equip our robot with a scoring method: Expencted Log-Predictive Density (ELPD) is a reasonable choice that approximates goodness-of-fit.
Here is the data our robot collects:

The red annotations show the ELPD scores for the assessed models. The robot traveled along the arrow path from [START] to [GOAL].
In this case the robot ended its search at the case study’s final model. This optimization amounts to simple symbolic regression on probabilistic programs.
While greedy maximization of ELPD is naive and we shouldn’t blindly trust it to give us our final model, we can at least use it to find promising neighborhoods. I hope this serves as a proof of concept to open the door for other, smarter network-wide algorithms.
Future work
I hope that model space “maps” will give us new “navigation” tools for development, enable automatic search and validation, and help others to understand and reuse our work.
There’s a ton of opportunity for development in this area, such as:
- What other algorithms can run on a network or neighborhood of models? What other scores would be useful to optimize?
- How should we decorate network edges to support model navigation? What about for special cases like causal inference?
- Can we validate modules themselves?
- Can we build libraries of modules?
- Can we learn to synthesize modules?
- Can we automatically ensemble or stack models in a network?
- Can we include automated model transformations in the network in addition to module-swapping transformations?
- If we’re fitting models in a network one after another, can we use their relatedness to sample more efficiently?
- Are there extensions of the module-swapping system or other metaprogramming features that would be worth the added complexity?
If you have any questions or ideas, please leave a comment or reach me directly at ryan[dot]bernstein[at]columbia.edu!
Footnotes: