“The most discriminatory federal judges” update

Christian Smith writes:

Thanks for your commentary on the white paper about judges, which I wrote earlier this year with some colleagues [Nicholas Goldrosen, Maria-Veronica Ciocanel, Rebecca Santorella, Chad Topaz, and Shilad Sen]. I just wanted to let you know that we substantially altered that paper in light of reasonable feedback, replacing the original OSF entry with a much humbler one. This Twitter thread explains further. If you’re up for clarifying in your blog post that the quoted excerpt is no longer in the white paper, I would appreciate that a lot.

Here’s the new version, which has the following note on page 1:

A previous version of this work included estimates on individually identified judges. Thanks to helpful feedback, we no longer place enough credence in judge-specific estimates to make sufficiently confident statements on any individual judge. We encourage others not to rely upon results from earlier versions of this work.

As I wrote in my earlier post, I don’t have anything to say on the substance of this work, but I’ll again share my methodological comments, generic advice of the sort I’ve given many times before, involving workflow, or the trail of breadcrumbs:

What I want to see is graphs of the data and fitted model. For each judge, make a scatterplot with a dot for each defendant they sentenced. Y-axis is the length of sentence they gave, x-axis is the predicted length based on the regression model excluding judge-specific factors. Use four colors of dots for white, black, hispanic, and other defendants, and then also plot the fitted line. You can make each of these graphs pretty small and still see the details, which allows a single display showing lots of judges. Order them in decreasing order of estimated sentencing bias.

What should we expect in comparing human causal inference to Bayesian models?

This is Jessica. In a recent visualization paper with Alex Kale and Yifan Wu, we looked at how well people could draw causal inferences from graphs. This seemed interesting since talking about doing exploratory visual analysis, we often casually allude to how people use plots to investigate possible causal relationships. Visual analysis software makes it easy to flip through different filters and add and remove variables to plots, which might seem well intended to support assessing causation. But, there isn’t much research assessing how well graphical formats, either static or interactive, help people assess the compatibility of data and various causal models. So, we set out to see how well visualizations of contingency table data like one could easily generate in a tool help people assess cause and effect. 

We did a few online experiments that involved showing people data and asking them to allocate belief across a set of possible data generating processes, which we summarized for them as DAGs. For example, in a first experiment on a relatively straightforward causal attribution task, we plotted contingency data like the bar charts below, and asked them to allocate probability across the two DAGs to the left. Specifically, we asked  “How much do you believe in each of the causal explanations described below?” and told them to imagine having 100 votes to allocate across the two possible explanations. They were told to imagine that these explanations are the only plausible ones, and that there is no reason to believe a priori that one explanation is more likely than the other. 

We evaluated people’s responses against posterior probabilities calculated using what has been called causal support, basically posterior log odds of one data generating process (or possibly a set of them that share some structure) over a set of alternative data generating processes, given a dataset.  

We also did a second experiment where the task was harder: there were four possible DAGs, same nodes as those above, but they had to judge potential confounding of the treatment effect from a gene. 


We modeled results from both experiments using a linear in log odds model, estimating both sensitivity to changes in the ground truth level of causal support (i.e., underlying log likelihood ratio), and bias in perceived causal support.

Here’s a plot of sensitivity (LLO slopes, where 1 means the ideal, one-to-one relationship between users’ responses and causal support) for the first experiment. People were not very sensitive with any of the visualizations we tested.

 

Some of the visualization conditions we tested were interactive (e.g., interactive bar charts where clicking on a bar in one chart filtered to only those observations in the other chart, or collapsible table-style bar charts or icon arrays). But interacting alone didn’t necessarily help people with the probability judgment, since some visualizations, like the collapsible ones, were less likely to lead to interactions to show counterfactuals.

In the conditions where the sensitivity was greatest, we saw a tendency among users to be more sensitive to causal support at negative values of delta p, a parameter we varied along with sample size to generate the stimuli. Delta p captures the difference in the proportion of people with disease in each data set depending on whether they received treatment, and when negative evidence is toward disconfirming the treatment effect. We also saw a strong tendency to discount sample size, which we’ve seen in other work comparing people’s inferences from graphs to Bayesian model predictions. It could be something cognitive like non-belief in the law of large numbers, or something related to logarithmic perception. We also saw a fair amount of bias in probability allocations when we looked at the average response when there is no signal (causal support is 0), again across conditions. 

For the confounding detection task, we asked people to describe how they used the graphs, as we often do in these kinds of experiments to try to better understand the results. While many (about 45% of 519) gave uninformative responses, of the remainder, about 80% described using what should have been an adequate strategy. So, it doesn’t seem we can write off the results as driven by people not knowing what they were supposed to be looking for, though it should be noted that these experiments were run on Mechanical Turk, not an analyst population, so definitely worth following up with a study on analysts. 

At any rate, these results were interesting to us because they suggest that visualizations don’t necessarily make it easy for people to assess causal explanations, despite how much we might like to tout the value of exploratory visual analysis and interactivity for building intuitions about causal relationships. If we see something similar with more experienced visual analysts, we might consider building in causal modeling into visual analysis tools, to help analysts check their intuitions about causality against a statistical estimate. This is along the lines of the kind of tools Andrew and I suggest here, and something we are currently thinking about in my lab. But, this work suggests to me there’s still a number of unanswered questions around how much an approach like causal support should align with human judgments about possible generating processes. 

For example, the fact that people seem more sensitive to disconfirming evidence wasn’t something we had expected. It reminds me a bit of the idea in an older paper that Andrew blogged a few months ago. The paper distinguishes trying to identify where there may be cause and effect using covariance or probabilistic contrasts, where one relies on and seeks out direct information about covaration between elements of an event, versus hypothesizing some possible mechanisms and looking for information about whether the target event satisfies the necessary preconditions. The paper does some experiments to look at whether people seem to seek information in a way that supports the latter (finding that they do). They describe the difference in approaches as a difference between empirical generalization driving the analysis, versus theoretical constructs or principles that hold as long as their preconditions are met. The way we presented a set of possible DAGs, and the causal support model in general, aligns with the causal mechanism approach. It’s possible people in our experiment sought disconfirming evidence as the easiest way to test a hypothesis that some process generated the data. But which form do existing visual analysis tools tend to better support? Maybe there are ways to make tests for confirming/disconfirming easier for people, through different visualizations or more support for representing models in tools. 

One tricky part is that, at least when using causal support, we have to assume that people reason about a finite set of mechanisms they believe are the plausible ones. I wonder how natural this is. If I imagine about being shown some data on co-varying factors, like shark attacks and ice cream sales, and asked to think of possible explanatory variables, I’m not sure how long I’d have to brainstorm before I’d feel confident concluding that I had throughly explored the set of possible explanations. Would my probability allocations across those ever look like I had assumed the set was complete? Even if they did, how easy is it for a person to translate their internal degrees of belief in each model to a set of probabilities that sum to 1? Maybe there’s a way to benchmark that amount of noise that we should expect in experiments like this simply from the difficulty of this translation, or to improve the elicitation through some kind of incentive structure. Suggestions or related work welcome! 

“Accounting Theory as a Bayesian Discipline”

David Johnstone writes:

The Bayesian logic of probability, evidence and decision is the presumed rule of reasoning in analytical models of accounting disclosure. Any rational explication of the decades-old accounting notions of “information content”, “value relevance”, “decision useful”, and possibly conservatism, is inevitably Bayesian. By raising some of the probability principles, paradoxes and surprises in Bayesian theory, intuition in accounting theory about information, and its value, can be tested and enhanced. Of all the branches of the social sciences, accounting information theory begs Bayesian insights. This monograph lays out the main logical constructs and principles of Bayesianism, and relates them to important contributions in the theoretical accounting literature. The approach taken is essentially “old-fashioned” normative statistics, building on the expositions of Demski, Ijiri, Feltham and other early accounting theorists who brought Bayesian theory to accounting theory. Some history of this nexus, and the role of business schools in the development of Bayesian statistics in the 1950–1970s, is described. Later developments in accounting, especially noisy rational expectations models under which the information reported by firms is endogenous, rather than unaffected or “drawn from nature”, make the task of Bayesian inference more difficult yet no different in principle. The information user must still revise beliefs based on what is reported. The extra complexity is that users must allow for the firm’s perceived disclosure motives and other relevant background knowledge in their Bayesian models. A known strength of Bayesian modelling is that subjective considerations are admitted and formally incorporated. Allowances for perceived self-interest or biased reporting, along with any other apparent signal defects or “information uncertainty”, are part and parcel of Bayesian information theory.

I don’t know anything about accounting, except that I used to have a convenient approach to doing my taxes where every one in awhile I’d send the IRS a check, based on my guess of what I owed them. They’d keep track of things so that each year my tax bill would account for my payments. Once I didn’t send them enough and they fined me something like $50.

In any case, it makes sense to me that uncertainty should be considered fundamental to accounting. I’ve always thought of accounting as arithmetic, but real-world accounting is full of uncertainty and combination of different sources of information, so, yeah, Bayes.

Did this study really identify “the most discriminatory federal judges”?

Christian Smith, Nicholas Goldrosen, Maria-Veronica Ciocanel, Rebecca Santorella, Chad Topaz, and Shilad Sen write:

In the aggregate, racial inequality in criminal sentencing is an empirically well- established social problem. Yet, data limitations have made it impossible to determine and name the most racially discriminatory federal judges. The authors use a new, large-scale database to determine and name the observed federal judges who impose the harshest sentence length penalties on Black and Hispanic defendants. . . . While acknowledging limitations of unobserved cases and variables, the authors find evidence that several judges give Black and Hispanic defendants double the sentences they give observationally equivalent white defendants.

They fit a multilevel model! That makes me happy.

I heard about this from Jeff Lax and David Hogg, who pointed me to a post on twitter by law professor Jonah Gelbach disputing the above claims. Gelbach writes, “the data are incomplete . . . a match rate of less than 50% . . . the match rate varies substantially across districts . . . endogeneity concerns . . . The dependent variable is specified as the log of 1 plus the sentence length . . .” The criticisms are a mixed bag—for example, at one point Gelbach writes, “One concern is that judges w/few defendants will have higher-variance random slope estimates, raising the possibility that the results would be an artifact of estimating lots of effects & then picking largest values, which are more likely to happen w/judges having few cases.”—but he’s actually getting things backward here, for reasons discussed by Phil and me in our 1999 paper, All maps of parameter estimates are misleading.

At this point, I think Jeff was hoping I’d adjudicate and share my own conclusion. But I don’t want to! For two reasons.

1. It takes work. Some things don’t take much work at all. Reading Alexey Guzey’s criticisms of Why We Sleep and then reading the relevant parts of Why We Sleep—it’s pretty clear what’s going on. Reading the overblown claims of John Gottman followed by the breakdown by journalist Laurie Abraham, again, this wasn’t a hard case to judge (although it seems that Abraham has made her own mistakes). Similarly with beauty-and-sex-ratios, ages-ending-in-9, and various other bits of junk science—serious flaws were immediately apparent to me.

2. The “send it to Andy” approach to judging tough statistics questions doesn’t scale.

So instead I’m going to give some generic advice of the sort I’ve given many times before, involving workflow, or the trail of breadcrumbs. What I want to see is graphs of the data and fitted model. For each judge, make a scatterplot with a dot for each defendant they sentenced. Y-axis is the length of sentence they gave, x-axis is the predicted length based on the regression model excluding judge-specific factors. Use four colors of dots for white, black, hispanic, and other defendants, and then also plot the fitted line. You can make each of these graphs pretty small and still see the details, which allows a single display showing lots of judges. Order them in decreasing order of estimated sentencing bias.

This won’t answer all our questions, but it’s a start. With these graphs in hand, you’ll be able to more carefully go through the different concerns with the study.

Also, it’s kinda wack that in their Tables 4 and 5, which cover individual judges, they just give point estimates and no uncertainties. What’s the point of fitting a big-ass model and then not presenting uncertainties??

Separate from all of this is the leap from a statistical pattern to actual discrimination (whatever that means, exactly). I’m not getting into that here.

It seems that Smith et al. are making two claims: first a general statement that blacks and hispanics are given slightly longer sentences than whites and others, on average; second a particular claim about the judges on the top of their list. I’ll just say this: if their general claim of aggregate bias is correct, then of course there will be variation, with some judges more biased than others. As everyone here recognizes, bias (statistical or otherwise) can come from some combination of the judge, the cases he or she sees, and the judge’s institutional setting.

P.S. Update here.

Use stacking rather than Bayesian model averaging.

Carlos Parada writes:

If I do a spike-and-slab regression in Stan (or Turing.jl, I guess; I don’t believe Stan allows for spike-and-slab priors), will that give equivalent results to doing Bayesian Model Averaging of two models, one of which includes a predictor variable and one of which doesn’t? (Assuming equivalent priors.) Mostly I’m somewhat concerned I picked the wrong side after I learned that BMA isn’t asymptotically efficient. Or, less jokingly, I’ve learned to expect that any time a Bayesian procedure does worse than a frequentist procedure, it’s because you’re using the wrong Bayesian procedure. I’m not sure what the alternative Bayesian procedure is. (Leave-one-out cross-validation doesn’t count!)

I know that BMA will successfully maximize the log-probability assigned to the correct model, assuming the model is in the candidate set. I guess there’s no reason this has to produce equivalent outcomes to the ones I’d get by minimizing the expected K-L divergence. But the idea of averaging models using anything other than Bayesian updating still rubs me the wrong way. As the kind of person who came to Bayes for mostly philosophical reasons, it feels ugly — what happened to Savage’s Dutch Book arguments, or Cox’s theorem?

I have two responses. First, if you want to select a subset of predictors, I recommend the horseshoe model. Second, I recommend stacking rather than Bayesian model averaging; see this paper. Regarding Parada’s question: realistically, if you’re thinking about model averaging at all, you’re going to be in the M-open setting in which you don’t think that any of the candidate models is a good description of the underlying process. As we discuss in the stacking paper, so-called BMA is designed for the M-closed setting in which one of the candidate models is correct, and it doesn’t always work so well in M-open.

Stan for the climate!

Thom Laepple writes:

I would like to draw your attention to the opening of two senior scientist and two postdoc positions in the Polar Terrestrial Environmental Systems section to complement the Earth System Diagnostics Team at the AWI (http://www.awi.de/) research centre in Potsdam, Germany
We are offering
One position for a statistician (with tenure track option) to optimally interpret complex and small sample size environmental datasets (such as paleoclimate data)
One position for a senior scientist on theoretical and stochastic modelling of Earth System processes (climate & paleoclimate archives)
Further, we are offering two two year PostDoc positions related to the ERC Starting Grant SPACE (Space time structure of climate change) of T. Laepple
Postdoc in ice-core research / statistical data analysis (m/f/d)
Ideally, the candidate has experience with time-series analysis in the time and spectral domain and ice-core data.
Postdoc in tree-ring research / tree ring data analysis
Ideally, the candidate has experience in the analysis of large-scale tree ring networks

Here’s some description from the above links:

You will develop and apply statistical tools to optimally interpret complex and small sample size environmental datasets. One example is the reconstruction of climate fields from sparse paleoclimate proxy measurements from ice-cores and sediment cores. You will further teach and consult researchers at AWI on statistical analysis. Ideally, you will successfully apply for a third-party funded junior research group to further boost the topic of climate/environmental data [email protected]

You will:
– Develop and apply Bayesian calibrations of environmental proxies to translate raw proxy ‘sensor’ data into physical/biological quantities
– Develop and apply field reconstructions from sparse and time uncertain data that allows to integrate proxy ‘sensor’ data and climate fields.
– Implement prototypes of digital tools such as proxy forward models / artificial core generators that can be integrated in O2A
– Design optimal sampling and measurement procedures.

Looks interesting.

Webinar: Predicting future forest tree communities and winegrowing regions with Stan

This post is by Eric.

On Friday, Elizabeth Wolkovich from the University of British Columbia is stopping by to talk to us about her work. You can register here.

Abstract

Climate change is having large impacts on natural and agricultural systems around the globe. Mitigating the worst consequences requires models that mechanistically predict changes. Towards that goal, the Temporal Ecology Lab works on models to better predict the most reported biological impact – shifts in phenology, the timing of recurring life history events such as leafout, and flowering. Here I review three major areas of research where Bayesian inference has been critical to my lab’s insights and advances: declining plant sensitivity to warming temperatures over time and space,  mismatches between critical species interactions (for example, plants and pollinators), and shifting winegrowing regions with warming.

About the speaker

Elizabeth Wolkovich is an Associate Professor and Canada Research Chair at the University of British Columbia where she runs the Temporal Ecology Lab. Her research focuses on understanding how climate change shapes plants and plant communities, with a focus on shifts in the timing of seasonal development. She is particularly interested in how climate change will affect different winegrape varieties, and how shifting varieties may help growers adapt to warming.

Another example to trick Bayesian inference

This post is by Yuling, not Andrew.

We have been talking about how Bayesian inference can be flawed. Particularly, we have argued that discrete model comparison and model averaging using marginal likelihood can often go wrong, unless you have a strong assumption on the model being correct, except models are never correct. The contrast between “discrete Bayesian model comparison kinda does not work” and “Bayesian inference is the only coherent inference paradigm” bothers me. Indeed, similar counterexamples that trick discrete Bayesian model comparison can also trick discrete/continuous Bayesian inference.

Consider a toy example. We are making inferences on the location parameter in a normal model y~ normal (mu, 1) with one observation y=0. We have a restricted parameter space that only contains four points: mu belongs to {-1, 0, 1, 3}. We pose a uniform prior on mu. The inference on mu is ok: the posterior probabilities of these four points are (.27, .45, .27, .005). It is not perfect but we can interpret it as a discrete approximation to the more desired posterior in the continuous model. So far so good.

But now I want to compromise my inference. So I expand the parameter space: mu belongs to {-1, 0, 1, 3.0000, 3.0001, 3.0002, … 3.1000}; some 1000 points are inserted between 3 and 3.1. Now I still use a flat prior and make the inference. The inference is “wrong”: the posterior of mu is concentrated at 3. Indeed, the Bayesian estimate is much worse than the MLE, mu=0.

You may say it is the fault of the non-informative prior. But a naive informative prior, such as target += normal_lpdf (mu | 0, 1) still does not solve the problem. The prior needs to adjust for both the height and the volume, or what Mackay would call the Occam factor.

You may say it is the fault of the discrete parameter space, and we cannot fit this discrete model in Stan anyway. But no, similar problems can exist in continuous models. Consider the model y=0 ~ norma (mu, sigma) with a slightly constrained parameter space: |mu|<10, sigma < |mu|+1, and a flat prior on this constrained parameter space. Then the posterior inference of mu peaks at 10 or -10, which is again worse than the MLE, mu=0. Further, when changing sigma < |mu|+1 to sigma < |mu|^2+1, or to sigma < exp(|mu|)+1, the posterior will change greatly even though all priors amount to some effectively flat prior around the parameter values that we care about. Adding an extra informative prior regularization target += normal_lpdf (mu | 0, 1) solves part of the problem, but the bias from this artificially designed parameter space still carries.

The point is that the parameter space and its measure are silent subjective decisions designed by the modeler. We typically work with the counting measure on discrete space or the euclidean space with the boreal measure by default, but such assumption is context-dependent, and may be potentially falsified in a larger workflow—The Bayes rule won’t do it for you automatically.

P.S. These two examples are both artificial and is easy to fix. But the general pattern can exist in more practical examples. As Andrew pointed out in the comment, we can model the degree of freedom in a student-t likelihood as a parameter, or we may reparameterize the model into (nu, sigma*nu/(nu-2)) such that the df becomes a transformed parameter. Similarly, when modeling covariance matrix, we can model each element of the covariance matrix, model the matrix square root of it, the correlation times the diagonal, the matrix square root of the correlation times the diagonal, or the Cholesky decomposition of the correlation… Reparameterization changes the default prior and changes the “distance” between parameter values—But there are always other potential parameterizations even if we do not perform them.

Andrew pointed out two future directions. One is that we can rescue discrete model companions by designing better priors on models: to place smaller priors on similar models. I think the “dilution prior” has done that. It is related to “understanding prior in the likelihood”, and also related to the Jefferys prior that keeps the reparameterization invariant, although the latter is not always desired on other occasions.

I should also make it clear that the examples I have here can be categorized into “model misspecification”. Well, they are correct models in the classical sense: one parameter is the true value, such that with infinitely many data, everything works fine. But these models are also clearly poorly designed and a standard posterior predictive check (PPC) will manifest the poor fit.

Interpreting apparently sparse fitted models

Hannes Margraf writes:

I would like your opinion on an emerging practice in machine learning for materials science.

The idea is to find empirical relationships between a complex material property (say the critical temperature of a superconductor) and simple descriptors of the constituent elements (say atomic radii and electronegativities). Typically, a longish list of these ‘primary’ descriptors is collected. Subsequently a much larger number of ‘derived’ features are generated, by combining the primary ones and some non-linear functions. So primary descriptors A,B can be combined to yield exp(A)/B and many other combinations. Finally, lasso or similar techniques are used to find a compact linear regression model for the target property (a*exp(A)/B+b*sin(C) or whatnot).

The main application of this approach is to quite small datasets (e.g. 10-50 datapoints). I’m kind of unsure what to think of this. I would personally just use some type of regularized non-linear regression with the primary features here (e.g. GPR). Supposedly, the lasso approach is more interpretable though because you can see what features get selected (i.e. how the non-linearity is introduced). But it also feels very garden-of-forking-paths-like to me.

I know that you’ve talked positively about lasso before, so I wonder what your take on this is.

It’s hard for me to answer this with any sophistication, as it’s been a long long time since I’ve worked in materials science—my most recent publication in that area appeared 34 years ago—so I’ll stick to generalities. First, lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model. So I’d prefer to think of lasso, horseshoe, etc. as regularizers, not as devices for selecting or finding or discovering a sparse model. To put it another way, I don’t take the shrink-all-the-way-to-zero thing seriously. Rather, I interpret the fitted model as an approximation to a fit with more continuous partial pooling.

Eliciting expert knowledge in applied machine learning

This is Jessica. I was going to blog this about elicitation a few days ago, and then before I got to publishing it Aki brought up elicitation of priors for Bayesian analysis. Elicitation is a topic I started thinking about a few years ago with Yea Seul Kim, where we were focusing mostly on graphical elicitation of prior beliefs for visualization interaction or analysis. The new paper by Aki and others is a great summary of the many hard questions one can run into in eliciting knowledge. As Tony O’Hagan, who’s written extensively on elicitation, has said “Eliciting expert knowledge carefully, and as scientifically as possible, is not simply a matter of sitting down with one or more experts and asking them to tell us what they think.” But while we know the elicitation process matters, it can be very difficult to evaluate whether you’ve gotten the “right” beliefs from someone. There’s undoubtedly some effect of asking them in the first place, and a danger that the elicitation process hallucinates an unwarranted amount of detail or bias in representing their knowledge. For me it’s been the kind of topic where the more I work on it, the less confident I feel that it is working.   

Anyway, prior elicitation is just one relatively well studied form of elicitation. Dan Kerrigan, Enrico Bertini and I recently looked at a sample of papers dealing with applied machine learning papers whose modeling contributions involve integrating knowledge gained from domain experts. There’s some precedent for looking at elicitation in what are called “expert systems,” which tends to be associated with developing knowledge bases and rule-based approaches to mimic expert decision making, popular in the 80s and into the 90s. We consider knowledge elicitation broadly in the context of fully automated and mixed initiative or human-in-the-loop predictive models, where there’s often some acknowledgment of the need to work with domain experts to make sure the models solve the right problem and gain intrinsic trust in the sense of aligning with the experts’ causal models, etc. 

Eliciting knowledge from domain experts can play an important role throughout the machine learning process, from correctly specifying the task to evaluating model results. However, knowledge elicitation is also fraught with challenges. In this work, we consider why and how machine learning researchers elicit knowledge from experts in the model development process. We develop a taxonomy to characterize elicitation approaches according to the elicitation goal, elicitation target, elicitation process, and use of elicited knowledge. We analyze the elicitation trends observed in 28 papers with this taxonomy and identify opportunities for adding rigor to these elicitation approaches. We suggest future directions for research in elicitation for machine learning by highlighting avenues for further exploration and drawing on what we can learn from elicitation research in other fields.

We looked at a set of papers (28) that were published in ML or related areas in the last 25 years and specifically mentioned domain knowledge elicitation. In getting to these we sifted through a much larger set of ML related papers that suggested some integration of domain knowledge, but ruled out at least as many as we report on because they didn’t motivate and describe the elicitation of expert knowledge or they didn’t demonstrate the use of the elicited knowledge in a model or system. (PS If you have paper suggestions please post them in the comments, we’re interested in growing the list for future reference).   

Domain knowledge can play many roles in an ML pipeline, so among those papers that did describe elicitation in some detail, we differentiate the high level goal for each described form of elicitation, including problem definition (figuring out the task, how humans currently do it, what the relevant data sources are, how to evaluate performance), feature engineering (feature relevance, transformations, etc.), model development (eliciting rules, constraints, or feature relationships), and model evaluation, not very prevalent in the sample but cases where domain expertise was used to assess the model’s performance and validate its results to improve the model. 

We also break down elicitation processes into the target of elicitation (e.g., instances, feature relevance, model constraints), process details like the medium (e.g., meetings or interviews, computer app), whether the elicitation prompt is well-defined (e.g., are there clear questions or tasks that are described as being put to the expert), whether any context given to the expert to establish common ground is described (which might include describing elicitation goals, model details, notable instances, or other information to “grease the wheel”), how well structured the prompts are, how constrained the experts’ responses are, if potentially unreliable or erroneous responses were accounted for in modeling (e.g., were expert labels treated as probabilistic?), and whether or not extracted knowledge was validated (checked for consistency or reliability in some way, even if informally like by recapitulating to the expert what one learned from an interview; notably this was absent in 75% of the papers we looked at). We look at whether there’s any pre-processing required before integrating into model development, and if so, whether it’s described, and how well defined the use of the information is in the model development pipeline. Was there a predefined and unambiguous process for applying the expert knowledge in the model development pipeline? Examples where the use of the elicited knowledge is well defined would include, for example, eliciting priors on edges for a Bayesian network, incorporating a monotonicity constraint to capture the expert’s mental model of how a feature behaves, or asking experts to label certain instances for training the model. An example where it’s not would include reporting on holding workshops with domain experts like doctors to understand what they found challenging about certain diagnoses but then not saying how exactly the information that was gained informed the model.  

Here’s a diagram showing how things break down in terms of goals, different types of process details, and use of the elicited info. We have to recall that these are the papers that provided enough detail to code in the first place, but overall it seems there’s a fair number of cases where the medium used to collect information isn’t specified, and tendencies to overlook how the expert is guided to think about the goals or model details as they provide information (context), to treat the elicited knowledge as given rather than formally account for uncertainty (unreliable responses not accounted for), and to not worry about validating responses. Sankey diagram showing patterns in elicitation processes in applied ml papers

Maybe the informalness of elicitation processes that these results suggest (at least when we’re not talking about more conventional uses we saw like eliciting labels or feature relevance) isn’t that surprising, since ML has traditionally de-emphasized the human tweaking and interfacing parts. It’s still relatively recently in the history of AI/ML that ideas like intelligence augmentation, human-in-the-loop, etc. have become mainstream. Maybe the best of the old expert systems elicitation approaches should be dusted off and covered more in ML curricula.  

If a bigger analysis were to find similar patterns, how important it is to increase the emphasis on more rigorous elicitation in ML? We argue that more rigor (and more development and validation of elicitation focused methods or tools) would be a worthy investment, for reasons of efficiency and transparency. If you’re not doing any validation or thinking carefully about prompting, establishment of common ground, etc. it seems easy to end up with some valid elicited knowledge and some hallucinated bits. And whenever the integration of domain knowledge is used as a selling point for the research contribution, it’s fair to expect a reproducible process that defines the target of elicitation, how the knowledge will be used, and how responses will be evaluated, rather than figuring it all out as we go. 

Summer internships at Flatiron Institute’s Center for Computational Mathematics

We’re hiring a crew of summer interns again this summer. We are looking for both undergraduates and graduate students. Here’s the ad.

I’m afraid the pay is low, but to make up for it, we cover travel, room, and most board (3 meals/day, 5 days/week). Also, there’s a large cohort of interns every summer across the five institutes at Flatiron (biology, astrophysics, neuroscience, quantum physics, and math), so there are plenty of peers with whom to socialize. Another plus is that we’re in a great location, on Fifth Avenue just south of the Flatiron Building (in the Flatiron neighborhood, which is a short walk to NYU in Greenwich Village and Google in Chelsea as well as to Times Square and the Hudson River Park).

If you’re interested in working on stats, especially applied Bayesian stats, Bayesian methodology, or Stan, please let me know via email at [email protected] so that I don't miss your application. We have two other Stan devs here, Yuling Yao (postdoc) and Brian Ward (software engineer).

We're also hiring full-time permanent research scientists at both the junior level and senior level, postdocs, and software engineers. For more on those jobs, see my previous post on jobs at Flatiron. That post has lots of nice photos of the office, which is really great. Or check out Google's album of photos.

Prior knowledge elicitation: The past, present, and future

Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, and Arto Klami write

Specification of the prior distribution for a Bayesian model is a central part of the Bayesian workflow for data analysis, but it is often difficult even for statistical experts. Prior elicitation transforms domain knowledge of various kinds into well-defined prior distributions, and offers a solution to the prior specification problem, in principle. In practice, however, we are still fairly far from having usable prior elicitation tools that could significantly influence the way we build probabilistic models in academia and industry. We lack elicitation methods that integrate well into the Bayesian workflow and perform elicitation efficiently in terms of costs of time and effort. We even lack a comprehensive theoretical framework for understanding different facets of the prior elicitation problem.

Why are we not widely using prior elicitation? We analyze the state of the art by identifying a range of key aspects of prior knowledge elicitation, from properties of the modelling task and the nature of the priors to the form of interaction with the expert. The existing prior elicitation literature is reviewed and categorized in these terms. This allows recognizing under-studied directions in prior elicitation research, finally leading to a proposal of several new avenues to improve prior elicitation methodology.

See the full 60 page paper in arXiv

(This post is by Aki, one of the many co-authors of the paper)

Drawing Maps of Model Space with Modular Stan

This post is by Ryan Bernstein. I’m a PhD candidate at Columbia with Andrew and Jeannette Wing.

Right now, each probabilistic program we write represents a single model. As we tackle a modeling task, we try one model after another, and old programs pile up in our “models” folder or git history. What if instead of describing one model at a time, we focused on drawing a map of our model space? What if a program represented not a single probabilistic model, but a whole network of models and their rich interconnections, a region of model space that grew as we explored?

If we drew maps of model space as we experimented, it would be easier for others to follow our work. We could annotate regions of our map, like “These models rely on a big assumption,” or “Here, there be dragons convergence issues.” We could draw our path of exploration, from our first experiment to our final model and all of the false turns along the way. We could document, “This way was too slow”, or “This way gave strange prior predictions”, or “I didn’t have time to explore this way, maybe you should.” Issues like the Garden of Forking Paths and P-hacking make it clear that how and why a model was grown is necessary context for assessing its trustworthiness; model space maps would act as living documentation of that context.

Having a map would also help us navigate through model space as we explore. We could orient ourselves in model space by contrasting our current model to its immediate neighbors: we could say things like, in that direction, a posterior mean trends upwards, or calibration issues appear, or predictive accuracy improves.

Once we sketch out a region of model space, we could deploy robot servants to help us understand it. They could scout ahead for models that are accurate, simple, or robust. They could also check that our conclusions don’t depend too much on any one turn we made along our route, ala the Garden of Forking Paths. Maps enable automation, and automation can save us time and catch our errors.

So what kind of program can represent a growing network of models while remaining concise and readable?

I’ve developed a simple metaprogramming feature called ‘swappable modules’1 that can extend languages like Stan to do just that. Each ‘modular Stan program’ can compile down to a whole set of normal, ‘concrete’ Stan programs. For now we’ll treat model space as a discrete2 network of models and we’ll assume that the data is consistent across the network.

First I’ll introduce ‘modular Stan’ alongside a trivial example, and then I’ll give two more interesting case studies. You can play along at home with my prototype compiler and interactive visualizations at http://ryanbe.me/modular-stan.html. You can load each example with the buttons on the lower left, or you can write your own modular program. Expect plenty of bugs!

A way to program maps

Suppose we have some data x that we believe to be normally distributed. Let’s write a modular Stan program to represent a region of the model space of x.

Modular Stan programs start with a base, and then they are built up by adding one piece, or module, at a time. The base is part of every model we produce, so it should only contain things we’re sure won’t change: the shape of the data, the key inferred quantities, and a basic outline of the model. Here is the base for our `x` modular program:

data {
    int N;
    vector[N] x;
}
model {
    x ~ normal(Mean(), Stddev());
}

So far this is like a Stan program except that it has two “holes” in it, Mean and Stddev, that represent as-yet undefined behavior.

Next, we write a module to fill in each hole. A module can contain arbitrary code and declare new parameters like a miniature Stan program, and it can also return a value.

Here’s a module to fill in the Mean hole:

module "standard" Mean() {
    return 0;
}

This reads as: define a module called "standard" that fills the Mean hole with 0.

Here’s a module to fill Stddev:

module "standard" Stddev() {
    return 1;
}

Our modular Stan program up to this point is equivalent to a single concrete Stan program:

data {
    int N;
    vector[N] x;
}
model {
    x ~ normal(0, 1);
}

This isn’t very interesting, so let’s explore the model space a little bit more. We can explore by simply adding more modules.

Let’s try modeling the mean of x as a random variable. We add a module called "normal" to fill the Mean hole with a new parameter that has a prior:

module "normal" Mean() {
    parameters {
        real mu;
    }
    mu ~ normal(0, 1);
    return mu;
}

If we use "normal" to fill Mean and "standard" to fill Stddev, the concrete Stan program will be:

data {
    int N;
    vector[N] x;
}
parameters {
  real mu;
}
model {
    mu ~ normal(0, 1);
    x ~ normal(mu, 1);
}

Since we now have two ways to fill the holes, our modular Stan program is equivalent to a set of two concrete Stan programs.

Now suppose we also want to model the standard deviation, but we aren’t sure whether or not to use an informative prior (sorry for the contrived example). We can leave that decision as another hole:

module "lognormal" Stddev() {
    parameters {
        real<lower=0> sigma;
    }
    sigma ~ lognormal(0, StddevInformative());
    return sigma;
}
module "yes" StddevInformative() {
    return 1;
}
module "no" StddevInformative() {
    return 100;
}

StddevInformative is a hole nested inside the module "lognormal". In this way, modular programs are hierarchical as well as branching. We can view modular programs as trees3:

Our modular program now represents a set of six concrete models, one for each way the set of holes can be filled: two ways for Mean ("standard" and "normal") times three ways for Stddev ("standard", "lognormal" with StddevInformative: "yes", and "lognormal" with StddevInformative: "no".).

How do we build a network from this set of programs? There’s a natural definition: we draw an edge between two models if they differ by only one module4. Each neighbor is then one ‘decision’ away5. Here is the network of models for this trivial example:

A representation of the network of models for the “x” example. Edges are annotated with “hole” that differs between the two models. The prototype website has an interactive version of this visualization.

‘Golf’ Example: A travelogue through model space

I’ve rewritten Andrew’s golf case study as a modular program to show what it would look like to write a case study as a “travelogue” of development through model space.

In the case study, the modeling task is to understand a golfer’s chance of sinking a shot given their distance from the hole. Like before, we’ll start by writing an outline of the problem that will be constant for the whole model space:

data {
  int J;        // Number of distances
  vector[J] x;  // Distances
  int n[J];     // Number of shots at each distance
  int y[J];     // Number of successful shots at each distance
}
model {
  y ~ NSuccesses(n, PSuccess(x));
}

We start with two holes: NSuccesses is the distribution of the number of successful shots given the number of attempts n and the probability of success; PSuccess is the probability of success given the distance from the hole x.

A good way to count successes is the Binomial distribution, so let’s add a module called “binomial” to fill NSuccesses:

module "binomial" NSuccesses(y | n, p) {
  y ~ binomial(n, p);
}

Now we need a module for PSuccess to map the distance of a shot x to the probability of success. We start with a logit function:

module "logistic" PSuccess(x) {
  parameters {
    real a;
    real b;
  }
  return logit(a + b*x);
}

We’ve now defined logistic regression, the first model from the case study:

data {
  int J;        // Number of distances
  vector[J] x;  // Distances
  int n[J];     // Number of shots at each distance
  int y[J];     // Number of successful shots at each distance
}
parameters {
  real a;
  real b;
}
model {
  y ~ binomial(n, logit(a + b*x));
}

In this fashion, we can add a new module for each step of Andrew’s logic in the case study. See the full modular program here: http://ryanbe.me/modular-stan.html?example=golf.

Let’s use my prototype graphical interface to visualize the resulting modular program. Here are the parts of the interface:

1. Modular program editor with compile and load buttons. 2. Interactive module tree. 3. Interactive network of models. 4. Selected concrete Stan program, if any. 5. Label and documentation editor for the selected concrete program, if any.

We can select modules to narrow down the network of models to a single concrete model, as shown in the figure at the top of this post.

And finally we can see the case study as a “travelogue” that documents a path through the network of models:

‘Birthday’ Example: Exploring model space with “scouting robots”

To demonstrate automation on the network of models, I’ve rewritten the Birthday case study from as a modular program. I collaborated with Hyunji Moon to build this demo.

The goal of the birthday case study is to understand trends in the number of births over time. The authors use a Gaussian process time series model, and they experiment with including a few different types of trends (e.g. day-of-week, day-of-year, holidays, seasonal) and other modeling decisions (e.g. the weighting structure of days-of-week scores and the choice of hierarchical priors on day-of-year scores). When each trend inclusion or other modeling decision is represented as an independent choice of module, the modular program represents 120 models.

You can load the example with the full source code at http://ryanbe.me/modular-stan.html?example=birthday.

Here is the network of modules:

Suppose we want to find a model that does a reasonable job of fitting our time-series birthday data. That’s a lot of possibilities to explore manually! So let’s enlist a friendly robot to search the network for promising models.

We send off our robot with instructions for a greedy search:

“Start here.
Score the neighbors of every model you visit.
Move to the highest scoring model you’ve seen so far.
If you’re already at the highest scoring model, stop searching.”

This is like discrete gradient descent. We equip our robot with a scoring method: Expencted Log-Predictive Density (ELPD) is a reasonable choice that approximates goodness-of-fit.

Here is the data our robot collects:

The red annotations show the ELPD scores for the assessed models. The robot traveled along the arrow path from [START] to [GOAL].

In this case the robot ended its search at the case study’s final model. This optimization amounts to simple symbolic regression on probabilistic programs.

While greedy maximization of ELPD is naive and we shouldn’t blindly trust it to give us our final model, we can at least use it to find promising neighborhoods. I hope this serves as a proof of concept to open the door for other, smarter network-wide algorithms.

Future work

I hope that model space “maps” will give us new “navigation” tools for development, enable automatic search and validation, and help others to understand and reuse our work.

There’s a ton of opportunity for development in this area, such as:

  • What other algorithms can run on a network or neighborhood of models? What other scores would be useful to optimize?
  • How should we decorate network edges to support model navigation? What about for special cases like causal inference?
  • Can we validate modules themselves6?
  • Can we build libraries of modules?
  • Can we learn to synthesize modules?
  • Can we automatically ensemble or stack models in a network?
  • Can we include automated model transformations in the network in addition to module-swapping transformations?
  • If we’re fitting models in a network one after another, can we use their relatedness to sample more efficiently?
  • Are there extensions of the module-swapping system or other metaprogramming features that would be worth the added complexity7?

If you have any questions or ideas, please leave a comment or reach me directly at ryan[dot]bernstein[at]columbia.edu!

Footnotes:

1
A more technical programming languages term might be, ‘module system with non-deterministic functor application’.

2
We can treat the model space as discrete by unifying models that differ only by the values of continuous hyperparameters. Those hyperparameters can be optimized (or modeled) independently from the network of models approach.

3
It is possible for modular programs to not form trees when two modules contain the same hole, but all modular programs form DAGs.

4
Strictly speaking, to account for holes like StddevInformative that aren’t always in use, two models are neighbors if they differ by one subtree of modules.

5
Akin to a Hamming graph.

6
To paraphrase Bob’s 2021 ProbProg talk: modularity is the elephant in the room, but its big challenge is testing modules in isolation. The next best alternative may be to test within a modeling context by a sort of swap-one-out validation.

7
For example: a special type of “hole” that’s filled with a collection of modules and returns their values as an array. This would be useful for cases like including a subset of features in a regression or components of a Gaussian process.

It’s so hard to compare the efficiency of MCMC samplers

I thought I’d share a comment I made on Perry de Valpine‘s post on the NIMBLE blog,

Perry was posting about a paper that tried to compare efficiency of Stan, NIMBLE, and JAGS for some linear modeling problems. You can find the link in Perry’s post if you really care and then the paper links to their GitHub. Perry was worried they were misrepresenting NIMBLE and users would get the wrong idea. We decided to take the approach of established companies and simply ignore this kind of criticism. But this time I couldn’t resist as it’s not really about us.

Here’s my comment (lightly edited):

Comparing systems on an equal footing is a well-nigh impossible task, which is why we shy away from doing it in Stan. The polite thing to do with these kinds of comparisons is to send your code to the devs of each system for tuning. That’s what we did for our autodiff eval paper.

I don’t like this paper’s evaluation any more than you do! I’d like to see an evaluation with (a) arithmetic on an equal footing, (b) the kinds of priors we actually use in our applied work, and (c) something higher dimensional than p = 100 (as in p = 10,000 or even p = 100,000 like in the genomics regressions I’m working on now). Then the evaluation I care about is time to ESS = 100 as measured by our conservative cross-chain ESS evaluations that also allow for antithetical sampling (Stan can produce samples whose ESS is higher than the number of iterations; may estimators just truncate at the number of iterations because they don’t understand ESS and its relation to square error through the MCMC CLT). The problem with this kind of eval is that we want to represent actual practice but also minimize warmup to put systems in as favorable a light as possible. In simple GLMs like these, Stan usually only needs 100 or maybe 200 iterations of warmup compared to harder models. So if you use our default of 1000 warmup iterations and then run sampling until you hit ESS = 100, then you’ve wasted a lot of time in too much iteration. But in practice, you don’t know if you can get away with less warmup (though you can use something like iterative deepening algorithms to probe deeper, they’re not built in yet).

One way to get around this sensitivity to warmup is to evaluate ESS/second after warmup (or what you might call “burnin” if you’re still following BUGS’s terminology). But given that we rarely need more than ESS = 100 and want to run at least 4 parallel chains to debug convergence, that’s not many iterations and you start getting a lot of variance in the number of iterations it takes to get there. And things are even more sensitive to getting adaptation right. And also, I don’t think ESS/second after warmup is the metric practitioners care about unless they’re trying to evaluate tail statistics, at which point they should be seriously considering control variates rather than more sampling.

In other words, this is a really hard problem.

I then read Perry’s follow up and couldn’t help myself. I actually looked at their Stan code. Then I had a follow up comment.

I just read their source code. It’s not exactly Stan best practices for statistics or computation. For instance, in mixture_model.stan, there are redundant data computations per iteration (line 25), redundant distributions (also line 25), inefficient function calls (line 31), and a conjugate parameterization inducing extra work like sqrt (line 23). Then in AFT_non_informative.stan, they use very weakly informative priors (so it’s misnamed), there are missing constraints on constrained variables (lines 17, 32), redundant computation of subexpressions and of constants (lines 26, 27), missing algebraic reductions (also lines 26 and 27), redundant initialization and setting (lines 22/23 and 26/27), redundant computations (line 32).

The worst case for efficiency is in their coding of linear models where they use a loop rather than a matrix multiply (LinearModel_conjugate.stan, lines 30–32, which also violates every naming convention in the world by mixing underscore separators and camel case). This code also divides by 2 everywhere when it should be multiplying by 0.5 and it also has a bunch of problems like the other code of missing constraints (this one’s critical—`sigma` needs to be constrained to be greater than 0).

Then when we look at LinearModel_non_informative_hc.stan, things get even worse. It combines the problems of LinearModel_conjugate with two really bad ones for performance: not vectorizing the normal distribution and needlessly truncating the Cauchy distribution. These would add up to at least a factor of 2 and probably much more.

And of course, none of these exploited within-chain parallelization or GPU. And no use of sufficient stats in the conjugate cases like Linear_model_conjugate.stan.

Simulation-based inference and approximate Bayesian computation in ecology and population genetics

Paul Kedrosky asks:

Have you written anything on approximate Bayesian computation? It is seemingly all the rage in ecology and population genetics, and this recent paper uses it heavily to come to some heretical conclusions.

I passed this over to Lizzie who wrote:

Fun paper! Unfortunately, I too do not know much about this myself but I have two examples where I have heard of using it, both of which I think have some similar issues of exponentially increasing number of possible branches taken (e.g., Coalescent theory of pop gen and Lévy flight models). I don’t know the math of these models well enough to know what exactly makes the likelihood so expensive to calculate and I don’t really see authors explaining that on quick glance. My very gut reaction is that these are inherently non-identifiable models ….

Lizzie also pointed to figure 1 in this paper which I’ve reproduced above. And she asked, “What makes something ‘approximate’ Bayesian? And is it different than being ‘approximately’ Bayesian?” I replied that, except in some very simple models, all inferences, Bayesian or otherwise, are approximate, and all computations are approximate too, in the sense that you can’t really get exact draws from the posterior distribution or closed forms for posterior expectations or intervals.

And Lizzie connected us to Geoff Legault, who wrote:

The paper is also a mystery to me, but I do think ABC methods, or more broadly, simulation-based inference can be useful if done carefully and with full awareness of its many limitations.

In ecology, there are many “models” that are easily simulated, that in principle have likelihood/probability functions, but those likelihoods are not available or they cannot possibly be computed. For example, certain kinds of stochastic, spatially-explicit population models would require working with transition matrices that are too large for any computer that currently exists or will ever exist (probably). I gather some models in physics also such intractable likelihoods. In those cases, the idea is to use the simulator to either approximate the likelihood or simply to learn more about your model. Most applications of simulation-based inference that I’ve seen opt for the latter: parameter values are sampled from a prior distribution, data is simulated with them, distances between several summary statistics of simulated and real data are computed, and those (parameter) samples with distances below some tolerance level are retained. To me, the reliance on distances between summary statistics feels much more “approximate” than the usual approximations we accept when making most inferences.

In addition to the paper Lizzie linked, I highly recommend this recent review of simulation-based inference.

And John Wakeley adds:

Here’s ABC in a nutshell from the perspective of population genetics. Ideally you want the likelihood of the full data, but this is too complicated to compute analytically. It is even too complicated to compute using simulations because the data have too many dimensions and you might not even be quite sure what to measure. An example would be the whole genomes of a sample of individuals. Also, you are interested in particular model(s) with particular parameters which might not be enough to determine all aspects of the data. These issues are of course in common to all sorts of inference problems. In ABC, there are two or three ways in which approximations are made. The first, also in common with many inference methods, is dimension reduction. You choose some quantities to calculate from the full data, ones which capture what you think are important axes of variation, relevant to your model(s) and parameters. The second is to choose a tolerance level, which you’ll use to estimate the probability of the reduced data by a rejection method. Specifically, for a given set of parameter values, you compute the fraction of simulation replicates which produce values of these quantities close enough to your observed ones to be acceptable. Obviously this requires some finesse: too loose a tolerance means tossing out much to the information you’ve got, and too tight a tolerance can put you back in situation you were trying to avoid in the first place (i.e. if almost no simulation replicates will give exactly the observed values of your quantities). A wrinkle in population genetics, due to genomes being more-or-less digital, is that most of the commonly used quantities are discrete rather than continuous, so tolerances should be chosen accordingly but that’s not necessarily a simple matter depending on the particular quantities. The third point of approximation is the choice of the number of simulation replicates. I suppose this could also bring up subtleties, e.g. the unevenness of errors in estimating the likelihoods for different sets of parameters values.

So there you have it. This discussion makes me think that there could be some useful ways to extend our Bayesian workflow to simulation-based models.

Webinar: The Current State and Evolution of Stan

This post is by Eric.

Next Thursday, Rok Češnovar is stopping by to talk to us about recent developments in and around Stan. You can register here.

Abstract

We will present the current state of the Stan ecosystem, highlight some of the advances that improved the performance of Stan in the past few years, and discuss what is still to come in the not-so-distant future. You will hear about the core modules of Stan, how they all fit together, and how the various interfaces bring that core to life in different ways to make your Stan models run. If your models have been compiling and running faster recently and you were wondering why, we will present the improvements that were made in Stan over the last few years that sped up those pesky gradient evaluations. If you feel that your models are still running slowly, we will provide a few tips on how to identify the computational (non-modeling) reasons why that may be the case, and what are some of the things you can do to speed up computation.

About the speaker

Rok Češnovar is a Stan developer and a PhD student at the University of Ljubljana who is interested in making Bayesian inference faster using parallelism on GPUs and multi-core CPUs. He joined the Stan development team while working on adding GPU support with the team of prof. Štrumbelj at UL, but quickly became interested in almost all aspects of Stan. In addition to working on adding the GPU support, he has co-authored the cmdstanr R package, added profiling for Stan, and implemented the language side of reduce_sum and the new ODE interface. He has also been managing and coordinating Stan & CmdStan releases for the last two years.

The video is now available here.

Comparison of MCMC effective sample size estimators

Aki writes:

tl;dr

In the case of well mixing chains, the functions ess_basic and mcse_quantile in the posterior package are the most accurate methods. The ess_basic package computes sum of autocorrelations using the truncation rule of Geyer (1992), and mcse_quantile uses in addition an inverse approach for quantile Monte Carlo standard error. In the case of well separated modes, the posterior package also provides appropriately small effective sample size indicating that the chains are not mixing.

What should effective sample size estimates look like when we have chains that are not mixing? Here’s what Aki says:

If the modes are well separated so that between-chain variance is much bigger than within-chain variances, each chain is mostly indicating the loaction of the mode and not worth much more than one observation. The posterior package combines Geyer truncated autocorrelation estimates with multi-chain convergence diagnostic R-hat. If the chains are not mixing, then the draws within each chain are considered to be more correlated within the chain and the multi-chain ESS estimate gets smaller. If the modes are well separated, multi-chain ESS is close to the number of chains. Arguably even this is an overestimate as increasing the number of non-mixing chains does not provide reliable estimate of the relative masses of the masses.

This is a big deal. If you estimate effective sample size separately for each chain and then average the chains, then you’ll be much too optimistic when your chains aren’t mixing. That’s why we combine between and within-chain information as described in BDA but which is done more effectively in our recent paper, Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC.

Finally, here’s the background from Aki:

The MCMC effective sample size (ESS) and Monte Carlo standard error (MCSE) estimated for one chain includes estimation of the correlation between the iterations, for example, using autocorrelation time (or spectral density at frequency zero). As a finite number of MCMC draws are available the autocorrelation estimates for bigger lags are noisier. In our paper, we wrote about autocorrelation estimators:

In our experiments, Geyer’s (1992) truncation had superior stability compared to flat-top (Doss et al., 2014) and slug-sail (Vats and Knudson, 2018) lag window approaches.

This notebook includes additional experiments . . .

Again, here’s the link to Aki’s notebook.

Is There a Replication Crisis in Finance?

Lasse Heje Pedersen writes:

Inspired by in part by your work on hierachical models, we have analyzed the evidence on research in financial economics, overturning the claims in prior papers that this field faces a replication crisis. Indeed, the power of the hierarchical model relative to frequentist multiple-testing adjustment along with simple improvements (e.g., leaving out findings that were never significant in the first place) turns out to make a huge difference here.

We also make simulations that show how the hierachical model reduces the false discovery rate while sacrificing little power. Comments very welcome!

The research article, by Theis Ingerslev Jensen, Bryan Kelly, and Lasse Heje Pedersen, begins:

Several papers argue that financial economics faces a replication crisis because the majority of studies cannot be replicated or are the result of multiple testing of too many factors. We develop and estimate a Bayesian model of factor replication, which leads to different conclusions. The majority of asset pricing factors: (1) can be replicated, (2) can be clustered into 13 themes, the majority of which are significant parts of the tangency portfolio, (3) work out-of-sample in a new large data set covering 93 countries, and (4) have evidence that is strengthened (not weakened) by the large number of observed factors.

I don’t know anything about the topic and I only glanced at the article, but I’m sharing because of our general interest in the topic of reproducibility in research.

Learn Stan, get a job doing Bayesian modeling. How cool is that??

Tom Vladeck writes:

About a year ago, you posted our want-ad on your blog. Because of that, we hired at least one (and maybe two, but my memory of where one of the candidates came from is a bit hazy) really great contractors — both of whom are on the Stan development team: Adam Haber and Rok Cesnovar.

We’re back in the market, this time for a full-time hire whose responsibilities will be less to build the model than to run it — e.g. set the priors for each customer, work with the draws, generate forecasts, etc.

That’s great! Here’s the ad:

We’re hiring a data scientist with experience in bayesian inference to help build Recast, a company to build a next-gen media mix model. This person will work directly with clients to understand their marketing activity, customize our core model for each client, and to develop forecasts and scenario analyses.

The market is huge and for such a small company we’re getting substantial inbound interest. We’ve also got a few people on the Stan development team helping us build the model. We’re out of the heavy R&D stage and are now actively growing the company. Early paying customers include Away, Harry’s, Mockingbird, and others. Roughly $500 million dollars of annual media spend is already modeled on our platform.

This is an opportunity to help build a big business and work at the cutting edge of causal inference, statistical computing, and marketing data science.

Here’s the JD. You can apply by emailing me directly at [email protected] Let me know why this is interesting to you and drop in any relevant github / linkedin / other links.

It’s great to see our work being useful to people out there in the world. That’s what free software is all about.

Approximate leave-future-out cross validation for Bayesian time series models

Paul Bürkner, Jonah Gabry, and Aki Vehtari write:

One of the common goals of time series analysis is to use the observed series to inform predictions for future observations. In the absence of any actual new data to predict, cross-validation can be used to estimate a model’s future predictive accuracy, for instance, for the purpose of model comparison or selection. Exact cross-validation for Bayesian models is often computationally expensive, but approximate cross-validation methods have been developed, most notably methods for leave-one-out cross-validation (LOO-CV). If the actual prediction task is to predict the future given the past, LOO-CV provides an overly optimistic estimate because the information from future observations is available to influence predictions of the past. To properly account for the time series structure, we can use leave-future-out cross-validation (LFO-CV). Like exact LOO-CV, exact LFO-CV requires refitting the model many times to different subsets of the data. Using Pareto smoothed importance sampling, we propose a method for approximating exact LFO-CV that drastically reduces the computational costs while also providing informative diagnostics about the quality of the approximation.

A useful generalization, I’d say!

More should be possible, although I’m not quite sure how. One concern with leave-future-out is that you’re evaluating model fit for the early points in the series in a different way than how you’re evaluating later points in the series. Which makes me think that this sort of marginal evaluation could be done in other settings as well. Maybe one way to understand what’s possible would be to apply this mixed marginal/conditional predictive idea (something sliding between prior and posterior predictive checking) for the simpler problem of regression with independent data points. In some sense, leave-one-out or other forms of cross validation are already intermediate steps along this continuum, so maybe this is a good opportunity to explore these ideas further. From the other direction, we can consider problems such as spatial and network models in which there is no ordering, and cross-validation is related to the pseudo-likelihood idea of Besag (1974).